114 lines
2.8 KiB
Markdown
114 lines
2.8 KiB
Markdown
# Quick Status - Post NRW Merge (2025-11-19)
|
||
|
||
## Current Dataset Totals
|
||
|
||
| Country | Institutions | Sources | Status |
|
||
|---------|--------------|---------|--------|
|
||
| **Netherlands** | 1,351 | Dutch Orgs CSV | ✅ Complete |
|
||
| **Germany** | **20,846** | ISIL + DDB + **NRW** ⭐ | ✅ Complete |
|
||
| **Other Priority 1** | 16,282 | ISIL registries | 🔄 In Progress |
|
||
| **TOTAL** | **38,479** | Multiple | **39.7% of 97K goal** |
|
||
|
||
## What Changed Today
|
||
|
||
### NRW Archives Integration ⭐
|
||
|
||
- **Discovered**: archive.nrw.de portal (523+ archives)
|
||
- **Harvested**: 441 NRW archives in 9.3 seconds
|
||
- **Merged**: 85 new institutions (356 duplicates filtered)
|
||
- **Geocoded**: 53 new cities
|
||
- **Impact**: NRW coverage 26 → 441 (+1600%)
|
||
|
||
### German Dataset Growth
|
||
|
||
| Version | Institutions | Change |
|
||
|---------|--------------|--------|
|
||
| Before (v1) | 20,761 | ISIL + DDB |
|
||
| After (v2) | **20,846** | **+85 (NRW)** |
|
||
|
||
## Key Files
|
||
|
||
### Production Scripts
|
||
- ✅ `scripts/scrapers/harvest_nrw_archives_fast.py`
|
||
- ✅ `scripts/scrapers/merge_nrw_to_german_dataset.py`
|
||
|
||
### Data Files
|
||
- ✅ `data/isil/germany/nrw_archives_fast_20251119_203700.json` (441 archives)
|
||
- ✅ `data/isil/germany/german_institutions_unified_v2_20251119_211132.json` (20,846 institutions) ⭐
|
||
|
||
### Documentation
|
||
- ✅ `NRW_HARVEST_COMPLETE_20251119.md` (technical details)
|
||
- ✅ `SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md` (full summary)
|
||
|
||
## Next Steps
|
||
|
||
### Continue Phase 1 Harvest
|
||
|
||
**Priority 1 Countries** (Target: 97,000 institutions):
|
||
- ✅ Netherlands - 1,351 institutions
|
||
- ✅ Germany - 20,846 institutions
|
||
- 🔄 Denmark - TBD
|
||
- 🔄 Austria - TBD
|
||
- 🔄 Belgium - TBD
|
||
- 🔄 Czech Republic - TBD
|
||
- 🔄 France - TBD
|
||
- 🔄 Switzerland - TBD
|
||
|
||
**Current Progress**: 38,479 / 97,000 (39.7%)
|
||
|
||
## Statistics Summary
|
||
|
||
### Merge Results
|
||
- Input (Unified): 20,761 institutions
|
||
- Input (NRW): 441 archives
|
||
- Duplicates: 356 (80.7%)
|
||
- New Added: 85 (19.3%)
|
||
- **Output: 20,846 institutions**
|
||
|
||
### Geocoding
|
||
- Successfully Geocoded: 53
|
||
- Failed: 2
|
||
- No City Data: 30
|
||
- **Coverage: 71.3%** (stable)
|
||
|
||
### Institution Types (NRW)
|
||
- Archive: 416
|
||
- Education Provider: 7
|
||
- Corporation: 6
|
||
- Official Institution: 5
|
||
- Holy Sites: 4
|
||
- Research Center: 3
|
||
|
||
## Performance Metrics
|
||
|
||
### Harvester (v3.0)
|
||
- **Runtime**: 9.3 seconds
|
||
- **Archives**: 441
|
||
- **Rate**: 47.4 archives/second
|
||
|
||
### Merge
|
||
- **Runtime**: ~8 minutes
|
||
- **Comparisons**: 9.1M (441 × 20,761)
|
||
- **Geocoding**: 55 API calls
|
||
- **Output Size**: 39 MB
|
||
|
||
## Data Quality
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Duplicate Detection | 80.7% |
|
||
| Geocoding Success | 96.4% |
|
||
| City Completeness | 83.7% |
|
||
| Type Classification | 100% |
|
||
|
||
## Ready for Next Session
|
||
|
||
✅ All code production-ready
|
||
✅ German dataset complete (ISIL + DDB + NRW)
|
||
✅ Documentation complete
|
||
✅ Ready to continue Phase 1 harvests
|
||
|
||
---
|
||
|
||
**Last Updated**: 2025-11-19 22:15 UTC
|
||
**Status**: ✅ Session Complete
|