glam/QUICK_STATUS_20251119_POST_NRW.md
2025-11-19 23:25:22 +01:00

114 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Quick Status - Post NRW Merge (2025-11-19)
## Current Dataset Totals
| Country | Institutions | Sources | Status |
|---------|--------------|---------|--------|
| **Netherlands** | 1,351 | Dutch Orgs CSV | ✅ Complete |
| **Germany** | **20,846** | ISIL + DDB + **NRW** ⭐ | ✅ Complete |
| **Other Priority 1** | 16,282 | ISIL registries | 🔄 In Progress |
| **TOTAL** | **38,479** | Multiple | **39.7% of 97K goal** |
## What Changed Today
### NRW Archives Integration ⭐
- **Discovered**: archive.nrw.de portal (523+ archives)
- **Harvested**: 441 NRW archives in 9.3 seconds
- **Merged**: 85 new institutions (356 duplicates filtered)
- **Geocoded**: 53 new cities
- **Impact**: NRW coverage 26 → 441 (+1600%)
### German Dataset Growth
| Version | Institutions | Change |
|---------|--------------|--------|
| Before (v1) | 20,761 | ISIL + DDB |
| After (v2) | **20,846** | **+85 (NRW)** |
## Key Files
### Production Scripts
-`scripts/scrapers/harvest_nrw_archives_fast.py`
-`scripts/scrapers/merge_nrw_to_german_dataset.py`
### Data Files
-`data/isil/germany/nrw_archives_fast_20251119_203700.json` (441 archives)
-`data/isil/germany/german_institutions_unified_v2_20251119_211132.json` (20,846 institutions) ⭐
### Documentation
-`NRW_HARVEST_COMPLETE_20251119.md` (technical details)
-`SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md` (full summary)
## Next Steps
### Continue Phase 1 Harvest
**Priority 1 Countries** (Target: 97,000 institutions):
- ✅ Netherlands - 1,351 institutions
- ✅ Germany - 20,846 institutions
- 🔄 Denmark - TBD
- 🔄 Austria - TBD
- 🔄 Belgium - TBD
- 🔄 Czech Republic - TBD
- 🔄 France - TBD
- 🔄 Switzerland - TBD
**Current Progress**: 38,479 / 97,000 (39.7%)
## Statistics Summary
### Merge Results
- Input (Unified): 20,761 institutions
- Input (NRW): 441 archives
- Duplicates: 356 (80.7%)
- New Added: 85 (19.3%)
- **Output: 20,846 institutions**
### Geocoding
- Successfully Geocoded: 53
- Failed: 2
- No City Data: 30
- **Coverage: 71.3%** (stable)
### Institution Types (NRW)
- Archive: 416
- Education Provider: 7
- Corporation: 6
- Official Institution: 5
- Holy Sites: 4
- Research Center: 3
## Performance Metrics
### Harvester (v3.0)
- **Runtime**: 9.3 seconds
- **Archives**: 441
- **Rate**: 47.4 archives/second
### Merge
- **Runtime**: ~8 minutes
- **Comparisons**: 9.1M (441 × 20,761)
- **Geocoding**: 55 API calls
- **Output Size**: 39 MB
## Data Quality
| Metric | Value |
|--------|-------|
| Duplicate Detection | 80.7% |
| Geocoding Success | 96.4% |
| City Completeness | 83.7% |
| Type Classification | 100% |
## Ready for Next Session
✅ All code production-ready
✅ German dataset complete (ISIL + DDB + NRW)
✅ Documentation complete
✅ Ready to continue Phase 1 harvests
---
**Last Updated**: 2025-11-19 22:15 UTC
**Status**: ✅ Session Complete