# Quick Action Plan - German Regional Archive Harvests **Context**: Discovered 12+ regional archive portals after NRW success **Current German Dataset**: 20,846 institutions (ISIL + DDB + NRW) **Expected Growth**: +1,280 new archives → ~22,100 total --- ## Immediate Priorities (Next 4 Hours) ### Priority 1: Thüringen (30 min) ⭐ **START HERE** **Why**: 149 archives CONFIRMED, simple portal structure **URL**: https://www.archive-in-thueringen.de/ **Expected**: 119 new archives (after 20% duplicates) **Steps**: 1. Inspect portal structure (archive directory page) 2. Build Playwright scraper (similar to NRW) 3. Extract archive names, cities, types 4. Geocode with Nominatim 5. Merge with German dataset **Script to Create**: `scripts/scrapers/harvest_thueringen_archives.py` --- ### Priority 2: Niedersachsen & Bremen (1 hour) ⭐ **Why**: Arcinsys platform, 300+ archives expected **URL**: https://arcinsys.niedersachsen.de/ **Expected**: 280 new archives **Steps**: 1. Navigate to archive directory: https://www.arcinsys.de/archive/archive_niedersachsen_bremen.html 2. Extract all participating archives (Landesarchiv, Kreisarchive, Stadtarchive, etc.) 3. Parse institution names and locations 4. Geocode and merge **Script to Create**: `scripts/scrapers/harvest_arcinsys_niedersachsen.py` --- ### Priority 3: Schleswig-Holstein (45 min) **Why**: Same Arcinsys platform as Niedersachsen **URL**: https://arcinsys.schleswig-holstein.de/ **Expected**: 120 new archives **Steps**: 1. Reuse Arcinsys scraper logic from Niedersachsen 2. Adapt for Schleswig-Holstein URLs 3. Extract and merge **Script to Create**: `scripts/scrapers/harvest_arcinsys_schleswig_holstein.py` --- ### Priority 4: Hessen (45 min) **Why**: Arcinsys platform (original developer) **URL**: https://arcinsys.hessen.de/ **Expected**: 160 new archives **Steps**: 1. Reuse Arcinsys scraper 2. Extract Hessen archives 3. Merge **Script to Create**: `scripts/scrapers/harvest_arcinsys_hessen.py` --- ## Medium Priority (Next 8 Hours) ### Priority 5: Baden-Württemberg (1.5 hours) **URL**: https://www.landesarchiv-bw.de/ **Expected**: 200 new archives ### Priority 6: Bayern (1 hour) **URL**: https://www.gda.bayern.de/archive **Expected**: 40 new archives (9 state + municipal) ### Priority 7: Sachsen (1 hour) **URL**: https://www.staatsarchiv.sachsen.de/ **Expected**: 120 new archives ### Priority 8: Sachsen-Anhalt (1 hour) **URL**: https://landesarchiv.sachsen-anhalt.de/ **Expected**: 80 new archives --- ## Harvest Strategy ### Step 1: Build Arcinsys Master Scraper (2 hours) **Target**: Niedersachsen, Bremen, Schleswig-Holstein, Hessen (4 states) **Expected**: 560 new archives **Logic**: ```python # Arcinsys shared structure base_urls = { "niedersachsen": "https://arcinsys.niedersachsen.de/", "schleswig-holstein": "https://arcinsys.schleswig-holstein.de/", "hessen": "https://arcinsys.hessen.de/" } # Archive directory pattern (consistent across Arcinsys) # /archive/archive_{state}.html or similar # Extract: # - Archive name # - Archive type (Landesarchiv, Kreisarchiv, Stadtarchiv, etc.) # - City/location # - Contact info ``` --- ### Step 2: Custom Scrapers for High-Impact States (3 hours) 1. **Thüringen** - 149 archives 2. **Baden-Württemberg** - 200+ archives 3. **Bayern** - 9 state archives --- ### Step 3: Merge and Deduplicate (1 hour) - Fuzzy match against existing 20,846 German institutions - Use 90% similarity threshold (validated by NRW: 80.7% duplicate rate) - Geocode new cities with Nominatim - Generate unified German dataset v3.0 --- ## Expected Timeline | Task | Time | Cumulative | New Archives | |------|------|------------|--------------| | **Thüringen** | 30 min | 0:30 | +119 | | **Niedersachsen & Bremen** | 1 hour | 1:30 | +280 | | **Schleswig-Holstein** | 45 min | 2:15 | +120 | | **Hessen** | 45 min | 3:00 | +160 | | **Merge & Deduplicate** | 30 min | 3:30 | - | | **PHASE 1 COMPLETE** | **3.5 hours** | - | **+679** | | | | | | | **Baden-Württemberg** | 1.5 hours | 5:00 | +200 | | **Bayern** | 1 hour | 6:00 | +40 | | **Sachsen** | 1 hour | 7:00 | +120 | | **Sachsen-Anhalt** | 1 hour | 8:00 | +80 | | **Merge & Deduplicate** | 30 min | 8:30 | - | | **PHASE 2 COMPLETE** | **8.5 hours** | - | **+1,119** | --- ## German Dataset Evolution | Version | Date | Institutions | Sources | Change | |---------|------|--------------|---------|--------| | v1.0 | 2025-11-19 13:49 | 8,129 | ISIL | - | | v1.1 | 2025-11-19 18:18 | 20,761 | ISIL + DDB | +12,632 | | v2.0 | 2025-11-19 21:11 | 20,846 | ISIL + DDB + NRW | +85 | | **v3.0** | **2025-11-20 02:00** | **~21,525** | **+ Arcinsys (4 states)** | **+679** ⭐ | | **v4.0** | **2025-11-20 08:00** | **~21,965** | **+ Regional (4 more states)** | **+440** | --- ## Output Files to Create ### Harvest Outputs (JSON) 1. `data/isil/germany/thueringen_archives_20251120_*.json` (149 archives) 2. `data/isil/germany/arcinsys_niedersachsen_20251120_*.json` (350 archives) 3. `data/isil/germany/arcinsys_schleswig_holstein_20251120_*.json` (150 archives) 4. `data/isil/germany/arcinsys_hessen_20251120_*.json` (200 archives) 5. `data/isil/germany/baden_wuerttemberg_archives_20251120_*.json` (250 archives) 6. `data/isil/germany/bayern_archives_20251120_*.json` (50 archives) 7. `data/isil/germany/sachsen_archives_20251120_*.json` (150 archives) 8. `data/isil/germany/sachsen_anhalt_archives_20251120_*.json` (100 archives) ### Merged Outputs 9. `data/isil/germany/german_institutions_unified_v3_20251120_*.json` (after Arcinsys merge) 10. `data/isil/germany/german_institutions_unified_v4_20251120_*.json` (after all regional merges) ### Scripts 11. `scripts/scrapers/harvest_thueringen_archives.py` 12. `scripts/scrapers/harvest_arcinsys_unified.py` (handles all 4 Arcinsys states) 13. `scripts/scrapers/harvest_baden_wuerttemberg_archives.py` 14. `scripts/scrapers/harvest_bayern_archives.py` 15. `scripts/scrapers/harvest_sachsen_archives.py` 16. `scripts/scrapers/harvest_sachsen_anhalt_archives.py` 17. `scripts/scrapers/merge_regional_to_german_dataset.py` (unified merger) --- ## Success Criteria ### Phase 1 (3.5 hours) ✅ Thüringen harvested (149 archives) ✅ Arcinsys consortium harvested (700+ archives) ✅ 679 new archives added to German dataset ✅ German dataset v3.0 created (~21,525 institutions) ### Phase 2 (8.5 hours) ✅ Baden-Württemberg harvested (250 archives) ✅ Bayern harvested (50 archives) ✅ Sachsen harvested (150 archives) ✅ Sachsen-Anhalt harvested (100 archives) ✅ 1,119 new archives added to German dataset ✅ German dataset v4.0 created (~21,965 institutions) --- ## Phase 1 Progress Impact | Metric | Before | After v3.0 | After v4.0 | Goal | |--------|--------|------------|------------|------| | **German Institutions** | 20,846 | 21,525 | 21,965 | ~22,000 | | **Phase 1 Total** | 38,479 | 39,158 | 39,598 | 97,000 | | **Progress %** | 39.7% | 40.4% | 40.8% | 100% | **Impact**: +0.7pp (v3.0) or +1.1pp (v4.0) toward Phase 1 goal --- ## Key Technical Notes ### Reuse NRW Pattern The NRW harvest showed excellent results: - **Fast text extraction** (no clicking) - **Regex city parsing** (German archive name patterns) - **Fuzzy deduplication** (90% threshold) - **Nominatim geocoding** (1 req/sec) **Apply same approach** to all regional portals. ### Arcinsys Advantage All 4 Arcinsys states share: - Same portal structure - Same archive directory format - Same HTML/CSS patterns **Build ONE scraper**, deploy to 4 states → 700+ archives in 3 hours --- ## Next Agent Instructions **START WITH THÜRINGEN** (30 minutes, 119 new archives) 1. Navigate to: https://www.archive-in-thueringen.de/en/ 2. Find archive directory (likely under "Archives" or "Institutions" menu) 3. Extract 149 archives listed 4. Parse names, cities, types 5. Geocode and merge **Then move to Arcinsys consortium** (3 hours, 560 new archives) --- **Ready to Execute**: YES **Expected Total Time**: 3.5 hours (Phase 1) or 8.5 hours (Phase 2) **Expected New Archives**: +679 (Phase 1) or +1,119 (Phase 2) **Priority Level**: HIGH ⭐ --- **Generated**: 2025-11-19 22:35 UTC **Reference**: GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md