8.1 KiB
Quick Action Plan - German Regional Archive Harvests
Context: Discovered 12+ regional archive portals after NRW success
Current German Dataset: 20,846 institutions (ISIL + DDB + NRW)
Expected Growth: +1,280 new archives → ~22,100 total
Immediate Priorities (Next 4 Hours)
Priority 1: Thüringen (30 min) ⭐ START HERE
Why: 149 archives CONFIRMED, simple portal structure
URL: https://www.archive-in-thueringen.de/
Expected: 119 new archives (after 20% duplicates)
Steps:
- Inspect portal structure (archive directory page)
- Build Playwright scraper (similar to NRW)
- Extract archive names, cities, types
- Geocode with Nominatim
- Merge with German dataset
Script to Create: scripts/scrapers/harvest_thueringen_archives.py
Priority 2: Niedersachsen & Bremen (1 hour) ⭐
Why: Arcinsys platform, 300+ archives expected
URL: https://arcinsys.niedersachsen.de/
Expected: 280 new archives
Steps:
- Navigate to archive directory: https://www.arcinsys.de/archive/archive_niedersachsen_bremen.html
- Extract all participating archives (Landesarchiv, Kreisarchive, Stadtarchive, etc.)
- Parse institution names and locations
- Geocode and merge
Script to Create: scripts/scrapers/harvest_arcinsys_niedersachsen.py
Priority 3: Schleswig-Holstein (45 min)
Why: Same Arcinsys platform as Niedersachsen
URL: https://arcinsys.schleswig-holstein.de/
Expected: 120 new archives
Steps:
- Reuse Arcinsys scraper logic from Niedersachsen
- Adapt for Schleswig-Holstein URLs
- Extract and merge
Script to Create: scripts/scrapers/harvest_arcinsys_schleswig_holstein.py
Priority 4: Hessen (45 min)
Why: Arcinsys platform (original developer)
URL: https://arcinsys.hessen.de/
Expected: 160 new archives
Steps:
- Reuse Arcinsys scraper
- Extract Hessen archives
- Merge
Script to Create: scripts/scrapers/harvest_arcinsys_hessen.py
Medium Priority (Next 8 Hours)
Priority 5: Baden-Württemberg (1.5 hours)
URL: https://www.landesarchiv-bw.de/
Expected: 200 new archives
Priority 6: Bayern (1 hour)
URL: https://www.gda.bayern.de/archive
Expected: 40 new archives (9 state + municipal)
Priority 7: Sachsen (1 hour)
URL: https://www.staatsarchiv.sachsen.de/
Expected: 120 new archives
Priority 8: Sachsen-Anhalt (1 hour)
URL: https://landesarchiv.sachsen-anhalt.de/
Expected: 80 new archives
Harvest Strategy
Step 1: Build Arcinsys Master Scraper (2 hours)
Target: Niedersachsen, Bremen, Schleswig-Holstein, Hessen (4 states)
Expected: 560 new archives
Logic:
# Arcinsys shared structure
base_urls = {
"niedersachsen": "https://arcinsys.niedersachsen.de/",
"schleswig-holstein": "https://arcinsys.schleswig-holstein.de/",
"hessen": "https://arcinsys.hessen.de/"
}
# Archive directory pattern (consistent across Arcinsys)
# /archive/archive_{state}.html or similar
# Extract:
# - Archive name
# - Archive type (Landesarchiv, Kreisarchiv, Stadtarchiv, etc.)
# - City/location
# - Contact info
Step 2: Custom Scrapers for High-Impact States (3 hours)
- Thüringen - 149 archives
- Baden-Württemberg - 200+ archives
- Bayern - 9 state archives
Step 3: Merge and Deduplicate (1 hour)
- Fuzzy match against existing 20,846 German institutions
- Use 90% similarity threshold (validated by NRW: 80.7% duplicate rate)
- Geocode new cities with Nominatim
- Generate unified German dataset v3.0
Expected Timeline
| Task | Time | Cumulative | New Archives |
|---|---|---|---|
| Thüringen | 30 min | 0:30 | +119 |
| Niedersachsen & Bremen | 1 hour | 1:30 | +280 |
| Schleswig-Holstein | 45 min | 2:15 | +120 |
| Hessen | 45 min | 3:00 | +160 |
| Merge & Deduplicate | 30 min | 3:30 | - |
| PHASE 1 COMPLETE | 3.5 hours | - | +679 |
| Baden-Württemberg | 1.5 hours | 5:00 | +200 |
| Bayern | 1 hour | 6:00 | +40 |
| Sachsen | 1 hour | 7:00 | +120 |
| Sachsen-Anhalt | 1 hour | 8:00 | +80 |
| Merge & Deduplicate | 30 min | 8:30 | - |
| PHASE 2 COMPLETE | 8.5 hours | - | +1,119 |
German Dataset Evolution
| Version | Date | Institutions | Sources | Change |
|---|---|---|---|---|
| v1.0 | 2025-11-19 13:49 | 8,129 | ISIL | - |
| v1.1 | 2025-11-19 18:18 | 20,761 | ISIL + DDB | +12,632 |
| v2.0 | 2025-11-19 21:11 | 20,846 | ISIL + DDB + NRW | +85 |
| v3.0 | 2025-11-20 02:00 | ~21,525 | + Arcinsys (4 states) | +679 ⭐ |
| v4.0 | 2025-11-20 08:00 | ~21,965 | + Regional (4 more states) | +440 |
Output Files to Create
Harvest Outputs (JSON)
data/isil/germany/thueringen_archives_20251120_*.json(149 archives)data/isil/germany/arcinsys_niedersachsen_20251120_*.json(350 archives)data/isil/germany/arcinsys_schleswig_holstein_20251120_*.json(150 archives)data/isil/germany/arcinsys_hessen_20251120_*.json(200 archives)data/isil/germany/baden_wuerttemberg_archives_20251120_*.json(250 archives)data/isil/germany/bayern_archives_20251120_*.json(50 archives)data/isil/germany/sachsen_archives_20251120_*.json(150 archives)data/isil/germany/sachsen_anhalt_archives_20251120_*.json(100 archives)
Merged Outputs
data/isil/germany/german_institutions_unified_v3_20251120_*.json(after Arcinsys merge)data/isil/germany/german_institutions_unified_v4_20251120_*.json(after all regional merges)
Scripts
scripts/scrapers/harvest_thueringen_archives.pyscripts/scrapers/harvest_arcinsys_unified.py(handles all 4 Arcinsys states)scripts/scrapers/harvest_baden_wuerttemberg_archives.pyscripts/scrapers/harvest_bayern_archives.pyscripts/scrapers/harvest_sachsen_archives.pyscripts/scrapers/harvest_sachsen_anhalt_archives.pyscripts/scrapers/merge_regional_to_german_dataset.py(unified merger)
Success Criteria
Phase 1 (3.5 hours)
✅ Thüringen harvested (149 archives)
✅ Arcinsys consortium harvested (700+ archives)
✅ 679 new archives added to German dataset
✅ German dataset v3.0 created (~21,525 institutions)
Phase 2 (8.5 hours)
✅ Baden-Württemberg harvested (250 archives)
✅ Bayern harvested (50 archives)
✅ Sachsen harvested (150 archives)
✅ Sachsen-Anhalt harvested (100 archives)
✅ 1,119 new archives added to German dataset
✅ German dataset v4.0 created (~21,965 institutions)
Phase 1 Progress Impact
| Metric | Before | After v3.0 | After v4.0 | Goal |
|---|---|---|---|---|
| German Institutions | 20,846 | 21,525 | 21,965 | ~22,000 |
| Phase 1 Total | 38,479 | 39,158 | 39,598 | 97,000 |
| Progress % | 39.7% | 40.4% | 40.8% | 100% |
Impact: +0.7pp (v3.0) or +1.1pp (v4.0) toward Phase 1 goal
Key Technical Notes
Reuse NRW Pattern
The NRW harvest showed excellent results:
- Fast text extraction (no clicking)
- Regex city parsing (German archive name patterns)
- Fuzzy deduplication (90% threshold)
- Nominatim geocoding (1 req/sec)
Apply same approach to all regional portals.
Arcinsys Advantage
All 4 Arcinsys states share:
- Same portal structure
- Same archive directory format
- Same HTML/CSS patterns
Build ONE scraper, deploy to 4 states → 700+ archives in 3 hours
Next Agent Instructions
START WITH THÜRINGEN (30 minutes, 119 new archives)
- Navigate to: https://www.archive-in-thueringen.de/en/
- Find archive directory (likely under "Archives" or "Institutions" menu)
- Extract 149 archives listed
- Parse names, cities, types
- Geocode and merge
Then move to Arcinsys consortium (3 hours, 560 new archives)
Ready to Execute: YES
Expected Total Time: 3.5 hours (Phase 1) or 8.5 hours (Phase 2)
Expected New Archives: +679 (Phase 1) or +1,119 (Phase 2)
Priority Level: HIGH ⭐
Generated: 2025-11-19 22:35 UTC
Reference: GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md