glam/QUICK_ACTION_PLAN_GERMAN_REGIONAL_HARVESTS.md
2025-11-19 23:25:22 +01:00

8.1 KiB

Quick Action Plan - German Regional Archive Harvests

Context: Discovered 12+ regional archive portals after NRW success
Current German Dataset: 20,846 institutions (ISIL + DDB + NRW)
Expected Growth: +1,280 new archives → ~22,100 total


Immediate Priorities (Next 4 Hours)

Priority 1: Thüringen (30 min) START HERE

Why: 149 archives CONFIRMED, simple portal structure
URL: https://www.archive-in-thueringen.de/
Expected: 119 new archives (after 20% duplicates)

Steps:

  1. Inspect portal structure (archive directory page)
  2. Build Playwright scraper (similar to NRW)
  3. Extract archive names, cities, types
  4. Geocode with Nominatim
  5. Merge with German dataset

Script to Create: scripts/scrapers/harvest_thueringen_archives.py


Priority 2: Niedersachsen & Bremen (1 hour)

Why: Arcinsys platform, 300+ archives expected
URL: https://arcinsys.niedersachsen.de/
Expected: 280 new archives

Steps:

  1. Navigate to archive directory: https://www.arcinsys.de/archive/archive_niedersachsen_bremen.html
  2. Extract all participating archives (Landesarchiv, Kreisarchive, Stadtarchive, etc.)
  3. Parse institution names and locations
  4. Geocode and merge

Script to Create: scripts/scrapers/harvest_arcinsys_niedersachsen.py


Priority 3: Schleswig-Holstein (45 min)

Why: Same Arcinsys platform as Niedersachsen
URL: https://arcinsys.schleswig-holstein.de/
Expected: 120 new archives

Steps:

  1. Reuse Arcinsys scraper logic from Niedersachsen
  2. Adapt for Schleswig-Holstein URLs
  3. Extract and merge

Script to Create: scripts/scrapers/harvest_arcinsys_schleswig_holstein.py


Priority 4: Hessen (45 min)

Why: Arcinsys platform (original developer)
URL: https://arcinsys.hessen.de/
Expected: 160 new archives

Steps:

  1. Reuse Arcinsys scraper
  2. Extract Hessen archives
  3. Merge

Script to Create: scripts/scrapers/harvest_arcinsys_hessen.py


Medium Priority (Next 8 Hours)

Priority 5: Baden-Württemberg (1.5 hours)

URL: https://www.landesarchiv-bw.de/
Expected: 200 new archives

Priority 6: Bayern (1 hour)

URL: https://www.gda.bayern.de/archive
Expected: 40 new archives (9 state + municipal)

Priority 7: Sachsen (1 hour)

URL: https://www.staatsarchiv.sachsen.de/
Expected: 120 new archives

Priority 8: Sachsen-Anhalt (1 hour)

URL: https://landesarchiv.sachsen-anhalt.de/
Expected: 80 new archives


Harvest Strategy

Step 1: Build Arcinsys Master Scraper (2 hours)

Target: Niedersachsen, Bremen, Schleswig-Holstein, Hessen (4 states)
Expected: 560 new archives

Logic:

# Arcinsys shared structure
base_urls = {
    "niedersachsen": "https://arcinsys.niedersachsen.de/",
    "schleswig-holstein": "https://arcinsys.schleswig-holstein.de/",
    "hessen": "https://arcinsys.hessen.de/"
}

# Archive directory pattern (consistent across Arcinsys)
# /archive/archive_{state}.html or similar

# Extract:
# - Archive name
# - Archive type (Landesarchiv, Kreisarchiv, Stadtarchiv, etc.)
# - City/location
# - Contact info

Step 2: Custom Scrapers for High-Impact States (3 hours)

  1. Thüringen - 149 archives
  2. Baden-Württemberg - 200+ archives
  3. Bayern - 9 state archives

Step 3: Merge and Deduplicate (1 hour)

  • Fuzzy match against existing 20,846 German institutions
  • Use 90% similarity threshold (validated by NRW: 80.7% duplicate rate)
  • Geocode new cities with Nominatim
  • Generate unified German dataset v3.0

Expected Timeline

Task Time Cumulative New Archives
Thüringen 30 min 0:30 +119
Niedersachsen & Bremen 1 hour 1:30 +280
Schleswig-Holstein 45 min 2:15 +120
Hessen 45 min 3:00 +160
Merge & Deduplicate 30 min 3:30 -
PHASE 1 COMPLETE 3.5 hours - +679
Baden-Württemberg 1.5 hours 5:00 +200
Bayern 1 hour 6:00 +40
Sachsen 1 hour 7:00 +120
Sachsen-Anhalt 1 hour 8:00 +80
Merge & Deduplicate 30 min 8:30 -
PHASE 2 COMPLETE 8.5 hours - +1,119

German Dataset Evolution

Version Date Institutions Sources Change
v1.0 2025-11-19 13:49 8,129 ISIL -
v1.1 2025-11-19 18:18 20,761 ISIL + DDB +12,632
v2.0 2025-11-19 21:11 20,846 ISIL + DDB + NRW +85
v3.0 2025-11-20 02:00 ~21,525 + Arcinsys (4 states) +679
v4.0 2025-11-20 08:00 ~21,965 + Regional (4 more states) +440

Output Files to Create

Harvest Outputs (JSON)

  1. data/isil/germany/thueringen_archives_20251120_*.json (149 archives)
  2. data/isil/germany/arcinsys_niedersachsen_20251120_*.json (350 archives)
  3. data/isil/germany/arcinsys_schleswig_holstein_20251120_*.json (150 archives)
  4. data/isil/germany/arcinsys_hessen_20251120_*.json (200 archives)
  5. data/isil/germany/baden_wuerttemberg_archives_20251120_*.json (250 archives)
  6. data/isil/germany/bayern_archives_20251120_*.json (50 archives)
  7. data/isil/germany/sachsen_archives_20251120_*.json (150 archives)
  8. data/isil/germany/sachsen_anhalt_archives_20251120_*.json (100 archives)

Merged Outputs

  1. data/isil/germany/german_institutions_unified_v3_20251120_*.json (after Arcinsys merge)
  2. data/isil/germany/german_institutions_unified_v4_20251120_*.json (after all regional merges)

Scripts

  1. scripts/scrapers/harvest_thueringen_archives.py
  2. scripts/scrapers/harvest_arcinsys_unified.py (handles all 4 Arcinsys states)
  3. scripts/scrapers/harvest_baden_wuerttemberg_archives.py
  4. scripts/scrapers/harvest_bayern_archives.py
  5. scripts/scrapers/harvest_sachsen_archives.py
  6. scripts/scrapers/harvest_sachsen_anhalt_archives.py
  7. scripts/scrapers/merge_regional_to_german_dataset.py (unified merger)

Success Criteria

Phase 1 (3.5 hours)

Thüringen harvested (149 archives)
Arcinsys consortium harvested (700+ archives)
679 new archives added to German dataset
German dataset v3.0 created (~21,525 institutions)

Phase 2 (8.5 hours)

Baden-Württemberg harvested (250 archives)
Bayern harvested (50 archives)
Sachsen harvested (150 archives)
Sachsen-Anhalt harvested (100 archives)
1,119 new archives added to German dataset
German dataset v4.0 created (~21,965 institutions)


Phase 1 Progress Impact

Metric Before After v3.0 After v4.0 Goal
German Institutions 20,846 21,525 21,965 ~22,000
Phase 1 Total 38,479 39,158 39,598 97,000
Progress % 39.7% 40.4% 40.8% 100%

Impact: +0.7pp (v3.0) or +1.1pp (v4.0) toward Phase 1 goal


Key Technical Notes

Reuse NRW Pattern

The NRW harvest showed excellent results:

  • Fast text extraction (no clicking)
  • Regex city parsing (German archive name patterns)
  • Fuzzy deduplication (90% threshold)
  • Nominatim geocoding (1 req/sec)

Apply same approach to all regional portals.

Arcinsys Advantage

All 4 Arcinsys states share:

  • Same portal structure
  • Same archive directory format
  • Same HTML/CSS patterns

Build ONE scraper, deploy to 4 states → 700+ archives in 3 hours


Next Agent Instructions

START WITH THÜRINGEN (30 minutes, 119 new archives)

  1. Navigate to: https://www.archive-in-thueringen.de/en/
  2. Find archive directory (likely under "Archives" or "Institutions" menu)
  3. Extract 149 archives listed
  4. Parse names, cities, types
  5. Geocode and merge

Then move to Arcinsys consortium (3 hours, 560 new archives)


Ready to Execute: YES
Expected Total Time: 3.5 hours (Phase 1) or 8.5 hours (Phase 2)
Expected New Archives: +679 (Phase 1) or +1,119 (Phase 2)
Priority Level: HIGH


Generated: 2025-11-19 22:35 UTC
Reference: GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md