glam/QUICK_ACTION_PLAN_GERMAN_REGIONAL_HARVESTS.md
2025-11-19 23:25:22 +01:00

282 lines
8.1 KiB
Markdown

# Quick Action Plan - German Regional Archive Harvests
**Context**: Discovered 12+ regional archive portals after NRW success
**Current German Dataset**: 20,846 institutions (ISIL + DDB + NRW)
**Expected Growth**: +1,280 new archives → ~22,100 total
---
## Immediate Priorities (Next 4 Hours)
### Priority 1: Thüringen (30 min) ⭐ **START HERE**
**Why**: 149 archives CONFIRMED, simple portal structure
**URL**: https://www.archive-in-thueringen.de/
**Expected**: 119 new archives (after 20% duplicates)
**Steps**:
1. Inspect portal structure (archive directory page)
2. Build Playwright scraper (similar to NRW)
3. Extract archive names, cities, types
4. Geocode with Nominatim
5. Merge with German dataset
**Script to Create**: `scripts/scrapers/harvest_thueringen_archives.py`
---
### Priority 2: Niedersachsen & Bremen (1 hour) ⭐
**Why**: Arcinsys platform, 300+ archives expected
**URL**: https://arcinsys.niedersachsen.de/
**Expected**: 280 new archives
**Steps**:
1. Navigate to archive directory: https://www.arcinsys.de/archive/archive_niedersachsen_bremen.html
2. Extract all participating archives (Landesarchiv, Kreisarchive, Stadtarchive, etc.)
3. Parse institution names and locations
4. Geocode and merge
**Script to Create**: `scripts/scrapers/harvest_arcinsys_niedersachsen.py`
---
### Priority 3: Schleswig-Holstein (45 min)
**Why**: Same Arcinsys platform as Niedersachsen
**URL**: https://arcinsys.schleswig-holstein.de/
**Expected**: 120 new archives
**Steps**:
1. Reuse Arcinsys scraper logic from Niedersachsen
2. Adapt for Schleswig-Holstein URLs
3. Extract and merge
**Script to Create**: `scripts/scrapers/harvest_arcinsys_schleswig_holstein.py`
---
### Priority 4: Hessen (45 min)
**Why**: Arcinsys platform (original developer)
**URL**: https://arcinsys.hessen.de/
**Expected**: 160 new archives
**Steps**:
1. Reuse Arcinsys scraper
2. Extract Hessen archives
3. Merge
**Script to Create**: `scripts/scrapers/harvest_arcinsys_hessen.py`
---
## Medium Priority (Next 8 Hours)
### Priority 5: Baden-Württemberg (1.5 hours)
**URL**: https://www.landesarchiv-bw.de/
**Expected**: 200 new archives
### Priority 6: Bayern (1 hour)
**URL**: https://www.gda.bayern.de/archive
**Expected**: 40 new archives (9 state + municipal)
### Priority 7: Sachsen (1 hour)
**URL**: https://www.staatsarchiv.sachsen.de/
**Expected**: 120 new archives
### Priority 8: Sachsen-Anhalt (1 hour)
**URL**: https://landesarchiv.sachsen-anhalt.de/
**Expected**: 80 new archives
---
## Harvest Strategy
### Step 1: Build Arcinsys Master Scraper (2 hours)
**Target**: Niedersachsen, Bremen, Schleswig-Holstein, Hessen (4 states)
**Expected**: 560 new archives
**Logic**:
```python
# Arcinsys shared structure
base_urls = {
"niedersachsen": "https://arcinsys.niedersachsen.de/",
"schleswig-holstein": "https://arcinsys.schleswig-holstein.de/",
"hessen": "https://arcinsys.hessen.de/"
}
# Archive directory pattern (consistent across Arcinsys)
# /archive/archive_{state}.html or similar
# Extract:
# - Archive name
# - Archive type (Landesarchiv, Kreisarchiv, Stadtarchiv, etc.)
# - City/location
# - Contact info
```
---
### Step 2: Custom Scrapers for High-Impact States (3 hours)
1. **Thüringen** - 149 archives
2. **Baden-Württemberg** - 200+ archives
3. **Bayern** - 9 state archives
---
### Step 3: Merge and Deduplicate (1 hour)
- Fuzzy match against existing 20,846 German institutions
- Use 90% similarity threshold (validated by NRW: 80.7% duplicate rate)
- Geocode new cities with Nominatim
- Generate unified German dataset v3.0
---
## Expected Timeline
| Task | Time | Cumulative | New Archives |
|------|------|------------|--------------|
| **Thüringen** | 30 min | 0:30 | +119 |
| **Niedersachsen & Bremen** | 1 hour | 1:30 | +280 |
| **Schleswig-Holstein** | 45 min | 2:15 | +120 |
| **Hessen** | 45 min | 3:00 | +160 |
| **Merge & Deduplicate** | 30 min | 3:30 | - |
| **PHASE 1 COMPLETE** | **3.5 hours** | - | **+679** |
| | | | |
| **Baden-Württemberg** | 1.5 hours | 5:00 | +200 |
| **Bayern** | 1 hour | 6:00 | +40 |
| **Sachsen** | 1 hour | 7:00 | +120 |
| **Sachsen-Anhalt** | 1 hour | 8:00 | +80 |
| **Merge & Deduplicate** | 30 min | 8:30 | - |
| **PHASE 2 COMPLETE** | **8.5 hours** | - | **+1,119** |
---
## German Dataset Evolution
| Version | Date | Institutions | Sources | Change |
|---------|------|--------------|---------|--------|
| v1.0 | 2025-11-19 13:49 | 8,129 | ISIL | - |
| v1.1 | 2025-11-19 18:18 | 20,761 | ISIL + DDB | +12,632 |
| v2.0 | 2025-11-19 21:11 | 20,846 | ISIL + DDB + NRW | +85 |
| **v3.0** | **2025-11-20 02:00** | **~21,525** | **+ Arcinsys (4 states)** | **+679** ⭐ |
| **v4.0** | **2025-11-20 08:00** | **~21,965** | **+ Regional (4 more states)** | **+440** |
---
## Output Files to Create
### Harvest Outputs (JSON)
1. `data/isil/germany/thueringen_archives_20251120_*.json` (149 archives)
2. `data/isil/germany/arcinsys_niedersachsen_20251120_*.json` (350 archives)
3. `data/isil/germany/arcinsys_schleswig_holstein_20251120_*.json` (150 archives)
4. `data/isil/germany/arcinsys_hessen_20251120_*.json` (200 archives)
5. `data/isil/germany/baden_wuerttemberg_archives_20251120_*.json` (250 archives)
6. `data/isil/germany/bayern_archives_20251120_*.json` (50 archives)
7. `data/isil/germany/sachsen_archives_20251120_*.json` (150 archives)
8. `data/isil/germany/sachsen_anhalt_archives_20251120_*.json` (100 archives)
### Merged Outputs
9. `data/isil/germany/german_institutions_unified_v3_20251120_*.json` (after Arcinsys merge)
10. `data/isil/germany/german_institutions_unified_v4_20251120_*.json` (after all regional merges)
### Scripts
11. `scripts/scrapers/harvest_thueringen_archives.py`
12. `scripts/scrapers/harvest_arcinsys_unified.py` (handles all 4 Arcinsys states)
13. `scripts/scrapers/harvest_baden_wuerttemberg_archives.py`
14. `scripts/scrapers/harvest_bayern_archives.py`
15. `scripts/scrapers/harvest_sachsen_archives.py`
16. `scripts/scrapers/harvest_sachsen_anhalt_archives.py`
17. `scripts/scrapers/merge_regional_to_german_dataset.py` (unified merger)
---
## Success Criteria
### Phase 1 (3.5 hours)
✅ Thüringen harvested (149 archives)
✅ Arcinsys consortium harvested (700+ archives)
✅ 679 new archives added to German dataset
✅ German dataset v3.0 created (~21,525 institutions)
### Phase 2 (8.5 hours)
✅ Baden-Württemberg harvested (250 archives)
✅ Bayern harvested (50 archives)
✅ Sachsen harvested (150 archives)
✅ Sachsen-Anhalt harvested (100 archives)
✅ 1,119 new archives added to German dataset
✅ German dataset v4.0 created (~21,965 institutions)
---
## Phase 1 Progress Impact
| Metric | Before | After v3.0 | After v4.0 | Goal |
|--------|--------|------------|------------|------|
| **German Institutions** | 20,846 | 21,525 | 21,965 | ~22,000 |
| **Phase 1 Total** | 38,479 | 39,158 | 39,598 | 97,000 |
| **Progress %** | 39.7% | 40.4% | 40.8% | 100% |
**Impact**: +0.7pp (v3.0) or +1.1pp (v4.0) toward Phase 1 goal
---
## Key Technical Notes
### Reuse NRW Pattern
The NRW harvest showed excellent results:
- **Fast text extraction** (no clicking)
- **Regex city parsing** (German archive name patterns)
- **Fuzzy deduplication** (90% threshold)
- **Nominatim geocoding** (1 req/sec)
**Apply same approach** to all regional portals.
### Arcinsys Advantage
All 4 Arcinsys states share:
- Same portal structure
- Same archive directory format
- Same HTML/CSS patterns
**Build ONE scraper**, deploy to 4 states → 700+ archives in 3 hours
---
## Next Agent Instructions
**START WITH THÜRINGEN** (30 minutes, 119 new archives)
1. Navigate to: https://www.archive-in-thueringen.de/en/
2. Find archive directory (likely under "Archives" or "Institutions" menu)
3. Extract 149 archives listed
4. Parse names, cities, types
5. Geocode and merge
**Then move to Arcinsys consortium** (3 hours, 560 new archives)
---
**Ready to Execute**: YES
**Expected Total Time**: 3.5 hours (Phase 1) or 8.5 hours (Phase 2)
**Expected New Archives**: +679 (Phase 1) or +1,119 (Phase 2)
**Priority Level**: HIGH ⭐
---
**Generated**: 2025-11-19 22:35 UTC
**Reference**: GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md