282 lines
8.1 KiB
Markdown
282 lines
8.1 KiB
Markdown
# Quick Action Plan - German Regional Archive Harvests
|
|
|
|
**Context**: Discovered 12+ regional archive portals after NRW success
|
|
**Current German Dataset**: 20,846 institutions (ISIL + DDB + NRW)
|
|
**Expected Growth**: +1,280 new archives → ~22,100 total
|
|
|
|
---
|
|
|
|
## Immediate Priorities (Next 4 Hours)
|
|
|
|
### Priority 1: Thüringen (30 min) ⭐ **START HERE**
|
|
|
|
**Why**: 149 archives CONFIRMED, simple portal structure
|
|
**URL**: https://www.archive-in-thueringen.de/
|
|
**Expected**: 119 new archives (after 20% duplicates)
|
|
|
|
**Steps**:
|
|
1. Inspect portal structure (archive directory page)
|
|
2. Build Playwright scraper (similar to NRW)
|
|
3. Extract archive names, cities, types
|
|
4. Geocode with Nominatim
|
|
5. Merge with German dataset
|
|
|
|
**Script to Create**: `scripts/scrapers/harvest_thueringen_archives.py`
|
|
|
|
---
|
|
|
|
### Priority 2: Niedersachsen & Bremen (1 hour) ⭐
|
|
|
|
**Why**: Arcinsys platform, 300+ archives expected
|
|
**URL**: https://arcinsys.niedersachsen.de/
|
|
**Expected**: 280 new archives
|
|
|
|
**Steps**:
|
|
1. Navigate to archive directory: https://www.arcinsys.de/archive/archive_niedersachsen_bremen.html
|
|
2. Extract all participating archives (Landesarchiv, Kreisarchive, Stadtarchive, etc.)
|
|
3. Parse institution names and locations
|
|
4. Geocode and merge
|
|
|
|
**Script to Create**: `scripts/scrapers/harvest_arcinsys_niedersachsen.py`
|
|
|
|
---
|
|
|
|
### Priority 3: Schleswig-Holstein (45 min)
|
|
|
|
**Why**: Same Arcinsys platform as Niedersachsen
|
|
**URL**: https://arcinsys.schleswig-holstein.de/
|
|
**Expected**: 120 new archives
|
|
|
|
**Steps**:
|
|
1. Reuse Arcinsys scraper logic from Niedersachsen
|
|
2. Adapt for Schleswig-Holstein URLs
|
|
3. Extract and merge
|
|
|
|
**Script to Create**: `scripts/scrapers/harvest_arcinsys_schleswig_holstein.py`
|
|
|
|
---
|
|
|
|
### Priority 4: Hessen (45 min)
|
|
|
|
**Why**: Arcinsys platform (original developer)
|
|
**URL**: https://arcinsys.hessen.de/
|
|
**Expected**: 160 new archives
|
|
|
|
**Steps**:
|
|
1. Reuse Arcinsys scraper
|
|
2. Extract Hessen archives
|
|
3. Merge
|
|
|
|
**Script to Create**: `scripts/scrapers/harvest_arcinsys_hessen.py`
|
|
|
|
---
|
|
|
|
## Medium Priority (Next 8 Hours)
|
|
|
|
### Priority 5: Baden-Württemberg (1.5 hours)
|
|
|
|
**URL**: https://www.landesarchiv-bw.de/
|
|
**Expected**: 200 new archives
|
|
|
|
### Priority 6: Bayern (1 hour)
|
|
|
|
**URL**: https://www.gda.bayern.de/archive
|
|
**Expected**: 40 new archives (9 state + municipal)
|
|
|
|
### Priority 7: Sachsen (1 hour)
|
|
|
|
**URL**: https://www.staatsarchiv.sachsen.de/
|
|
**Expected**: 120 new archives
|
|
|
|
### Priority 8: Sachsen-Anhalt (1 hour)
|
|
|
|
**URL**: https://landesarchiv.sachsen-anhalt.de/
|
|
**Expected**: 80 new archives
|
|
|
|
---
|
|
|
|
## Harvest Strategy
|
|
|
|
### Step 1: Build Arcinsys Master Scraper (2 hours)
|
|
|
|
**Target**: Niedersachsen, Bremen, Schleswig-Holstein, Hessen (4 states)
|
|
**Expected**: 560 new archives
|
|
|
|
**Logic**:
|
|
```python
|
|
# Arcinsys shared structure
|
|
base_urls = {
|
|
"niedersachsen": "https://arcinsys.niedersachsen.de/",
|
|
"schleswig-holstein": "https://arcinsys.schleswig-holstein.de/",
|
|
"hessen": "https://arcinsys.hessen.de/"
|
|
}
|
|
|
|
# Archive directory pattern (consistent across Arcinsys)
|
|
# /archive/archive_{state}.html or similar
|
|
|
|
# Extract:
|
|
# - Archive name
|
|
# - Archive type (Landesarchiv, Kreisarchiv, Stadtarchiv, etc.)
|
|
# - City/location
|
|
# - Contact info
|
|
```
|
|
|
|
---
|
|
|
|
### Step 2: Custom Scrapers for High-Impact States (3 hours)
|
|
|
|
1. **Thüringen** - 149 archives
|
|
2. **Baden-Württemberg** - 200+ archives
|
|
3. **Bayern** - 9 state archives
|
|
|
|
---
|
|
|
|
### Step 3: Merge and Deduplicate (1 hour)
|
|
|
|
- Fuzzy match against existing 20,846 German institutions
|
|
- Use 90% similarity threshold (validated by NRW: 80.7% duplicate rate)
|
|
- Geocode new cities with Nominatim
|
|
- Generate unified German dataset v3.0
|
|
|
|
---
|
|
|
|
## Expected Timeline
|
|
|
|
| Task | Time | Cumulative | New Archives |
|
|
|------|------|------------|--------------|
|
|
| **Thüringen** | 30 min | 0:30 | +119 |
|
|
| **Niedersachsen & Bremen** | 1 hour | 1:30 | +280 |
|
|
| **Schleswig-Holstein** | 45 min | 2:15 | +120 |
|
|
| **Hessen** | 45 min | 3:00 | +160 |
|
|
| **Merge & Deduplicate** | 30 min | 3:30 | - |
|
|
| **PHASE 1 COMPLETE** | **3.5 hours** | - | **+679** |
|
|
| | | | |
|
|
| **Baden-Württemberg** | 1.5 hours | 5:00 | +200 |
|
|
| **Bayern** | 1 hour | 6:00 | +40 |
|
|
| **Sachsen** | 1 hour | 7:00 | +120 |
|
|
| **Sachsen-Anhalt** | 1 hour | 8:00 | +80 |
|
|
| **Merge & Deduplicate** | 30 min | 8:30 | - |
|
|
| **PHASE 2 COMPLETE** | **8.5 hours** | - | **+1,119** |
|
|
|
|
---
|
|
|
|
## German Dataset Evolution
|
|
|
|
| Version | Date | Institutions | Sources | Change |
|
|
|---------|------|--------------|---------|--------|
|
|
| v1.0 | 2025-11-19 13:49 | 8,129 | ISIL | - |
|
|
| v1.1 | 2025-11-19 18:18 | 20,761 | ISIL + DDB | +12,632 |
|
|
| v2.0 | 2025-11-19 21:11 | 20,846 | ISIL + DDB + NRW | +85 |
|
|
| **v3.0** | **2025-11-20 02:00** | **~21,525** | **+ Arcinsys (4 states)** | **+679** ⭐ |
|
|
| **v4.0** | **2025-11-20 08:00** | **~21,965** | **+ Regional (4 more states)** | **+440** |
|
|
|
|
---
|
|
|
|
## Output Files to Create
|
|
|
|
### Harvest Outputs (JSON)
|
|
|
|
1. `data/isil/germany/thueringen_archives_20251120_*.json` (149 archives)
|
|
2. `data/isil/germany/arcinsys_niedersachsen_20251120_*.json` (350 archives)
|
|
3. `data/isil/germany/arcinsys_schleswig_holstein_20251120_*.json` (150 archives)
|
|
4. `data/isil/germany/arcinsys_hessen_20251120_*.json` (200 archives)
|
|
5. `data/isil/germany/baden_wuerttemberg_archives_20251120_*.json` (250 archives)
|
|
6. `data/isil/germany/bayern_archives_20251120_*.json` (50 archives)
|
|
7. `data/isil/germany/sachsen_archives_20251120_*.json` (150 archives)
|
|
8. `data/isil/germany/sachsen_anhalt_archives_20251120_*.json` (100 archives)
|
|
|
|
### Merged Outputs
|
|
|
|
9. `data/isil/germany/german_institutions_unified_v3_20251120_*.json` (after Arcinsys merge)
|
|
10. `data/isil/germany/german_institutions_unified_v4_20251120_*.json` (after all regional merges)
|
|
|
|
### Scripts
|
|
|
|
11. `scripts/scrapers/harvest_thueringen_archives.py`
|
|
12. `scripts/scrapers/harvest_arcinsys_unified.py` (handles all 4 Arcinsys states)
|
|
13. `scripts/scrapers/harvest_baden_wuerttemberg_archives.py`
|
|
14. `scripts/scrapers/harvest_bayern_archives.py`
|
|
15. `scripts/scrapers/harvest_sachsen_archives.py`
|
|
16. `scripts/scrapers/harvest_sachsen_anhalt_archives.py`
|
|
17. `scripts/scrapers/merge_regional_to_german_dataset.py` (unified merger)
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Phase 1 (3.5 hours)
|
|
|
|
✅ Thüringen harvested (149 archives)
|
|
✅ Arcinsys consortium harvested (700+ archives)
|
|
✅ 679 new archives added to German dataset
|
|
✅ German dataset v3.0 created (~21,525 institutions)
|
|
|
|
### Phase 2 (8.5 hours)
|
|
|
|
✅ Baden-Württemberg harvested (250 archives)
|
|
✅ Bayern harvested (50 archives)
|
|
✅ Sachsen harvested (150 archives)
|
|
✅ Sachsen-Anhalt harvested (100 archives)
|
|
✅ 1,119 new archives added to German dataset
|
|
✅ German dataset v4.0 created (~21,965 institutions)
|
|
|
|
---
|
|
|
|
## Phase 1 Progress Impact
|
|
|
|
| Metric | Before | After v3.0 | After v4.0 | Goal |
|
|
|--------|--------|------------|------------|------|
|
|
| **German Institutions** | 20,846 | 21,525 | 21,965 | ~22,000 |
|
|
| **Phase 1 Total** | 38,479 | 39,158 | 39,598 | 97,000 |
|
|
| **Progress %** | 39.7% | 40.4% | 40.8% | 100% |
|
|
|
|
**Impact**: +0.7pp (v3.0) or +1.1pp (v4.0) toward Phase 1 goal
|
|
|
|
---
|
|
|
|
## Key Technical Notes
|
|
|
|
### Reuse NRW Pattern
|
|
|
|
The NRW harvest showed excellent results:
|
|
- **Fast text extraction** (no clicking)
|
|
- **Regex city parsing** (German archive name patterns)
|
|
- **Fuzzy deduplication** (90% threshold)
|
|
- **Nominatim geocoding** (1 req/sec)
|
|
|
|
**Apply same approach** to all regional portals.
|
|
|
|
### Arcinsys Advantage
|
|
|
|
All 4 Arcinsys states share:
|
|
- Same portal structure
|
|
- Same archive directory format
|
|
- Same HTML/CSS patterns
|
|
|
|
**Build ONE scraper**, deploy to 4 states → 700+ archives in 3 hours
|
|
|
|
---
|
|
|
|
## Next Agent Instructions
|
|
|
|
**START WITH THÜRINGEN** (30 minutes, 119 new archives)
|
|
|
|
1. Navigate to: https://www.archive-in-thueringen.de/en/
|
|
2. Find archive directory (likely under "Archives" or "Institutions" menu)
|
|
3. Extract 149 archives listed
|
|
4. Parse names, cities, types
|
|
5. Geocode and merge
|
|
|
|
**Then move to Arcinsys consortium** (3 hours, 560 new archives)
|
|
|
|
---
|
|
|
|
**Ready to Execute**: YES
|
|
**Expected Total Time**: 3.5 hours (Phase 1) or 8.5 hours (Phase 2)
|
|
**Expected New Archives**: +679 (Phase 1) or +1,119 (Phase 2)
|
|
**Priority Level**: HIGH ⭐
|
|
|
|
---
|
|
|
|
**Generated**: 2025-11-19 22:35 UTC
|
|
**Reference**: GERMAN_REGIONAL_ARCHIVE_PORTALS_DISCOVERY.md
|