11 KiB
Next Agent Handoff: Saxony Complete, Bavaria Ready
Date: 2025-11-20
Status: Saxony extraction COMPLETE (411 institutions at 99.8% ISIL coverage)
Next Target: Bavaria (Bayern) - estimated 1,200-1,500 institutions
What We Just Finished
Saxony Dataset COMPLETE ✅
- 411 institutions extracted (6 archives + 6 libraries + 399 museums)
- 99.8% ISIL coverage (410/411 institutions) 🏆 BEST IN PROJECT
- 213 cities covered (excellent rural penetration)
- Foundation-first strategy validated (quality archives/libraries first, then bulk museum extraction)
Key Files Created
- Data:
data/isil/germany/sachsen_complete_20251120_153257.json(640 KB, 411 institutions) - Scraper:
scripts/scrapers/harvest_isil_museum_sachsen.py(museum extractor) - Documentation:
SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md(full case study)GERMAN_STATE_EXTRACTION_PATTERN.md(reusable template)GERMAN_HARVEST_STATUS.md(current progress)
What's Ready for You (Next Agent)
Immediate Next Task: Bavaria (Bayern) Extraction
Goal: Extract 1,200-1,500 Bavarian institutions using proven Saxony pattern
Estimated Time: 1.5-2 hours (including foundation research)
Expected ISIL Coverage: 98%+
Step-by-Step Instructions for Bavaria
Phase 1: Museum Extraction (5 minutes)
# 1. Copy Saxony scraper template
cd /Users/kempersc/apps/glam
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
# 2. Update state references (macOS)
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
# 3. Manually edit the URL (line ~27)
# Before: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Sachsen"
# After: BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
# 4. Manually edit region (line ~139)
# Before: "region": "Sachsen"
# After: "region": "Bayern"
# 5. Run extraction
python3 scripts/scrapers/harvest_isil_museum_bayern.py
# Expected output: data/isil/germany/bayern_museums_YYYYMMDD_HHMMSS.json
# Expected count: ~1,200 Bavarian museums
Manual Edits Required:
- Line 27: Update
SACHSEN_URLtoBAYERN_URLwithsuchbegriff=Bayern - Line 139: Change region from
"Sachsen"to"Bayern"
Phase 2: Foundation Dataset (30-60 minutes)
Research Bavarian State Archives and Libraries:
Bavarian State Archives (Bayerische Staatsarchive):
- Hauptstaatsarchiv München (Munich State Archive)
- Staatsarchiv Amberg
- Staatsarchiv Augsburg
- Staatsarchiv Bamberg
- Staatsarchiv Coburg
- Staatsarchiv Landshut
- Staatsarchiv Nürnberg (Nuremberg)
- Staatsarchiv Würzburg
Major Bavarian Libraries:
- Bayerische Staatsbibliothek (Munich) - https://www.bsb-muenchen.de
- Universitätsbibliothek München (LMU)
- Universitätsbibliothek der TU München
- Universitätsbibliothek Würzburg
- Universitätsbibliothek Erlangen-Nürnberg
- Universitätsbibliothek Regensburg
Extraction Method:
- Visit official websites
- Extract: name, city, address, phone, email, website, ISIL code
- Create JSON files:
data/isil/germany/bayern_archives_YYYYMMDD_HHMMSS.jsondata/isil/germany/bayern_libraries_YYYYMMDD_HHMMSS.json
Target: ~14 foundation institutions at 80%+ completeness
Phase 3: Merge Datasets (5 minutes)
# 1. Copy merge template
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
# 2. Update state references (macOS)
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
# 3. Update file patterns (edit script manually)
# Change: sachsen_archives_*.json → bayern_archives_*.json
# Change: sachsen_slub_dresden_*.json → (remove this - not applicable to Bayern)
# Change: sachsen_university_libraries_*.json → bayern_libraries_*.json
# Change: sachsen_museums_*.json → bayern_museums_*.json
# 4. Run merge
python3 scripts/merge_bayern_complete.py
# Expected output: data/isil/germany/bayern_complete_YYYYMMDD_HHMMSS.json
# Expected count: ~1,214 institutions (14 foundation + 1,200 museums)
Expected Bavaria Results
Institution Breakdown:
- State archives: 8
- Major libraries: 6
- Museums: 1,200+ (from isil.museum)
- Total: ~1,214 institutions
ISIL Coverage: 98%+ (based on Saxony pattern)
Geographic Distribution: ~200-300 Bavarian cities
Top Cities (estimated):
- Munich (München): 200-300 institutions
- Nuremberg (Nürnberg): 50-80 institutions
- Augsburg: 30-50 institutions
- Regensburg: 20-30 institutions
- Würzburg: 20-30 institutions
Quick Reference: What Works
Foundation-First Strategy ✅
-
Extract high-quality foundation dataset first (archives + major libraries)
- Target: 10-20 institutions
- Completeness: 80%+
- Method: Manual web research
- Time: 30-60 minutes
-
Extract museums from isil.museum registry
- Target: 200-1,500 institutions (varies by state size)
- Completeness: 40%+ (basic extraction)
- Method: Automated scraping
- Time: ~5 seconds
-
Merge datasets
- Combine foundation + museums
- Sort by city, then name
- Generate reports
- Time: ~3 seconds
Why This Works
- Quality first: Foundation dataset provides high-completeness benchmark
- Quantity second: Museum registry provides comprehensive coverage
- Reproducible: Same pattern works for all German states
- Fast: Total automation time <10 seconds, manual research 30-60 minutes
Troubleshooting Guide
Problem: "No museums found in HTML"
Solution: Check URL encoding. Bavaria may require special characters:
# Try these URL variations:
BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
# OR
BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bavaria" # English name
Problem: "ISIL coverage <95%"
Solution: Some foundation institutions may not have ISIL codes. Check:
- SIGEL database: https://sigel.staatsbibliothek-berlin.de
- Search for missing institutions
- Mark as "ISIL_not_assigned" if genuinely missing
Problem: "City names with umlauts"
Solution: Keep original German names with umlauts:
- München (not Muenchen)
- Nürnberg (not Nuernberg)
- Ensure UTF-8 encoding:
encoding='utf-8'
Validation Checklist
Before marking Bavaria as COMPLETE, verify:
- Foundation dataset created (8 archives + 6 libraries)
- Museums extracted from isil.museum (~1,200 institutions)
- Datasets merged into
bayern_complete_*.json - ISIL coverage >95%
- Core field completeness 100% (name, type, city)
- Geographic distribution analyzed (city counts)
- Metadata completeness report generated
- LinkML schema validation passed
- Session summary documented
Success Metrics
Minimum Viable Bavaria Dataset:
- ✅ Foundation: 10+ institutions at 80%+ completeness
- ✅ Museums: 1,000+ institutions at 40%+ completeness
- ✅ ISIL coverage: >95%
- ✅ Core fields: 100%
High-Quality Bavaria Dataset:
- ✅ Foundation: 14+ institutions at 90%+ completeness
- ✅ Museums: 1,200+ institutions at 50%+ completeness
- ✅ ISIL coverage: >98%
- ✅ Core fields: 100%
Reference Files (Use These!)
Templates
- Scraper:
scripts/scrapers/harvest_isil_museum_sachsen.py - Merger:
scripts/merge_sachsen_complete.py - Pattern Guide:
GERMAN_STATE_EXTRACTION_PATTERN.md
Documentation
- Saxony Case Study:
SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md - Harvest Status:
GERMAN_HARVEST_STATUS.md - Strategy:
SAXONY_HARVEST_STRATEGY.md
Data Files (Saxony Examples)
- Complete:
data/isil/germany/sachsen_complete_20251120_153257.json - Museums:
data/isil/germany/sachsen_museums_20251120_153233.json - Archives:
data/isil/germany/sachsen_archives_20251120_152047.json
After Bavaria: Next Targets
Priority Order (by institution count):
- ✅ Nordrhein-Westfalen - COMPLETE (1,893 institutions)
- ✅ Thüringen - COMPLETE (1,061 institutions)
- 📋 Bayern (Bavaria) - NEXT TARGET (1,200-1,500 estimated) ← YOU ARE HERE
- 📋 Baden-Württemberg - 1,000-1,200 estimated
- 📋 Niedersachsen (Lower Saxony) - 800-1,000 estimated
Estimated Time to Complete Germany:
- Remaining: 12 states
- Time per state: 1.5 hours average
- Total remaining: ~18 hours
Project Context
Current Status
- Completed: 4/16 German states (25%)
- Total Institutions: 3,682
- ISIL Coverage: 98.5%+
- Best ISIL Coverage: Saxony (99.8%) 🏆
Post-Bavaria Status (Projected)
- Completed: 5/16 German states (31%)
- Total Institutions: ~4,900 (3,682 + 1,214)
- ISIL Coverage: 98.5%+ (maintained)
Quick Start Command Summary
# COPY-PASTE THESE COMMANDS FOR BAVARIA EXTRACTION
# 1. Create Bavaria scraper
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
# 2. Manually edit URLs/regions in bayern scraper (see Phase 1 above)
# 3. Run museum extraction
python3 scripts/scrapers/harvest_isil_museum_bayern.py
# 4. Research foundation dataset (archives + libraries)
# Create: bayern_archives_*.json and bayern_libraries_*.json
# 5. Create merge script
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
# 6. Manually edit file patterns in merge script (see Phase 3 above)
# 7. Run merge
python3 scripts/merge_bayern_complete.py
# 8. Verify results
python3 -c "import json; data = json.load(open('data/isil/germany/bayern_complete_*.json')); print(f'Total: {len(data)} institutions')"
# 9. Document session (copy/adapt SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md)
Final Notes
This is a proven pattern - we just validated it on Saxony with:
- ✅ 99.8% ISIL coverage (best in project)
- ✅ 8 seconds total automation time
- ✅ 100% LinkML schema compliance
- ✅ 213 cities covered
Just follow the instructions and you'll have Bavaria complete in 1.5-2 hours!
Key Success Factor: Foundation-first strategy (quality before quantity)
Status: ✅ Ready for Bavaria extraction
Next Agent: Start with Phase 1 (museum extraction) - takes only 5 minutes!
Expected Completion: Bavaria complete in 1.5-2 hours
Good luck! 🚀