341 lines
11 KiB
Markdown
341 lines
11 KiB
Markdown
# Next Agent Handoff: Saxony Complete, Bavaria Ready
|
|
|
|
**Date**: 2025-11-20
|
|
**Status**: Saxony extraction COMPLETE (411 institutions at 99.8% ISIL coverage)
|
|
**Next Target**: Bavaria (Bayern) - estimated 1,200-1,500 institutions
|
|
|
|
---
|
|
|
|
## What We Just Finished
|
|
|
|
### Saxony Dataset COMPLETE ✅
|
|
- **411 institutions** extracted (6 archives + 6 libraries + 399 museums)
|
|
- **99.8% ISIL coverage** (410/411 institutions) 🏆 **BEST IN PROJECT**
|
|
- **213 cities** covered (excellent rural penetration)
|
|
- **Foundation-first strategy** validated (quality archives/libraries first, then bulk museum extraction)
|
|
|
|
### Key Files Created
|
|
1. **Data**: `data/isil/germany/sachsen_complete_20251120_153257.json` (640 KB, 411 institutions)
|
|
2. **Scraper**: `scripts/scrapers/harvest_isil_museum_sachsen.py` (museum extractor)
|
|
3. **Documentation**:
|
|
- `SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md` (full case study)
|
|
- `GERMAN_STATE_EXTRACTION_PATTERN.md` (reusable template)
|
|
- `GERMAN_HARVEST_STATUS.md` (current progress)
|
|
|
|
---
|
|
|
|
## What's Ready for You (Next Agent)
|
|
|
|
### Immediate Next Task: Bavaria (Bayern) Extraction
|
|
|
|
**Goal**: Extract 1,200-1,500 Bavarian institutions using proven Saxony pattern
|
|
|
|
**Estimated Time**: 1.5-2 hours (including foundation research)
|
|
|
|
**Expected ISIL Coverage**: 98%+
|
|
|
|
---
|
|
|
|
## Step-by-Step Instructions for Bavaria
|
|
|
|
### Phase 1: Museum Extraction (5 minutes)
|
|
|
|
```bash
|
|
# 1. Copy Saxony scraper template
|
|
cd /Users/kempersc/apps/glam
|
|
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# 2. Update state references (macOS)
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# 3. Manually edit the URL (line ~27)
|
|
# Before: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Sachsen"
|
|
# After: BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
|
|
|
|
# 4. Manually edit region (line ~139)
|
|
# Before: "region": "Sachsen"
|
|
# After: "region": "Bayern"
|
|
|
|
# 5. Run extraction
|
|
python3 scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# Expected output: data/isil/germany/bayern_museums_YYYYMMDD_HHMMSS.json
|
|
# Expected count: ~1,200 Bavarian museums
|
|
```
|
|
|
|
**Manual Edits Required**:
|
|
- Line 27: Update `SACHSEN_URL` to `BAYERN_URL` with `suchbegriff=Bayern`
|
|
- Line 139: Change region from `"Sachsen"` to `"Bayern"`
|
|
|
|
### Phase 2: Foundation Dataset (30-60 minutes)
|
|
|
|
**Research Bavarian State Archives and Libraries**:
|
|
|
|
**Bavarian State Archives** (Bayerische Staatsarchive):
|
|
1. Hauptstaatsarchiv München (Munich State Archive)
|
|
2. Staatsarchiv Amberg
|
|
3. Staatsarchiv Augsburg
|
|
4. Staatsarchiv Bamberg
|
|
5. Staatsarchiv Coburg
|
|
6. Staatsarchiv Landshut
|
|
7. Staatsarchiv Nürnberg (Nuremberg)
|
|
8. Staatsarchiv Würzburg
|
|
|
|
**Major Bavarian Libraries**:
|
|
1. Bayerische Staatsbibliothek (Munich) - https://www.bsb-muenchen.de
|
|
2. Universitätsbibliothek München (LMU)
|
|
3. Universitätsbibliothek der TU München
|
|
4. Universitätsbibliothek Würzburg
|
|
5. Universitätsbibliothek Erlangen-Nürnberg
|
|
6. Universitätsbibliothek Regensburg
|
|
|
|
**Extraction Method**:
|
|
- Visit official websites
|
|
- Extract: name, city, address, phone, email, website, ISIL code
|
|
- Create JSON files:
|
|
- `data/isil/germany/bayern_archives_YYYYMMDD_HHMMSS.json`
|
|
- `data/isil/germany/bayern_libraries_YYYYMMDD_HHMMSS.json`
|
|
|
|
**Target**: ~14 foundation institutions at 80%+ completeness
|
|
|
|
### Phase 3: Merge Datasets (5 minutes)
|
|
|
|
```bash
|
|
# 1. Copy merge template
|
|
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
|
|
|
|
# 2. Update state references (macOS)
|
|
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
|
|
|
|
# 3. Update file patterns (edit script manually)
|
|
# Change: sachsen_archives_*.json → bayern_archives_*.json
|
|
# Change: sachsen_slub_dresden_*.json → (remove this - not applicable to Bayern)
|
|
# Change: sachsen_university_libraries_*.json → bayern_libraries_*.json
|
|
# Change: sachsen_museums_*.json → bayern_museums_*.json
|
|
|
|
# 4. Run merge
|
|
python3 scripts/merge_bayern_complete.py
|
|
|
|
# Expected output: data/isil/germany/bayern_complete_YYYYMMDD_HHMMSS.json
|
|
# Expected count: ~1,214 institutions (14 foundation + 1,200 museums)
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Bavaria Results
|
|
|
|
**Institution Breakdown**:
|
|
- State archives: 8
|
|
- Major libraries: 6
|
|
- Museums: 1,200+ (from isil.museum)
|
|
- **Total**: ~1,214 institutions
|
|
|
|
**ISIL Coverage**: 98%+ (based on Saxony pattern)
|
|
|
|
**Geographic Distribution**: ~200-300 Bavarian cities
|
|
|
|
**Top Cities** (estimated):
|
|
- Munich (München): 200-300 institutions
|
|
- Nuremberg (Nürnberg): 50-80 institutions
|
|
- Augsburg: 30-50 institutions
|
|
- Regensburg: 20-30 institutions
|
|
- Würzburg: 20-30 institutions
|
|
|
|
---
|
|
|
|
## Quick Reference: What Works
|
|
|
|
### Foundation-First Strategy ✅
|
|
1. **Extract high-quality foundation dataset first** (archives + major libraries)
|
|
- Target: 10-20 institutions
|
|
- Completeness: 80%+
|
|
- Method: Manual web research
|
|
- Time: 30-60 minutes
|
|
|
|
2. **Extract museums from isil.museum registry**
|
|
- Target: 200-1,500 institutions (varies by state size)
|
|
- Completeness: 40%+ (basic extraction)
|
|
- Method: Automated scraping
|
|
- Time: ~5 seconds
|
|
|
|
3. **Merge datasets**
|
|
- Combine foundation + museums
|
|
- Sort by city, then name
|
|
- Generate reports
|
|
- Time: ~3 seconds
|
|
|
|
### Why This Works
|
|
- **Quality first**: Foundation dataset provides high-completeness benchmark
|
|
- **Quantity second**: Museum registry provides comprehensive coverage
|
|
- **Reproducible**: Same pattern works for all German states
|
|
- **Fast**: Total automation time <10 seconds, manual research 30-60 minutes
|
|
|
|
---
|
|
|
|
## Troubleshooting Guide
|
|
|
|
### Problem: "No museums found in HTML"
|
|
|
|
**Solution**: Check URL encoding. Bavaria may require special characters:
|
|
```python
|
|
# Try these URL variations:
|
|
BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
|
|
# OR
|
|
BAYERN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bavaria" # English name
|
|
```
|
|
|
|
### Problem: "ISIL coverage <95%"
|
|
|
|
**Solution**: Some foundation institutions may not have ISIL codes. Check:
|
|
1. SIGEL database: https://sigel.staatsbibliothek-berlin.de
|
|
2. Search for missing institutions
|
|
3. Mark as "ISIL_not_assigned" if genuinely missing
|
|
|
|
### Problem: "City names with umlauts"
|
|
|
|
**Solution**: Keep original German names with umlauts:
|
|
- München (not Muenchen)
|
|
- Nürnberg (not Nuernberg)
|
|
- Ensure UTF-8 encoding: `encoding='utf-8'`
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
Before marking Bavaria as COMPLETE, verify:
|
|
|
|
- [ ] Foundation dataset created (8 archives + 6 libraries)
|
|
- [ ] Museums extracted from isil.museum (~1,200 institutions)
|
|
- [ ] Datasets merged into `bayern_complete_*.json`
|
|
- [ ] ISIL coverage >95%
|
|
- [ ] Core field completeness 100% (name, type, city)
|
|
- [ ] Geographic distribution analyzed (city counts)
|
|
- [ ] Metadata completeness report generated
|
|
- [ ] LinkML schema validation passed
|
|
- [ ] Session summary documented
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
**Minimum Viable Bavaria Dataset**:
|
|
- ✅ Foundation: 10+ institutions at 80%+ completeness
|
|
- ✅ Museums: 1,000+ institutions at 40%+ completeness
|
|
- ✅ ISIL coverage: >95%
|
|
- ✅ Core fields: 100%
|
|
|
|
**High-Quality Bavaria Dataset**:
|
|
- ✅ Foundation: 14+ institutions at 90%+ completeness
|
|
- ✅ Museums: 1,200+ institutions at 50%+ completeness
|
|
- ✅ ISIL coverage: >98%
|
|
- ✅ Core fields: 100%
|
|
|
|
---
|
|
|
|
## Reference Files (Use These!)
|
|
|
|
### Templates
|
|
- **Scraper**: `scripts/scrapers/harvest_isil_museum_sachsen.py`
|
|
- **Merger**: `scripts/merge_sachsen_complete.py`
|
|
- **Pattern Guide**: `GERMAN_STATE_EXTRACTION_PATTERN.md`
|
|
|
|
### Documentation
|
|
- **Saxony Case Study**: `SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md`
|
|
- **Harvest Status**: `GERMAN_HARVEST_STATUS.md`
|
|
- **Strategy**: `SAXONY_HARVEST_STRATEGY.md`
|
|
|
|
### Data Files (Saxony Examples)
|
|
- **Complete**: `data/isil/germany/sachsen_complete_20251120_153257.json`
|
|
- **Museums**: `data/isil/germany/sachsen_museums_20251120_153233.json`
|
|
- **Archives**: `data/isil/germany/sachsen_archives_20251120_152047.json`
|
|
|
|
---
|
|
|
|
## After Bavaria: Next Targets
|
|
|
|
**Priority Order** (by institution count):
|
|
1. ✅ **Nordrhein-Westfalen** - COMPLETE (1,893 institutions)
|
|
2. ✅ **Thüringen** - COMPLETE (1,061 institutions)
|
|
3. 📋 **Bayern (Bavaria)** - NEXT TARGET (1,200-1,500 estimated) ← **YOU ARE HERE**
|
|
4. 📋 **Baden-Württemberg** - 1,000-1,200 estimated
|
|
5. 📋 **Niedersachsen (Lower Saxony)** - 800-1,000 estimated
|
|
|
|
**Estimated Time to Complete Germany**:
|
|
- Remaining: 12 states
|
|
- Time per state: 1.5 hours average
|
|
- Total remaining: ~18 hours
|
|
|
|
---
|
|
|
|
## Project Context
|
|
|
|
### Current Status
|
|
- **Completed**: 4/16 German states (25%)
|
|
- **Total Institutions**: 3,682
|
|
- **ISIL Coverage**: 98.5%+
|
|
- **Best ISIL Coverage**: Saxony (99.8%) 🏆
|
|
|
|
### Post-Bavaria Status (Projected)
|
|
- **Completed**: 5/16 German states (31%)
|
|
- **Total Institutions**: ~4,900 (3,682 + 1,214)
|
|
- **ISIL Coverage**: 98.5%+ (maintained)
|
|
|
|
---
|
|
|
|
## Quick Start Command Summary
|
|
|
|
```bash
|
|
# COPY-PASTE THESE COMMANDS FOR BAVARIA EXTRACTION
|
|
|
|
# 1. Create Bavaria scraper
|
|
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# 2. Manually edit URLs/regions in bayern scraper (see Phase 1 above)
|
|
|
|
# 3. Run museum extraction
|
|
python3 scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# 4. Research foundation dataset (archives + libraries)
|
|
# Create: bayern_archives_*.json and bayern_libraries_*.json
|
|
|
|
# 5. Create merge script
|
|
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
|
|
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
|
|
|
|
# 6. Manually edit file patterns in merge script (see Phase 3 above)
|
|
|
|
# 7. Run merge
|
|
python3 scripts/merge_bayern_complete.py
|
|
|
|
# 8. Verify results
|
|
python3 -c "import json; data = json.load(open('data/isil/germany/bayern_complete_*.json')); print(f'Total: {len(data)} institutions')"
|
|
|
|
# 9. Document session (copy/adapt SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md)
|
|
```
|
|
|
|
---
|
|
|
|
## Final Notes
|
|
|
|
**This is a proven pattern** - we just validated it on Saxony with:
|
|
- ✅ 99.8% ISIL coverage (best in project)
|
|
- ✅ 8 seconds total automation time
|
|
- ✅ 100% LinkML schema compliance
|
|
- ✅ 213 cities covered
|
|
|
|
**Just follow the instructions** and you'll have Bavaria complete in 1.5-2 hours!
|
|
|
|
**Key Success Factor**: Foundation-first strategy (quality before quantity)
|
|
|
|
---
|
|
|
|
**Status**: ✅ Ready for Bavaria extraction
|
|
**Next Agent**: Start with Phase 1 (museum extraction) - takes only 5 minutes!
|
|
**Expected Completion**: Bavaria complete in 1.5-2 hours
|
|
|
|
**Good luck! 🚀**
|