# German State Extraction Pattern (Reusable Template) **Last Updated**: 2025-11-20 **Status**: Production-ready pattern validated on Saxony (411 institutions) --- ## Overview This document provides a **copy-paste template** for extracting heritage institutions from any German state (Bundesland) using the proven Saxony pattern. **Success Rate**: 99.8% ISIL coverage, 100% core metadata completeness **Time Required**: 1.5-2 hours per state (foundation + museums) **Difficulty**: Easy (automated scraping, minimal manual curation) --- ## Two-Phase Extraction Strategy ### Phase 1: Foundation Dataset (Archives + Libraries) Extract **high-quality records** from major state institutions: - State archives (Staatsarchiv, Landesarchiv) - State/university libraries (Staatsbibliothek, Universitätsbibliothek) - Research libraries (specialized subject libraries) **Target**: 10-20 institutions with 80%+ completeness ### Phase 2: Museum Registry Extraction Extract **comprehensive museum coverage** from isil.museum: - Official German museum ISIL registry (Institut für Museumsforschung) - ~6,300 total German museums - Filter by state name (Bundesland) **Target**: 200-1,500 museums per state (varies by state size) --- ## Quick Start Template ### Step 1: Create State Scraper (5 minutes) ```bash # Replace STATE_NAME and state_name throughout # Example: Bayern (Bavaria), Baden-Württemberg, etc. # Copy Saxony museum scraper template cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_STATE_NAME.py # Update state references # macOS: sed -i '' 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py sed -i '' 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py # Linux: sed -i 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py sed -i 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py ``` **Manual Edits Required**: 1. Update `SACHSEN_URL` to point to your state: ```python # Before: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Sachsen" # After: STATE_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern" # Example: Bayern ``` 2. Update region in `convert_to_linkml()`: ```python # Before: "region": "Sachsen" # After: "region": "Bayern" # Use official German state name ``` ### Step 2: Extract Museums (2 minutes) ```bash # Run state museum scraper python3 scripts/scrapers/harvest_isil_museum_STATE_NAME.py # Output: data/isil/germany/state_name_museums_YYYYMMDD_HHMMSS.json ``` **Expected Results**: - 200-1,500 museums (depends on state size) - 100% ISIL coverage (DE-MUS-* codes) - 100% name/city coverage - 0% address coverage (requires detail page scraping) ### Step 3: Extract Foundation Dataset (30-60 minutes) **Option A: Manual Web Research** (recommended for first iteration) 1. Search for state archives (e.g., "Bayerisches Hauptstaatsarchiv") 2. Visit official websites 3. Extract contact info, ISIL codes, descriptions 4. Create JSON file: `data/isil/germany/state_name_archives_YYYYMMDD_HHMMSS.json` **Option B: Automated Scraping** (if API/structured data available) 1. Create custom scraper for state archive portal 2. Parse institution listings 3. Extract metadata 4. Export to JSON **Foundation Institutions to Target**: - State archives (Staatsarchiv, Landesarchiv) - State library (Staatsbibliothek, if separate from state archive) - Major university libraries (technical universities, research universities) - Specialized research libraries ### Step 4: Merge Datasets (2 minutes) ```bash # Copy merge template cp scripts/merge_sachsen_complete.py scripts/merge_STATE_NAME_complete.py # Update state references # macOS: sed -i '' 's/sachsen/state_name/g' scripts/merge_STATE_NAME_complete.py sed -i '' 's/Sachsen/STATE_NAME/g' scripts/merge_STATE_NAME_complete.py # Run merge python3 scripts/merge_STATE_NAME_complete.py # Output: data/isil/germany/state_name_complete_YYYYMMDD_HHMMSS.json ``` --- ## German States Priority List ### High Priority (Large States, High Institution Count) | State | German Name | Estimated Institutions | Difficulty | |-------|-------------|------------------------|------------| | Bavaria | Bayern | 1,200-1,500 | Medium | | Baden-Württemberg | Baden-Württemberg | 1,000-1,200 | Medium | | Lower Saxony | Niedersachsen | 800-1,000 | Medium | | Hesse | Hessen | 500-700 | Easy | | Rhineland-Palatinate | Rheinland-Pfalz | 400-600 | Easy | ### Medium Priority | State | German Name | Estimated Institutions | Difficulty | |-------|-------------|------------------------|------------| | Berlin | Berlin | 300-400 | Easy | | Brandenburg | Brandenburg | 300-400 | Easy | | Schleswig-Holstein | Schleswig-Holstein | 250-350 | Easy | | Mecklenburg-Vorpommern | Mecklenburg-Vorpommern | 200-300 | Easy | ### Already Complete ✅ | State | Status | Institutions | ISIL Coverage | |-------|--------|-------------|---------------| | Saxony | ✅ COMPLETE | 411 | 99.8% | | Thuringia | ✅ COMPLETE | 1,061 | 97.8% | | Saxony-Anhalt | ✅ COMPLETE | 317 | 98.4% | | North Rhine-Westphalia | ✅ COMPLETE | 1,893 | 99.2% | --- ## Data Quality Expectations ### Phase 1 (Foundation Dataset) | Field | Expected Coverage | |-------|-------------------| | Name | 100% | | Institution Type | 100% | | City | 100% | | Street Address | 80-100% | | Postal Code | 80-100% | | Phone | 80-100% | | Email | 60-80% | | Website | 90-100% | | ISIL Code | 90-100% | | Description | 100% | **Average Completeness**: 80-90% ### Phase 2 (Museum Registry) | Field | Expected Coverage | |-------|-------------------| | Name | 100% | | Institution Type | 100% | | City | 100% | | Street Address | 0% (requires detail page scraping) | | Postal Code | 0% (requires detail page scraping) | | Phone | 0% (requires detail page scraping) | | Email | 0% (requires detail page scraping) | | Website | 0% (requires detail page scraping) | | ISIL Code | 100% | | Description | 100% (generic) | **Average Completeness**: 40-50% (basic extraction) ### Phase 3 (Optional Enrichment) **Detail Page Scraping** (adds 2-3 hours): - Extract addresses, phone, email, website from individual museum pages - Expected completeness gain: 40% → 75% **Wikidata Enrichment** (adds 1-2 hours): - SPARQL query for state museums - Fuzzy match to extracted museums - Expected Wikidata coverage: 0% → 50-60% --- ## Validation Checklist Before marking a state as "COMPLETE", verify: - [ ] Foundation dataset created (10-20 institutions) - [ ] Museums extracted from isil.museum (200+ institutions) - [ ] Datasets merged into `state_name_complete_*.json` - [ ] ISIL coverage >95% - [ ] Core field completeness 100% (name, type, city) - [ ] Geographic distribution analyzed (city counts) - [ ] Metadata completeness report generated - [ ] LinkML schema validation passed - [ ] Session summary documented --- ## Example Workflow: Bavaria (Bayern) ### Step 1: Museum Extraction ```bash # Create scraper cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py # Update references sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py # Manually update URL # Line 27: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern" # Line 139: "region": "Bayern" # Run extraction python3 scripts/scrapers/harvest_isil_museum_bayern.py # Expected output: ~1,200 Bavarian museums ``` ### Step 2: Foundation Dataset **Bavarian Foundation Institutions** (research manually): 1. **Bavarian State Archives** - Hauptstaatsarchiv München - Staatsarchiv Amberg - Staatsarchiv Augsburg - Staatsarchiv Bamberg - Staatsarchiv Coburg - Staatsarchiv Landshut - Staatsarchiv Nürnberg - Staatsarchiv Würzburg 2. **Major Libraries** - Bayerische Staatsbibliothek (Munich) - Universitätsbibliothek München (LMU) - Universitätsbibliothek der TU München - Universitätsbibliothek Würzburg - Universitätsbibliothek Erlangen-Nürnberg - Universitätsbibliothek Regensburg **Total**: ~14 foundation institutions ### Step 3: Merge ```bash # Create merge script cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py # Update references sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py # Run merge python3 scripts/merge_bayern_complete.py # Expected output: ~1,214 Bavarian institutions (14 + 1,200) ``` --- ## Troubleshooting ### Problem: "No museums found in HTML" **Cause**: State name not recognized by isil.museum registry **Solution**: 1. Manually browse to http://www.museen-in-deutschland.de 2. Search for state name in dropdown menu 3. Copy exact search parameter from URL (e.g., `suchbegriff=Baden-W%C3%BCrttemberg`) 4. Update `STATE_URL` in scraper with correct URL-encoded parameter ### Problem: "ISIL coverage <95%" **Cause**: Some foundation institutions may not have ISIL codes **Solution**: 1. Check SIGEL database: https://sigel.staatsbibliothek-berlin.de 2. Search for missing institutions 3. Manually add ISIL codes if found 4. Mark as "ISIL_not_assigned" if genuinely missing ### Problem: "Merge fails with FileNotFoundError" **Cause**: Museum extraction output not found **Solution**: 1. Verify museum scraper ran successfully: `ls -la data/isil/germany/*museums*.json` 2. Check scraper output for errors 3. Ensure output directory exists: `mkdir -p data/isil/germany` 4. Re-run museum extraction if needed ### Problem: "City names have special characters" **Cause**: German umlauts and special characters in city names **Solution**: - Keep original German names (don't transliterate) - Ensure UTF-8 encoding: `encoding='utf-8'` in all file operations - Example: München (not Muenchen), Köln (not Koln), Düsseldorf (not Duesseldorf) --- ## Performance Benchmarks | State | Institutions | Extraction Time | Merge Time | Total Time | |-------|-------------|-----------------|------------|------------| | Saxony | 411 | 5 seconds | 3 seconds | 8 seconds | | Bavaria (est.) | 1,214 | 15 seconds | 5 seconds | 20 seconds | | Baden-Württemberg (est.) | 1,100 | 12 seconds | 5 seconds | 17 seconds | **Note**: Foundation dataset research adds 30-60 minutes (manual web research) --- ## LinkML Schema Compliance All extracted records MUST conform to: - `schemas/core.yaml` - HeritageCustodian, Location, Identifier classes - `schemas/enums.yaml` - InstitutionTypeEnum (ARCHIVE, LIBRARY, MUSEUM) - `schemas/provenance.yaml` - Provenance, data_source, data_tier **Required Fields**: ```yaml id: https://w3id.org/heritage/custodian/de/... name: institution_type: MUSEUM | LIBRARY | ARCHIVE locations: - city: region: country: DE identifiers: - identifier_scheme: ISIL identifier_value: DE-MUS-* | DE-* | ... provenance: data_source: WEB_SCRAPING data_tier: TIER_2_VERIFIED extraction_date: extraction_method: confidence_score: 0.90 source_url: ``` --- ## Success Metrics ### Minimum Viable Dataset - ✅ Foundation dataset: 10+ institutions at 80%+ completeness - ✅ Museums: 200+ institutions at 40%+ completeness - ✅ ISIL coverage: >95% - ✅ Core fields: 100% (name, type, city) ### High-Quality Dataset - ✅ Foundation dataset: 15+ institutions at 90%+ completeness - ✅ Museums: 500+ institutions at 50%+ completeness - ✅ ISIL coverage: >98% - ✅ Core fields: 100% - ✅ Wikidata enrichment: >50% for major institutions --- ## Related Documentation - `SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md` - Full Saxony extraction case study - `SAXONY_HARVEST_STRATEGY.md` - Strategic planning document - `SESSION_SUMMARY_20251120_THUERINGEN_V4_COMPLETE.md` - Thuringia case study (enrichment example) - `AGENTS.md` - AI agent instructions for extraction --- ## Contact & Support **Questions?** Check existing session summaries in project root for similar extraction patterns. **Bugs in scraper?** The isil.museum HTML structure may change. If extraction fails: 1. Inspect current HTML: `curl http://www.museen-in-deutschland.de/?t=liste&mode=land&suchbegriff=YOUR_STATE > debug.html` 2. Open `debug.html` in browser 3. Update BeautifulSoup selectors in `parse_museum_table()` function 4. Test with small sample before full extraction --- **Pattern Status**: ✅ Production-ready (validated on Saxony with 99.8% ISIL coverage) **Reusability**: High (copy-paste template with minimal edits) **Scalability**: Excellent (handles 200-1,500 institutions per state) **Maintenance**: Low (official registry rarely changes structure)