12 KiB
German State Extraction Pattern (Reusable Template)
Last Updated: 2025-11-20
Status: Production-ready pattern validated on Saxony (411 institutions)
Overview
This document provides a copy-paste template for extracting heritage institutions from any German state (Bundesland) using the proven Saxony pattern.
Success Rate: 99.8% ISIL coverage, 100% core metadata completeness
Time Required: 1.5-2 hours per state (foundation + museums)
Difficulty: Easy (automated scraping, minimal manual curation)
Two-Phase Extraction Strategy
Phase 1: Foundation Dataset (Archives + Libraries)
Extract high-quality records from major state institutions:
- State archives (Staatsarchiv, Landesarchiv)
- State/university libraries (Staatsbibliothek, Universitätsbibliothek)
- Research libraries (specialized subject libraries)
Target: 10-20 institutions with 80%+ completeness
Phase 2: Museum Registry Extraction
Extract comprehensive museum coverage from isil.museum:
- Official German museum ISIL registry (Institut für Museumsforschung)
- ~6,300 total German museums
- Filter by state name (Bundesland)
Target: 200-1,500 museums per state (varies by state size)
Quick Start Template
Step 1: Create State Scraper (5 minutes)
# Replace STATE_NAME and state_name throughout
# Example: Bayern (Bavaria), Baden-Württemberg, etc.
# Copy Saxony museum scraper template
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_STATE_NAME.py
# Update state references
# macOS:
sed -i '' 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
sed -i '' 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
# Linux:
sed -i 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
sed -i 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
Manual Edits Required:
-
Update
SACHSEN_URLto point to your state:# Before: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Sachsen" # After: STATE_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern" # Example: Bayern -
Update region in
convert_to_linkml():# Before: "region": "Sachsen" # After: "region": "Bayern" # Use official German state name
Step 2: Extract Museums (2 minutes)
# Run state museum scraper
python3 scripts/scrapers/harvest_isil_museum_STATE_NAME.py
# Output: data/isil/germany/state_name_museums_YYYYMMDD_HHMMSS.json
Expected Results:
- 200-1,500 museums (depends on state size)
- 100% ISIL coverage (DE-MUS-* codes)
- 100% name/city coverage
- 0% address coverage (requires detail page scraping)
Step 3: Extract Foundation Dataset (30-60 minutes)
Option A: Manual Web Research (recommended for first iteration)
- Search for state archives (e.g., "Bayerisches Hauptstaatsarchiv")
- Visit official websites
- Extract contact info, ISIL codes, descriptions
- Create JSON file:
data/isil/germany/state_name_archives_YYYYMMDD_HHMMSS.json
Option B: Automated Scraping (if API/structured data available)
- Create custom scraper for state archive portal
- Parse institution listings
- Extract metadata
- Export to JSON
Foundation Institutions to Target:
- State archives (Staatsarchiv, Landesarchiv)
- State library (Staatsbibliothek, if separate from state archive)
- Major university libraries (technical universities, research universities)
- Specialized research libraries
Step 4: Merge Datasets (2 minutes)
# Copy merge template
cp scripts/merge_sachsen_complete.py scripts/merge_STATE_NAME_complete.py
# Update state references
# macOS:
sed -i '' 's/sachsen/state_name/g' scripts/merge_STATE_NAME_complete.py
sed -i '' 's/Sachsen/STATE_NAME/g' scripts/merge_STATE_NAME_complete.py
# Run merge
python3 scripts/merge_STATE_NAME_complete.py
# Output: data/isil/germany/state_name_complete_YYYYMMDD_HHMMSS.json
German States Priority List
High Priority (Large States, High Institution Count)
| State | German Name | Estimated Institutions | Difficulty |
|---|---|---|---|
| Bavaria | Bayern | 1,200-1,500 | Medium |
| Baden-Württemberg | Baden-Württemberg | 1,000-1,200 | Medium |
| Lower Saxony | Niedersachsen | 800-1,000 | Medium |
| Hesse | Hessen | 500-700 | Easy |
| Rhineland-Palatinate | Rheinland-Pfalz | 400-600 | Easy |
Medium Priority
| State | German Name | Estimated Institutions | Difficulty |
|---|---|---|---|
| Berlin | Berlin | 300-400 | Easy |
| Brandenburg | Brandenburg | 300-400 | Easy |
| Schleswig-Holstein | Schleswig-Holstein | 250-350 | Easy |
| Mecklenburg-Vorpommern | Mecklenburg-Vorpommern | 200-300 | Easy |
Already Complete ✅
| State | Status | Institutions | ISIL Coverage |
|---|---|---|---|
| Saxony | ✅ COMPLETE | 411 | 99.8% |
| Thuringia | ✅ COMPLETE | 1,061 | 97.8% |
| Saxony-Anhalt | ✅ COMPLETE | 317 | 98.4% |
| North Rhine-Westphalia | ✅ COMPLETE | 1,893 | 99.2% |
Data Quality Expectations
Phase 1 (Foundation Dataset)
| Field | Expected Coverage |
|---|---|
| Name | 100% |
| Institution Type | 100% |
| City | 100% |
| Street Address | 80-100% |
| Postal Code | 80-100% |
| Phone | 80-100% |
| 60-80% | |
| Website | 90-100% |
| ISIL Code | 90-100% |
| Description | 100% |
Average Completeness: 80-90%
Phase 2 (Museum Registry)
| Field | Expected Coverage |
|---|---|
| Name | 100% |
| Institution Type | 100% |
| City | 100% |
| Street Address | 0% (requires detail page scraping) |
| Postal Code | 0% (requires detail page scraping) |
| Phone | 0% (requires detail page scraping) |
| 0% (requires detail page scraping) | |
| Website | 0% (requires detail page scraping) |
| ISIL Code | 100% |
| Description | 100% (generic) |
Average Completeness: 40-50% (basic extraction)
Phase 3 (Optional Enrichment)
Detail Page Scraping (adds 2-3 hours):
- Extract addresses, phone, email, website from individual museum pages
- Expected completeness gain: 40% → 75%
Wikidata Enrichment (adds 1-2 hours):
- SPARQL query for state museums
- Fuzzy match to extracted museums
- Expected Wikidata coverage: 0% → 50-60%
Validation Checklist
Before marking a state as "COMPLETE", verify:
- Foundation dataset created (10-20 institutions)
- Museums extracted from isil.museum (200+ institutions)
- Datasets merged into
state_name_complete_*.json - ISIL coverage >95%
- Core field completeness 100% (name, type, city)
- Geographic distribution analyzed (city counts)
- Metadata completeness report generated
- LinkML schema validation passed
- Session summary documented
Example Workflow: Bavaria (Bayern)
Step 1: Museum Extraction
# Create scraper
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
# Update references
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
# Manually update URL
# Line 27: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
# Line 139: "region": "Bayern"
# Run extraction
python3 scripts/scrapers/harvest_isil_museum_bayern.py
# Expected output: ~1,200 Bavarian museums
Step 2: Foundation Dataset
Bavarian Foundation Institutions (research manually):
-
Bavarian State Archives
- Hauptstaatsarchiv München
- Staatsarchiv Amberg
- Staatsarchiv Augsburg
- Staatsarchiv Bamberg
- Staatsarchiv Coburg
- Staatsarchiv Landshut
- Staatsarchiv Nürnberg
- Staatsarchiv Würzburg
-
Major Libraries
- Bayerische Staatsbibliothek (Munich)
- Universitätsbibliothek München (LMU)
- Universitätsbibliothek der TU München
- Universitätsbibliothek Würzburg
- Universitätsbibliothek Erlangen-Nürnberg
- Universitätsbibliothek Regensburg
Total: ~14 foundation institutions
Step 3: Merge
# Create merge script
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
# Update references
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
# Run merge
python3 scripts/merge_bayern_complete.py
# Expected output: ~1,214 Bavarian institutions (14 + 1,200)
Troubleshooting
Problem: "No museums found in HTML"
Cause: State name not recognized by isil.museum registry
Solution:
- Manually browse to http://www.museen-in-deutschland.de
- Search for state name in dropdown menu
- Copy exact search parameter from URL (e.g.,
suchbegriff=Baden-W%C3%BCrttemberg) - Update
STATE_URLin scraper with correct URL-encoded parameter
Problem: "ISIL coverage <95%"
Cause: Some foundation institutions may not have ISIL codes
Solution:
- Check SIGEL database: https://sigel.staatsbibliothek-berlin.de
- Search for missing institutions
- Manually add ISIL codes if found
- Mark as "ISIL_not_assigned" if genuinely missing
Problem: "Merge fails with FileNotFoundError"
Cause: Museum extraction output not found
Solution:
- Verify museum scraper ran successfully:
ls -la data/isil/germany/*museums*.json - Check scraper output for errors
- Ensure output directory exists:
mkdir -p data/isil/germany - Re-run museum extraction if needed
Problem: "City names have special characters"
Cause: German umlauts and special characters in city names
Solution:
- Keep original German names (don't transliterate)
- Ensure UTF-8 encoding:
encoding='utf-8'in all file operations - Example: München (not Muenchen), Köln (not Koln), Düsseldorf (not Duesseldorf)
Performance Benchmarks
| State | Institutions | Extraction Time | Merge Time | Total Time |
|---|---|---|---|---|
| Saxony | 411 | 5 seconds | 3 seconds | 8 seconds |
| Bavaria (est.) | 1,214 | 15 seconds | 5 seconds | 20 seconds |
| Baden-Württemberg (est.) | 1,100 | 12 seconds | 5 seconds | 17 seconds |
Note: Foundation dataset research adds 30-60 minutes (manual web research)
LinkML Schema Compliance
All extracted records MUST conform to:
schemas/core.yaml- HeritageCustodian, Location, Identifier classesschemas/enums.yaml- InstitutionTypeEnum (ARCHIVE, LIBRARY, MUSEUM)schemas/provenance.yaml- Provenance, data_source, data_tier
Required Fields:
id: https://w3id.org/heritage/custodian/de/...
name: <institution name>
institution_type: MUSEUM | LIBRARY | ARCHIVE
locations:
- city: <city name>
region: <state name>
country: DE
identifiers:
- identifier_scheme: ISIL
identifier_value: DE-MUS-* | DE-* | ...
provenance:
data_source: WEB_SCRAPING
data_tier: TIER_2_VERIFIED
extraction_date: <ISO 8601 timestamp>
extraction_method: <description>
confidence_score: 0.90
source_url: <URL>
Success Metrics
Minimum Viable Dataset
- ✅ Foundation dataset: 10+ institutions at 80%+ completeness
- ✅ Museums: 200+ institutions at 40%+ completeness
- ✅ ISIL coverage: >95%
- ✅ Core fields: 100% (name, type, city)
High-Quality Dataset
- ✅ Foundation dataset: 15+ institutions at 90%+ completeness
- ✅ Museums: 500+ institutions at 50%+ completeness
- ✅ ISIL coverage: >98%
- ✅ Core fields: 100%
- ✅ Wikidata enrichment: >50% for major institutions
Related Documentation
SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md- Full Saxony extraction case studySAXONY_HARVEST_STRATEGY.md- Strategic planning documentSESSION_SUMMARY_20251120_THUERINGEN_V4_COMPLETE.md- Thuringia case study (enrichment example)AGENTS.md- AI agent instructions for extraction
Contact & Support
Questions? Check existing session summaries in project root for similar extraction patterns.
Bugs in scraper? The isil.museum HTML structure may change. If extraction fails:
- Inspect current HTML:
curl http://www.museen-in-deutschland.de/?t=liste&mode=land&suchbegriff=YOUR_STATE > debug.html - Open
debug.htmlin browser - Update BeautifulSoup selectors in
parse_museum_table()function - Test with small sample before full extraction
Pattern Status: ✅ Production-ready (validated on Saxony with 99.8% ISIL coverage)
Reusability: High (copy-paste template with minimal edits)
Scalability: Excellent (handles 200-1,500 institutions per state)
Maintenance: Low (official registry rarely changes structure)