419 lines
12 KiB
Markdown
419 lines
12 KiB
Markdown
# German State Extraction Pattern (Reusable Template)
|
|
|
|
**Last Updated**: 2025-11-20
|
|
**Status**: Production-ready pattern validated on Saxony (411 institutions)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document provides a **copy-paste template** for extracting heritage institutions from any German state (Bundesland) using the proven Saxony pattern.
|
|
|
|
**Success Rate**: 99.8% ISIL coverage, 100% core metadata completeness
|
|
**Time Required**: 1.5-2 hours per state (foundation + museums)
|
|
**Difficulty**: Easy (automated scraping, minimal manual curation)
|
|
|
|
---
|
|
|
|
## Two-Phase Extraction Strategy
|
|
|
|
### Phase 1: Foundation Dataset (Archives + Libraries)
|
|
Extract **high-quality records** from major state institutions:
|
|
- State archives (Staatsarchiv, Landesarchiv)
|
|
- State/university libraries (Staatsbibliothek, Universitätsbibliothek)
|
|
- Research libraries (specialized subject libraries)
|
|
|
|
**Target**: 10-20 institutions with 80%+ completeness
|
|
|
|
### Phase 2: Museum Registry Extraction
|
|
Extract **comprehensive museum coverage** from isil.museum:
|
|
- Official German museum ISIL registry (Institut für Museumsforschung)
|
|
- ~6,300 total German museums
|
|
- Filter by state name (Bundesland)
|
|
|
|
**Target**: 200-1,500 museums per state (varies by state size)
|
|
|
|
---
|
|
|
|
## Quick Start Template
|
|
|
|
### Step 1: Create State Scraper (5 minutes)
|
|
|
|
```bash
|
|
# Replace STATE_NAME and state_name throughout
|
|
# Example: Bayern (Bavaria), Baden-Württemberg, etc.
|
|
|
|
# Copy Saxony museum scraper template
|
|
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
|
|
# Update state references
|
|
# macOS:
|
|
sed -i '' 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
sed -i '' 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
|
|
# Linux:
|
|
sed -i 's/Sachsen/STATE_NAME/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
sed -i 's/sachsen/state_name/g' scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
```
|
|
|
|
**Manual Edits Required**:
|
|
1. Update `SACHSEN_URL` to point to your state:
|
|
```python
|
|
# Before:
|
|
SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Sachsen"
|
|
|
|
# After:
|
|
STATE_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern" # Example: Bayern
|
|
```
|
|
|
|
2. Update region in `convert_to_linkml()`:
|
|
```python
|
|
# Before:
|
|
"region": "Sachsen"
|
|
|
|
# After:
|
|
"region": "Bayern" # Use official German state name
|
|
```
|
|
|
|
### Step 2: Extract Museums (2 minutes)
|
|
|
|
```bash
|
|
# Run state museum scraper
|
|
python3 scripts/scrapers/harvest_isil_museum_STATE_NAME.py
|
|
|
|
# Output: data/isil/germany/state_name_museums_YYYYMMDD_HHMMSS.json
|
|
```
|
|
|
|
**Expected Results**:
|
|
- 200-1,500 museums (depends on state size)
|
|
- 100% ISIL coverage (DE-MUS-* codes)
|
|
- 100% name/city coverage
|
|
- 0% address coverage (requires detail page scraping)
|
|
|
|
### Step 3: Extract Foundation Dataset (30-60 minutes)
|
|
|
|
**Option A: Manual Web Research** (recommended for first iteration)
|
|
1. Search for state archives (e.g., "Bayerisches Hauptstaatsarchiv")
|
|
2. Visit official websites
|
|
3. Extract contact info, ISIL codes, descriptions
|
|
4. Create JSON file: `data/isil/germany/state_name_archives_YYYYMMDD_HHMMSS.json`
|
|
|
|
**Option B: Automated Scraping** (if API/structured data available)
|
|
1. Create custom scraper for state archive portal
|
|
2. Parse institution listings
|
|
3. Extract metadata
|
|
4. Export to JSON
|
|
|
|
**Foundation Institutions to Target**:
|
|
- State archives (Staatsarchiv, Landesarchiv)
|
|
- State library (Staatsbibliothek, if separate from state archive)
|
|
- Major university libraries (technical universities, research universities)
|
|
- Specialized research libraries
|
|
|
|
### Step 4: Merge Datasets (2 minutes)
|
|
|
|
```bash
|
|
# Copy merge template
|
|
cp scripts/merge_sachsen_complete.py scripts/merge_STATE_NAME_complete.py
|
|
|
|
# Update state references
|
|
# macOS:
|
|
sed -i '' 's/sachsen/state_name/g' scripts/merge_STATE_NAME_complete.py
|
|
sed -i '' 's/Sachsen/STATE_NAME/g' scripts/merge_STATE_NAME_complete.py
|
|
|
|
# Run merge
|
|
python3 scripts/merge_STATE_NAME_complete.py
|
|
|
|
# Output: data/isil/germany/state_name_complete_YYYYMMDD_HHMMSS.json
|
|
```
|
|
|
|
---
|
|
|
|
## German States Priority List
|
|
|
|
### High Priority (Large States, High Institution Count)
|
|
|
|
| State | German Name | Estimated Institutions | Difficulty |
|
|
|-------|-------------|------------------------|------------|
|
|
| Bavaria | Bayern | 1,200-1,500 | Medium |
|
|
| Baden-Württemberg | Baden-Württemberg | 1,000-1,200 | Medium |
|
|
| Lower Saxony | Niedersachsen | 800-1,000 | Medium |
|
|
| Hesse | Hessen | 500-700 | Easy |
|
|
| Rhineland-Palatinate | Rheinland-Pfalz | 400-600 | Easy |
|
|
|
|
### Medium Priority
|
|
|
|
| State | German Name | Estimated Institutions | Difficulty |
|
|
|-------|-------------|------------------------|------------|
|
|
| Berlin | Berlin | 300-400 | Easy |
|
|
| Brandenburg | Brandenburg | 300-400 | Easy |
|
|
| Schleswig-Holstein | Schleswig-Holstein | 250-350 | Easy |
|
|
| Mecklenburg-Vorpommern | Mecklenburg-Vorpommern | 200-300 | Easy |
|
|
|
|
### Already Complete ✅
|
|
|
|
| State | Status | Institutions | ISIL Coverage |
|
|
|-------|--------|-------------|---------------|
|
|
| Saxony | ✅ COMPLETE | 411 | 99.8% |
|
|
| Thuringia | ✅ COMPLETE | 1,061 | 97.8% |
|
|
| Saxony-Anhalt | ✅ COMPLETE | 317 | 98.4% |
|
|
| North Rhine-Westphalia | ✅ COMPLETE | 1,893 | 99.2% |
|
|
|
|
---
|
|
|
|
## Data Quality Expectations
|
|
|
|
### Phase 1 (Foundation Dataset)
|
|
|
|
| Field | Expected Coverage |
|
|
|-------|-------------------|
|
|
| Name | 100% |
|
|
| Institution Type | 100% |
|
|
| City | 100% |
|
|
| Street Address | 80-100% |
|
|
| Postal Code | 80-100% |
|
|
| Phone | 80-100% |
|
|
| Email | 60-80% |
|
|
| Website | 90-100% |
|
|
| ISIL Code | 90-100% |
|
|
| Description | 100% |
|
|
|
|
**Average Completeness**: 80-90%
|
|
|
|
### Phase 2 (Museum Registry)
|
|
|
|
| Field | Expected Coverage |
|
|
|-------|-------------------|
|
|
| Name | 100% |
|
|
| Institution Type | 100% |
|
|
| City | 100% |
|
|
| Street Address | 0% (requires detail page scraping) |
|
|
| Postal Code | 0% (requires detail page scraping) |
|
|
| Phone | 0% (requires detail page scraping) |
|
|
| Email | 0% (requires detail page scraping) |
|
|
| Website | 0% (requires detail page scraping) |
|
|
| ISIL Code | 100% |
|
|
| Description | 100% (generic) |
|
|
|
|
**Average Completeness**: 40-50% (basic extraction)
|
|
|
|
### Phase 3 (Optional Enrichment)
|
|
|
|
**Detail Page Scraping** (adds 2-3 hours):
|
|
- Extract addresses, phone, email, website from individual museum pages
|
|
- Expected completeness gain: 40% → 75%
|
|
|
|
**Wikidata Enrichment** (adds 1-2 hours):
|
|
- SPARQL query for state museums
|
|
- Fuzzy match to extracted museums
|
|
- Expected Wikidata coverage: 0% → 50-60%
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
Before marking a state as "COMPLETE", verify:
|
|
|
|
- [ ] Foundation dataset created (10-20 institutions)
|
|
- [ ] Museums extracted from isil.museum (200+ institutions)
|
|
- [ ] Datasets merged into `state_name_complete_*.json`
|
|
- [ ] ISIL coverage >95%
|
|
- [ ] Core field completeness 100% (name, type, city)
|
|
- [ ] Geographic distribution analyzed (city counts)
|
|
- [ ] Metadata completeness report generated
|
|
- [ ] LinkML schema validation passed
|
|
- [ ] Session summary documented
|
|
|
|
---
|
|
|
|
## Example Workflow: Bavaria (Bayern)
|
|
|
|
### Step 1: Museum Extraction
|
|
|
|
```bash
|
|
# Create scraper
|
|
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# Update references
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# Manually update URL
|
|
# Line 27: SACHSEN_URL = f"{BASE_URL}/?t=liste&mode=land&suchbegriff=Bayern"
|
|
# Line 139: "region": "Bayern"
|
|
|
|
# Run extraction
|
|
python3 scripts/scrapers/harvest_isil_museum_bayern.py
|
|
|
|
# Expected output: ~1,200 Bavarian museums
|
|
```
|
|
|
|
### Step 2: Foundation Dataset
|
|
|
|
**Bavarian Foundation Institutions** (research manually):
|
|
|
|
1. **Bavarian State Archives**
|
|
- Hauptstaatsarchiv München
|
|
- Staatsarchiv Amberg
|
|
- Staatsarchiv Augsburg
|
|
- Staatsarchiv Bamberg
|
|
- Staatsarchiv Coburg
|
|
- Staatsarchiv Landshut
|
|
- Staatsarchiv Nürnberg
|
|
- Staatsarchiv Würzburg
|
|
|
|
2. **Major Libraries**
|
|
- Bayerische Staatsbibliothek (Munich)
|
|
- Universitätsbibliothek München (LMU)
|
|
- Universitätsbibliothek der TU München
|
|
- Universitätsbibliothek Würzburg
|
|
- Universitätsbibliothek Erlangen-Nürnberg
|
|
- Universitätsbibliothek Regensburg
|
|
|
|
**Total**: ~14 foundation institutions
|
|
|
|
### Step 3: Merge
|
|
|
|
```bash
|
|
# Create merge script
|
|
cp scripts/merge_sachsen_complete.py scripts/merge_bayern_complete.py
|
|
|
|
# Update references
|
|
sed -i '' 's/sachsen/bayern/g' scripts/merge_bayern_complete.py
|
|
sed -i '' 's/Sachsen/Bayern/g' scripts/merge_bayern_complete.py
|
|
|
|
# Run merge
|
|
python3 scripts/merge_bayern_complete.py
|
|
|
|
# Expected output: ~1,214 Bavarian institutions (14 + 1,200)
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Problem: "No museums found in HTML"
|
|
|
|
**Cause**: State name not recognized by isil.museum registry
|
|
|
|
**Solution**:
|
|
1. Manually browse to http://www.museen-in-deutschland.de
|
|
2. Search for state name in dropdown menu
|
|
3. Copy exact search parameter from URL (e.g., `suchbegriff=Baden-W%C3%BCrttemberg`)
|
|
4. Update `STATE_URL` in scraper with correct URL-encoded parameter
|
|
|
|
### Problem: "ISIL coverage <95%"
|
|
|
|
**Cause**: Some foundation institutions may not have ISIL codes
|
|
|
|
**Solution**:
|
|
1. Check SIGEL database: https://sigel.staatsbibliothek-berlin.de
|
|
2. Search for missing institutions
|
|
3. Manually add ISIL codes if found
|
|
4. Mark as "ISIL_not_assigned" if genuinely missing
|
|
|
|
### Problem: "Merge fails with FileNotFoundError"
|
|
|
|
**Cause**: Museum extraction output not found
|
|
|
|
**Solution**:
|
|
1. Verify museum scraper ran successfully: `ls -la data/isil/germany/*museums*.json`
|
|
2. Check scraper output for errors
|
|
3. Ensure output directory exists: `mkdir -p data/isil/germany`
|
|
4. Re-run museum extraction if needed
|
|
|
|
### Problem: "City names have special characters"
|
|
|
|
**Cause**: German umlauts and special characters in city names
|
|
|
|
**Solution**:
|
|
- Keep original German names (don't transliterate)
|
|
- Ensure UTF-8 encoding: `encoding='utf-8'` in all file operations
|
|
- Example: München (not Muenchen), Köln (not Koln), Düsseldorf (not Duesseldorf)
|
|
|
|
---
|
|
|
|
## Performance Benchmarks
|
|
|
|
| State | Institutions | Extraction Time | Merge Time | Total Time |
|
|
|-------|-------------|-----------------|------------|------------|
|
|
| Saxony | 411 | 5 seconds | 3 seconds | 8 seconds |
|
|
| Bavaria (est.) | 1,214 | 15 seconds | 5 seconds | 20 seconds |
|
|
| Baden-Württemberg (est.) | 1,100 | 12 seconds | 5 seconds | 17 seconds |
|
|
|
|
**Note**: Foundation dataset research adds 30-60 minutes (manual web research)
|
|
|
|
---
|
|
|
|
## LinkML Schema Compliance
|
|
|
|
All extracted records MUST conform to:
|
|
- `schemas/core.yaml` - HeritageCustodian, Location, Identifier classes
|
|
- `schemas/enums.yaml` - InstitutionTypeEnum (ARCHIVE, LIBRARY, MUSEUM)
|
|
- `schemas/provenance.yaml` - Provenance, data_source, data_tier
|
|
|
|
**Required Fields**:
|
|
```yaml
|
|
id: https://w3id.org/heritage/custodian/de/...
|
|
name: <institution name>
|
|
institution_type: MUSEUM | LIBRARY | ARCHIVE
|
|
locations:
|
|
- city: <city name>
|
|
region: <state name>
|
|
country: DE
|
|
identifiers:
|
|
- identifier_scheme: ISIL
|
|
identifier_value: DE-MUS-* | DE-* | ...
|
|
provenance:
|
|
data_source: WEB_SCRAPING
|
|
data_tier: TIER_2_VERIFIED
|
|
extraction_date: <ISO 8601 timestamp>
|
|
extraction_method: <description>
|
|
confidence_score: 0.90
|
|
source_url: <URL>
|
|
```
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
### Minimum Viable Dataset
|
|
- ✅ Foundation dataset: 10+ institutions at 80%+ completeness
|
|
- ✅ Museums: 200+ institutions at 40%+ completeness
|
|
- ✅ ISIL coverage: >95%
|
|
- ✅ Core fields: 100% (name, type, city)
|
|
|
|
### High-Quality Dataset
|
|
- ✅ Foundation dataset: 15+ institutions at 90%+ completeness
|
|
- ✅ Museums: 500+ institutions at 50%+ completeness
|
|
- ✅ ISIL coverage: >98%
|
|
- ✅ Core fields: 100%
|
|
- ✅ Wikidata enrichment: >50% for major institutions
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md` - Full Saxony extraction case study
|
|
- `SAXONY_HARVEST_STRATEGY.md` - Strategic planning document
|
|
- `SESSION_SUMMARY_20251120_THUERINGEN_V4_COMPLETE.md` - Thuringia case study (enrichment example)
|
|
- `AGENTS.md` - AI agent instructions for extraction
|
|
|
|
---
|
|
|
|
## Contact & Support
|
|
|
|
**Questions?** Check existing session summaries in project root for similar extraction patterns.
|
|
|
|
**Bugs in scraper?** The isil.museum HTML structure may change. If extraction fails:
|
|
1. Inspect current HTML: `curl http://www.museen-in-deutschland.de/?t=liste&mode=land&suchbegriff=YOUR_STATE > debug.html`
|
|
2. Open `debug.html` in browser
|
|
3. Update BeautifulSoup selectors in `parse_museum_table()` function
|
|
4. Test with small sample before full extraction
|
|
|
|
---
|
|
|
|
**Pattern Status**: ✅ Production-ready (validated on Saxony with 99.8% ISIL coverage)
|
|
**Reusability**: High (copy-paste template with minimal edits)
|
|
**Scalability**: Excellent (handles 200-1,500 institutions per state)
|
|
**Maintenance**: Low (official registry rarely changes structure)
|