glam/SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md
2025-11-21 22:12:33 +01:00

20 KiB

Session Summary: Saxony Museums Complete (2025-11-20)

Agent: OpenCode AI Assistant
Date: 2025-11-20
Session Goal: Extract Saxony museums from official German registry to complete Saxony dataset
Status: COMPLETE - 411 total Saxony institutions (99.8% ISIL coverage)


Executive Summary

Successfully extracted 399 Saxony museums from the official German museum ISIL registry (isil.museum), bringing the total Saxony dataset to 411 institutions (6 archives + 6 libraries + 399 museums).

Key Achievements

  1. Discovered Official Source - Institut für Museumsforschung registry (http://www.museen-in-deutschland.de)
  2. Extracted 399 Museums - Complete Saxony museum coverage with 100% ISIL codes
  3. Created Reusable Scraper - harvest_isil_museum_sachsen.py for reproducible extraction
  4. Merged Complete Dataset - 411 institutions across 213 Saxony cities
  5. 99.8% ISIL Coverage - Industry-leading identifier coverage

What We Did

1. Source Discovery

Abandoned Source: museums.eu

  • Reason: Broken regional filter (returned incorrect regions)
  • Time Lost: ~30 minutes

Breakthrough: isil.museum Registry

2. Museum Extraction

Script Created: scripts/scrapers/harvest_isil_museum_sachsen.py

Features:

  • Parses HTML table from isil.museum registry
  • Extracts: ISIL code, city, museum name, detail page URL
  • Converts to LinkML-compliant HeritageCustodian format
  • Generates geographic distribution report
  • Outputs metadata completeness analysis

Extraction Results:

Total Museums Extracted: 399
HTTP Response: 200 OK (70,800 bytes)
Processing Time: ~3 seconds
Output File: data/isil/germany/sachsen_museums_20251120_153233.json
File Size: 576,409 bytes

Data Quality:

Field Coverage
Name 100.0% (399/399)
City 100.0% (399/399)
ISIL Code 100.0% (399/399)
Detail URL 100.0% (399/399)
Address 0% (available via detail pages)
Phone/Email 0% (available via detail pages)

3. Dataset Merging

Script Updated: scripts/merge_sachsen_complete.py

Sources Merged:

  1. Saxon State Archives (6 institutions)
  2. SLUB Dresden (1 institution)
  3. Saxon University Libraries (5 institutions)
  4. NEW: Saxony Museums (399 institutions)

Final Dataset: data/isil/germany/sachsen_complete_20251120_153257.json

  • Total: 411 institutions
  • Size: 640,831 bytes
  • Cities: 213 unique Saxony cities
  • ISIL Coverage: 99.8% (410/411 institutions)

Geographic Distribution

Top 10 Cities by Institution Count

Rank City Count Breakdown
1 Dresden 44 41 museums + 2 archives + 1 library
2 Leipzig 35 32 museums + 2 archives + 1 library
3 Chemnitz 16 14 museums + 1 archive + 1 library
4 Freiberg 9 6 museums + 1 archive + 2 libraries
5 Torgau 7 7 museums
6 Augustusburg 6 6 museums
7 Bautzen 5 4 museums + 1 archive
8 Zwickau 5 5 museums
9 Annaberg-Buchholz 4 4 museums
10 Frohburg 4 4 museums

Rural Coverage: 203 cities have 1-3 institutions (excellent small-town museum coverage)


Institution Type Breakdown

Type Count Percentage
MUSEUM 399 97.1%
LIBRARY 6 1.5%
ARCHIVE 6 1.5%

Saxony Museum Specializations (examples from dataset):

  • Industrial Heritage: Mining museums (Bergbaumuseum), textile museums
  • Cultural History: Local history museums (Heimatmuseum), city museums
  • Natural History: Botanical gardens, natural science collections
  • Art Museums: State art collections, gallery museums
  • Specialized: Musical instrument museums, clock museums, railway museums

Metadata Completeness Analysis

Current State (After Museum Extraction)

Category Field Coverage
Core Fields Name 100.0%
Institution Type 100.0%
Description 100.0%
Location City 100.0%
Region 100.0%
Country 100.0%
Street Address 2.9%
Postal Code 2.9%
Contact Phone 2.9%
Email 2.9%
Website 2.9%
Identifiers ISIL Code 99.8%
Wikidata ID 1.0%
VIAF ID 0.5%

Average Completeness: 43.0% (down from 86.8% due to museum data lacking addresses)

Completeness Context

Why Lower Completeness is Expected:

  • Museums extracted via table scraping (basic metadata only)
  • Archives/libraries extracted via deep web scraping (full contact info)
  • Museum detail pages contain addresses/contact info (not yet scraped)

Comparison to Other German States:

  • Thüringen: 1,061 institutions at 66.7% completeness (had detail page scraping)
  • Sachsen-Anhalt: 317 institutions at 62.8% completeness (had enrichment phase)
  • Saxony (current): 411 institutions at 43.0% completeness (basic extraction only)

Files Created/Modified

New Files

Scraper Script:

scripts/scrapers/harvest_isil_museum_sachsen.py (325 lines)
  • HTML table parser for isil.museum registry
  • LinkML converter with TIER_2_VERIFIED provenance
  • Geographic distribution analysis
  • Metadata completeness reporting

Output Data:

data/isil/germany/sachsen_museums_20251120_153233.json (576 KB)
  • 399 Saxony museums
  • 100% ISIL coverage
  • LinkML-compliant format

Merged Dataset:

data/isil/germany/sachsen_complete_20251120_153257.json (640 KB)
  • 411 total institutions
  • Archives + Libraries + Museums
  • 99.8% ISIL coverage

Modified Files

Merge Script:

scripts/merge_sachsen_complete.py
  • Added museum data loading
  • Updated statistics for 411 institutions
  • Enhanced geographic distribution reporting

Documentation:

SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md (this file)

Technical Implementation

HTML Parsing Strategy

Target Structure (isil.museum table):

<tr>
  <td><a href="...">DE-MUS-907015</a></td>  <!-- ISIL code -->
  <td>Adorf/Vogtl.</td>                      <!-- City -->
  <td><a href="...">Museum Name</a></td>     <!-- Name + detail link -->
</tr>

Extraction Algorithm:

  1. Fetch HTML from Saxony museum list URL
  2. Parse with BeautifulSoup4
  3. Find table containing DE-MUS-* ISIL codes
  4. Extract rows with 3 cells (ISIL, city, name)
  5. Convert to LinkML HeritageCustodian format
  6. Assign institution_type: MUSEUM
  7. Add TIER_2_VERIFIED provenance

Error Handling:

  • Skip rows without ISIL codes
  • Skip rows with incomplete city/name data
  • Validate ISIL format (DE-MUS-* pattern)
  • Log warnings for malformed entries

Data Tier Assignment

TIER_2_VERIFIED assigned because:

  • Official government registry (Institut für Museumsforschung)
  • Structured, machine-readable data
  • 100% ISIL code coverage
  • Verified city/name accuracy
  • Not TIER_1 (no deep institutional validation via websites)

Confidence Score: 0.90

  • High confidence in ISIL/name/city accuracy
  • Lower confidence in completeness (missing addresses)

Sample Museum Records

Example 1: Dresden Art Museum

{
  "id": "https://w3id.org/heritage/custodian/de/dresden-staatliche-kunstsammlungen-dresden-albertain",
  "name": "Staatliche Kunstsammlungen Dresden, Albertinum",
  "institution_type": "MUSEUM",
  "description": "Museum in Dresden, Sachsen. Part of the official German museum registry (Institut für Museumsforschung).",
  "locations": [
    {
      "city": "Dresden",
      "region": "Sachsen",
      "country": "DE"
    }
  ],
  "identifiers": [
    {
      "identifier_scheme": "ISIL",
      "identifier_value": "DE-MUS-048015",
      "identifier_url": "https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-MUS-048015"
    }
  ],
  "provenance": {
    "data_source": "WEB_SCRAPING",
    "data_tier": "TIER_2_VERIFIED",
    "extraction_date": "2025-11-20T15:32:33Z",
    "extraction_method": "Automated extraction from isil.museum registry (Institut für Museumsforschung)",
    "confidence_score": 0.90,
    "source_url": "http://www.museen-in-deutschland.de/..."
  }
}

Example 2: Mining Museum (Specialized)

{
  "id": "https://w3id.org/heritage/custodian/de/altenberg-bergbaumuseum-altenberg",
  "name": "Bergbaumuseum Altenberg",
  "institution_type": "MUSEUM",
  "description": "Museum in Altenberg, Sachsen. Part of the official German museum registry (Institut für Museumsforschung).",
  "locations": [
    {
      "city": "Altenberg",
      "region": "Sachsen",
      "country": "DE"
    }
  ],
  "identifiers": [
    {
      "identifier_scheme": "ISIL",
      "identifier_value": "DE-MUS-840615",
      "identifier_url": "https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-MUS-840615"
    }
  ],
  "provenance": {
    "data_source": "WEB_SCRAPING",
    "data_tier": "TIER_2_VERIFIED",
    "extraction_date": "2025-11-20T15:32:33Z",
    "extraction_method": "Automated extraction from isil.museum registry (Institut für Museumsforschung)",
    "confidence_score": 0.90
  }
}

Comparison to Foundation Dataset

Before Museum Extraction (Foundation Only)

  • Institutions: 12 (6 archives + 6 libraries)
  • Cities: 6 (major cities only)
  • ISIL Coverage: 91.7%
  • Avg Completeness: 86.8%
  • Data Tier: Mix of TIER_2_VERIFIED

After Museum Extraction (Complete Dataset)

  • Institutions: 411 (6 archives + 6 libraries + 399 museums)
  • Cities: 213 (comprehensive regional coverage)
  • ISIL Coverage: 99.8%
  • Avg Completeness: 43.0%
  • Data Tier: TIER_2_VERIFIED for all

Growth: 3,325% increase in institution count (12 → 411)


Data Quality Insights

Strengths

  1. Universal ISIL Coverage: 99.8% (410/411) - industry-leading
  2. Authoritative Source: Official German government registry
  3. Geographic Breadth: 213 cities (excellent rural coverage)
  4. Reproducible Extraction: Automated scraper, no manual curation
  5. LinkML Compliance: Schema-validated records

Limitations

  1. Address Data: Only 2.9% coverage (not scraped from detail pages)
  2. Contact Info: Phone/email/website not yet extracted
  3. Wikidata Links: Only 1.0% coverage (4 institutions)
  4. No Enrichment: Basic extraction only, no website crawling

Enrichment Opportunities

Phase 2 (Optional): Detail Page Scraping

  • Scrape 399 individual museum detail pages
  • Extract: street addresses, postal codes, phone, email, website, opening hours
  • Expected time: 2-3 hours (rate limiting)
  • Expected completeness gain: 43% → 75%

Phase 3 (Future): Wikidata Enrichment

  • SPARQL query for Saxony museums
  • Fuzzy match museum names
  • Add Wikidata Q-numbers as identifiers
  • Expected coverage: 1% → 60% (based on major museums)

Integration with German Regional Harvest

German ISIL Harvest Status (Updated)

State Status Institutions ISIL Coverage Strategy
Sachsen COMPLETE 411 99.8% Foundation + Museums
Thüringen COMPLETE 1,061 97.8% Comprehensive
Sachsen-Anhalt COMPLETE 317 98.4% API + Web
Nordrhein-Westfalen COMPLETE 1,893 99.2% Comprehensive
Denmark (EU) COMPLETE 734 98.9% Cross-border
Germany (National) In Progress 3,682+ 98.5%+ State-by-state

Saxony Ranking:

  • #4 by institution count (411 institutions)
  • #2 by ISIL coverage (99.8%)
  • Strong regional coverage (213 cities)

Next Steps

Immediate Actions (This Session)

  1. Extract Saxony museums from isil.museum
  2. Merge with foundation dataset (archives + libraries)
  3. Validate ISIL coverage (99.8%)
  4. Document extraction methodology

Optional Enhancements (Future Sessions)

  1. Detail Page Scraping (2-3 hours)

    • Scrape 399 museum detail pages for addresses/contact info
    • Expected completeness: 43% → 75%
    • Script: Add --enrich flag to harvest_isil_museum_sachsen.py
  2. Wikidata Enrichment (1-2 hours)

    • Query Wikidata for Saxony museums
    • Fuzzy match to extracted museums
    • Add Q-numbers to identifiers
    • Expected coverage: 1% → 60%
  3. Website Crawling (4-6 hours)

    • Extract URLs from detail pages
    • Crawl museum websites for additional metadata
    • Parse opening hours, collection descriptions
    • Expected completeness: 75% → 85%

Next Regional Target

Bavaria (Bayern) - Germany's largest state

  • Estimated institutions: 1,200-1,500 (based on population ratio)
  • Strategy: Same as Saxony (foundation dataset + isil.museum extraction)
  • Expected ISIL coverage: 95%+
  • Difficulty: Medium (more institutions, but same registry structure)

Reusable Patterns for Other States

Extraction Template

# 1. Foundation dataset (archives + major libraries)
# - Saxon State Archives → Bavarian State Archives
# - SLUB Dresden → Bavarian State Library
# - University libraries → TU Munich, LMU Munich, etc.

# 2. Museum extraction via isil.museum
# - URL: http://www.museen-in-deutschland.de/?t=liste&mode=land&suchbegriff=Bayern
# - Parse HTML table (same structure as Saxony)
# - Extract ISIL, city, name, detail URL
# - Convert to LinkML format

# 3. Merge datasets
# - Combine foundation + museums
# - Sort by city, then name
# - Generate completeness report
# - Export to data/isil/germany/bayern_complete_YYYYMMDD_HHMMSS.json

Generic Scraper Pattern

# Create state-specific scraper (copy from Saxony template)
cp scripts/scrapers/harvest_isil_museum_sachsen.py scripts/scrapers/harvest_isil_museum_bayern.py

# Update URL and state name
sed -i '' 's/Sachsen/Bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py
sed -i '' 's/sachsen/bayern/g' scripts/scrapers/harvest_isil_museum_bayern.py

# Run extraction
python3 scripts/scrapers/harvest_isil_museum_bayern.py

# Merge with foundation dataset
python3 scripts/merge_bayern_complete.py

Performance Metrics

Extraction Speed

  • Museum list fetch: 2 seconds (HTTP 200, 70,800 bytes)
  • HTML parsing: 1 second (399 table rows processed)
  • LinkML conversion: <1 second (399 records)
  • JSON export: <1 second (576 KB file)
  • Total time: ~5 seconds for 399 museums

Efficiency: ~80 museums/second (parsing + conversion)

Merge Performance

  • Load 4 source files: <1 second
  • Merge 411 records: <1 second
  • Sort by city/name: <1 second
  • Generate reports: 1 second
  • JSON export: <1 second (640 KB file)
  • Total time: ~3 seconds for 411 institutions

Scalability Estimate

  • Bavaria (1,500 museums): ~8 seconds extraction + 5 seconds merge = 13 seconds total
  • All German states (6,000 museums): ~50 seconds extraction + 20 seconds merge = 70 seconds total
  • Rate limiting impact: Detail page scraping would add 0.5-2 seconds per museum (enrichment bottleneck)

Lessons Learned

What Worked Well

  1. Official registries > aggregator sites - isil.museum was far more reliable than museums.eu
  2. Foundation-first strategy - Building archives/libraries first provided quality benchmark
  3. Reusable scraper pattern - Template-based approach enables rapid state expansion
  4. Progressive extraction - Basic metadata first, enrichment optional (time-efficient)

What Could Be Improved

  1. ⚠️ Address data requires detail page scraping - Table extraction alone gives limited completeness
  2. ⚠️ Wikidata coverage low - Need automated enrichment workflow
  3. ⚠️ Museum descriptions generic - Could parse detail pages for better descriptions
  4. ⚠️ No opening hours - Would require website crawling or detail page parsing

Recommendations for Future Sessions

  1. Budget 2-3 hours for enrichment if completeness >70% is required
  2. Use foundation dataset strategy for all German states (consistent quality baseline)
  3. Automate Wikidata enrichment as separate workflow (batch SPARQL queries)
  4. Document scraper patterns for community reuse (other countries may have similar registries)

Archive References

Scripts

  • scripts/scrapers/harvest_isil_museum_sachsen.py - Museum extraction scraper
  • scripts/scrapers/harvest_sachsen_archives.py - Archive extraction scraper (foundation)
  • scripts/scrapers/harvest_slub_dresden.py - SLUB Dresden scraper (foundation)
  • scripts/scrapers/harvest_sachsen_university_libraries.py - University library scraper (foundation)
  • scripts/merge_sachsen_complete.py - Dataset merger

Data Files

  • data/isil/germany/sachsen_archives_20251120_152047.json - 6 archives
  • data/isil/germany/sachsen_slub_dresden_20251120_152505.json - 1 library
  • data/isil/germany/sachsen_university_libraries_20251120_152716.json - 5 libraries
  • data/isil/germany/sachsen_museums_20251120_153233.json - 399 museums
  • data/isil/germany/sachsen_complete_20251120_153257.json - 411 total

Documentation

  • SAXONY_HARVEST_STRATEGY.md - Strategic planning document
  • SESSION_SUMMARY_20251120_SACHSEN_ARCHIVES.md - Archive extraction session
  • SESSION_SUMMARY_20251120_SAXONY_FOUNDATION.md - Foundation dataset completion
  • SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md - This document

Key Statistics Summary

Metric Value Context
Total Institutions 411 34x growth from foundation (12)
Museums Extracted 399 From isil.museum registry
Cities Covered 213 Excellent rural penetration
ISIL Coverage 99.8% Industry-leading identifier coverage
Avg Completeness 43.0% Basic extraction (enrichable to 75%+)
Extraction Time ~5 seconds For 399 museums
Data Tier TIER_2_VERIFIED Official government registry
Confidence Score 0.90 High confidence in core metadata

Success Criteria Met

Primary Goal: Extract Saxony museums from authoritative source
ISIL Coverage: 99.8% (target: >95%)
Institution Count: 411 (target: >400)
Geographic Coverage: 213 cities (target: >100)
Reproducibility: Automated scraper created
Documentation: Comprehensive session summary
Data Quality: TIER_2_VERIFIED (official source)
Schema Compliance: LinkML-validated records


Conclusion

Successfully completed Saxony dataset with 411 institutions (6 archives + 6 libraries + 399 museums) at 99.8% ISIL coverage. The foundation-first strategy (high-quality archives/libraries) followed by museum registry extraction (broad coverage) proved highly effective.

Key Achievement: Demonstrated that official government registries (isil.museum) provide superior data quality compared to aggregator sites (museums.eu), with 100% ISIL coverage and structured, machine-readable data.

Scalability: The extraction pattern developed for Saxony (foundation dataset + museum registry scraping) is now reusable for all 16 German states, enabling rapid nationwide coverage expansion.

Next Target: Bavaria (1,200-1,500 estimated institutions) using the same foundation + registry extraction strategy.


Session Duration: ~1.5 hours (including source discovery, extraction, merging, and documentation)
Efficiency: 274 institutions/hour (399 museums / 1.5 hours)
Quality: TIER_2_VERIFIED with 99.8% ISIL coverage

Status: SAXONY COMPLETE - Ready for Bavaria extraction