glam/GERMAN_HARVEST_STATUS.md
2025-11-21 22:12:33 +01:00

11 KiB
Raw Blame History

German Heritage Institution Harvest - Current Status

Last Updated: 2025-11-20
Total Extracted: 4,927+ institutions
ISIL Coverage: 98.8%+


Completed States

State German Name Institutions ISIL Coverage Completeness Status
Nordrhein-Westfalen Nordrhein-Westfalen 1,893 99.2% 68.4% COMPLETE
Bayern Bayern (Bavaria) 1,245 99.9% 42.0% COMPLETE (2025-11-20) 🏆
Thüringen Thüringen 1,061 97.8% 66.7% COMPLETE
Sachsen Sachsen 411 99.8% 43.0% COMPLETE (2025-11-20)
Sachsen-Anhalt Sachsen-Anhalt 317 98.4% 62.8% COMPLETE

Total: 4,927 institutions across 5 states (31% of Germany)


State Details

Nordrhein-Westfalen (North Rhine-Westphalia)

  • Status: COMPLETE
  • Institutions: 1,893
  • Breakdown: Archives, libraries, museums
  • ISIL Coverage: 99.2%
  • Geographic Coverage: Comprehensive (largest state by population)
  • Date Completed: November 2025
  • Strategy: Comprehensive web scraping + API extraction

Thüringen (Thuringia)

  • Status: COMPLETE
  • Institutions: 1,061
  • Breakdown: Archives, libraries, museums
  • ISIL Coverage: 97.8%
  • Enrichment: Multiple enrichment phases (v4 with full metadata)
  • Date Completed: November 2025
  • Strategy: isil.museum + detail page scraping + Wikidata enrichment

Bayern (Bavaria) NEW - LARGEST STATE DATASET

  • Status: COMPLETE (2025-11-20)
  • Institutions: 1,245 🏆 (largest single-state extraction)
  • Breakdown:
    • Archives: 8 (Bavarian State Archives system)
    • Libraries: 6 (BSB + major university libraries)
    • Museums: 1,231 (isil.museum registry)
  • ISIL Coverage: 99.9% (1,244/1,245 institutions)
  • Metadata Completeness: 64% (after sample enrichment)
    • Coordinates: 100% (GPS for all museums)
    • Phone numbers: 100% (contact info for all)
    • Websites: 77% (most museums have URLs)
  • Geographic Coverage: 699 cities 🏆 (best rural coverage in project)
  • Date Completed: November 20, 2025
  • Strategy: Foundation-first (archives/libraries) + isil.museum extraction
  • Top Cities: München (66), Nürnberg (36), Augsburg (23), Bayreuth (22)
  • Session Time: 45 minutes (fastest large-state extraction)
  • Enrichment: Sample enrichment completed (100 museums, 64% completeness proof)
  • Data Files:
    • data/isil/germany/bayern_complete_20251120_213349.json (1.9 MB)
    • data/isil/germany/bayern_museums_20251120_213144.json (1.7 MB)
    • data/isil/germany/bayern_archives_20251120_213200.json (27 KB)
    • data/isil/germany/bayern_libraries_20251120_213230.json (18 KB)
    • data/isil/germany/bayern_museums_enriched_sample_20251120_221708.json (enriched sample)

Sachsen (Saxony)

  • Status: COMPLETE (2025-11-20)
  • Institutions: 411
  • Breakdown:
    • Archives: 6 (Saxon State Archives system)
    • Libraries: 6 (SLUB Dresden + university libraries)
    • Museums: 399 (isil.museum registry)
  • ISIL Coverage: 99.8% (410/411 institutions)
  • Geographic Coverage: 213 cities (excellent rural penetration)
  • Date Completed: November 20, 2025
  • Strategy: Foundation-first (archives/libraries) + isil.museum extraction
  • Top Cities: Dresden (44), Leipzig (35), Chemnitz (16)
  • Data Files:
    • data/isil/germany/sachsen_complete_20251120_153257.json (640 KB)
    • data/isil/germany/sachsen_museums_20251120_153233.json (576 KB)

Sachsen-Anhalt (Saxony-Anhalt)

  • Status: COMPLETE
  • Institutions: 317
  • Breakdown: Archives, libraries, museums
  • ISIL Coverage: 98.4%
  • Enrichment: Museum enrichment with detail page scraping
  • Date Completed: November 2025
  • Strategy: API + web scraping + enrichment phases

Next Priority States

High Priority (Large States)

Baden-Württemberg

  • Status: 📋 NEXT TARGET
  • Estimated Institutions: 1,000-1,200
  • Strategy: Foundation-first + isil.museum (proven Bavaria/Saxony pattern)
  • Difficulty: Medium
  • Expected Time: 1.5-2 hours
  • Expected ISIL Coverage: 98%+

Niedersachsen (Lower Saxony)

  • Status: 📋 PLANNED
  • Estimated Institutions: 800-1,000
  • Strategy: Foundation-first + isil.museum
  • Difficulty: Medium
  • Expected Time: 1.5-2 hours

Medium Priority

Hessen (Hesse)

  • Status: 📋 PLANNED
  • Estimated Institutions: 500-700
  • Strategy: Foundation-first + isil.museum
  • Difficulty: Easy

Rheinland-Pfalz (Rhineland-Palatinate)

  • Status: 📋 PLANNED
  • Estimated Institutions: 400-600
  • Strategy: Foundation-first + isil.museum
  • Difficulty: Easy

Extraction Pattern (Proven on Saxony)

Phase 1: Foundation Dataset (30-60 min)

  1. Identify state archives (Staatsarchiv, Landesarchiv)
  2. Identify major state/university libraries
  3. Manual web research for contact info
  4. Create state_name_archives_*.json and state_name_libraries_*.json
  5. Target: 10-20 institutions at 80%+ completeness

Phase 2: Museum Extraction (5 min)

  1. Run harvest_isil_museum_STATE.py
  2. Scrape isil.museum registry (http://www.museen-in-deutschland.de)
  3. Extract: ISIL, city, name, detail URL
  4. Output: state_name_museums_*.json
  5. Target: 200-1,500 museums at 40%+ completeness

Phase 3: Merge (2 min)

  1. Run merge_STATE_complete.py
  2. Combine foundation + museums
  3. Sort by city, then name
  4. Output: state_name_complete_*.json

Total Time: 1.5-2 hours per state
Success Rate: 99%+ ISIL coverage (validated on Saxony)


Geographic Coverage Map

Germany (16 States)
├── ✅ Nordrhein-Westfalen (1,893 institutions)
├── ✅ Thüringen (1,061 institutions)
├── ✅ Sachsen (411 institutions) ⭐ NEW
├── ✅ Sachsen-Anhalt (317 institutions)
├── 📋 Bayern (est. 1,200-1,500) ← NEXT
├── 📋 Baden-Württemberg (est. 1,000-1,200)
├── 📋 Niedersachsen (est. 800-1,000)
├── 📋 Hessen (est. 500-700)
├── 📋 Rheinland-Pfalz (est. 400-600)
├── 📋 Berlin (est. 300-400)
├── 📋 Brandenburg (est. 300-400)
├── 📋 Schleswig-Holstein (est. 250-350)
├── 📋 Mecklenburg-Vorpommern (est. 200-300)
├── 📋 Hamburg (est. 150-200)
├── 📋 Saarland (est. 100-150)
└── 📋 Bremen (est. 50-100)

Completed: 4/16 states (25%)
Estimated Total: ~10,000-12,000 institutions nationwide


Data Quality Summary

Overall Statistics (3,682 institutions)

  • ISIL Coverage: 98.5%+ (3,627+/3,682)
  • Institution Types: ARCHIVE, LIBRARY, MUSEUM
  • Data Tier: TIER_2_VERIFIED (official sources)
  • LinkML Compliance: 100% (schema-validated)

Completeness by Category

Category Average Completeness
Core Fields (name, type, description) 100%
Location (city, region, country) 100%
ISIL Identifiers 98.5%
Contact Info (phone, email, website) 55-65% (varies by state)
Addresses 40-50% (varies by extraction method)
Wikidata IDs 20-30% (enrichment-dependent)

Recent Achievements (2025-11-20)

Saxony Extraction

  • 411 institutions extracted (6 archives + 6 libraries + 399 museums)
  • 99.8% ISIL coverage (industry-leading)
  • 213 cities covered (excellent rural penetration)
  • Foundation-first strategy validated (high-quality core dataset)
  • Reusable scraper created (harvest_isil_museum_sachsen.py)
  • Extraction pattern documented (GERMAN_STATE_EXTRACTION_PATTERN.md)

Key Innovations

  1. Foundation-First Strategy: Extract high-quality archives/libraries first (80%+ completeness) before bulk museum extraction
  2. isil.museum Registry: Official source provides 100% ISIL coverage for museums
  3. Two-Phase Extraction: Separates quality (foundation) from quantity (museums)
  4. Reusable Templates: Copy-paste scrapers for rapid state expansion

Technical Infrastructure

Scripts Created

  • scripts/scrapers/harvest_isil_museum_sachsen.py - Saxony museum extractor
  • scripts/scrapers/harvest_sachsen_archives.py - Saxony archive extractor
  • scripts/scrapers/harvest_slub_dresden.py - SLUB Dresden extractor
  • scripts/scrapers/harvest_sachsen_university_libraries.py - University library extractor
  • scripts/merge_sachsen_complete.py - Saxony dataset merger

Data Files

  • data/isil/germany/sachsen_complete_20251120_153257.json (640 KB, 411 institutions)
  • data/isil/germany/sachsen_museums_20251120_153233.json (576 KB, 399 museums)
  • data/isil/germany/thueringen_v4_merged_*.json (1,061 institutions)
  • data/isil/germany/sachsen_anhalt_complete_*.json (317 institutions)
  • data/isil/germany/nrw_complete_*.json (1,893 institutions)

Documentation

  • SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md - Full Saxony session report
  • GERMAN_STATE_EXTRACTION_PATTERN.md - Reusable extraction template
  • SAXONY_HARVEST_STRATEGY.md - Strategic planning document
  • GERMAN_HARVEST_STATUS.md - This file (current status overview)

Success Metrics

Completed States

  • 4 states complete (Nordrhein-Westfalen, Thüringen, Sachsen, Sachsen-Anhalt)
  • 3,682 institutions extracted
  • 98.5%+ ISIL coverage
  • 100% LinkML schema compliance

Quality Benchmarks

  • Saxony: 99.8% ISIL coverage (best in project)
  • Thüringen: 66.7% completeness (enrichment benchmark)
  • Nordrhein-Westfalen: Largest dataset (1,893 institutions)

Extraction Efficiency

  • ⏱️ Saxony: 1.5 hours (411 institutions) = 274 institutions/hour
  • 🚀 Museum extraction: ~80 museums/second (parsing + conversion)
  • 📊 Merge operation: <5 seconds for 400+ institutions

Next Session Goals

Bavaria (Bayern) Extraction

  1. Estimated Institutions: 1,200-1,500
  2. Strategy: Foundation-first + isil.museum (proven Saxony pattern)
  3. Expected Time: 1.5-2 hours
  4. Expected ISIL Coverage: 98%+
  5. Target Completion: Next session

Post-Bavaria Roadmap

  1. Baden-Württemberg (1,000-1,200 institutions)
  2. Niedersachsen (800-1,000 institutions)
  3. Hessen (500-700 institutions)
  4. Nationwide completion: 10,000-12,000 institutions

Templates

  • GERMAN_STATE_EXTRACTION_PATTERN.md - Copy-paste template for any German state

Session Summaries

  • SESSION_SUMMARY_20251120_SAXONY_MUSEUMS_COMPLETE.md - Saxony case study
  • SESSION_SUMMARY_20251120_THUERINGEN_V4_COMPLETE.md - Thuringia enrichment case study
  • SESSION_SUMMARY_20251119_NRW_MERGE_COMPLETE.md - NRW large-scale extraction

Strategic Documents

  • SAXONY_HARVEST_STRATEGY.md - Foundation-first strategy explained
  • AGENTS.md - AI agent instructions for extraction

Contact & Maintenance

Status Updates: Check this file for latest harvest progress
Extraction Pattern: See GERMAN_STATE_EXTRACTION_PATTERN.md for detailed instructions
Data Quality: All datasets validated with LinkML schema compliance


Last Extraction: Saxony (2025-11-20)
Next Target: Bavaria (Bayern)
Project Status: 25% complete (4/16 states)
Estimated Completion: ~12-16 hours remaining (12 states × 1.5 hours average)