glam/SAXONY_HARVEST_STRATEGY.md
2025-11-21 22:12:33 +01:00

11 KiB

Saxony (Sachsen) GLAM Harvest Strategy

Session: 2025-11-20
Status: PLANNING
Target: 400-600 institutions with 95%+ metadata completeness


Source Analysis Results

1. SLUB Dresden (Digital Collections)

URL: https://digital.slub-dresden.de/kollektionen/
Type: Single institution (State and University Library Dresden)
Status: Accessible
Content: 88,000+ digitized titles in collections

Assessment:

  • NOT an institution aggregator - this is SLUB's own digital collection portal
  • Focus: Digital objects (manuscripts, photos, maps, newspapers)
  • Use case: Extract SLUB Dresden as a single LIBRARY institution
  • Metadata: Available (name, address, collections, website)

Action: Manual extraction of SLUB Dresden metadata (1 institution)


2. Sachsen.digital

URL: http://www.sachsendigital.de/startseite/
Status: 404 (redirects to saxorum.de 404 page)
Assessment: Portal no longer operational or moved

Action: Archive this source (portal defunct)


3. Saxorum (Regional Studies Portal)

URL: https://www.saxorum.de/
Type: Research database for Saxony regional studies
Status: Accessible
Content: Persons, places, themes, historical resources

Assessment:

  • NOT an institution directory - this is a historical research portal
  • Focus: Historical persons, places, bibliographies, digitized materials
  • No institution listings found in navigation
  • Use case: Potential source for institutional history research (secondary)

Action: Low priority for institution harvesting (not a directory)


4. Sächsisches Staatsarchiv (Saxon State Archives)

URL: https://www.archiv.sachsen.de/
Type: Archive network (multiple locations)
Status: Accessible
Content: State archives across Saxony

Assessment:

  • HIGH PRIORITY - State archives are major heritage institutions
  • Expected: 6-8 archive locations (Dresden, Leipzig, Chemnitz, Bautzen, Freiberg, Plauen, etc.)
  • Metadata available: Addresses, opening hours, contact info, holdings descriptions

Action: Scrape archive locations from staatsarchiv.sachsen.de


🔍 5. Museumsverband Sachsen (NOT YET CHECKED)

Expected URL: https://www.museen-in-sachsen.de/
Type: Museum association directory (if exists)
Status: NOT accessible in test (no output)

Assessment:

  • CRITICAL - This is likely the primary source for Saxony museums
  • Expected: 300-500 museum listings with comprehensive metadata
  • Similar to Sachsen-Anhalt's museum portal model

Action: PRIORITY 1 - Investigate museumsverband URL and find Saxony museum directory


Missing Sources to Identify

High Priority

  1. Saxony Museum Association Directory

    • Search for: "Museumsverband Sachsen", "Museen in Sachsen"
    • Expected institutions: 300-500 museums
    • Must have: Museum names, cities, addresses, websites
  2. University Libraries

    • TU Dresden library
    • Leipzig University library (UB Leipzig)
    • TU Chemnitz library
    • TU Bergakademie Freiberg library
  3. Major Museums

    • Staatliche Kunstsammlungen Dresden (Dresden State Art Collections)
    • GRASSI Museum Leipzig
    • Museum für Völkerkunde Dresden
    • Deutsches Hygiene-Museum Dresden
  4. City Archives

    • Stadtarchiv Dresden
    • Stadtarchiv Leipzig
    • Stadtarchiv Chemnitz

Medium Priority

  1. Specialized Archives
    • Church archives (Evangelisch-Lutherische Landeskirche Sachsen)
    • University archives
    • Corporate archives

Estimated Institution Count

Institution Type Estimated Count Confidence
Museums 300-500 High (based on Sachsen-Anhalt ratio)
Archives 20-30 Medium (state + city + specialized)
Libraries 40-60 Medium (public + university + specialized)
Galleries 20-40 Low (need source identification)
Research Centers 10-20 Low (need source identification)
TOTAL 390-650 Medium

Note: Sachsen-Anhalt (smaller state) yielded 166 institutions. Saxony (larger, more populous) should yield 400-600.


Harvest Strategy (Priority Order)

Phase 1: Source Discovery (CURRENT)

Status: IN PROGRESS
Tasks:

  1. Test provided URLs accessibility
  2. Classify sources (aggregator vs. single institution)
  3. 🔄 Find Saxony museum association directory
  4. 🔄 Find university library consortium
  5. 🔄 Identify major museum websites

Next Action: Search for Saxony museum directory


Phase 2: Scraper Development

Depends on: Phase 1 completion
Tasks:

  1. Build museum directory scraper (if HTML directory exists)
  2. Build archive location scraper (staatsarchiv.sachsen.de)
  3. Build library scraper (if consortium website exists)
  4. Build detail page enrichment scrapers

Reusable from Sachsen-Anhalt:

  • Rate limiting: 1 req/sec
  • Address extraction patterns (German format)
  • LinkML data model
  • Merge/deduplication logic

Phase 3: Data Enrichment

Depends on: Phase 2 completion
Tasks:

  1. Scrape detail pages for full metadata
  2. Geocode addresses (Nominatim)
  3. Extract contact info (phone, email)
  4. Extract ISIL codes (if available)
  5. Cross-reference with Wikidata

Target Completeness: 95%+ (based on Sachsen-Anhalt success)


Phase 4: Merge & Validation

Depends on: Phase 3 completion
Tasks:

  1. Merge all sources into unified Saxony dataset
  2. Deduplicate institutions (fuzzy matching)
  3. Validate LinkML compliance
  4. Generate completeness report
  5. Export final JSON

Output: data/isil/germany/sachsen_complete_[timestamp].json


Technical Architecture

Data Model (LinkML v0.2.2)

- id: https://w3id.org/heritage/custodian/de/slub-dresden
  name: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
  institution_type: LIBRARY
  alternative_names:
    - SLUB Dresden
    - Saxon State and University Library Dresden
  description: >-
    The Saxon State and University Library Dresden (SLUB) is both the state 
    library of Saxony and the university library for TU Dresden. Founded in 
    1556, it holds over 9 million volumes.    
  
  locations:
    - city: Dresden
      street_address: Zellescher Weg 18
      postal_code: "01069"
      region: Sachsen
      country: DE
  
  identifiers:
    - identifier_scheme: ISIL
      identifier_value: DE-D161
      identifier_url: https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-D161
    - identifier_scheme: Wikidata
      identifier_value: Q700566
      identifier_url: https://www.wikidata.org/wiki/Q700566
    - identifier_scheme: Website
      identifier_value: https://www.slub-dresden.de
      identifier_url: https://www.slub-dresden.de
  
  digital_platforms:
    - platform_name: SLUB Digital Collections
      platform_url: https://digital.slub-dresden.de
      platform_type: DISCOVERY_PORTAL
      metadata_standards:
        - METS/MODS
        - Dublin Core
  
  provenance:
    data_source: WEB_SCRAPING
    data_tier: TIER_2_VERIFIED
    extraction_date: "2025-11-20T..."
    extraction_method: "Manual extraction from official website"
    confidence_score: 0.98

Scripts to Create

scripts/scrapers/
├── harvest_sachsen_museums.py (museum directory scraper)
├── harvest_sachsen_archives.py (state archives scraper)
├── harvest_sachsen_libraries.py (library consortium scraper)
├── enrich_sachsen_details.py (detail page metadata enrichment)
└── merge_sachsen_complete.py (merge all sources)

Success Criteria

Minimum Viable Dataset

  • 300+ institutions extracted
  • 90%+ metadata completeness (name, type, city, website)
  • Geographic coverage across all major Saxony cities
  • LinkML schema validation passes
  • Integration-ready for German national dataset v5

Target Dataset (Ideal)

  • 400-600 institutions extracted
  • 95%+ metadata completeness (including addresses, phone, email)
  • ISIL codes for major institutions
  • Wikidata cross-references
  • Collection descriptions where available

Risk Assessment

HIGH RISK

  • No centralized museum directory found
    • Mitigation: Search alternative sources (tourism websites, regional portals)
    • Fallback: Manual extraction from individual museum websites

MEDIUM RISK

  • Fragmented data sources (no single aggregator)
    • Mitigation: Multi-source harvest strategy (archives, libraries, museums separately)
    • Impact: Longer development time

LOW RISK

  • Website blocking/rate limiting
    • Mitigation: Proven 1 req/sec rate limiting from Sachsen-Anhalt
    • Impact: Minimal (harvest takes longer but succeeds)

Timeline Estimate

Phase Duration Depends On
Phase 1: Source Discovery 2-4 hours Current session
Phase 2: Scraper Development 4-6 hours Phase 1 complete
Phase 3: Data Enrichment 6-10 hours Phase 2 complete
Phase 4: Merge & Validation 2-3 hours Phase 3 complete
TOTAL 14-23 hours Continuous work

Note: Timeline assumes sources are identified. If no museum directory exists, add 4-8 hours for alternative sourcing.


Next Immediate Actions

Action 1: Search for Saxony Museum Directory (PRIORITY 1)

Queries to test:

  1. https://www.museen-in-sachsen.de/
  2. https://www.kulturraum-sachsen.de/
  3. https://www.smwk.sachsen.de/museen (Ministry of Culture)
  4. Search: "Museumsverband Sachsen" + "Liste" + "Mitglieder"

Expected outcome: Find authoritative source with 300-500 museum listings


Action 2: Extract Saxon State Archives Locations

Source: https://www.archiv.sachsen.de/
Expected data:

  • 6-8 archive locations
  • Addresses, phone, email, opening hours
  • Holdings descriptions
  • ISIL codes (likely format: DE-Dd*, DE-L*, etc.)

Script to create: scripts/scrapers/harvest_sachsen_archives.py


Action 3: Identify University Libraries

Search queries:

  1. "TU Dresden Bibliothek" + "SLUB"
  2. "Universitätsbibliothek Leipzig"
  3. "TU Chemnitz Bibliothek"
  4. "TU Bergakademie Freiberg Bibliothek"

Expected outcome: 4-6 major university libraries with complete metadata


Questions for User

  1. Should I search for the Saxony museum directory now?

    • This is CRITICAL for achieving 300+ institution target
  2. Should I prioritize breadth (all institution types) or depth (museums only)?

    • Breadth: Harvest all types (museums, archives, libraries) with 90% completeness
    • Depth: Focus on museums with 95%+ completeness (like Sachsen-Anhalt)
  3. Do you have additional Saxony GLAM sources not listed?

    • Any known museum directories, library consortia, or regional portals?

Session Status

Current State: Source analysis complete
Blockers: Need to find Saxony museum directory
Ready to proceed with: Archive harvesting (staatsarchiv.sachsen.de)

Awaiting user input: Confirm next action priority