# Saxony (Sachsen) GLAM Harvest Strategy **Session**: 2025-11-20 **Status**: PLANNING **Target**: 400-600 institutions with 95%+ metadata completeness --- ## Source Analysis Results ### ✅ 1. SLUB Dresden (Digital Collections) **URL**: https://digital.slub-dresden.de/kollektionen/ **Type**: Single institution (State and University Library Dresden) **Status**: Accessible **Content**: 88,000+ digitized titles in collections **Assessment**: - **NOT** an institution aggregator - this is SLUB's own digital collection portal - Focus: Digital objects (manuscripts, photos, maps, newspapers) - Use case: Extract SLUB Dresden as a single LIBRARY institution - Metadata: Available (name, address, collections, website) **Action**: Manual extraction of SLUB Dresden metadata (1 institution) --- ### ❌ 2. Sachsen.digital **URL**: http://www.sachsendigital.de/startseite/ **Status**: 404 (redirects to saxorum.de 404 page) **Assessment**: Portal no longer operational or moved **Action**: Archive this source (portal defunct) --- ### ✅ 3. Saxorum (Regional Studies Portal) **URL**: https://www.saxorum.de/ **Type**: Research database for Saxony regional studies **Status**: Accessible **Content**: Persons, places, themes, historical resources **Assessment**: - **NOT** an institution directory - this is a historical research portal - Focus: Historical persons, places, bibliographies, digitized materials - No institution listings found in navigation - Use case: Potential source for institutional history research (secondary) **Action**: Low priority for institution harvesting (not a directory) --- ### ✅ 4. Sächsisches Staatsarchiv (Saxon State Archives) **URL**: https://www.archiv.sachsen.de/ **Type**: Archive network (multiple locations) **Status**: Accessible **Content**: State archives across Saxony **Assessment**: - **HIGH PRIORITY** - State archives are major heritage institutions - Expected: 6-8 archive locations (Dresden, Leipzig, Chemnitz, Bautzen, Freiberg, Plauen, etc.) - Metadata available: Addresses, opening hours, contact info, holdings descriptions **Action**: Scrape archive locations from staatsarchiv.sachsen.de --- ### 🔍 5. Museumsverband Sachsen (NOT YET CHECKED) **Expected URL**: https://www.museen-in-sachsen.de/ **Type**: Museum association directory (if exists) **Status**: NOT accessible in test (no output) **Assessment**: - **CRITICAL** - This is likely the primary source for Saxony museums - Expected: 300-500 museum listings with comprehensive metadata - Similar to Sachsen-Anhalt's museum portal model **Action**: PRIORITY 1 - Investigate museumsverband URL and find Saxony museum directory --- ## Missing Sources to Identify ### High Priority 1. **Saxony Museum Association Directory** - Search for: "Museumsverband Sachsen", "Museen in Sachsen" - Expected institutions: 300-500 museums - Must have: Museum names, cities, addresses, websites 2. **University Libraries** - TU Dresden library - Leipzig University library (UB Leipzig) - TU Chemnitz library - TU Bergakademie Freiberg library 3. **Major Museums** - Staatliche Kunstsammlungen Dresden (Dresden State Art Collections) - GRASSI Museum Leipzig - Museum für Völkerkunde Dresden - Deutsches Hygiene-Museum Dresden 4. **City Archives** - Stadtarchiv Dresden - Stadtarchiv Leipzig - Stadtarchiv Chemnitz ### Medium Priority 5. **Specialized Archives** - Church archives (Evangelisch-Lutherische Landeskirche Sachsen) - University archives - Corporate archives --- ## Estimated Institution Count | Institution Type | Estimated Count | Confidence | |------------------|-----------------|------------| | Museums | 300-500 | High (based on Sachsen-Anhalt ratio) | | Archives | 20-30 | Medium (state + city + specialized) | | Libraries | 40-60 | Medium (public + university + specialized) | | Galleries | 20-40 | Low (need source identification) | | Research Centers | 10-20 | Low (need source identification) | | **TOTAL** | **390-650** | **Medium** | **Note**: Sachsen-Anhalt (smaller state) yielded 166 institutions. Saxony (larger, more populous) should yield 400-600. --- ## Harvest Strategy (Priority Order) ### Phase 1: Source Discovery (CURRENT) **Status**: IN PROGRESS **Tasks**: 1. ✅ Test provided URLs accessibility 2. ✅ Classify sources (aggregator vs. single institution) 3. 🔄 Find Saxony museum association directory 4. 🔄 Find university library consortium 5. 🔄 Identify major museum websites **Next Action**: Search for Saxony museum directory --- ### Phase 2: Scraper Development **Depends on**: Phase 1 completion **Tasks**: 1. Build museum directory scraper (if HTML directory exists) 2. Build archive location scraper (staatsarchiv.sachsen.de) 3. Build library scraper (if consortium website exists) 4. Build detail page enrichment scrapers **Reusable from Sachsen-Anhalt**: - Rate limiting: 1 req/sec - Address extraction patterns (German format) - LinkML data model - Merge/deduplication logic --- ### Phase 3: Data Enrichment **Depends on**: Phase 2 completion **Tasks**: 1. Scrape detail pages for full metadata 2. Geocode addresses (Nominatim) 3. Extract contact info (phone, email) 4. Extract ISIL codes (if available) 5. Cross-reference with Wikidata **Target Completeness**: 95%+ (based on Sachsen-Anhalt success) --- ### Phase 4: Merge & Validation **Depends on**: Phase 3 completion **Tasks**: 1. Merge all sources into unified Saxony dataset 2. Deduplicate institutions (fuzzy matching) 3. Validate LinkML compliance 4. Generate completeness report 5. Export final JSON **Output**: `data/isil/germany/sachsen_complete_[timestamp].json` --- ## Technical Architecture ### Data Model (LinkML v0.2.2) ```yaml - id: https://w3id.org/heritage/custodian/de/slub-dresden name: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden institution_type: LIBRARY alternative_names: - SLUB Dresden - Saxon State and University Library Dresden description: >- The Saxon State and University Library Dresden (SLUB) is both the state library of Saxony and the university library for TU Dresden. Founded in 1556, it holds over 9 million volumes. locations: - city: Dresden street_address: Zellescher Weg 18 postal_code: "01069" region: Sachsen country: DE identifiers: - identifier_scheme: ISIL identifier_value: DE-D161 identifier_url: https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-D161 - identifier_scheme: Wikidata identifier_value: Q700566 identifier_url: https://www.wikidata.org/wiki/Q700566 - identifier_scheme: Website identifier_value: https://www.slub-dresden.de identifier_url: https://www.slub-dresden.de digital_platforms: - platform_name: SLUB Digital Collections platform_url: https://digital.slub-dresden.de platform_type: DISCOVERY_PORTAL metadata_standards: - METS/MODS - Dublin Core provenance: data_source: WEB_SCRAPING data_tier: TIER_2_VERIFIED extraction_date: "2025-11-20T..." extraction_method: "Manual extraction from official website" confidence_score: 0.98 ``` ### Scripts to Create ``` scripts/scrapers/ ├── harvest_sachsen_museums.py (museum directory scraper) ├── harvest_sachsen_archives.py (state archives scraper) ├── harvest_sachsen_libraries.py (library consortium scraper) ├── enrich_sachsen_details.py (detail page metadata enrichment) └── merge_sachsen_complete.py (merge all sources) ``` --- ## Success Criteria ### Minimum Viable Dataset - ✅ 300+ institutions extracted - ✅ 90%+ metadata completeness (name, type, city, website) - ✅ Geographic coverage across all major Saxony cities - ✅ LinkML schema validation passes - ✅ Integration-ready for German national dataset v5 ### Target Dataset (Ideal) - ✅ 400-600 institutions extracted - ✅ 95%+ metadata completeness (including addresses, phone, email) - ✅ ISIL codes for major institutions - ✅ Wikidata cross-references - ✅ Collection descriptions where available --- ## Risk Assessment ### HIGH RISK - **No centralized museum directory found** - Mitigation: Search alternative sources (tourism websites, regional portals) - Fallback: Manual extraction from individual museum websites ### MEDIUM RISK - **Fragmented data sources** (no single aggregator) - Mitigation: Multi-source harvest strategy (archives, libraries, museums separately) - Impact: Longer development time ### LOW RISK - **Website blocking/rate limiting** - Mitigation: Proven 1 req/sec rate limiting from Sachsen-Anhalt - Impact: Minimal (harvest takes longer but succeeds) --- ## Timeline Estimate | Phase | Duration | Depends On | |-------|----------|------------| | Phase 1: Source Discovery | 2-4 hours | Current session | | Phase 2: Scraper Development | 4-6 hours | Phase 1 complete | | Phase 3: Data Enrichment | 6-10 hours | Phase 2 complete | | Phase 4: Merge & Validation | 2-3 hours | Phase 3 complete | | **TOTAL** | **14-23 hours** | **Continuous work** | **Note**: Timeline assumes sources are identified. If no museum directory exists, add 4-8 hours for alternative sourcing. --- ## Next Immediate Actions ### Action 1: Search for Saxony Museum Directory (PRIORITY 1) **Queries to test**: 1. https://www.museen-in-sachsen.de/ 2. https://www.kulturraum-sachsen.de/ 3. https://www.smwk.sachsen.de/museen (Ministry of Culture) 4. Search: "Museumsverband Sachsen" + "Liste" + "Mitglieder" **Expected outcome**: Find authoritative source with 300-500 museum listings --- ### Action 2: Extract Saxon State Archives Locations **Source**: https://www.archiv.sachsen.de/ **Expected data**: - 6-8 archive locations - Addresses, phone, email, opening hours - Holdings descriptions - ISIL codes (likely format: DE-Dd*, DE-L*, etc.) **Script to create**: `scripts/scrapers/harvest_sachsen_archives.py` --- ### Action 3: Identify University Libraries **Search queries**: 1. "TU Dresden Bibliothek" + "SLUB" 2. "Universitätsbibliothek Leipzig" 3. "TU Chemnitz Bibliothek" 4. "TU Bergakademie Freiberg Bibliothek" **Expected outcome**: 4-6 major university libraries with complete metadata --- ## Questions for User 1. **Should I search for the Saxony museum directory now?** - This is CRITICAL for achieving 300+ institution target 2. **Should I prioritize breadth (all institution types) or depth (museums only)?** - Breadth: Harvest all types (museums, archives, libraries) with 90% completeness - Depth: Focus on museums with 95%+ completeness (like Sachsen-Anhalt) 3. **Do you have additional Saxony GLAM sources not listed?** - Any known museum directories, library consortia, or regional portals? --- ## Session Status **Current State**: Source analysis complete **Blockers**: Need to find Saxony museum directory **Ready to proceed with**: Archive harvesting (staatsarchiv.sachsen.de) **Awaiting user input**: Confirm next action priority