11 KiB
Saxony (Sachsen) GLAM Harvest Strategy
Session: 2025-11-20
Status: PLANNING
Target: 400-600 institutions with 95%+ metadata completeness
Source Analysis Results
✅ 1. SLUB Dresden (Digital Collections)
URL: https://digital.slub-dresden.de/kollektionen/
Type: Single institution (State and University Library Dresden)
Status: Accessible
Content: 88,000+ digitized titles in collections
Assessment:
- NOT an institution aggregator - this is SLUB's own digital collection portal
- Focus: Digital objects (manuscripts, photos, maps, newspapers)
- Use case: Extract SLUB Dresden as a single LIBRARY institution
- Metadata: Available (name, address, collections, website)
Action: Manual extraction of SLUB Dresden metadata (1 institution)
❌ 2. Sachsen.digital
URL: http://www.sachsendigital.de/startseite/
Status: 404 (redirects to saxorum.de 404 page)
Assessment: Portal no longer operational or moved
Action: Archive this source (portal defunct)
✅ 3. Saxorum (Regional Studies Portal)
URL: https://www.saxorum.de/
Type: Research database for Saxony regional studies
Status: Accessible
Content: Persons, places, themes, historical resources
Assessment:
- NOT an institution directory - this is a historical research portal
- Focus: Historical persons, places, bibliographies, digitized materials
- No institution listings found in navigation
- Use case: Potential source for institutional history research (secondary)
Action: Low priority for institution harvesting (not a directory)
✅ 4. Sächsisches Staatsarchiv (Saxon State Archives)
URL: https://www.archiv.sachsen.de/
Type: Archive network (multiple locations)
Status: Accessible
Content: State archives across Saxony
Assessment:
- HIGH PRIORITY - State archives are major heritage institutions
- Expected: 6-8 archive locations (Dresden, Leipzig, Chemnitz, Bautzen, Freiberg, Plauen, etc.)
- Metadata available: Addresses, opening hours, contact info, holdings descriptions
Action: Scrape archive locations from staatsarchiv.sachsen.de
🔍 5. Museumsverband Sachsen (NOT YET CHECKED)
Expected URL: https://www.museen-in-sachsen.de/
Type: Museum association directory (if exists)
Status: NOT accessible in test (no output)
Assessment:
- CRITICAL - This is likely the primary source for Saxony museums
- Expected: 300-500 museum listings with comprehensive metadata
- Similar to Sachsen-Anhalt's museum portal model
Action: PRIORITY 1 - Investigate museumsverband URL and find Saxony museum directory
Missing Sources to Identify
High Priority
-
Saxony Museum Association Directory
- Search for: "Museumsverband Sachsen", "Museen in Sachsen"
- Expected institutions: 300-500 museums
- Must have: Museum names, cities, addresses, websites
-
University Libraries
- TU Dresden library
- Leipzig University library (UB Leipzig)
- TU Chemnitz library
- TU Bergakademie Freiberg library
-
Major Museums
- Staatliche Kunstsammlungen Dresden (Dresden State Art Collections)
- GRASSI Museum Leipzig
- Museum für Völkerkunde Dresden
- Deutsches Hygiene-Museum Dresden
-
City Archives
- Stadtarchiv Dresden
- Stadtarchiv Leipzig
- Stadtarchiv Chemnitz
Medium Priority
- Specialized Archives
- Church archives (Evangelisch-Lutherische Landeskirche Sachsen)
- University archives
- Corporate archives
Estimated Institution Count
| Institution Type | Estimated Count | Confidence |
|---|---|---|
| Museums | 300-500 | High (based on Sachsen-Anhalt ratio) |
| Archives | 20-30 | Medium (state + city + specialized) |
| Libraries | 40-60 | Medium (public + university + specialized) |
| Galleries | 20-40 | Low (need source identification) |
| Research Centers | 10-20 | Low (need source identification) |
| TOTAL | 390-650 | Medium |
Note: Sachsen-Anhalt (smaller state) yielded 166 institutions. Saxony (larger, more populous) should yield 400-600.
Harvest Strategy (Priority Order)
Phase 1: Source Discovery (CURRENT)
Status: IN PROGRESS
Tasks:
- ✅ Test provided URLs accessibility
- ✅ Classify sources (aggregator vs. single institution)
- 🔄 Find Saxony museum association directory
- 🔄 Find university library consortium
- 🔄 Identify major museum websites
Next Action: Search for Saxony museum directory
Phase 2: Scraper Development
Depends on: Phase 1 completion
Tasks:
- Build museum directory scraper (if HTML directory exists)
- Build archive location scraper (staatsarchiv.sachsen.de)
- Build library scraper (if consortium website exists)
- Build detail page enrichment scrapers
Reusable from Sachsen-Anhalt:
- Rate limiting: 1 req/sec
- Address extraction patterns (German format)
- LinkML data model
- Merge/deduplication logic
Phase 3: Data Enrichment
Depends on: Phase 2 completion
Tasks:
- Scrape detail pages for full metadata
- Geocode addresses (Nominatim)
- Extract contact info (phone, email)
- Extract ISIL codes (if available)
- Cross-reference with Wikidata
Target Completeness: 95%+ (based on Sachsen-Anhalt success)
Phase 4: Merge & Validation
Depends on: Phase 3 completion
Tasks:
- Merge all sources into unified Saxony dataset
- Deduplicate institutions (fuzzy matching)
- Validate LinkML compliance
- Generate completeness report
- Export final JSON
Output: data/isil/germany/sachsen_complete_[timestamp].json
Technical Architecture
Data Model (LinkML v0.2.2)
- id: https://w3id.org/heritage/custodian/de/slub-dresden
name: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
institution_type: LIBRARY
alternative_names:
- SLUB Dresden
- Saxon State and University Library Dresden
description: >-
The Saxon State and University Library Dresden (SLUB) is both the state
library of Saxony and the university library for TU Dresden. Founded in
1556, it holds over 9 million volumes.
locations:
- city: Dresden
street_address: Zellescher Weg 18
postal_code: "01069"
region: Sachsen
country: DE
identifiers:
- identifier_scheme: ISIL
identifier_value: DE-D161
identifier_url: https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-D161
- identifier_scheme: Wikidata
identifier_value: Q700566
identifier_url: https://www.wikidata.org/wiki/Q700566
- identifier_scheme: Website
identifier_value: https://www.slub-dresden.de
identifier_url: https://www.slub-dresden.de
digital_platforms:
- platform_name: SLUB Digital Collections
platform_url: https://digital.slub-dresden.de
platform_type: DISCOVERY_PORTAL
metadata_standards:
- METS/MODS
- Dublin Core
provenance:
data_source: WEB_SCRAPING
data_tier: TIER_2_VERIFIED
extraction_date: "2025-11-20T..."
extraction_method: "Manual extraction from official website"
confidence_score: 0.98
Scripts to Create
scripts/scrapers/
├── harvest_sachsen_museums.py (museum directory scraper)
├── harvest_sachsen_archives.py (state archives scraper)
├── harvest_sachsen_libraries.py (library consortium scraper)
├── enrich_sachsen_details.py (detail page metadata enrichment)
└── merge_sachsen_complete.py (merge all sources)
Success Criteria
Minimum Viable Dataset
- ✅ 300+ institutions extracted
- ✅ 90%+ metadata completeness (name, type, city, website)
- ✅ Geographic coverage across all major Saxony cities
- ✅ LinkML schema validation passes
- ✅ Integration-ready for German national dataset v5
Target Dataset (Ideal)
- ✅ 400-600 institutions extracted
- ✅ 95%+ metadata completeness (including addresses, phone, email)
- ✅ ISIL codes for major institutions
- ✅ Wikidata cross-references
- ✅ Collection descriptions where available
Risk Assessment
HIGH RISK
- No centralized museum directory found
- Mitigation: Search alternative sources (tourism websites, regional portals)
- Fallback: Manual extraction from individual museum websites
MEDIUM RISK
- Fragmented data sources (no single aggregator)
- Mitigation: Multi-source harvest strategy (archives, libraries, museums separately)
- Impact: Longer development time
LOW RISK
- Website blocking/rate limiting
- Mitigation: Proven 1 req/sec rate limiting from Sachsen-Anhalt
- Impact: Minimal (harvest takes longer but succeeds)
Timeline Estimate
| Phase | Duration | Depends On |
|---|---|---|
| Phase 1: Source Discovery | 2-4 hours | Current session |
| Phase 2: Scraper Development | 4-6 hours | Phase 1 complete |
| Phase 3: Data Enrichment | 6-10 hours | Phase 2 complete |
| Phase 4: Merge & Validation | 2-3 hours | Phase 3 complete |
| TOTAL | 14-23 hours | Continuous work |
Note: Timeline assumes sources are identified. If no museum directory exists, add 4-8 hours for alternative sourcing.
Next Immediate Actions
Action 1: Search for Saxony Museum Directory (PRIORITY 1)
Queries to test:
- https://www.museen-in-sachsen.de/
- https://www.kulturraum-sachsen.de/
- https://www.smwk.sachsen.de/museen (Ministry of Culture)
- Search: "Museumsverband Sachsen" + "Liste" + "Mitglieder"
Expected outcome: Find authoritative source with 300-500 museum listings
Action 2: Extract Saxon State Archives Locations
Source: https://www.archiv.sachsen.de/
Expected data:
- 6-8 archive locations
- Addresses, phone, email, opening hours
- Holdings descriptions
- ISIL codes (likely format: DE-Dd*, DE-L*, etc.)
Script to create: scripts/scrapers/harvest_sachsen_archives.py
Action 3: Identify University Libraries
Search queries:
- "TU Dresden Bibliothek" + "SLUB"
- "Universitätsbibliothek Leipzig"
- "TU Chemnitz Bibliothek"
- "TU Bergakademie Freiberg Bibliothek"
Expected outcome: 4-6 major university libraries with complete metadata
Questions for User
-
Should I search for the Saxony museum directory now?
- This is CRITICAL for achieving 300+ institution target
-
Should I prioritize breadth (all institution types) or depth (museums only)?
- Breadth: Harvest all types (museums, archives, libraries) with 90% completeness
- Depth: Focus on museums with 95%+ completeness (like Sachsen-Anhalt)
-
Do you have additional Saxony GLAM sources not listed?
- Any known museum directories, library consortia, or regional portals?
Session Status
Current State: Source analysis complete
Blockers: Need to find Saxony museum directory
Ready to proceed with: Archive harvesting (staatsarchiv.sachsen.de)
Awaiting user input: Confirm next action priority