354 lines
11 KiB
Markdown
354 lines
11 KiB
Markdown
# Saxony (Sachsen) GLAM Harvest Strategy
|
|
|
|
**Session**: 2025-11-20
|
|
**Status**: PLANNING
|
|
**Target**: 400-600 institutions with 95%+ metadata completeness
|
|
|
|
---
|
|
|
|
## Source Analysis Results
|
|
|
|
### ✅ 1. SLUB Dresden (Digital Collections)
|
|
**URL**: https://digital.slub-dresden.de/kollektionen/
|
|
**Type**: Single institution (State and University Library Dresden)
|
|
**Status**: Accessible
|
|
**Content**: 88,000+ digitized titles in collections
|
|
|
|
**Assessment**:
|
|
- **NOT** an institution aggregator - this is SLUB's own digital collection portal
|
|
- Focus: Digital objects (manuscripts, photos, maps, newspapers)
|
|
- Use case: Extract SLUB Dresden as a single LIBRARY institution
|
|
- Metadata: Available (name, address, collections, website)
|
|
|
|
**Action**: Manual extraction of SLUB Dresden metadata (1 institution)
|
|
|
|
---
|
|
|
|
### ❌ 2. Sachsen.digital
|
|
**URL**: http://www.sachsendigital.de/startseite/
|
|
**Status**: 404 (redirects to saxorum.de 404 page)
|
|
**Assessment**: Portal no longer operational or moved
|
|
|
|
**Action**: Archive this source (portal defunct)
|
|
|
|
---
|
|
|
|
### ✅ 3. Saxorum (Regional Studies Portal)
|
|
**URL**: https://www.saxorum.de/
|
|
**Type**: Research database for Saxony regional studies
|
|
**Status**: Accessible
|
|
**Content**: Persons, places, themes, historical resources
|
|
|
|
**Assessment**:
|
|
- **NOT** an institution directory - this is a historical research portal
|
|
- Focus: Historical persons, places, bibliographies, digitized materials
|
|
- No institution listings found in navigation
|
|
- Use case: Potential source for institutional history research (secondary)
|
|
|
|
**Action**: Low priority for institution harvesting (not a directory)
|
|
|
|
---
|
|
|
|
### ✅ 4. Sächsisches Staatsarchiv (Saxon State Archives)
|
|
**URL**: https://www.archiv.sachsen.de/
|
|
**Type**: Archive network (multiple locations)
|
|
**Status**: Accessible
|
|
**Content**: State archives across Saxony
|
|
|
|
**Assessment**:
|
|
- **HIGH PRIORITY** - State archives are major heritage institutions
|
|
- Expected: 6-8 archive locations (Dresden, Leipzig, Chemnitz, Bautzen, Freiberg, Plauen, etc.)
|
|
- Metadata available: Addresses, opening hours, contact info, holdings descriptions
|
|
|
|
**Action**: Scrape archive locations from staatsarchiv.sachsen.de
|
|
|
|
---
|
|
|
|
### 🔍 5. Museumsverband Sachsen (NOT YET CHECKED)
|
|
**Expected URL**: https://www.museen-in-sachsen.de/
|
|
**Type**: Museum association directory (if exists)
|
|
**Status**: NOT accessible in test (no output)
|
|
|
|
**Assessment**:
|
|
- **CRITICAL** - This is likely the primary source for Saxony museums
|
|
- Expected: 300-500 museum listings with comprehensive metadata
|
|
- Similar to Sachsen-Anhalt's museum portal model
|
|
|
|
**Action**: PRIORITY 1 - Investigate museumsverband URL and find Saxony museum directory
|
|
|
|
---
|
|
|
|
## Missing Sources to Identify
|
|
|
|
### High Priority
|
|
1. **Saxony Museum Association Directory**
|
|
- Search for: "Museumsverband Sachsen", "Museen in Sachsen"
|
|
- Expected institutions: 300-500 museums
|
|
- Must have: Museum names, cities, addresses, websites
|
|
|
|
2. **University Libraries**
|
|
- TU Dresden library
|
|
- Leipzig University library (UB Leipzig)
|
|
- TU Chemnitz library
|
|
- TU Bergakademie Freiberg library
|
|
|
|
3. **Major Museums**
|
|
- Staatliche Kunstsammlungen Dresden (Dresden State Art Collections)
|
|
- GRASSI Museum Leipzig
|
|
- Museum für Völkerkunde Dresden
|
|
- Deutsches Hygiene-Museum Dresden
|
|
|
|
4. **City Archives**
|
|
- Stadtarchiv Dresden
|
|
- Stadtarchiv Leipzig
|
|
- Stadtarchiv Chemnitz
|
|
|
|
### Medium Priority
|
|
5. **Specialized Archives**
|
|
- Church archives (Evangelisch-Lutherische Landeskirche Sachsen)
|
|
- University archives
|
|
- Corporate archives
|
|
|
|
---
|
|
|
|
## Estimated Institution Count
|
|
|
|
| Institution Type | Estimated Count | Confidence |
|
|
|------------------|-----------------|------------|
|
|
| Museums | 300-500 | High (based on Sachsen-Anhalt ratio) |
|
|
| Archives | 20-30 | Medium (state + city + specialized) |
|
|
| Libraries | 40-60 | Medium (public + university + specialized) |
|
|
| Galleries | 20-40 | Low (need source identification) |
|
|
| Research Centers | 10-20 | Low (need source identification) |
|
|
| **TOTAL** | **390-650** | **Medium** |
|
|
|
|
**Note**: Sachsen-Anhalt (smaller state) yielded 166 institutions. Saxony (larger, more populous) should yield 400-600.
|
|
|
|
---
|
|
|
|
## Harvest Strategy (Priority Order)
|
|
|
|
### Phase 1: Source Discovery (CURRENT)
|
|
**Status**: IN PROGRESS
|
|
**Tasks**:
|
|
1. ✅ Test provided URLs accessibility
|
|
2. ✅ Classify sources (aggregator vs. single institution)
|
|
3. 🔄 Find Saxony museum association directory
|
|
4. 🔄 Find university library consortium
|
|
5. 🔄 Identify major museum websites
|
|
|
|
**Next Action**: Search for Saxony museum directory
|
|
|
|
---
|
|
|
|
### Phase 2: Scraper Development
|
|
**Depends on**: Phase 1 completion
|
|
**Tasks**:
|
|
1. Build museum directory scraper (if HTML directory exists)
|
|
2. Build archive location scraper (staatsarchiv.sachsen.de)
|
|
3. Build library scraper (if consortium website exists)
|
|
4. Build detail page enrichment scrapers
|
|
|
|
**Reusable from Sachsen-Anhalt**:
|
|
- Rate limiting: 1 req/sec
|
|
- Address extraction patterns (German format)
|
|
- LinkML data model
|
|
- Merge/deduplication logic
|
|
|
|
---
|
|
|
|
### Phase 3: Data Enrichment
|
|
**Depends on**: Phase 2 completion
|
|
**Tasks**:
|
|
1. Scrape detail pages for full metadata
|
|
2. Geocode addresses (Nominatim)
|
|
3. Extract contact info (phone, email)
|
|
4. Extract ISIL codes (if available)
|
|
5. Cross-reference with Wikidata
|
|
|
|
**Target Completeness**: 95%+ (based on Sachsen-Anhalt success)
|
|
|
|
---
|
|
|
|
### Phase 4: Merge & Validation
|
|
**Depends on**: Phase 3 completion
|
|
**Tasks**:
|
|
1. Merge all sources into unified Saxony dataset
|
|
2. Deduplicate institutions (fuzzy matching)
|
|
3. Validate LinkML compliance
|
|
4. Generate completeness report
|
|
5. Export final JSON
|
|
|
|
**Output**: `data/isil/germany/sachsen_complete_[timestamp].json`
|
|
|
|
---
|
|
|
|
## Technical Architecture
|
|
|
|
### Data Model (LinkML v0.2.2)
|
|
```yaml
|
|
- id: https://w3id.org/heritage/custodian/de/slub-dresden
|
|
name: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
|
|
institution_type: LIBRARY
|
|
alternative_names:
|
|
- SLUB Dresden
|
|
- Saxon State and University Library Dresden
|
|
description: >-
|
|
The Saxon State and University Library Dresden (SLUB) is both the state
|
|
library of Saxony and the university library for TU Dresden. Founded in
|
|
1556, it holds over 9 million volumes.
|
|
|
|
locations:
|
|
- city: Dresden
|
|
street_address: Zellescher Weg 18
|
|
postal_code: "01069"
|
|
region: Sachsen
|
|
country: DE
|
|
|
|
identifiers:
|
|
- identifier_scheme: ISIL
|
|
identifier_value: DE-D161
|
|
identifier_url: https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-D161
|
|
- identifier_scheme: Wikidata
|
|
identifier_value: Q700566
|
|
identifier_url: https://www.wikidata.org/wiki/Q700566
|
|
- identifier_scheme: Website
|
|
identifier_value: https://www.slub-dresden.de
|
|
identifier_url: https://www.slub-dresden.de
|
|
|
|
digital_platforms:
|
|
- platform_name: SLUB Digital Collections
|
|
platform_url: https://digital.slub-dresden.de
|
|
platform_type: DISCOVERY_PORTAL
|
|
metadata_standards:
|
|
- METS/MODS
|
|
- Dublin Core
|
|
|
|
provenance:
|
|
data_source: WEB_SCRAPING
|
|
data_tier: TIER_2_VERIFIED
|
|
extraction_date: "2025-11-20T..."
|
|
extraction_method: "Manual extraction from official website"
|
|
confidence_score: 0.98
|
|
```
|
|
|
|
### Scripts to Create
|
|
```
|
|
scripts/scrapers/
|
|
├── harvest_sachsen_museums.py (museum directory scraper)
|
|
├── harvest_sachsen_archives.py (state archives scraper)
|
|
├── harvest_sachsen_libraries.py (library consortium scraper)
|
|
├── enrich_sachsen_details.py (detail page metadata enrichment)
|
|
└── merge_sachsen_complete.py (merge all sources)
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Minimum Viable Dataset
|
|
- ✅ 300+ institutions extracted
|
|
- ✅ 90%+ metadata completeness (name, type, city, website)
|
|
- ✅ Geographic coverage across all major Saxony cities
|
|
- ✅ LinkML schema validation passes
|
|
- ✅ Integration-ready for German national dataset v5
|
|
|
|
### Target Dataset (Ideal)
|
|
- ✅ 400-600 institutions extracted
|
|
- ✅ 95%+ metadata completeness (including addresses, phone, email)
|
|
- ✅ ISIL codes for major institutions
|
|
- ✅ Wikidata cross-references
|
|
- ✅ Collection descriptions where available
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
### HIGH RISK
|
|
- **No centralized museum directory found**
|
|
- Mitigation: Search alternative sources (tourism websites, regional portals)
|
|
- Fallback: Manual extraction from individual museum websites
|
|
|
|
### MEDIUM RISK
|
|
- **Fragmented data sources** (no single aggregator)
|
|
- Mitigation: Multi-source harvest strategy (archives, libraries, museums separately)
|
|
- Impact: Longer development time
|
|
|
|
### LOW RISK
|
|
- **Website blocking/rate limiting**
|
|
- Mitigation: Proven 1 req/sec rate limiting from Sachsen-Anhalt
|
|
- Impact: Minimal (harvest takes longer but succeeds)
|
|
|
|
---
|
|
|
|
## Timeline Estimate
|
|
|
|
| Phase | Duration | Depends On |
|
|
|-------|----------|------------|
|
|
| Phase 1: Source Discovery | 2-4 hours | Current session |
|
|
| Phase 2: Scraper Development | 4-6 hours | Phase 1 complete |
|
|
| Phase 3: Data Enrichment | 6-10 hours | Phase 2 complete |
|
|
| Phase 4: Merge & Validation | 2-3 hours | Phase 3 complete |
|
|
| **TOTAL** | **14-23 hours** | **Continuous work** |
|
|
|
|
**Note**: Timeline assumes sources are identified. If no museum directory exists, add 4-8 hours for alternative sourcing.
|
|
|
|
---
|
|
|
|
## Next Immediate Actions
|
|
|
|
### Action 1: Search for Saxony Museum Directory (PRIORITY 1)
|
|
**Queries to test**:
|
|
1. https://www.museen-in-sachsen.de/
|
|
2. https://www.kulturraum-sachsen.de/
|
|
3. https://www.smwk.sachsen.de/museen (Ministry of Culture)
|
|
4. Search: "Museumsverband Sachsen" + "Liste" + "Mitglieder"
|
|
|
|
**Expected outcome**: Find authoritative source with 300-500 museum listings
|
|
|
|
---
|
|
|
|
### Action 2: Extract Saxon State Archives Locations
|
|
**Source**: https://www.archiv.sachsen.de/
|
|
**Expected data**:
|
|
- 6-8 archive locations
|
|
- Addresses, phone, email, opening hours
|
|
- Holdings descriptions
|
|
- ISIL codes (likely format: DE-Dd*, DE-L*, etc.)
|
|
|
|
**Script to create**: `scripts/scrapers/harvest_sachsen_archives.py`
|
|
|
|
---
|
|
|
|
### Action 3: Identify University Libraries
|
|
**Search queries**:
|
|
1. "TU Dresden Bibliothek" + "SLUB"
|
|
2. "Universitätsbibliothek Leipzig"
|
|
3. "TU Chemnitz Bibliothek"
|
|
4. "TU Bergakademie Freiberg Bibliothek"
|
|
|
|
**Expected outcome**: 4-6 major university libraries with complete metadata
|
|
|
|
---
|
|
|
|
## Questions for User
|
|
|
|
1. **Should I search for the Saxony museum directory now?**
|
|
- This is CRITICAL for achieving 300+ institution target
|
|
|
|
2. **Should I prioritize breadth (all institution types) or depth (museums only)?**
|
|
- Breadth: Harvest all types (museums, archives, libraries) with 90% completeness
|
|
- Depth: Focus on museums with 95%+ completeness (like Sachsen-Anhalt)
|
|
|
|
3. **Do you have additional Saxony GLAM sources not listed?**
|
|
- Any known museum directories, library consortia, or regional portals?
|
|
|
|
---
|
|
|
|
## Session Status
|
|
|
|
**Current State**: Source analysis complete
|
|
**Blockers**: Need to find Saxony museum directory
|
|
**Ready to proceed with**: Archive harvesting (staatsarchiv.sachsen.de)
|
|
|
|
**Awaiting user input**: Confirm next action priority
|