glam/SAXONY_HARVEST_STRATEGY.md
2025-11-21 22:12:33 +01:00

354 lines
11 KiB
Markdown

# Saxony (Sachsen) GLAM Harvest Strategy
**Session**: 2025-11-20
**Status**: PLANNING
**Target**: 400-600 institutions with 95%+ metadata completeness
---
## Source Analysis Results
### ✅ 1. SLUB Dresden (Digital Collections)
**URL**: https://digital.slub-dresden.de/kollektionen/
**Type**: Single institution (State and University Library Dresden)
**Status**: Accessible
**Content**: 88,000+ digitized titles in collections
**Assessment**:
- **NOT** an institution aggregator - this is SLUB's own digital collection portal
- Focus: Digital objects (manuscripts, photos, maps, newspapers)
- Use case: Extract SLUB Dresden as a single LIBRARY institution
- Metadata: Available (name, address, collections, website)
**Action**: Manual extraction of SLUB Dresden metadata (1 institution)
---
### ❌ 2. Sachsen.digital
**URL**: http://www.sachsendigital.de/startseite/
**Status**: 404 (redirects to saxorum.de 404 page)
**Assessment**: Portal no longer operational or moved
**Action**: Archive this source (portal defunct)
---
### ✅ 3. Saxorum (Regional Studies Portal)
**URL**: https://www.saxorum.de/
**Type**: Research database for Saxony regional studies
**Status**: Accessible
**Content**: Persons, places, themes, historical resources
**Assessment**:
- **NOT** an institution directory - this is a historical research portal
- Focus: Historical persons, places, bibliographies, digitized materials
- No institution listings found in navigation
- Use case: Potential source for institutional history research (secondary)
**Action**: Low priority for institution harvesting (not a directory)
---
### ✅ 4. Sächsisches Staatsarchiv (Saxon State Archives)
**URL**: https://www.archiv.sachsen.de/
**Type**: Archive network (multiple locations)
**Status**: Accessible
**Content**: State archives across Saxony
**Assessment**:
- **HIGH PRIORITY** - State archives are major heritage institutions
- Expected: 6-8 archive locations (Dresden, Leipzig, Chemnitz, Bautzen, Freiberg, Plauen, etc.)
- Metadata available: Addresses, opening hours, contact info, holdings descriptions
**Action**: Scrape archive locations from staatsarchiv.sachsen.de
---
### 🔍 5. Museumsverband Sachsen (NOT YET CHECKED)
**Expected URL**: https://www.museen-in-sachsen.de/
**Type**: Museum association directory (if exists)
**Status**: NOT accessible in test (no output)
**Assessment**:
- **CRITICAL** - This is likely the primary source for Saxony museums
- Expected: 300-500 museum listings with comprehensive metadata
- Similar to Sachsen-Anhalt's museum portal model
**Action**: PRIORITY 1 - Investigate museumsverband URL and find Saxony museum directory
---
## Missing Sources to Identify
### High Priority
1. **Saxony Museum Association Directory**
- Search for: "Museumsverband Sachsen", "Museen in Sachsen"
- Expected institutions: 300-500 museums
- Must have: Museum names, cities, addresses, websites
2. **University Libraries**
- TU Dresden library
- Leipzig University library (UB Leipzig)
- TU Chemnitz library
- TU Bergakademie Freiberg library
3. **Major Museums**
- Staatliche Kunstsammlungen Dresden (Dresden State Art Collections)
- GRASSI Museum Leipzig
- Museum für Völkerkunde Dresden
- Deutsches Hygiene-Museum Dresden
4. **City Archives**
- Stadtarchiv Dresden
- Stadtarchiv Leipzig
- Stadtarchiv Chemnitz
### Medium Priority
5. **Specialized Archives**
- Church archives (Evangelisch-Lutherische Landeskirche Sachsen)
- University archives
- Corporate archives
---
## Estimated Institution Count
| Institution Type | Estimated Count | Confidence |
|------------------|-----------------|------------|
| Museums | 300-500 | High (based on Sachsen-Anhalt ratio) |
| Archives | 20-30 | Medium (state + city + specialized) |
| Libraries | 40-60 | Medium (public + university + specialized) |
| Galleries | 20-40 | Low (need source identification) |
| Research Centers | 10-20 | Low (need source identification) |
| **TOTAL** | **390-650** | **Medium** |
**Note**: Sachsen-Anhalt (smaller state) yielded 166 institutions. Saxony (larger, more populous) should yield 400-600.
---
## Harvest Strategy (Priority Order)
### Phase 1: Source Discovery (CURRENT)
**Status**: IN PROGRESS
**Tasks**:
1. ✅ Test provided URLs accessibility
2. ✅ Classify sources (aggregator vs. single institution)
3. 🔄 Find Saxony museum association directory
4. 🔄 Find university library consortium
5. 🔄 Identify major museum websites
**Next Action**: Search for Saxony museum directory
---
### Phase 2: Scraper Development
**Depends on**: Phase 1 completion
**Tasks**:
1. Build museum directory scraper (if HTML directory exists)
2. Build archive location scraper (staatsarchiv.sachsen.de)
3. Build library scraper (if consortium website exists)
4. Build detail page enrichment scrapers
**Reusable from Sachsen-Anhalt**:
- Rate limiting: 1 req/sec
- Address extraction patterns (German format)
- LinkML data model
- Merge/deduplication logic
---
### Phase 3: Data Enrichment
**Depends on**: Phase 2 completion
**Tasks**:
1. Scrape detail pages for full metadata
2. Geocode addresses (Nominatim)
3. Extract contact info (phone, email)
4. Extract ISIL codes (if available)
5. Cross-reference with Wikidata
**Target Completeness**: 95%+ (based on Sachsen-Anhalt success)
---
### Phase 4: Merge & Validation
**Depends on**: Phase 3 completion
**Tasks**:
1. Merge all sources into unified Saxony dataset
2. Deduplicate institutions (fuzzy matching)
3. Validate LinkML compliance
4. Generate completeness report
5. Export final JSON
**Output**: `data/isil/germany/sachsen_complete_[timestamp].json`
---
## Technical Architecture
### Data Model (LinkML v0.2.2)
```yaml
- id: https://w3id.org/heritage/custodian/de/slub-dresden
name: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
institution_type: LIBRARY
alternative_names:
- SLUB Dresden
- Saxon State and University Library Dresden
description: >-
The Saxon State and University Library Dresden (SLUB) is both the state
library of Saxony and the university library for TU Dresden. Founded in
1556, it holds over 9 million volumes.
locations:
- city: Dresden
street_address: Zellescher Weg 18
postal_code: "01069"
region: Sachsen
country: DE
identifiers:
- identifier_scheme: ISIL
identifier_value: DE-D161
identifier_url: https://sigel.staatsbibliothek-berlin.de/suche/?isil=DE-D161
- identifier_scheme: Wikidata
identifier_value: Q700566
identifier_url: https://www.wikidata.org/wiki/Q700566
- identifier_scheme: Website
identifier_value: https://www.slub-dresden.de
identifier_url: https://www.slub-dresden.de
digital_platforms:
- platform_name: SLUB Digital Collections
platform_url: https://digital.slub-dresden.de
platform_type: DISCOVERY_PORTAL
metadata_standards:
- METS/MODS
- Dublin Core
provenance:
data_source: WEB_SCRAPING
data_tier: TIER_2_VERIFIED
extraction_date: "2025-11-20T..."
extraction_method: "Manual extraction from official website"
confidence_score: 0.98
```
### Scripts to Create
```
scripts/scrapers/
├── harvest_sachsen_museums.py (museum directory scraper)
├── harvest_sachsen_archives.py (state archives scraper)
├── harvest_sachsen_libraries.py (library consortium scraper)
├── enrich_sachsen_details.py (detail page metadata enrichment)
└── merge_sachsen_complete.py (merge all sources)
```
---
## Success Criteria
### Minimum Viable Dataset
- ✅ 300+ institutions extracted
- ✅ 90%+ metadata completeness (name, type, city, website)
- ✅ Geographic coverage across all major Saxony cities
- ✅ LinkML schema validation passes
- ✅ Integration-ready for German national dataset v5
### Target Dataset (Ideal)
- ✅ 400-600 institutions extracted
- ✅ 95%+ metadata completeness (including addresses, phone, email)
- ✅ ISIL codes for major institutions
- ✅ Wikidata cross-references
- ✅ Collection descriptions where available
---
## Risk Assessment
### HIGH RISK
- **No centralized museum directory found**
- Mitigation: Search alternative sources (tourism websites, regional portals)
- Fallback: Manual extraction from individual museum websites
### MEDIUM RISK
- **Fragmented data sources** (no single aggregator)
- Mitigation: Multi-source harvest strategy (archives, libraries, museums separately)
- Impact: Longer development time
### LOW RISK
- **Website blocking/rate limiting**
- Mitigation: Proven 1 req/sec rate limiting from Sachsen-Anhalt
- Impact: Minimal (harvest takes longer but succeeds)
---
## Timeline Estimate
| Phase | Duration | Depends On |
|-------|----------|------------|
| Phase 1: Source Discovery | 2-4 hours | Current session |
| Phase 2: Scraper Development | 4-6 hours | Phase 1 complete |
| Phase 3: Data Enrichment | 6-10 hours | Phase 2 complete |
| Phase 4: Merge & Validation | 2-3 hours | Phase 3 complete |
| **TOTAL** | **14-23 hours** | **Continuous work** |
**Note**: Timeline assumes sources are identified. If no museum directory exists, add 4-8 hours for alternative sourcing.
---
## Next Immediate Actions
### Action 1: Search for Saxony Museum Directory (PRIORITY 1)
**Queries to test**:
1. https://www.museen-in-sachsen.de/
2. https://www.kulturraum-sachsen.de/
3. https://www.smwk.sachsen.de/museen (Ministry of Culture)
4. Search: "Museumsverband Sachsen" + "Liste" + "Mitglieder"
**Expected outcome**: Find authoritative source with 300-500 museum listings
---
### Action 2: Extract Saxon State Archives Locations
**Source**: https://www.archiv.sachsen.de/
**Expected data**:
- 6-8 archive locations
- Addresses, phone, email, opening hours
- Holdings descriptions
- ISIL codes (likely format: DE-Dd*, DE-L*, etc.)
**Script to create**: `scripts/scrapers/harvest_sachsen_archives.py`
---
### Action 3: Identify University Libraries
**Search queries**:
1. "TU Dresden Bibliothek" + "SLUB"
2. "Universitätsbibliothek Leipzig"
3. "TU Chemnitz Bibliothek"
4. "TU Bergakademie Freiberg Bibliothek"
**Expected outcome**: 4-6 major university libraries with complete metadata
---
## Questions for User
1. **Should I search for the Saxony museum directory now?**
- This is CRITICAL for achieving 300+ institution target
2. **Should I prioritize breadth (all institution types) or depth (museums only)?**
- Breadth: Harvest all types (museums, archives, libraries) with 90% completeness
- Depth: Focus on museums with 95%+ completeness (like Sachsen-Anhalt)
3. **Do you have additional Saxony GLAM sources not listed?**
- Any known museum directories, library consortia, or regional portals?
---
## Session Status
**Current State**: Source analysis complete
**Blockers**: Need to find Saxony museum directory
**Ready to proceed with**: Archive harvesting (staatsarchiv.sachsen.de)
**Awaiting user input**: Confirm next action priority