glam/data/isil/germany/COMPLETENESS_PLAN.md
2025-11-19 23:25:22 +01:00

399 lines
12 KiB
Markdown

# German Archive Completeness Plan - Archivportal-D Integration
**Date**: November 19, 2025
**Status**: Discovery → Planning → Implementation
**Goal**: Achieve 100% coverage of German archives
---
## Current Status
### What We Have ✅
- **16,979 ISIL-registered institutions** (Tier 1 data)
- Libraries, archives, museums
- Excellent metadata (87% geocoded)
- Source: Deutsche Nationalbibliothek SRU API
### What We're Missing ⚠️
- **~5,000-10,000 archives without ISIL codes**
- Smaller municipal archives
- Specialized archives (family, business, church)
- Newly founded archives
- Archives choosing not to register for ISIL
### Gap Example: North Rhine-Westphalia
- **ISIL registry**: 301 archives (63%)
- **archive.nrw.de portal**: 477 archives (100%)
- **Missing**: 176 archives (37%)
---
## Solution: Archivportal-D Harvest
### What is Archivportal-D?
**URL**: https://www.archivportal-d.de/
**Operator**: Deutsche Digitale Bibliothek (German Digital Library)
**Coverage**: ALL archives across Germany (national aggregator)
**Key Features**:
-**16 federal states** - Complete national coverage
-**~10,000-20,000 archives** - Comprehensive archive listings
-**9 archive sectors** - State, municipal, church, business, etc.
-**API access available** - Machine-readable data via DDB API
-**CC0 metadata** - Open data license for reuse
-**Actively maintained** - Updated by national library infrastructure
---
## Archive Sectors in Archivportal-D
1. **State archives** (Landesarchive) - Federal state archives
2. **Local/municipal archives** (Kommunalarchive) - City, county archives
3. **Church archives** (Kirchenarchive) - Catholic, Protestant, Jewish
4. **Nobility and family archives** (Adelsarchive, Familienarchive)
5. **Business archives** (Wirtschaftsarchive) - Corporate, economic
6. **Political archives** (Politische Archive) - Parties, movements, foundations
7. **Media archives** (Medienarchive) - Broadcasting, film, press
8. **University archives** (Hochschularchive) - Academic institutions
9. **Other archives** (Sonstige Archive) - Specialized collections
---
## DDB API Access
### API Documentation
**URL**: https://api.deutsche-digitale-bibliothek.de/
**Format**: REST API with OpenAPI 3.0 specification
**Authentication**: API key (free for registered users)
**License**: CC0 for metadata
### Registration Process
1. Create account at https://www.deutsche-digitale-bibliothek.de/
2. Log in to "My DDB" (Meine DDB)
3. Generate API key in account settings
4. Use API key in requests (Authorization header)
### API Endpoints (Relevant for Archives)
```http
# Search archives
GET https://api.deutsche-digitale-bibliothek.de/search
?query=*
&sector=arc_archives
&rows=100
&offset=0
# Get archive details
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}
# Filter by federal state
GET https://api.deutsche-digitale-bibliothek.de/search
?facetValues[]=federalState-Nordrhein-Westfalen
&sector=arc_archives
```
### Response Format
```json
{
"numberOfResults": 10000,
"results": [
{
"id": "archive_id",
"title": "Archive Name",
"label": "Archive Type",
"federalState": "Nordrhein-Westfalen",
"place": "City",
"latitude": "51.5",
"longitude": "7.2",
"isil": "DE-123" // if available
}
]
}
```
---
## Harvest Strategy - Three-Phase Approach
### Phase 1: API Harvest (RECOMMENDED) 🚀
**Method**: Use DDB API to fetch all archive records
**Estimated time**: 1-2 hours
**Coverage**: ~10,000-20,000 archives
```python
# Pseudocode
def harvest_archivportal_d():
api_key = get_api_key()
archives = []
offset = 0
batch_size = 100
while True:
response = requests.get(
"https://api.deutsche-digitale-bibliothek.de/search",
params={
"query": "*",
"sector": "arc_archives",
"rows": batch_size,
"offset": offset
},
headers={"Authorization": f"Bearer {api_key}"}
)
data = response.json()
archives.extend(data['results'])
if len(data['results']) < batch_size:
break
offset += batch_size
time.sleep(0.5) # Rate limiting
return archives
```
### Phase 2: Web Scraping Fallback (if API limited)
**Method**: Scrape archive list pages
**Estimated time**: 3-4 hours
**Coverage**: Same as API
```python
# Scrape archive listing pages
# URL: https://www.archivportal-d.de/struktur?page=X
# Pagination: ~400-800 pages (10-20 results per page)
```
### Phase 3: Regional Portal Enrichment (OPTIONAL)
**Method**: Scrape individual state portals for detailed metadata
**Estimated time**: 8-12 hours (16 states)
**Coverage**: Enrichment only (addresses, phone numbers, etc.)
**State Portals**:
- NRW: https://www.archive.nrw.de/ (477 archives)
- Baden-Württemberg: https://www.landesarchiv-bw.de/
- Niedersachsen: https://www.arcinsys.niedersachsen.de/
- Bavaria, Saxony, etc.: Various systems
---
## Data Integration Plan
### Step 1: Harvest Archivportal-D
- Fetch all ~10,000-20,000 archive records via API
- Store in JSON format
- Include: name, location, federal state, archive type, ISIL (if present)
### Step 2: Cross-Reference with ISIL Dataset
```python
# Match archives by ISIL code
for archive in archivportal_archives:
if archive.get('isil'):
isil_match = find_in_isil_dataset(archive['isil'])
if isil_match:
archive['isil_data'] = isil_match # Merge metadata
else:
archive['new_isil_discovery'] = True
else:
archive['no_isil_code'] = True # New discovery
```
### Step 3: Identify New Discoveries
- Archives in Archivportal-D **without** ISIL codes
- Archives in Archivportal-D **with** ISIL codes not in our dataset
- Expected: ~5,000-10,000 new archives
### Step 4: Merge Datasets
```yaml
# Unified German GLAM Dataset
- id: unique_id
name: Institution name
institution_type: ARCHIVE/LIBRARY/MUSEUM
isil: ISIL code (if available)
location:
city: City
state: Federal state
coordinates: [lat, lon]
archive_type: Sector category (if archive)
data_sources:
- ISIL_REGISTRY (Tier 1)
- ARCHIVPORTAL_D (Tier 2)
provenance:
tier: TIER_1 or TIER_2
extraction_date: ISO timestamp
```
### Step 5: Quality Validation
- Check for duplicates (same name + city)
- Verify ISIL code matches
- Geocode addresses (Nominatim API)
- Validate institution types
---
## Expected Results
### Before Integration
| Source | Institutions | Archives | Coverage |
|--------|--------------|----------|----------|
| ISIL Registry | 16,979 | ~2,000-3,000 | 20-30% of archives |
| **Total** | **16,979** | **~2,500** | **Partial** |
### After Integration
| Source | Institutions | Archives | Coverage |
|--------|--------------|----------|----------|
| ISIL Registry | 16,979 | ~2,500 | Core authoritative data |
| Archivportal-D | ~12,000 | ~12,000 | Complete archive coverage |
| **Merged (deduplicated)** | **~25,000** | **~12,000** | **~100% archives** |
### Coverage Improvement: NRW Example
- **Before**: 301 archives (63%)
- **After**: ~477 archives (100%)
- **Gain**: +176 archives (+58%)
---
## Timeline and Resources
### Time Estimate
- **API registration**: 10 minutes
- **API harvester development**: 2 hours
- **Data harvest**: 1-2 hours
- **Cross-referencing**: 1 hour
- **Deduplication**: 1 hour
- **Validation**: 2 hours
- **Documentation**: 1 hour
- **Total**: ~8-10 hours
### Required Tools
- Python 3.9+
- `requests` library (API calls)
- `beautifulsoup4` (if web scraping needed)
- Nominatim API (geocoding)
- LinkML validator (schema validation)
### Output Files
```
/data/isil/germany/
├── archivportal_d_raw_20251119.json # Raw API data
├── archivportal_d_archives_20251119.json # Processed archives
├── german_isil_complete_20251119_134939.json # Existing ISIL data
├── german_unified_20251119.json # Merged dataset
├── german_unified_20251119.jsonl # Line-delimited
├── german_new_discoveries_20251119.json # Archives without ISIL
├── ARCHIVPORTAL_D_HARVEST_REPORT.md # Documentation
└── GERMAN_UNIFIED_STATISTICS.json # Comprehensive stats
```
---
## Next Steps (Priority Order)
### 1. Register for DDB API ✅ (Immediate)
- Go to https://www.deutsche-digitale-bibliothek.de/
- Create free account
- Generate API key in "My DDB"
- Test API access
### 2. Develop Archivportal-D Harvester ✅ (Today)
- Script: `scripts/scrapers/harvest_archivportal_d.py`
- Use DDB API (preferred) or web scraping (fallback)
- Implement rate limiting (1 req/second)
- Add error handling and retry logic
### 3. Harvest All German Archives ✅ (Today)
- Run harvester for ~10,000-20,000 archives
- Save raw data + processed records
- Generate harvest report
### 4. Cross-Reference and Merge ✅ (Today)
- Match by ISIL codes
- Identify new discoveries
- Create unified dataset
- Generate statistics
### 5. Validate and Document ✅ (Today)
- Check for duplicates
- Validate data quality
- Create comprehensive documentation
- Update progress reports
### 6. OPTIONAL: Regional Portal Enrichment (Future)
- Scrape NRW portal for detailed metadata
- Repeat for other states
- Enrich merged dataset with contact info
---
## Success Criteria
**Complete German Archive Coverage**
- All ~12,000 German archives harvested
- 100% of Archivportal-D listings
- Cross-referenced with ISIL registry
**High Data Quality**
- Deduplication complete (< 1% duplicates)
- Geocoding for 80%+ of archives
- Institution types classified
**Comprehensive Documentation**
- Harvest reports for all sources
- Integration methodology documented
- Data quality metrics published
**Ready for GLAM Integration**
- LinkML format conversion complete
- GLAMORCUBESFIXPHDNT taxonomy applied
- GHCIDs generated
---
## Risk Mitigation
### Risk 1: API Rate Limits
- **Mitigation**: Implement 1 req/second limit, use batch requests
- **Fallback**: Web scraping if API unavailable
### Risk 2: Data Quality Issues
- **Mitigation**: Validation scripts, manual review of suspicious records
- **Fallback**: Flag low-quality records, document issues
### Risk 3: Duplicates
- **Mitigation**: Fuzzy matching by name+city, ISIL code verification
- **Fallback**: Manual deduplication for high-confidence matches
### Risk 4: Missing Metadata
- **Mitigation**: Use multiple sources (ISIL + Archivportal-D + regional portals)
- **Fallback**: Mark incomplete records, enrich later
---
## References
### Primary Sources
- **ISIL Registry**: https://services.dnb.de/sru/bib (16,979 records)
- **Archivportal-D**: https://www.archivportal-d.de/ (~12,000 archives) 🔄
- **DDB API**: https://api.deutsche-digitale-bibliothek.de/ 🔄
### Regional Portals
- **NRW**: https://www.archive.nrw.de/ (477 archives)
- **Baden-Württemberg**: https://www.landesarchiv-bw.de/
- **Niedersachsen**: https://www.arcinsys.niedersachsen.de/
### Documentation
- ISIL Harvest: `/data/isil/germany/HARVEST_REPORT.md`
- Comprehensiveness Check: `/data/isil/germany/COMPREHENSIVENESS_REPORT.md`
- Archivportal-D Discovery: `/data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md`
- This Plan: `/data/isil/germany/COMPLETENESS_PLAN.md`
---
**Status**: Plan complete, ready for implementation
**Priority**: HIGH - Critical for 100% German coverage
**Next Action**: Register for DDB API key, develop harvester
**Estimated Completion**: Same day (8-10 hours work)
---
**End of Plan**