399 lines
12 KiB
Markdown
399 lines
12 KiB
Markdown
# German Archive Completeness Plan - Archivportal-D Integration
|
|
|
|
**Date**: November 19, 2025
|
|
**Status**: Discovery → Planning → Implementation
|
|
**Goal**: Achieve 100% coverage of German archives
|
|
|
|
---
|
|
|
|
## Current Status
|
|
|
|
### What We Have ✅
|
|
- **16,979 ISIL-registered institutions** (Tier 1 data)
|
|
- Libraries, archives, museums
|
|
- Excellent metadata (87% geocoded)
|
|
- Source: Deutsche Nationalbibliothek SRU API
|
|
|
|
### What We're Missing ⚠️
|
|
- **~5,000-10,000 archives without ISIL codes**
|
|
- Smaller municipal archives
|
|
- Specialized archives (family, business, church)
|
|
- Newly founded archives
|
|
- Archives choosing not to register for ISIL
|
|
|
|
### Gap Example: North Rhine-Westphalia
|
|
- **ISIL registry**: 301 archives (63%)
|
|
- **archive.nrw.de portal**: 477 archives (100%)
|
|
- **Missing**: 176 archives (37%)
|
|
|
|
---
|
|
|
|
## Solution: Archivportal-D Harvest
|
|
|
|
### What is Archivportal-D?
|
|
|
|
**URL**: https://www.archivportal-d.de/
|
|
**Operator**: Deutsche Digitale Bibliothek (German Digital Library)
|
|
**Coverage**: ALL archives across Germany (national aggregator)
|
|
|
|
**Key Features**:
|
|
- ✅ **16 federal states** - Complete national coverage
|
|
- ✅ **~10,000-20,000 archives** - Comprehensive archive listings
|
|
- ✅ **9 archive sectors** - State, municipal, church, business, etc.
|
|
- ✅ **API access available** - Machine-readable data via DDB API
|
|
- ✅ **CC0 metadata** - Open data license for reuse
|
|
- ✅ **Actively maintained** - Updated by national library infrastructure
|
|
|
|
---
|
|
|
|
## Archive Sectors in Archivportal-D
|
|
|
|
1. **State archives** (Landesarchive) - Federal state archives
|
|
2. **Local/municipal archives** (Kommunalarchive) - City, county archives
|
|
3. **Church archives** (Kirchenarchive) - Catholic, Protestant, Jewish
|
|
4. **Nobility and family archives** (Adelsarchive, Familienarchive)
|
|
5. **Business archives** (Wirtschaftsarchive) - Corporate, economic
|
|
6. **Political archives** (Politische Archive) - Parties, movements, foundations
|
|
7. **Media archives** (Medienarchive) - Broadcasting, film, press
|
|
8. **University archives** (Hochschularchive) - Academic institutions
|
|
9. **Other archives** (Sonstige Archive) - Specialized collections
|
|
|
|
---
|
|
|
|
## DDB API Access
|
|
|
|
### API Documentation
|
|
**URL**: https://api.deutsche-digitale-bibliothek.de/
|
|
**Format**: REST API with OpenAPI 3.0 specification
|
|
**Authentication**: API key (free for registered users)
|
|
**License**: CC0 for metadata
|
|
|
|
### Registration Process
|
|
1. Create account at https://www.deutsche-digitale-bibliothek.de/
|
|
2. Log in to "My DDB" (Meine DDB)
|
|
3. Generate API key in account settings
|
|
4. Use API key in requests (Authorization header)
|
|
|
|
### API Endpoints (Relevant for Archives)
|
|
|
|
```http
|
|
# Search archives
|
|
GET https://api.deutsche-digitale-bibliothek.de/search
|
|
?query=*
|
|
§or=arc_archives
|
|
&rows=100
|
|
&offset=0
|
|
|
|
# Get archive details
|
|
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}
|
|
|
|
# Filter by federal state
|
|
GET https://api.deutsche-digitale-bibliothek.de/search
|
|
?facetValues[]=federalState-Nordrhein-Westfalen
|
|
§or=arc_archives
|
|
```
|
|
|
|
### Response Format
|
|
```json
|
|
{
|
|
"numberOfResults": 10000,
|
|
"results": [
|
|
{
|
|
"id": "archive_id",
|
|
"title": "Archive Name",
|
|
"label": "Archive Type",
|
|
"federalState": "Nordrhein-Westfalen",
|
|
"place": "City",
|
|
"latitude": "51.5",
|
|
"longitude": "7.2",
|
|
"isil": "DE-123" // if available
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Harvest Strategy - Three-Phase Approach
|
|
|
|
### Phase 1: API Harvest (RECOMMENDED) 🚀
|
|
**Method**: Use DDB API to fetch all archive records
|
|
**Estimated time**: 1-2 hours
|
|
**Coverage**: ~10,000-20,000 archives
|
|
|
|
```python
|
|
# Pseudocode
|
|
def harvest_archivportal_d():
|
|
api_key = get_api_key()
|
|
archives = []
|
|
offset = 0
|
|
batch_size = 100
|
|
|
|
while True:
|
|
response = requests.get(
|
|
"https://api.deutsche-digitale-bibliothek.de/search",
|
|
params={
|
|
"query": "*",
|
|
"sector": "arc_archives",
|
|
"rows": batch_size,
|
|
"offset": offset
|
|
},
|
|
headers={"Authorization": f"Bearer {api_key}"}
|
|
)
|
|
|
|
data = response.json()
|
|
archives.extend(data['results'])
|
|
|
|
if len(data['results']) < batch_size:
|
|
break
|
|
|
|
offset += batch_size
|
|
time.sleep(0.5) # Rate limiting
|
|
|
|
return archives
|
|
```
|
|
|
|
### Phase 2: Web Scraping Fallback (if API limited)
|
|
**Method**: Scrape archive list pages
|
|
**Estimated time**: 3-4 hours
|
|
**Coverage**: Same as API
|
|
|
|
```python
|
|
# Scrape archive listing pages
|
|
# URL: https://www.archivportal-d.de/struktur?page=X
|
|
# Pagination: ~400-800 pages (10-20 results per page)
|
|
```
|
|
|
|
### Phase 3: Regional Portal Enrichment (OPTIONAL)
|
|
**Method**: Scrape individual state portals for detailed metadata
|
|
**Estimated time**: 8-12 hours (16 states)
|
|
**Coverage**: Enrichment only (addresses, phone numbers, etc.)
|
|
|
|
**State Portals**:
|
|
- NRW: https://www.archive.nrw.de/ (477 archives)
|
|
- Baden-Württemberg: https://www.landesarchiv-bw.de/
|
|
- Niedersachsen: https://www.arcinsys.niedersachsen.de/
|
|
- Bavaria, Saxony, etc.: Various systems
|
|
|
|
---
|
|
|
|
## Data Integration Plan
|
|
|
|
### Step 1: Harvest Archivportal-D
|
|
- Fetch all ~10,000-20,000 archive records via API
|
|
- Store in JSON format
|
|
- Include: name, location, federal state, archive type, ISIL (if present)
|
|
|
|
### Step 2: Cross-Reference with ISIL Dataset
|
|
```python
|
|
# Match archives by ISIL code
|
|
for archive in archivportal_archives:
|
|
if archive.get('isil'):
|
|
isil_match = find_in_isil_dataset(archive['isil'])
|
|
if isil_match:
|
|
archive['isil_data'] = isil_match # Merge metadata
|
|
else:
|
|
archive['new_isil_discovery'] = True
|
|
else:
|
|
archive['no_isil_code'] = True # New discovery
|
|
```
|
|
|
|
### Step 3: Identify New Discoveries
|
|
- Archives in Archivportal-D **without** ISIL codes
|
|
- Archives in Archivportal-D **with** ISIL codes not in our dataset
|
|
- Expected: ~5,000-10,000 new archives
|
|
|
|
### Step 4: Merge Datasets
|
|
```yaml
|
|
# Unified German GLAM Dataset
|
|
- id: unique_id
|
|
name: Institution name
|
|
institution_type: ARCHIVE/LIBRARY/MUSEUM
|
|
isil: ISIL code (if available)
|
|
location:
|
|
city: City
|
|
state: Federal state
|
|
coordinates: [lat, lon]
|
|
archive_type: Sector category (if archive)
|
|
data_sources:
|
|
- ISIL_REGISTRY (Tier 1)
|
|
- ARCHIVPORTAL_D (Tier 2)
|
|
provenance:
|
|
tier: TIER_1 or TIER_2
|
|
extraction_date: ISO timestamp
|
|
```
|
|
|
|
### Step 5: Quality Validation
|
|
- Check for duplicates (same name + city)
|
|
- Verify ISIL code matches
|
|
- Geocode addresses (Nominatim API)
|
|
- Validate institution types
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
### Before Integration
|
|
| Source | Institutions | Archives | Coverage |
|
|
|--------|--------------|----------|----------|
|
|
| ISIL Registry | 16,979 | ~2,000-3,000 | 20-30% of archives |
|
|
| **Total** | **16,979** | **~2,500** | **Partial** |
|
|
|
|
### After Integration
|
|
| Source | Institutions | Archives | Coverage |
|
|
|--------|--------------|----------|----------|
|
|
| ISIL Registry | 16,979 | ~2,500 | Core authoritative data |
|
|
| Archivportal-D | ~12,000 | ~12,000 | Complete archive coverage |
|
|
| **Merged (deduplicated)** | **~25,000** | **~12,000** | **~100% archives** |
|
|
|
|
### Coverage Improvement: NRW Example
|
|
- **Before**: 301 archives (63%)
|
|
- **After**: ~477 archives (100%)
|
|
- **Gain**: +176 archives (+58%)
|
|
|
|
---
|
|
|
|
## Timeline and Resources
|
|
|
|
### Time Estimate
|
|
- **API registration**: 10 minutes
|
|
- **API harvester development**: 2 hours
|
|
- **Data harvest**: 1-2 hours
|
|
- **Cross-referencing**: 1 hour
|
|
- **Deduplication**: 1 hour
|
|
- **Validation**: 2 hours
|
|
- **Documentation**: 1 hour
|
|
- **Total**: ~8-10 hours
|
|
|
|
### Required Tools
|
|
- Python 3.9+
|
|
- `requests` library (API calls)
|
|
- `beautifulsoup4` (if web scraping needed)
|
|
- Nominatim API (geocoding)
|
|
- LinkML validator (schema validation)
|
|
|
|
### Output Files
|
|
```
|
|
/data/isil/germany/
|
|
├── archivportal_d_raw_20251119.json # Raw API data
|
|
├── archivportal_d_archives_20251119.json # Processed archives
|
|
├── german_isil_complete_20251119_134939.json # Existing ISIL data
|
|
├── german_unified_20251119.json # Merged dataset
|
|
├── german_unified_20251119.jsonl # Line-delimited
|
|
├── german_new_discoveries_20251119.json # Archives without ISIL
|
|
├── ARCHIVPORTAL_D_HARVEST_REPORT.md # Documentation
|
|
└── GERMAN_UNIFIED_STATISTICS.json # Comprehensive stats
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps (Priority Order)
|
|
|
|
### 1. Register for DDB API ✅ (Immediate)
|
|
- Go to https://www.deutsche-digitale-bibliothek.de/
|
|
- Create free account
|
|
- Generate API key in "My DDB"
|
|
- Test API access
|
|
|
|
### 2. Develop Archivportal-D Harvester ✅ (Today)
|
|
- Script: `scripts/scrapers/harvest_archivportal_d.py`
|
|
- Use DDB API (preferred) or web scraping (fallback)
|
|
- Implement rate limiting (1 req/second)
|
|
- Add error handling and retry logic
|
|
|
|
### 3. Harvest All German Archives ✅ (Today)
|
|
- Run harvester for ~10,000-20,000 archives
|
|
- Save raw data + processed records
|
|
- Generate harvest report
|
|
|
|
### 4. Cross-Reference and Merge ✅ (Today)
|
|
- Match by ISIL codes
|
|
- Identify new discoveries
|
|
- Create unified dataset
|
|
- Generate statistics
|
|
|
|
### 5. Validate and Document ✅ (Today)
|
|
- Check for duplicates
|
|
- Validate data quality
|
|
- Create comprehensive documentation
|
|
- Update progress reports
|
|
|
|
### 6. OPTIONAL: Regional Portal Enrichment (Future)
|
|
- Scrape NRW portal for detailed metadata
|
|
- Repeat for other states
|
|
- Enrich merged dataset with contact info
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
✅ **Complete German Archive Coverage**
|
|
- All ~12,000 German archives harvested
|
|
- 100% of Archivportal-D listings
|
|
- Cross-referenced with ISIL registry
|
|
|
|
✅ **High Data Quality**
|
|
- Deduplication complete (< 1% duplicates)
|
|
- Geocoding for 80%+ of archives
|
|
- Institution types classified
|
|
|
|
✅ **Comprehensive Documentation**
|
|
- Harvest reports for all sources
|
|
- Integration methodology documented
|
|
- Data quality metrics published
|
|
|
|
✅ **Ready for GLAM Integration**
|
|
- LinkML format conversion complete
|
|
- GLAMORCUBESFIXPHDNT taxonomy applied
|
|
- GHCIDs generated
|
|
|
|
---
|
|
|
|
## Risk Mitigation
|
|
|
|
### Risk 1: API Rate Limits
|
|
- **Mitigation**: Implement 1 req/second limit, use batch requests
|
|
- **Fallback**: Web scraping if API unavailable
|
|
|
|
### Risk 2: Data Quality Issues
|
|
- **Mitigation**: Validation scripts, manual review of suspicious records
|
|
- **Fallback**: Flag low-quality records, document issues
|
|
|
|
### Risk 3: Duplicates
|
|
- **Mitigation**: Fuzzy matching by name+city, ISIL code verification
|
|
- **Fallback**: Manual deduplication for high-confidence matches
|
|
|
|
### Risk 4: Missing Metadata
|
|
- **Mitigation**: Use multiple sources (ISIL + Archivportal-D + regional portals)
|
|
- **Fallback**: Mark incomplete records, enrich later
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
### Primary Sources
|
|
- **ISIL Registry**: https://services.dnb.de/sru/bib (16,979 records) ✅
|
|
- **Archivportal-D**: https://www.archivportal-d.de/ (~12,000 archives) 🔄
|
|
- **DDB API**: https://api.deutsche-digitale-bibliothek.de/ 🔄
|
|
|
|
### Regional Portals
|
|
- **NRW**: https://www.archive.nrw.de/ (477 archives)
|
|
- **Baden-Württemberg**: https://www.landesarchiv-bw.de/
|
|
- **Niedersachsen**: https://www.arcinsys.niedersachsen.de/
|
|
|
|
### Documentation
|
|
- ISIL Harvest: `/data/isil/germany/HARVEST_REPORT.md`
|
|
- Comprehensiveness Check: `/data/isil/germany/COMPREHENSIVENESS_REPORT.md`
|
|
- Archivportal-D Discovery: `/data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md`
|
|
- This Plan: `/data/isil/germany/COMPLETENESS_PLAN.md`
|
|
|
|
---
|
|
|
|
**Status**: Plan complete, ready for implementation
|
|
**Priority**: HIGH - Critical for 100% German coverage
|
|
**Next Action**: Register for DDB API key, develop harvester
|
|
**Estimated Completion**: Same day (8-10 hours work)
|
|
|
|
---
|
|
|
|
**End of Plan**
|