12 KiB
German Archive Completeness Plan - Archivportal-D Integration
Date: November 19, 2025
Status: Discovery → Planning → Implementation
Goal: Achieve 100% coverage of German archives
Current Status
What We Have ✅
- 16,979 ISIL-registered institutions (Tier 1 data)
- Libraries, archives, museums
- Excellent metadata (87% geocoded)
- Source: Deutsche Nationalbibliothek SRU API
What We're Missing ⚠️
- ~5,000-10,000 archives without ISIL codes
- Smaller municipal archives
- Specialized archives (family, business, church)
- Newly founded archives
- Archives choosing not to register for ISIL
Gap Example: North Rhine-Westphalia
- ISIL registry: 301 archives (63%)
- archive.nrw.de portal: 477 archives (100%)
- Missing: 176 archives (37%)
Solution: Archivportal-D Harvest
What is Archivportal-D?
URL: https://www.archivportal-d.de/
Operator: Deutsche Digitale Bibliothek (German Digital Library)
Coverage: ALL archives across Germany (national aggregator)
Key Features:
- ✅ 16 federal states - Complete national coverage
- ✅ ~10,000-20,000 archives - Comprehensive archive listings
- ✅ 9 archive sectors - State, municipal, church, business, etc.
- ✅ API access available - Machine-readable data via DDB API
- ✅ CC0 metadata - Open data license for reuse
- ✅ Actively maintained - Updated by national library infrastructure
Archive Sectors in Archivportal-D
- State archives (Landesarchive) - Federal state archives
- Local/municipal archives (Kommunalarchive) - City, county archives
- Church archives (Kirchenarchive) - Catholic, Protestant, Jewish
- Nobility and family archives (Adelsarchive, Familienarchive)
- Business archives (Wirtschaftsarchive) - Corporate, economic
- Political archives (Politische Archive) - Parties, movements, foundations
- Media archives (Medienarchive) - Broadcasting, film, press
- University archives (Hochschularchive) - Academic institutions
- Other archives (Sonstige Archive) - Specialized collections
DDB API Access
API Documentation
URL: https://api.deutsche-digitale-bibliothek.de/
Format: REST API with OpenAPI 3.0 specification
Authentication: API key (free for registered users)
License: CC0 for metadata
Registration Process
- Create account at https://www.deutsche-digitale-bibliothek.de/
- Log in to "My DDB" (Meine DDB)
- Generate API key in account settings
- Use API key in requests (Authorization header)
API Endpoints (Relevant for Archives)
# Search archives
GET https://api.deutsche-digitale-bibliothek.de/search
?query=*
§or=arc_archives
&rows=100
&offset=0
# Get archive details
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}
# Filter by federal state
GET https://api.deutsche-digitale-bibliothek.de/search
?facetValues[]=federalState-Nordrhein-Westfalen
§or=arc_archives
Response Format
{
"numberOfResults": 10000,
"results": [
{
"id": "archive_id",
"title": "Archive Name",
"label": "Archive Type",
"federalState": "Nordrhein-Westfalen",
"place": "City",
"latitude": "51.5",
"longitude": "7.2",
"isil": "DE-123" // if available
}
]
}
Harvest Strategy - Three-Phase Approach
Phase 1: API Harvest (RECOMMENDED) 🚀
Method: Use DDB API to fetch all archive records
Estimated time: 1-2 hours
Coverage: ~10,000-20,000 archives
# Pseudocode
def harvest_archivportal_d():
api_key = get_api_key()
archives = []
offset = 0
batch_size = 100
while True:
response = requests.get(
"https://api.deutsche-digitale-bibliothek.de/search",
params={
"query": "*",
"sector": "arc_archives",
"rows": batch_size,
"offset": offset
},
headers={"Authorization": f"Bearer {api_key}"}
)
data = response.json()
archives.extend(data['results'])
if len(data['results']) < batch_size:
break
offset += batch_size
time.sleep(0.5) # Rate limiting
return archives
Phase 2: Web Scraping Fallback (if API limited)
Method: Scrape archive list pages
Estimated time: 3-4 hours
Coverage: Same as API
# Scrape archive listing pages
# URL: https://www.archivportal-d.de/struktur?page=X
# Pagination: ~400-800 pages (10-20 results per page)
Phase 3: Regional Portal Enrichment (OPTIONAL)
Method: Scrape individual state portals for detailed metadata
Estimated time: 8-12 hours (16 states)
Coverage: Enrichment only (addresses, phone numbers, etc.)
State Portals:
- NRW: https://www.archive.nrw.de/ (477 archives)
- Baden-Württemberg: https://www.landesarchiv-bw.de/
- Niedersachsen: https://www.arcinsys.niedersachsen.de/
- Bavaria, Saxony, etc.: Various systems
Data Integration Plan
Step 1: Harvest Archivportal-D
- Fetch all ~10,000-20,000 archive records via API
- Store in JSON format
- Include: name, location, federal state, archive type, ISIL (if present)
Step 2: Cross-Reference with ISIL Dataset
# Match archives by ISIL code
for archive in archivportal_archives:
if archive.get('isil'):
isil_match = find_in_isil_dataset(archive['isil'])
if isil_match:
archive['isil_data'] = isil_match # Merge metadata
else:
archive['new_isil_discovery'] = True
else:
archive['no_isil_code'] = True # New discovery
Step 3: Identify New Discoveries
- Archives in Archivportal-D without ISIL codes
- Archives in Archivportal-D with ISIL codes not in our dataset
- Expected: ~5,000-10,000 new archives
Step 4: Merge Datasets
# Unified German GLAM Dataset
- id: unique_id
name: Institution name
institution_type: ARCHIVE/LIBRARY/MUSEUM
isil: ISIL code (if available)
location:
city: City
state: Federal state
coordinates: [lat, lon]
archive_type: Sector category (if archive)
data_sources:
- ISIL_REGISTRY (Tier 1)
- ARCHIVPORTAL_D (Tier 2)
provenance:
tier: TIER_1 or TIER_2
extraction_date: ISO timestamp
Step 5: Quality Validation
- Check for duplicates (same name + city)
- Verify ISIL code matches
- Geocode addresses (Nominatim API)
- Validate institution types
Expected Results
Before Integration
| Source | Institutions | Archives | Coverage |
|---|---|---|---|
| ISIL Registry | 16,979 | ~2,000-3,000 | 20-30% of archives |
| Total | 16,979 | ~2,500 | Partial |
After Integration
| Source | Institutions | Archives | Coverage |
|---|---|---|---|
| ISIL Registry | 16,979 | ~2,500 | Core authoritative data |
| Archivportal-D | ~12,000 | ~12,000 | Complete archive coverage |
| Merged (deduplicated) | ~25,000 | ~12,000 | ~100% archives |
Coverage Improvement: NRW Example
- Before: 301 archives (63%)
- After: ~477 archives (100%)
- Gain: +176 archives (+58%)
Timeline and Resources
Time Estimate
- API registration: 10 minutes
- API harvester development: 2 hours
- Data harvest: 1-2 hours
- Cross-referencing: 1 hour
- Deduplication: 1 hour
- Validation: 2 hours
- Documentation: 1 hour
- Total: ~8-10 hours
Required Tools
- Python 3.9+
requestslibrary (API calls)beautifulsoup4(if web scraping needed)- Nominatim API (geocoding)
- LinkML validator (schema validation)
Output Files
/data/isil/germany/
├── archivportal_d_raw_20251119.json # Raw API data
├── archivportal_d_archives_20251119.json # Processed archives
├── german_isil_complete_20251119_134939.json # Existing ISIL data
├── german_unified_20251119.json # Merged dataset
├── german_unified_20251119.jsonl # Line-delimited
├── german_new_discoveries_20251119.json # Archives without ISIL
├── ARCHIVPORTAL_D_HARVEST_REPORT.md # Documentation
└── GERMAN_UNIFIED_STATISTICS.json # Comprehensive stats
Next Steps (Priority Order)
1. Register for DDB API ✅ (Immediate)
- Go to https://www.deutsche-digitale-bibliothek.de/
- Create free account
- Generate API key in "My DDB"
- Test API access
2. Develop Archivportal-D Harvester ✅ (Today)
- Script:
scripts/scrapers/harvest_archivportal_d.py - Use DDB API (preferred) or web scraping (fallback)
- Implement rate limiting (1 req/second)
- Add error handling and retry logic
3. Harvest All German Archives ✅ (Today)
- Run harvester for ~10,000-20,000 archives
- Save raw data + processed records
- Generate harvest report
4. Cross-Reference and Merge ✅ (Today)
- Match by ISIL codes
- Identify new discoveries
- Create unified dataset
- Generate statistics
5. Validate and Document ✅ (Today)
- Check for duplicates
- Validate data quality
- Create comprehensive documentation
- Update progress reports
6. OPTIONAL: Regional Portal Enrichment (Future)
- Scrape NRW portal for detailed metadata
- Repeat for other states
- Enrich merged dataset with contact info
Success Criteria
✅ Complete German Archive Coverage
- All ~12,000 German archives harvested
- 100% of Archivportal-D listings
- Cross-referenced with ISIL registry
✅ High Data Quality
- Deduplication complete (< 1% duplicates)
- Geocoding for 80%+ of archives
- Institution types classified
✅ Comprehensive Documentation
- Harvest reports for all sources
- Integration methodology documented
- Data quality metrics published
✅ Ready for GLAM Integration
- LinkML format conversion complete
- GLAMORCUBESFIXPHDNT taxonomy applied
- GHCIDs generated
Risk Mitigation
Risk 1: API Rate Limits
- Mitigation: Implement 1 req/second limit, use batch requests
- Fallback: Web scraping if API unavailable
Risk 2: Data Quality Issues
- Mitigation: Validation scripts, manual review of suspicious records
- Fallback: Flag low-quality records, document issues
Risk 3: Duplicates
- Mitigation: Fuzzy matching by name+city, ISIL code verification
- Fallback: Manual deduplication for high-confidence matches
Risk 4: Missing Metadata
- Mitigation: Use multiple sources (ISIL + Archivportal-D + regional portals)
- Fallback: Mark incomplete records, enrich later
References
Primary Sources
- ISIL Registry: https://services.dnb.de/sru/bib (16,979 records) ✅
- Archivportal-D: https://www.archivportal-d.de/ (~12,000 archives) 🔄
- DDB API: https://api.deutsche-digitale-bibliothek.de/ 🔄
Regional Portals
- NRW: https://www.archive.nrw.de/ (477 archives)
- Baden-Württemberg: https://www.landesarchiv-bw.de/
- Niedersachsen: https://www.arcinsys.niedersachsen.de/
Documentation
- ISIL Harvest:
/data/isil/germany/HARVEST_REPORT.md - Comprehensiveness Check:
/data/isil/germany/COMPREHENSIVENESS_REPORT.md - Archivportal-D Discovery:
/data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md - This Plan:
/data/isil/germany/COMPLETENESS_PLAN.md
Status: Plan complete, ready for implementation
Priority: HIGH - Critical for 100% German coverage
Next Action: Register for DDB API key, develop harvester
Estimated Completion: Same day (8-10 hours work)
End of Plan