# German Archive Completeness Plan - Archivportal-D Integration **Date**: November 19, 2025 **Status**: Discovery → Planning → Implementation **Goal**: Achieve 100% coverage of German archives --- ## Current Status ### What We Have ✅ - **16,979 ISIL-registered institutions** (Tier 1 data) - Libraries, archives, museums - Excellent metadata (87% geocoded) - Source: Deutsche Nationalbibliothek SRU API ### What We're Missing ⚠️ - **~5,000-10,000 archives without ISIL codes** - Smaller municipal archives - Specialized archives (family, business, church) - Newly founded archives - Archives choosing not to register for ISIL ### Gap Example: North Rhine-Westphalia - **ISIL registry**: 301 archives (63%) - **archive.nrw.de portal**: 477 archives (100%) - **Missing**: 176 archives (37%) --- ## Solution: Archivportal-D Harvest ### What is Archivportal-D? **URL**: https://www.archivportal-d.de/ **Operator**: Deutsche Digitale Bibliothek (German Digital Library) **Coverage**: ALL archives across Germany (national aggregator) **Key Features**: - ✅ **16 federal states** - Complete national coverage - ✅ **~10,000-20,000 archives** - Comprehensive archive listings - ✅ **9 archive sectors** - State, municipal, church, business, etc. - ✅ **API access available** - Machine-readable data via DDB API - ✅ **CC0 metadata** - Open data license for reuse - ✅ **Actively maintained** - Updated by national library infrastructure --- ## Archive Sectors in Archivportal-D 1. **State archives** (Landesarchive) - Federal state archives 2. **Local/municipal archives** (Kommunalarchive) - City, county archives 3. **Church archives** (Kirchenarchive) - Catholic, Protestant, Jewish 4. **Nobility and family archives** (Adelsarchive, Familienarchive) 5. **Business archives** (Wirtschaftsarchive) - Corporate, economic 6. **Political archives** (Politische Archive) - Parties, movements, foundations 7. **Media archives** (Medienarchive) - Broadcasting, film, press 8. **University archives** (Hochschularchive) - Academic institutions 9. **Other archives** (Sonstige Archive) - Specialized collections --- ## DDB API Access ### API Documentation **URL**: https://api.deutsche-digitale-bibliothek.de/ **Format**: REST API with OpenAPI 3.0 specification **Authentication**: API key (free for registered users) **License**: CC0 for metadata ### Registration Process 1. Create account at https://www.deutsche-digitale-bibliothek.de/ 2. Log in to "My DDB" (Meine DDB) 3. Generate API key in account settings 4. Use API key in requests (Authorization header) ### API Endpoints (Relevant for Archives) ```http # Search archives GET https://api.deutsche-digitale-bibliothek.de/search ?query=* §or=arc_archives &rows=100 &offset=0 # Get archive details GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id} # Filter by federal state GET https://api.deutsche-digitale-bibliothek.de/search ?facetValues[]=federalState-Nordrhein-Westfalen §or=arc_archives ``` ### Response Format ```json { "numberOfResults": 10000, "results": [ { "id": "archive_id", "title": "Archive Name", "label": "Archive Type", "federalState": "Nordrhein-Westfalen", "place": "City", "latitude": "51.5", "longitude": "7.2", "isil": "DE-123" // if available } ] } ``` --- ## Harvest Strategy - Three-Phase Approach ### Phase 1: API Harvest (RECOMMENDED) 🚀 **Method**: Use DDB API to fetch all archive records **Estimated time**: 1-2 hours **Coverage**: ~10,000-20,000 archives ```python # Pseudocode def harvest_archivportal_d(): api_key = get_api_key() archives = [] offset = 0 batch_size = 100 while True: response = requests.get( "https://api.deutsche-digitale-bibliothek.de/search", params={ "query": "*", "sector": "arc_archives", "rows": batch_size, "offset": offset }, headers={"Authorization": f"Bearer {api_key}"} ) data = response.json() archives.extend(data['results']) if len(data['results']) < batch_size: break offset += batch_size time.sleep(0.5) # Rate limiting return archives ``` ### Phase 2: Web Scraping Fallback (if API limited) **Method**: Scrape archive list pages **Estimated time**: 3-4 hours **Coverage**: Same as API ```python # Scrape archive listing pages # URL: https://www.archivportal-d.de/struktur?page=X # Pagination: ~400-800 pages (10-20 results per page) ``` ### Phase 3: Regional Portal Enrichment (OPTIONAL) **Method**: Scrape individual state portals for detailed metadata **Estimated time**: 8-12 hours (16 states) **Coverage**: Enrichment only (addresses, phone numbers, etc.) **State Portals**: - NRW: https://www.archive.nrw.de/ (477 archives) - Baden-Württemberg: https://www.landesarchiv-bw.de/ - Niedersachsen: https://www.arcinsys.niedersachsen.de/ - Bavaria, Saxony, etc.: Various systems --- ## Data Integration Plan ### Step 1: Harvest Archivportal-D - Fetch all ~10,000-20,000 archive records via API - Store in JSON format - Include: name, location, federal state, archive type, ISIL (if present) ### Step 2: Cross-Reference with ISIL Dataset ```python # Match archives by ISIL code for archive in archivportal_archives: if archive.get('isil'): isil_match = find_in_isil_dataset(archive['isil']) if isil_match: archive['isil_data'] = isil_match # Merge metadata else: archive['new_isil_discovery'] = True else: archive['no_isil_code'] = True # New discovery ``` ### Step 3: Identify New Discoveries - Archives in Archivportal-D **without** ISIL codes - Archives in Archivportal-D **with** ISIL codes not in our dataset - Expected: ~5,000-10,000 new archives ### Step 4: Merge Datasets ```yaml # Unified German GLAM Dataset - id: unique_id name: Institution name institution_type: ARCHIVE/LIBRARY/MUSEUM isil: ISIL code (if available) location: city: City state: Federal state coordinates: [lat, lon] archive_type: Sector category (if archive) data_sources: - ISIL_REGISTRY (Tier 1) - ARCHIVPORTAL_D (Tier 2) provenance: tier: TIER_1 or TIER_2 extraction_date: ISO timestamp ``` ### Step 5: Quality Validation - Check for duplicates (same name + city) - Verify ISIL code matches - Geocode addresses (Nominatim API) - Validate institution types --- ## Expected Results ### Before Integration | Source | Institutions | Archives | Coverage | |--------|--------------|----------|----------| | ISIL Registry | 16,979 | ~2,000-3,000 | 20-30% of archives | | **Total** | **16,979** | **~2,500** | **Partial** | ### After Integration | Source | Institutions | Archives | Coverage | |--------|--------------|----------|----------| | ISIL Registry | 16,979 | ~2,500 | Core authoritative data | | Archivportal-D | ~12,000 | ~12,000 | Complete archive coverage | | **Merged (deduplicated)** | **~25,000** | **~12,000** | **~100% archives** | ### Coverage Improvement: NRW Example - **Before**: 301 archives (63%) - **After**: ~477 archives (100%) - **Gain**: +176 archives (+58%) --- ## Timeline and Resources ### Time Estimate - **API registration**: 10 minutes - **API harvester development**: 2 hours - **Data harvest**: 1-2 hours - **Cross-referencing**: 1 hour - **Deduplication**: 1 hour - **Validation**: 2 hours - **Documentation**: 1 hour - **Total**: ~8-10 hours ### Required Tools - Python 3.9+ - `requests` library (API calls) - `beautifulsoup4` (if web scraping needed) - Nominatim API (geocoding) - LinkML validator (schema validation) ### Output Files ``` /data/isil/germany/ ├── archivportal_d_raw_20251119.json # Raw API data ├── archivportal_d_archives_20251119.json # Processed archives ├── german_isil_complete_20251119_134939.json # Existing ISIL data ├── german_unified_20251119.json # Merged dataset ├── german_unified_20251119.jsonl # Line-delimited ├── german_new_discoveries_20251119.json # Archives without ISIL ├── ARCHIVPORTAL_D_HARVEST_REPORT.md # Documentation └── GERMAN_UNIFIED_STATISTICS.json # Comprehensive stats ``` --- ## Next Steps (Priority Order) ### 1. Register for DDB API ✅ (Immediate) - Go to https://www.deutsche-digitale-bibliothek.de/ - Create free account - Generate API key in "My DDB" - Test API access ### 2. Develop Archivportal-D Harvester ✅ (Today) - Script: `scripts/scrapers/harvest_archivportal_d.py` - Use DDB API (preferred) or web scraping (fallback) - Implement rate limiting (1 req/second) - Add error handling and retry logic ### 3. Harvest All German Archives ✅ (Today) - Run harvester for ~10,000-20,000 archives - Save raw data + processed records - Generate harvest report ### 4. Cross-Reference and Merge ✅ (Today) - Match by ISIL codes - Identify new discoveries - Create unified dataset - Generate statistics ### 5. Validate and Document ✅ (Today) - Check for duplicates - Validate data quality - Create comprehensive documentation - Update progress reports ### 6. OPTIONAL: Regional Portal Enrichment (Future) - Scrape NRW portal for detailed metadata - Repeat for other states - Enrich merged dataset with contact info --- ## Success Criteria ✅ **Complete German Archive Coverage** - All ~12,000 German archives harvested - 100% of Archivportal-D listings - Cross-referenced with ISIL registry ✅ **High Data Quality** - Deduplication complete (< 1% duplicates) - Geocoding for 80%+ of archives - Institution types classified ✅ **Comprehensive Documentation** - Harvest reports for all sources - Integration methodology documented - Data quality metrics published ✅ **Ready for GLAM Integration** - LinkML format conversion complete - GLAMORCUBESFIXPHDNT taxonomy applied - GHCIDs generated --- ## Risk Mitigation ### Risk 1: API Rate Limits - **Mitigation**: Implement 1 req/second limit, use batch requests - **Fallback**: Web scraping if API unavailable ### Risk 2: Data Quality Issues - **Mitigation**: Validation scripts, manual review of suspicious records - **Fallback**: Flag low-quality records, document issues ### Risk 3: Duplicates - **Mitigation**: Fuzzy matching by name+city, ISIL code verification - **Fallback**: Manual deduplication for high-confidence matches ### Risk 4: Missing Metadata - **Mitigation**: Use multiple sources (ISIL + Archivportal-D + regional portals) - **Fallback**: Mark incomplete records, enrich later --- ## References ### Primary Sources - **ISIL Registry**: https://services.dnb.de/sru/bib (16,979 records) ✅ - **Archivportal-D**: https://www.archivportal-d.de/ (~12,000 archives) 🔄 - **DDB API**: https://api.deutsche-digitale-bibliothek.de/ 🔄 ### Regional Portals - **NRW**: https://www.archive.nrw.de/ (477 archives) - **Baden-Württemberg**: https://www.landesarchiv-bw.de/ - **Niedersachsen**: https://www.arcinsys.niedersachsen.de/ ### Documentation - ISIL Harvest: `/data/isil/germany/HARVEST_REPORT.md` - Comprehensiveness Check: `/data/isil/germany/COMPREHENSIVENESS_REPORT.md` - Archivportal-D Discovery: `/data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md` - This Plan: `/data/isil/germany/COMPLETENESS_PLAN.md` --- **Status**: Plan complete, ready for implementation **Priority**: HIGH - Critical for 100% German coverage **Next Action**: Register for DDB API key, develop harvester **Estimated Completion**: Same day (8-10 hours work) --- **End of Plan**