13 KiB
German Archive Completion - Quick Start Guide
Goal: Achieve 100% German archive coverage by harvesting Archivportal-D via DDB API
Current Status
✅ ISIL Registry: 16,979 institutions harvested
🔄 Archivportal-D: Awaiting API access (~10,000-20,000 archives)
🎯 Target: ~25,000-27,000 total German institutions
Step-by-Step: Complete German Archive Harvest
Step 1: Register for DDB API (10 minutes)
- Visit: https://www.deutsche-digitale-bibliothek.de/
- Click: "Registrieren" (Register button, top right)
- Fill form:
- Email: [your-email]
- Username: [choose username]
- Password: [secure password]
- Verify email: Check inbox and click confirmation link
- Log in: Use credentials to log into DDB portal
- Navigate to: "Meine DDB" (My DDB) in account menu
- Generate API key: Look for "API-Schlüssel" or "API Key" section
- Copy key: Save to secure location (e.g., password manager)
Result: You now have a DDB API key (e.g., ddb_abc123xyz456...)
Step 2: Create API Harvester (2 hours)
File: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d_api.py
#!/usr/bin/env python3
"""
Archivportal-D API Harvester
Fetches all German archives via Deutsche Digitale Bibliothek REST API
"""
import requests
import json
import time
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional
# Configuration
API_BASE_URL = "https://api.deutsche-digitale-bibliothek.de"
API_KEY = "YOUR_API_KEY_HERE" # Replace with your DDB API key
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
BATCH_SIZE = 100 # Archives per request
REQUEST_DELAY = 0.5 # Seconds between requests
MAX_RETRIES = 3
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
def fetch_archives_batch(offset: int = 0, rows: int = 100) -> Optional[Dict]:
"""
Fetch a batch of archives via DDB API.
Args:
offset: Starting record number
rows: Number of records to fetch
Returns:
API response dict or None on error
"""
headers = {
"Authorization": f"Bearer {API_KEY}",
"Accept": "application/json"
}
params = {
"query": "*", # All archives
"sector": "arc_archives", # Archives sector only
"rows": rows,
"offset": offset
}
for attempt in range(MAX_RETRIES):
try:
print(f"Fetching archives {offset}-{offset+rows-1}...", end=' ')
response = requests.get(
f"{API_BASE_URL}/search",
headers=headers,
params=params,
timeout=30
)
response.raise_for_status()
data = response.json()
total = data.get('numberOfResults', 0)
print(f"OK (total: {total})")
return data
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt+1}/{MAX_RETRIES} failed: {e}")
if attempt < MAX_RETRIES - 1:
time.sleep(REQUEST_DELAY * (attempt + 1))
else:
return None
def parse_archive_record(record: Dict) -> Dict:
"""
Parse DDB API archive record into simplified format.
Args:
record: Raw API record
Returns:
Parsed archive dictionary
"""
return {
'id': record.get('id'),
'name': record.get('title'),
'location': record.get('place'),
'federal_state': record.get('federalState'),
'archive_type': record.get('label'),
'isil': record.get('isil'),
'latitude': record.get('latitude'),
'longitude': record.get('longitude'),
'thumbnail': record.get('thumbnail'),
'profile_url': f"https://www.archivportal-d.de/item/{record.get('id')}" if record.get('id') else None
}
def harvest_all_archives() -> List[Dict]:
"""
Harvest all archives from DDB API.
Returns:
List of parsed archive records
"""
print(f"\n{'='*70}")
print(f"Harvesting Archivportal-D via DDB API")
print(f"Endpoint: {API_BASE_URL}/search")
print(f"{'='*70}\n")
all_archives = []
offset = 0
while True:
# Fetch batch
data = fetch_archives_batch(offset, BATCH_SIZE)
if not data:
print(f"Warning: Failed to fetch batch at offset {offset}. Stopping.")
break
# Parse results
results = data.get('results', [])
for result in results:
archive = parse_archive_record(result)
all_archives.append(archive)
print(f"Progress: {len(all_archives)} archives collected")
# Check if done
total = data.get('numberOfResults', 0)
if len(all_archives) >= total or len(results) < BATCH_SIZE:
break
offset += BATCH_SIZE
time.sleep(REQUEST_DELAY)
print(f"\n{'='*70}")
print(f"Harvest complete: {len(all_archives)} archives")
print(f"{'='*70}\n")
return all_archives
def save_archives(archives: List[Dict]):
"""Save archives to JSON file."""
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_file = OUTPUT_DIR / f"archivportal_d_api_{timestamp}.json"
output = {
'metadata': {
'source': 'Archivportal-D via DDB API',
'source_url': 'https://www.archivportal-d.de',
'api_endpoint': f'{API_BASE_URL}/search',
'operator': 'Deutsche Digitale Bibliothek',
'harvest_date': datetime.utcnow().isoformat() + 'Z',
'total_archives': len(archives),
'method': 'REST API',
'license': 'CC0 1.0 Universal (Public Domain)'
},
'archives': archives
}
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f"✓ Saved to: {output_file}")
print(f" File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB\n")
def generate_statistics(archives: List[Dict]):
"""Generate statistics."""
stats = {
'total': len(archives),
'by_state': {},
'by_type': {},
'with_isil': 0,
'with_coordinates': 0
}
for archive in archives:
# By state
state = archive.get('federal_state', 'Unknown')
stats['by_state'][state] = stats['by_state'].get(state, 0) + 1
# By type
arch_type = archive.get('archive_type', 'Unknown')
stats['by_type'][arch_type] = stats['by_type'].get(arch_type, 0) + 1
# Completeness
if archive.get('isil'):
stats['with_isil'] += 1
if archive.get('latitude'):
stats['with_coordinates'] += 1
print(f"\n{'='*70}")
print("Statistics:")
print(f"{'='*70}")
print(f"Total archives: {stats['total']}")
print(f"With ISIL: {stats['with_isil']} ({stats['with_isil']/stats['total']*100:.1f}%)")
print(f"With coordinates: {stats['with_coordinates']} ({stats['with_coordinates']/stats['total']*100:.1f}%)")
print(f"\nTop 10 federal states:")
for state, count in sorted(stats['by_state'].items(), key=lambda x: x[1], reverse=True)[:10]:
print(f" {state}: {count}")
print(f"\nTop 10 archive types:")
for arch_type, count in sorted(stats['by_type'].items(), key=lambda x: x[1], reverse=True)[:10]:
print(f" {arch_type}: {count}")
print(f"{'='*70}\n")
return stats
def main():
"""Main execution."""
print(f"\n{'#'*70}")
print(f"# Archivportal-D API Harvester")
print(f"# Deutsche Digitale Bibliothek REST API")
print(f"{'#'*70}\n")
if API_KEY == "YOUR_API_KEY_HERE":
print("ERROR: Please set your DDB API key in the script!")
print("Edit line 21: API_KEY = 'your-actual-api-key'")
return
# Harvest
archives = harvest_all_archives()
if not archives:
print("No archives harvested. Exiting.")
return
# Save
save_archives(archives)
# Statistics
stats = generate_statistics(archives)
# Save stats
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
stats_file = OUTPUT_DIR / f"archivportal_d_api_stats_{timestamp}.json"
with open(stats_file, 'w', encoding='utf-8') as f:
json.dump(stats, f, ensure_ascii=False, indent=2)
print(f"✓ Statistics saved to: {stats_file}\n")
print("✓ Harvest complete!\n")
if __name__ == "__main__":
main()
Edit line 21: Replace YOUR_API_KEY_HERE with your actual DDB API key.
Step 3: Test Harvest (30 minutes)
cd /Users/kempersc/apps/glam
# Test with first 100 archives
python3 scripts/scrapers/harvest_archivportal_d_api.py
Verify Output:
- JSON file created in
data/isil/germany/ - 100-10,000+ archives (depending on total)
- Statistics show reasonable distribution by state
Step 4: Full Harvest (1-2 hours)
If test successful, run full harvest (script automatically fetches all):
python3 scripts/scrapers/harvest_archivportal_d_api.py
Monitor Progress:
- Watch console output for batch progress
- Estimated time: 1-2 hours for ~10,000-20,000 archives
- Final output: JSON file with complete archive listing
Step 5: Cross-Reference with ISIL (1 hour)
Create merge script: scripts/merge_archivportal_isil.py
# Match by ISIL code, identify new discoveries
python3 scripts/merge_archivportal_isil.py
Expected Results:
- 30-50% of archives match ISIL by code
- 50-70% are new discoveries (no ISIL)
- Report: Overlap statistics, new archive count
Step 6: Create Unified Dataset (1 hour)
Merge ISIL + Archivportal-D data:
python3 scripts/create_german_unified_dataset.py
Output:
data/isil/germany/german_unified_TIMESTAMP.json- ~25,000-27,000 total institutions
- ~12,000-15,000 archives (complete coverage)
- ~12,000-15,000 libraries/museums (from ISIL)
Success Criteria
✅ Harvest Successful when:
- 10,000-20,000 archives fetched
- All 16 federal states represented
- 30-50% have ISIL codes
- 80%+ have geographic coordinates
- JSON file size: 5-20 MB
✅ Integration Complete when:
- Cross-referenced with ISIL dataset
- Duplicates removed (< 1%)
- Unified dataset created
- Statistics generated
- Documentation updated
Troubleshooting
Issue: API Key Invalid
Error: 401 Unauthorized or 403 Forbidden
Fix:
- Verify API key copied correctly (no extra spaces)
- Check key is active in DDB account
- Ensure using
Bearertoken format in Authorization header
Issue: No Results Returned
Error: numberOfResults: 0
Fix:
- Try different query parameter (e.g.,
query: "archiv*") - Check
sectorparameter is correct (arc_archives) - Verify API endpoint URL is correct
Issue: Rate Limit Exceeded
Error: 429 Too Many Requests
Fix:
- Increase
REQUEST_DELAYfrom 0.5s to 1.0s or 2.0s - Reduce
BATCH_SIZEfrom 100 to 50 - Wait a few minutes before retrying
Files You'll Create
/data/isil/germany/
├── archivportal_d_api_TIMESTAMP.json # Harvested archives
├── archivportal_d_api_stats_TIMESTAMP.json # Statistics
├── german_unified_TIMESTAMP.json # Merged dataset
└── GERMAN_UNIFIED_REPORT.md # Final report
/scripts/scrapers/
├── harvest_archivportal_d_api.py # API harvester
└── merge_archivportal_isil.py # Merge script
Estimated Timeline
| Task | Time | Status |
|---|---|---|
| DDB API Registration | 10 min | ⏳ To-do |
| Create API Harvester | 2 hours | ⏳ To-do |
| Test Harvest | 30 min | ⏳ To-do |
| Full Harvest | 1-2 hours | ⏳ To-do |
| Cross-Reference | 1 hour | ⏳ To-do |
| Create Unified Dataset | 1 hour | ⏳ To-do |
| Documentation | 1 hour | ⏳ To-do |
| TOTAL | ~6-9 hours |
After Completion
German harvest will be 100% complete:
- ✅ All ISIL-registered institutions (16,979)
- ✅ All Archivportal-D archives (~10,000-20,000)
- ✅ Unified, deduplicated dataset (~25,000-27,000)
- ✅ Ready for LinkML conversion
- ✅ Ready for GHCID generation
Next countries to harvest:
- Czech Republic - ISIL + Caslin registry
- Austria - ISIL + BiPHAN
- France - ISIL + Archives de France
- Belgium - ISIL + LOCUS
Quick Links
- DDB Portal: https://www.deutsche-digitale-bibliothek.de/
- API Docs: https://api.deutsche-digitale-bibliothek.de/ (after login)
- Archivportal-D: https://www.archivportal-d.de/
- ISIL Registry: https://sigel.staatsbibliothek-berlin.de/
Ready to Start? ✅ Register for DDB API now!
Questions? Review /data/isil/germany/SESSION_SUMMARY_20251119_ARCHIVPORTAL_D.md