466 lines
13 KiB
Markdown
466 lines
13 KiB
Markdown
# German Archive Completion - Quick Start Guide
|
|
|
|
**Goal**: Achieve 100% German archive coverage by harvesting Archivportal-D via DDB API
|
|
|
|
---
|
|
|
|
## Current Status
|
|
|
|
✅ **ISIL Registry**: 16,979 institutions harvested
|
|
🔄 **Archivportal-D**: Awaiting API access (~10,000-20,000 archives)
|
|
🎯 **Target**: ~25,000-27,000 total German institutions
|
|
|
|
---
|
|
|
|
## Step-by-Step: Complete German Archive Harvest
|
|
|
|
### Step 1: Register for DDB API (10 minutes)
|
|
|
|
1. **Visit**: https://www.deutsche-digitale-bibliothek.de/
|
|
2. **Click**: "Registrieren" (Register button, top right)
|
|
3. **Fill form**:
|
|
- Email: [your-email]
|
|
- Username: [choose username]
|
|
- Password: [secure password]
|
|
4. **Verify email**: Check inbox and click confirmation link
|
|
5. **Log in**: Use credentials to log into DDB portal
|
|
6. **Navigate to**: "Meine DDB" (My DDB) in account menu
|
|
7. **Generate API key**: Look for "API-Schlüssel" or "API Key" section
|
|
8. **Copy key**: Save to secure location (e.g., password manager)
|
|
|
|
**Result**: You now have a DDB API key (e.g., `ddb_abc123xyz456...`)
|
|
|
|
---
|
|
|
|
### Step 2: Create API Harvester (2 hours)
|
|
|
|
**File**: `/Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d_api.py`
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Archivportal-D API Harvester
|
|
Fetches all German archives via Deutsche Digitale Bibliothek REST API
|
|
"""
|
|
|
|
import requests
|
|
import json
|
|
import time
|
|
from pathlib import Path
|
|
from datetime import datetime
|
|
from typing import List, Dict, Optional
|
|
|
|
# Configuration
|
|
API_BASE_URL = "https://api.deutsche-digitale-bibliothek.de"
|
|
API_KEY = "YOUR_API_KEY_HERE" # Replace with your DDB API key
|
|
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
|
|
BATCH_SIZE = 100 # Archives per request
|
|
REQUEST_DELAY = 0.5 # Seconds between requests
|
|
MAX_RETRIES = 3
|
|
|
|
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
|
|
|
|
|
def fetch_archives_batch(offset: int = 0, rows: int = 100) -> Optional[Dict]:
|
|
"""
|
|
Fetch a batch of archives via DDB API.
|
|
|
|
Args:
|
|
offset: Starting record number
|
|
rows: Number of records to fetch
|
|
|
|
Returns:
|
|
API response dict or None on error
|
|
"""
|
|
headers = {
|
|
"Authorization": f"Bearer {API_KEY}",
|
|
"Accept": "application/json"
|
|
}
|
|
|
|
params = {
|
|
"query": "*", # All archives
|
|
"sector": "arc_archives", # Archives sector only
|
|
"rows": rows,
|
|
"offset": offset
|
|
}
|
|
|
|
for attempt in range(MAX_RETRIES):
|
|
try:
|
|
print(f"Fetching archives {offset}-{offset+rows-1}...", end=' ')
|
|
response = requests.get(
|
|
f"{API_BASE_URL}/search",
|
|
headers=headers,
|
|
params=params,
|
|
timeout=30
|
|
)
|
|
response.raise_for_status()
|
|
|
|
data = response.json()
|
|
total = data.get('numberOfResults', 0)
|
|
print(f"OK (total: {total})")
|
|
|
|
return data
|
|
|
|
except requests.exceptions.RequestException as e:
|
|
print(f"Attempt {attempt+1}/{MAX_RETRIES} failed: {e}")
|
|
if attempt < MAX_RETRIES - 1:
|
|
time.sleep(REQUEST_DELAY * (attempt + 1))
|
|
else:
|
|
return None
|
|
|
|
|
|
def parse_archive_record(record: Dict) -> Dict:
|
|
"""
|
|
Parse DDB API archive record into simplified format.
|
|
|
|
Args:
|
|
record: Raw API record
|
|
|
|
Returns:
|
|
Parsed archive dictionary
|
|
"""
|
|
return {
|
|
'id': record.get('id'),
|
|
'name': record.get('title'),
|
|
'location': record.get('place'),
|
|
'federal_state': record.get('federalState'),
|
|
'archive_type': record.get('label'),
|
|
'isil': record.get('isil'),
|
|
'latitude': record.get('latitude'),
|
|
'longitude': record.get('longitude'),
|
|
'thumbnail': record.get('thumbnail'),
|
|
'profile_url': f"https://www.archivportal-d.de/item/{record.get('id')}" if record.get('id') else None
|
|
}
|
|
|
|
|
|
def harvest_all_archives() -> List[Dict]:
|
|
"""
|
|
Harvest all archives from DDB API.
|
|
|
|
Returns:
|
|
List of parsed archive records
|
|
"""
|
|
print(f"\n{'='*70}")
|
|
print(f"Harvesting Archivportal-D via DDB API")
|
|
print(f"Endpoint: {API_BASE_URL}/search")
|
|
print(f"{'='*70}\n")
|
|
|
|
all_archives = []
|
|
offset = 0
|
|
|
|
while True:
|
|
# Fetch batch
|
|
data = fetch_archives_batch(offset, BATCH_SIZE)
|
|
if not data:
|
|
print(f"Warning: Failed to fetch batch at offset {offset}. Stopping.")
|
|
break
|
|
|
|
# Parse results
|
|
results = data.get('results', [])
|
|
for result in results:
|
|
archive = parse_archive_record(result)
|
|
all_archives.append(archive)
|
|
|
|
print(f"Progress: {len(all_archives)} archives collected")
|
|
|
|
# Check if done
|
|
total = data.get('numberOfResults', 0)
|
|
if len(all_archives) >= total or len(results) < BATCH_SIZE:
|
|
break
|
|
|
|
offset += BATCH_SIZE
|
|
time.sleep(REQUEST_DELAY)
|
|
|
|
print(f"\n{'='*70}")
|
|
print(f"Harvest complete: {len(all_archives)} archives")
|
|
print(f"{'='*70}\n")
|
|
|
|
return all_archives
|
|
|
|
|
|
def save_archives(archives: List[Dict]):
|
|
"""Save archives to JSON file."""
|
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
output_file = OUTPUT_DIR / f"archivportal_d_api_{timestamp}.json"
|
|
|
|
output = {
|
|
'metadata': {
|
|
'source': 'Archivportal-D via DDB API',
|
|
'source_url': 'https://www.archivportal-d.de',
|
|
'api_endpoint': f'{API_BASE_URL}/search',
|
|
'operator': 'Deutsche Digitale Bibliothek',
|
|
'harvest_date': datetime.utcnow().isoformat() + 'Z',
|
|
'total_archives': len(archives),
|
|
'method': 'REST API',
|
|
'license': 'CC0 1.0 Universal (Public Domain)'
|
|
},
|
|
'archives': archives
|
|
}
|
|
|
|
with open(output_file, 'w', encoding='utf-8') as f:
|
|
json.dump(output, f, ensure_ascii=False, indent=2)
|
|
|
|
print(f"✓ Saved to: {output_file}")
|
|
print(f" File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB\n")
|
|
|
|
|
|
def generate_statistics(archives: List[Dict]):
|
|
"""Generate statistics."""
|
|
stats = {
|
|
'total': len(archives),
|
|
'by_state': {},
|
|
'by_type': {},
|
|
'with_isil': 0,
|
|
'with_coordinates': 0
|
|
}
|
|
|
|
for archive in archives:
|
|
# By state
|
|
state = archive.get('federal_state', 'Unknown')
|
|
stats['by_state'][state] = stats['by_state'].get(state, 0) + 1
|
|
|
|
# By type
|
|
arch_type = archive.get('archive_type', 'Unknown')
|
|
stats['by_type'][arch_type] = stats['by_type'].get(arch_type, 0) + 1
|
|
|
|
# Completeness
|
|
if archive.get('isil'):
|
|
stats['with_isil'] += 1
|
|
if archive.get('latitude'):
|
|
stats['with_coordinates'] += 1
|
|
|
|
print(f"\n{'='*70}")
|
|
print("Statistics:")
|
|
print(f"{'='*70}")
|
|
print(f"Total archives: {stats['total']}")
|
|
print(f"With ISIL: {stats['with_isil']} ({stats['with_isil']/stats['total']*100:.1f}%)")
|
|
print(f"With coordinates: {stats['with_coordinates']} ({stats['with_coordinates']/stats['total']*100:.1f}%)")
|
|
|
|
print(f"\nTop 10 federal states:")
|
|
for state, count in sorted(stats['by_state'].items(), key=lambda x: x[1], reverse=True)[:10]:
|
|
print(f" {state}: {count}")
|
|
|
|
print(f"\nTop 10 archive types:")
|
|
for arch_type, count in sorted(stats['by_type'].items(), key=lambda x: x[1], reverse=True)[:10]:
|
|
print(f" {arch_type}: {count}")
|
|
|
|
print(f"{'='*70}\n")
|
|
|
|
return stats
|
|
|
|
|
|
def main():
|
|
"""Main execution."""
|
|
print(f"\n{'#'*70}")
|
|
print(f"# Archivportal-D API Harvester")
|
|
print(f"# Deutsche Digitale Bibliothek REST API")
|
|
print(f"{'#'*70}\n")
|
|
|
|
if API_KEY == "YOUR_API_KEY_HERE":
|
|
print("ERROR: Please set your DDB API key in the script!")
|
|
print("Edit line 21: API_KEY = 'your-actual-api-key'")
|
|
return
|
|
|
|
# Harvest
|
|
archives = harvest_all_archives()
|
|
|
|
if not archives:
|
|
print("No archives harvested. Exiting.")
|
|
return
|
|
|
|
# Save
|
|
save_archives(archives)
|
|
|
|
# Statistics
|
|
stats = generate_statistics(archives)
|
|
|
|
# Save stats
|
|
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
|
stats_file = OUTPUT_DIR / f"archivportal_d_api_stats_{timestamp}.json"
|
|
with open(stats_file, 'w', encoding='utf-8') as f:
|
|
json.dump(stats, f, ensure_ascii=False, indent=2)
|
|
print(f"✓ Statistics saved to: {stats_file}\n")
|
|
|
|
print("✓ Harvest complete!\n")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
```
|
|
|
|
**Edit line 21**: Replace `YOUR_API_KEY_HERE` with your actual DDB API key.
|
|
|
|
---
|
|
|
|
### Step 3: Test Harvest (30 minutes)
|
|
|
|
```bash
|
|
cd /Users/kempersc/apps/glam
|
|
|
|
# Test with first 100 archives
|
|
python3 scripts/scrapers/harvest_archivportal_d_api.py
|
|
```
|
|
|
|
**Verify Output**:
|
|
- JSON file created in `data/isil/germany/`
|
|
- 100-10,000+ archives (depending on total)
|
|
- Statistics show reasonable distribution by state
|
|
|
|
---
|
|
|
|
### Step 4: Full Harvest (1-2 hours)
|
|
|
|
If test successful, run full harvest (script automatically fetches all):
|
|
|
|
```bash
|
|
python3 scripts/scrapers/harvest_archivportal_d_api.py
|
|
```
|
|
|
|
**Monitor Progress**:
|
|
- Watch console output for batch progress
|
|
- Estimated time: 1-2 hours for ~10,000-20,000 archives
|
|
- Final output: JSON file with complete archive listing
|
|
|
|
---
|
|
|
|
### Step 5: Cross-Reference with ISIL (1 hour)
|
|
|
|
Create merge script: `scripts/merge_archivportal_isil.py`
|
|
|
|
```bash
|
|
# Match by ISIL code, identify new discoveries
|
|
python3 scripts/merge_archivportal_isil.py
|
|
```
|
|
|
|
**Expected Results**:
|
|
- 30-50% of archives match ISIL by code
|
|
- 50-70% are new discoveries (no ISIL)
|
|
- Report: Overlap statistics, new archive count
|
|
|
|
---
|
|
|
|
### Step 6: Create Unified Dataset (1 hour)
|
|
|
|
Merge ISIL + Archivportal-D data:
|
|
|
|
```bash
|
|
python3 scripts/create_german_unified_dataset.py
|
|
```
|
|
|
|
**Output**:
|
|
- `data/isil/germany/german_unified_TIMESTAMP.json`
|
|
- ~25,000-27,000 total institutions
|
|
- ~12,000-15,000 archives (complete coverage)
|
|
- ~12,000-15,000 libraries/museums (from ISIL)
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
✅ **Harvest Successful** when:
|
|
- [ ] 10,000-20,000 archives fetched
|
|
- [ ] All 16 federal states represented
|
|
- [ ] 30-50% have ISIL codes
|
|
- [ ] 80%+ have geographic coordinates
|
|
- [ ] JSON file size: 5-20 MB
|
|
|
|
✅ **Integration Complete** when:
|
|
- [ ] Cross-referenced with ISIL dataset
|
|
- [ ] Duplicates removed (< 1%)
|
|
- [ ] Unified dataset created
|
|
- [ ] Statistics generated
|
|
- [ ] Documentation updated
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: API Key Invalid
|
|
|
|
**Error**: `401 Unauthorized` or `403 Forbidden`
|
|
|
|
**Fix**:
|
|
- Verify API key copied correctly (no extra spaces)
|
|
- Check key is active in DDB account
|
|
- Ensure using `Bearer` token format in Authorization header
|
|
|
|
### Issue: No Results Returned
|
|
|
|
**Error**: `numberOfResults: 0`
|
|
|
|
**Fix**:
|
|
- Try different query parameter (e.g., `query: "archiv*"`)
|
|
- Check `sector` parameter is correct (`arc_archives`)
|
|
- Verify API endpoint URL is correct
|
|
|
|
### Issue: Rate Limit Exceeded
|
|
|
|
**Error**: `429 Too Many Requests`
|
|
|
|
**Fix**:
|
|
- Increase `REQUEST_DELAY` from 0.5s to 1.0s or 2.0s
|
|
- Reduce `BATCH_SIZE` from 100 to 50
|
|
- Wait a few minutes before retrying
|
|
|
|
---
|
|
|
|
## Files You'll Create
|
|
|
|
```
|
|
/data/isil/germany/
|
|
├── archivportal_d_api_TIMESTAMP.json # Harvested archives
|
|
├── archivportal_d_api_stats_TIMESTAMP.json # Statistics
|
|
├── german_unified_TIMESTAMP.json # Merged dataset
|
|
└── GERMAN_UNIFIED_REPORT.md # Final report
|
|
|
|
/scripts/scrapers/
|
|
├── harvest_archivportal_d_api.py # API harvester
|
|
└── merge_archivportal_isil.py # Merge script
|
|
```
|
|
|
|
---
|
|
|
|
## Estimated Timeline
|
|
|
|
| Task | Time | Status |
|
|
|------|------|--------|
|
|
| **DDB API Registration** | 10 min | ⏳ To-do |
|
|
| **Create API Harvester** | 2 hours | ⏳ To-do |
|
|
| **Test Harvest** | 30 min | ⏳ To-do |
|
|
| **Full Harvest** | 1-2 hours | ⏳ To-do |
|
|
| **Cross-Reference** | 1 hour | ⏳ To-do |
|
|
| **Create Unified Dataset** | 1 hour | ⏳ To-do |
|
|
| **Documentation** | 1 hour | ⏳ To-do |
|
|
| **TOTAL** | **~6-9 hours** | |
|
|
|
|
---
|
|
|
|
## After Completion
|
|
|
|
**German harvest will be 100% complete**:
|
|
- ✅ All ISIL-registered institutions (16,979)
|
|
- ✅ All Archivportal-D archives (~10,000-20,000)
|
|
- ✅ Unified, deduplicated dataset (~25,000-27,000)
|
|
- ✅ Ready for LinkML conversion
|
|
- ✅ Ready for GHCID generation
|
|
|
|
**Next countries to harvest**:
|
|
1. **Czech Republic** - ISIL + Caslin registry
|
|
2. **Austria** - ISIL + BiPHAN
|
|
3. **France** - ISIL + Archives de France
|
|
4. **Belgium** - ISIL + LOCUS
|
|
|
|
---
|
|
|
|
## Quick Links
|
|
|
|
- **DDB Portal**: https://www.deutsche-digitale-bibliothek.de/
|
|
- **API Docs**: https://api.deutsche-digitale-bibliothek.de/ (after login)
|
|
- **Archivportal-D**: https://www.archivportal-d.de/
|
|
- **ISIL Registry**: https://sigel.staatsbibliothek-berlin.de/
|
|
|
|
---
|
|
|
|
**Ready to Start?** ✅ Register for DDB API now!
|
|
|
|
**Questions?** Review `/data/isil/germany/SESSION_SUMMARY_20251119_ARCHIVPORTAL_D.md`
|