12 KiB
Archivportal-D Harvester - README
Date: November 19, 2025
Status: Ready for testing
Script: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d.py
Overview
This harvester collects archive listings from Archivportal-D, the national German archive portal operated by Deutsche Digitale Bibliothek.
Portal: https://www.archivportal-d.de/
Coverage: All archives across Germany (~10,000-20,000 archives)
Method: Web scraping (fallback if API access unavailable)
Current Status: Web Scraping Implementation
Why Web Scraping?
-
DDB API Registration Required:
- API access requires free account at https://www.deutsche-digitale-bibliothek.de/
- API key must be generated in "My DDB" area
- Not immediately available without manual registration
-
Fallback Strategy:
- Web scraping provides immediate data access
- Allows initial testing and data collection
- Can be upgraded to API access later
-
Production Recommendation:
- For large-scale harvests, register for DDB API
- API provides structured data (JSON)
- Web scraping requires HTML selector maintenance
Installation
Prerequisites
# Python 3.9+
python3 --version
# Required libraries (already installed)
pip install requests beautifulsoup4
Verify Installation
cd /Users/kempersc/apps/glam
python3 -c "import requests, bs4; print('✓ Dependencies OK')"
Usage
Basic Usage (Test Mode)
# Run harvester in test mode (10 pages, no profile enrichment)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/harvest_archivportal_d.py
Expected Output:
- JSON file:
data/isil/germany/archivportal_d_archives_TIMESTAMP.json - Statistics:
data/isil/germany/archivportal_d_stats_TIMESTAMP.json - ~100-500 archives (depending on pagination)
Full Harvest (Production Mode)
Edit the script to remove page limits:
# Line ~394: Change max_pages=10 to max_pages=None
archives = harvest_archive_list(max_pages=None, enrich_profiles=False)
Estimated Time: 2-4 hours (depending on total archives)
Estimated Output: ~10,000-20,000 archive records
Profile Enrichment (Optional)
Enable detailed profile page scraping:
# Line ~394: Set enrich_profiles=True
archives = harvest_archive_list(max_pages=None, enrich_profiles=True)
Enriched Data:
- Email addresses
- Phone numbers
- Websites
- Finding aids counts
- Digital copies counts
- Geographic coordinates
Warning: Profile enrichment significantly increases harvest time (1-2 requests per archive).
Output Format
JSON Structure
{
"metadata": {
"source": "Archivportal-D",
"source_url": "https://www.archivportal-d.de",
"operator": "Deutsche Digitale Bibliothek",
"harvest_date": "2025-11-19T14:30:00Z",
"total_archives": 10000,
"method": "Web scraping",
"license": "CC0 1.0 Universal (Public Domain)"
},
"archives": [
{
"name": "Stadtarchiv Köln",
"location": "Köln",
"federal_state": "Nordrhein-Westfalen",
"archive_type": "Kommunalarchiv",
"archive_id": "stadtarchiv-koeln",
"profile_url": "https://www.archivportal-d.de/struktur/stadtarchiv-koeln",
"description": "Das Historische Archiv der Stadt Köln...",
"isil": "DE-KN"
}
]
}
Archive Record Fields
| Field | Type | Description | Required |
|---|---|---|---|
name |
string | Archive name | ✓ |
location |
string | City/municipality | ✓ |
federal_state |
string | German federal state | ✓ |
archive_type |
string | Archive sector/category | ◯ |
archive_id |
string | Portal internal ID | ◯ |
profile_url |
string | Link to archive profile | ◯ |
description |
string | Archive description | ◯ |
isil |
string | ISIL identifier (if available) | ◯ |
contact |
object | Email, phone, website | ◯ |
finding_aids |
integer | Number of finding aids | ◯ |
digital_copies |
integer | Digitized items count | ◯ |
coordinates |
object | Latitude/longitude | ◯ |
HTML Selector Strategy
Current Selectors (Subject to Change)
The script uses multiple fallback selectors to maximize compatibility:
# Archive name
['h2', 'h3', 'h4'], class_=['title', 'heading', 'name']
# Location
class_=['location', 'place', 'city']
# Federal state
class_=['state', 'federal-state', 'bundesland']
# Archive type
class_=['type', 'sector', 'category']
# Description
class_=['description', 'abstract', 'summary']
# Profile links
'a', href=re.compile(r'/struktur/[A-Za-z0-9_-]+')
Updating Selectors
If harvest results are poor, inspect the HTML and update selectors:
- Navigate to: https://www.archivportal-d.de/struktur
- Inspect Element (browser dev tools)
- Find archive listing containers and note CSS classes
- Update selectors in
extract_archive_from_listing()function - Rerun test harvest to verify
Data Quality Validation
Expected Data Completeness
After harvest, validate using the statistics output:
# Check statistics file
cat data/isil/germany/archivportal_d_stats_TIMESTAMP.json
Expected Metrics:
- Total archives: 10,000-20,000
- With ISIL code: 30-50% (archives already in ISIL registry)
- With profile URL: 95-100% (portal listings)
- With email: 20-40% (if profile enrichment enabled)
- With coordinates: 10-30% (if profile enrichment enabled)
Archive Distribution by Federal State
Expected Top States (by archive count):
- Nordrhein-Westfalen (NRW) - ~2,000-3,000 archives
- Bayern (Bavaria) - ~1,500-2,500 archives
- Baden-Württemberg - ~1,000-1,500 archives
- Niedersachsen - ~1,000-1,500 archives
- Hessen - ~800-1,200 archives
If distribution is skewed, check:
- Pagination working correctly
- All federal states being harvested
- No selector issues filtering results
Next Steps After Harvest
1. Cross-Reference with ISIL Dataset
# Match archives by ISIL code
python3 scripts/merge_archivportal_isil.py
Expected Matches:
- ~30-50% of Archivportal-D archives have ISIL codes
- These will match with existing ISIL dataset
- ~50-70% are new discoveries (archives without ISIL)
2. Deduplication
# Find duplicates (same name + city)
python3 scripts/deduplicate_german_archives.py
Expected Duplicates: < 1% (archives listed multiple times)
3. Geocoding
# Add coordinates to archives missing them
python3 scripts/geocode_german_archives.py
Expected Geocoding Success: 80-90% (using Nominatim)
4. Create Unified Dataset
# Merge ISIL + Archivportal-D data
python3 scripts/create_german_unified_dataset.py
Expected Output:
- ~25,000-27,000 total institutions (deduplicated)
- ~12,000-15,000 archives (complete coverage)
- ~12,000-15,000 libraries/museums (from ISIL)
Troubleshooting
Issue: No archives found on page 0
Cause: HTML selectors don't match current website structure
Fix:
- Visit https://www.archivportal-d.de/struktur
- Inspect page source
- Update selectors in
extract_archive_from_listing() - Test with
max_pages=1
Issue: Request timeout or connection errors
Cause: Server rate limiting or network issues
Fix:
- Increase
REQUEST_DELAY(line 19) from 1.5s to 3.0s - Reduce
BATCH_SIZEif using pagination - Check internet connection
- Verify Archivportal-D is accessible
Issue: Incomplete data (many missing fields)
Cause: Portal structure changed or fields not present in listings
Fix:
- Enable profile enrichment:
enrich_profiles=True - Profiles have more detailed information than list pages
- Increases harvest time but improves data quality
Issue: HTTP 403 Forbidden
Cause: User-Agent blocking or bot detection
Fix:
- Update
USER_AGENT(line 20) to look more like a real browser - Add additional headers (Accept-Language, Referer, etc.)
- Implement random delays between requests
API Migration (Future)
When to Upgrade to DDB API
- Need for production-scale harvest (10,000+ archives)
- Web scraping becomes unreliable (frequent HTML changes)
- Structured data required (JSON responses instead of HTML)
- Rate limits are problematic (API has higher quotas)
API Registration Steps
- Create account: https://www.deutsche-digitale-bibliothek.de/
- Verify email and log in
- Navigate to "My DDB" (Meine DDB)
- Generate API key in account settings
- Copy API key to environment variable or config file
API Endpoint Documentation
Base URL: https://api.deutsche-digitale-bibliothek.de/
Documentation: https://api.deutsche-digitale-bibliothek.de/ (requires login)
OpenAPI Spec: Available after API key generation
Example API Request (Future Implementation)
import requests
API_KEY = "your-api-key-here"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Search for archives
response = requests.get(
"https://api.deutsche-digitale-bibliothek.de/search",
params={
"query": "*",
"sector": "arc_archives", # Archives only
"rows": 100,
"offset": 0
},
headers=headers
)
archives = response.json()['results']
Performance Optimization
Current Performance
- Request rate: 1 request per 1.5 seconds
- Throughput: ~40 archives/minute (listing pages only)
- Full harvest time: 4-6 hours (10,000 archives)
With Profile Enrichment
- Request rate: 1 request per 1.5 seconds (per profile)
- Throughput: ~40 profiles/minute
- Full harvest time: 4-6 hours (listing) + 4-8 hours (profiles) = 8-14 hours
Optimization Strategies
-
Parallel requests (use
asyncio+aiohttp)- Potential throughput: 100-200 archives/minute
- Full harvest time: 1-2 hours
-
Batch processing (if API available)
- Fetch 100 archives per request
- Dramatically reduces total requests
-
Caching (avoid re-fetching known archives)
- Store archive IDs in database
- Only fetch new/updated archives
Legal and Ethical Considerations
Robots.txt Compliance
Check: https://www.archivportal-d.de/robots.txt
Current Policy: (Verify before running full harvest)
- Respect crawl delays
- Avoid overloading server
- Use descriptive User-Agent
Data License
Archivportal-D Metadata: CC0 1.0 Universal (Public Domain)
- Free to reuse, remix, and redistribute
- No attribution required (but appreciated)
- Safe for commercial use
Rate Limiting
Current Implementation: 1.5 seconds between requests
Justification: Respectful to server resources
Do NOT: Reduce delay below 1.0 second without permission
Support and Maintenance
Script Maintenance
Frequency: Check quarterly (every 3 months)
Reason: HTML structure may change
Maintenance Tasks:
- Test harvest with
max_pages=1 - Verify selectors still match
- Check for new fields in profile pages
- Update documentation
Issue Reporting
If you encounter problems:
- Check documentation first (this file)
- Verify dependencies (
pip list) - Test with small dataset (
max_pages=1) - Inspect HTML manually (browser dev tools)
- Document the issue with example URLs
Success Criteria
✅ Harvest Complete when:
- Fetched all pages (no "Next" button found)
- Total archives: 10,000-20,000 (expected range)
- Archive distribution across all 16 federal states
- < 1% duplicates (same name + city)
✅ High Data Quality when:
- 95%+ archives have name
- 90%+ archives have location
- 80%+ archives have federal state
- 30%+ archives have ISIL code
- 50%+ archives have profile URL
✅ Ready for Integration when:
- JSON format valid
- Statistics generated
- Cross-referenced with ISIL dataset
- Duplicates removed
- Geocoding complete (80%+)
References
- Archivportal-D: https://www.archivportal-d.de/
- Deutsche Digitale Bibliothek: https://www.deutsche-digitale-bibliothek.de/
- DDB API: https://api.deutsche-digitale-bibliothek.de/
- ISIL Registry: https://sigel.staatsbibliothek-berlin.de/
- ISIL Harvest Report:
/data/isil/germany/HARVEST_REPORT.md - Completeness Plan:
/data/isil/germany/COMPLETENESS_PLAN.md
Version: 1.0
Last Updated: November 19, 2025
Maintainer: GLAM Data Extraction Project
License: MIT (script), CC0 (data)
End of README