glam/data/isil/germany/ARCHIVPORTAL_D_HARVESTER_README.md
2025-11-19 23:25:22 +01:00

12 KiB

Archivportal-D Harvester - README

Date: November 19, 2025
Status: Ready for testing
Script: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d.py


Overview

This harvester collects archive listings from Archivportal-D, the national German archive portal operated by Deutsche Digitale Bibliothek.

Portal: https://www.archivportal-d.de/
Coverage: All archives across Germany (~10,000-20,000 archives)
Method: Web scraping (fallback if API access unavailable)


Current Status: Web Scraping Implementation

Why Web Scraping?

  1. DDB API Registration Required:

  2. Fallback Strategy:

    • Web scraping provides immediate data access
    • Allows initial testing and data collection
    • Can be upgraded to API access later
  3. Production Recommendation:

    • For large-scale harvests, register for DDB API
    • API provides structured data (JSON)
    • Web scraping requires HTML selector maintenance

Installation

Prerequisites

# Python 3.9+
python3 --version

# Required libraries (already installed)
pip install requests beautifulsoup4

Verify Installation

cd /Users/kempersc/apps/glam
python3 -c "import requests, bs4; print('✓ Dependencies OK')"

Usage

Basic Usage (Test Mode)

# Run harvester in test mode (10 pages, no profile enrichment)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/harvest_archivportal_d.py

Expected Output:

  • JSON file: data/isil/germany/archivportal_d_archives_TIMESTAMP.json
  • Statistics: data/isil/germany/archivportal_d_stats_TIMESTAMP.json
  • ~100-500 archives (depending on pagination)

Full Harvest (Production Mode)

Edit the script to remove page limits:

# Line ~394: Change max_pages=10 to max_pages=None
archives = harvest_archive_list(max_pages=None, enrich_profiles=False)

Estimated Time: 2-4 hours (depending on total archives)
Estimated Output: ~10,000-20,000 archive records

Profile Enrichment (Optional)

Enable detailed profile page scraping:

# Line ~394: Set enrich_profiles=True
archives = harvest_archive_list(max_pages=None, enrich_profiles=True)

Enriched Data:

  • Email addresses
  • Phone numbers
  • Websites
  • Finding aids counts
  • Digital copies counts
  • Geographic coordinates

Warning: Profile enrichment significantly increases harvest time (1-2 requests per archive).


Output Format

JSON Structure

{
  "metadata": {
    "source": "Archivportal-D",
    "source_url": "https://www.archivportal-d.de",
    "operator": "Deutsche Digitale Bibliothek",
    "harvest_date": "2025-11-19T14:30:00Z",
    "total_archives": 10000,
    "method": "Web scraping",
    "license": "CC0 1.0 Universal (Public Domain)"
  },
  "archives": [
    {
      "name": "Stadtarchiv Köln",
      "location": "Köln",
      "federal_state": "Nordrhein-Westfalen",
      "archive_type": "Kommunalarchiv",
      "archive_id": "stadtarchiv-koeln",
      "profile_url": "https://www.archivportal-d.de/struktur/stadtarchiv-koeln",
      "description": "Das Historische Archiv der Stadt Köln...",
      "isil": "DE-KN"
    }
  ]
}

Archive Record Fields

Field Type Description Required
name string Archive name
location string City/municipality
federal_state string German federal state
archive_type string Archive sector/category
archive_id string Portal internal ID
profile_url string Link to archive profile
description string Archive description
isil string ISIL identifier (if available)
contact object Email, phone, website
finding_aids integer Number of finding aids
digital_copies integer Digitized items count
coordinates object Latitude/longitude

HTML Selector Strategy

Current Selectors (Subject to Change)

The script uses multiple fallback selectors to maximize compatibility:

# Archive name
['h2', 'h3', 'h4'], class_=['title', 'heading', 'name']

# Location
class_=['location', 'place', 'city']

# Federal state
class_=['state', 'federal-state', 'bundesland']

# Archive type
class_=['type', 'sector', 'category']

# Description
class_=['description', 'abstract', 'summary']

# Profile links
'a', href=re.compile(r'/struktur/[A-Za-z0-9_-]+')

Updating Selectors

If harvest results are poor, inspect the HTML and update selectors:

  1. Navigate to: https://www.archivportal-d.de/struktur
  2. Inspect Element (browser dev tools)
  3. Find archive listing containers and note CSS classes
  4. Update selectors in extract_archive_from_listing() function
  5. Rerun test harvest to verify

Data Quality Validation

Expected Data Completeness

After harvest, validate using the statistics output:

# Check statistics file
cat data/isil/germany/archivportal_d_stats_TIMESTAMP.json

Expected Metrics:

  • Total archives: 10,000-20,000
  • With ISIL code: 30-50% (archives already in ISIL registry)
  • With profile URL: 95-100% (portal listings)
  • With email: 20-40% (if profile enrichment enabled)
  • With coordinates: 10-30% (if profile enrichment enabled)

Archive Distribution by Federal State

Expected Top States (by archive count):

  1. Nordrhein-Westfalen (NRW) - ~2,000-3,000 archives
  2. Bayern (Bavaria) - ~1,500-2,500 archives
  3. Baden-Württemberg - ~1,000-1,500 archives
  4. Niedersachsen - ~1,000-1,500 archives
  5. Hessen - ~800-1,200 archives

If distribution is skewed, check:

  • Pagination working correctly
  • All federal states being harvested
  • No selector issues filtering results

Next Steps After Harvest

1. Cross-Reference with ISIL Dataset

# Match archives by ISIL code
python3 scripts/merge_archivportal_isil.py

Expected Matches:

  • ~30-50% of Archivportal-D archives have ISIL codes
  • These will match with existing ISIL dataset
  • ~50-70% are new discoveries (archives without ISIL)

2. Deduplication

# Find duplicates (same name + city)
python3 scripts/deduplicate_german_archives.py

Expected Duplicates: < 1% (archives listed multiple times)

3. Geocoding

# Add coordinates to archives missing them
python3 scripts/geocode_german_archives.py

Expected Geocoding Success: 80-90% (using Nominatim)

4. Create Unified Dataset

# Merge ISIL + Archivportal-D data
python3 scripts/create_german_unified_dataset.py

Expected Output:

  • ~25,000-27,000 total institutions (deduplicated)
  • ~12,000-15,000 archives (complete coverage)
  • ~12,000-15,000 libraries/museums (from ISIL)

Troubleshooting

Issue: No archives found on page 0

Cause: HTML selectors don't match current website structure

Fix:

  1. Visit https://www.archivportal-d.de/struktur
  2. Inspect page source
  3. Update selectors in extract_archive_from_listing()
  4. Test with max_pages=1

Issue: Request timeout or connection errors

Cause: Server rate limiting or network issues

Fix:

  1. Increase REQUEST_DELAY (line 19) from 1.5s to 3.0s
  2. Reduce BATCH_SIZE if using pagination
  3. Check internet connection
  4. Verify Archivportal-D is accessible

Issue: Incomplete data (many missing fields)

Cause: Portal structure changed or fields not present in listings

Fix:

  1. Enable profile enrichment: enrich_profiles=True
  2. Profiles have more detailed information than list pages
  3. Increases harvest time but improves data quality

Issue: HTTP 403 Forbidden

Cause: User-Agent blocking or bot detection

Fix:

  1. Update USER_AGENT (line 20) to look more like a real browser
  2. Add additional headers (Accept-Language, Referer, etc.)
  3. Implement random delays between requests

API Migration (Future)

When to Upgrade to DDB API

  1. Need for production-scale harvest (10,000+ archives)
  2. Web scraping becomes unreliable (frequent HTML changes)
  3. Structured data required (JSON responses instead of HTML)
  4. Rate limits are problematic (API has higher quotas)

API Registration Steps

  1. Create account: https://www.deutsche-digitale-bibliothek.de/
  2. Verify email and log in
  3. Navigate to "My DDB" (Meine DDB)
  4. Generate API key in account settings
  5. Copy API key to environment variable or config file

API Endpoint Documentation

Base URL: https://api.deutsche-digitale-bibliothek.de/
Documentation: https://api.deutsche-digitale-bibliothek.de/ (requires login)
OpenAPI Spec: Available after API key generation

Example API Request (Future Implementation)

import requests

API_KEY = "your-api-key-here"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Search for archives
response = requests.get(
    "https://api.deutsche-digitale-bibliothek.de/search",
    params={
        "query": "*",
        "sector": "arc_archives",  # Archives only
        "rows": 100,
        "offset": 0
    },
    headers=headers
)

archives = response.json()['results']

Performance Optimization

Current Performance

  • Request rate: 1 request per 1.5 seconds
  • Throughput: ~40 archives/minute (listing pages only)
  • Full harvest time: 4-6 hours (10,000 archives)

With Profile Enrichment

  • Request rate: 1 request per 1.5 seconds (per profile)
  • Throughput: ~40 profiles/minute
  • Full harvest time: 4-6 hours (listing) + 4-8 hours (profiles) = 8-14 hours

Optimization Strategies

  1. Parallel requests (use asyncio + aiohttp)

    • Potential throughput: 100-200 archives/minute
    • Full harvest time: 1-2 hours
  2. Batch processing (if API available)

    • Fetch 100 archives per request
    • Dramatically reduces total requests
  3. Caching (avoid re-fetching known archives)

    • Store archive IDs in database
    • Only fetch new/updated archives

Robots.txt Compliance

Check: https://www.archivportal-d.de/robots.txt

Current Policy: (Verify before running full harvest)

  • Respect crawl delays
  • Avoid overloading server
  • Use descriptive User-Agent

Data License

Archivportal-D Metadata: CC0 1.0 Universal (Public Domain)

  • Free to reuse, remix, and redistribute
  • No attribution required (but appreciated)
  • Safe for commercial use

Rate Limiting

Current Implementation: 1.5 seconds between requests
Justification: Respectful to server resources
Do NOT: Reduce delay below 1.0 second without permission


Support and Maintenance

Script Maintenance

Frequency: Check quarterly (every 3 months)
Reason: HTML structure may change

Maintenance Tasks:

  1. Test harvest with max_pages=1
  2. Verify selectors still match
  3. Check for new fields in profile pages
  4. Update documentation

Issue Reporting

If you encounter problems:

  1. Check documentation first (this file)
  2. Verify dependencies (pip list)
  3. Test with small dataset (max_pages=1)
  4. Inspect HTML manually (browser dev tools)
  5. Document the issue with example URLs

Success Criteria

Harvest Complete when:

  • Fetched all pages (no "Next" button found)
  • Total archives: 10,000-20,000 (expected range)
  • Archive distribution across all 16 federal states
  • < 1% duplicates (same name + city)

High Data Quality when:

  • 95%+ archives have name
  • 90%+ archives have location
  • 80%+ archives have federal state
  • 30%+ archives have ISIL code
  • 50%+ archives have profile URL

Ready for Integration when:

  • JSON format valid
  • Statistics generated
  • Cross-referenced with ISIL dataset
  • Duplicates removed
  • Geocoding complete (80%+)

References


Version: 1.0
Last Updated: November 19, 2025
Maintainer: GLAM Data Extraction Project
License: MIT (script), CC0 (data)


End of README