kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

12 KiB

Raw Blame History

Archivportal-D Harvester - README

Date: November 19, 2025
Status: Ready for testing
Script: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d.py

Overview

This harvester collects archive listings from Archivportal-D, the national German archive portal operated by Deutsche Digitale Bibliothek.

Portal: https://www.archivportal-d.de/
Coverage: All archives across Germany (~10,000-20,000 archives)
Method: Web scraping (fallback if API access unavailable)

Current Status: Web Scraping Implementation

Why Web Scraping?

DDB API Registration Required:
- API access requires free account at https://www.deutsche-digitale-bibliothek.de/
- API key must be generated in "My DDB" area
- Not immediately available without manual registration
Fallback Strategy:
- Web scraping provides immediate data access
- Allows initial testing and data collection
- Can be upgraded to API access later
Production Recommendation:
- For large-scale harvests, register for DDB API
- API provides structured data (JSON)
- Web scraping requires HTML selector maintenance

Installation

Prerequisites

# Python 3.9+
python3 --version

# Required libraries (already installed)
pip install requests beautifulsoup4

Verify Installation

cd /Users/kempersc/apps/glam
python3 -c "import requests, bs4; print('✓ Dependencies OK')"

Usage

Basic Usage (Test Mode)

# Run harvester in test mode (10 pages, no profile enrichment)
cd /Users/kempersc/apps/glam
python3 scripts/scrapers/harvest_archivportal_d.py

Expected Output:

JSON file: data/isil/germany/archivportal_d_archives_TIMESTAMP.json
Statistics: data/isil/germany/archivportal_d_stats_TIMESTAMP.json
~100-500 archives (depending on pagination)

Full Harvest (Production Mode)

Edit the script to remove page limits:

# Line ~394: Change max_pages=10 to max_pages=None
archives = harvest_archive_list(max_pages=None, enrich_profiles=False)

Estimated Time: 2-4 hours (depending on total archives)
Estimated Output: ~10,000-20,000 archive records

Profile Enrichment (Optional)

Enable detailed profile page scraping:

# Line ~394: Set enrich_profiles=True
archives = harvest_archive_list(max_pages=None, enrich_profiles=True)

Enriched Data:

Email addresses
Phone numbers
Websites
Finding aids counts
Digital copies counts
Geographic coordinates

Warning: Profile enrichment significantly increases harvest time (1-2 requests per archive).

Output Format

JSON Structure

{
  "metadata": {
    "source": "Archivportal-D",
    "source_url": "https://www.archivportal-d.de",
    "operator": "Deutsche Digitale Bibliothek",
    "harvest_date": "2025-11-19T14:30:00Z",
    "total_archives": 10000,
    "method": "Web scraping",
    "license": "CC0 1.0 Universal (Public Domain)"
  },
  "archives": [
    {
      "name": "Stadtarchiv Köln",
      "location": "Köln",
      "federal_state": "Nordrhein-Westfalen",
      "archive_type": "Kommunalarchiv",
      "archive_id": "stadtarchiv-koeln",
      "profile_url": "https://www.archivportal-d.de/struktur/stadtarchiv-koeln",
      "description": "Das Historische Archiv der Stadt Köln...",
      "isil": "DE-KN"
    }
  ]
}

Archive Record Fields

Field	Type	Description	Required
`name`	string	Archive name	✓
`location`	string	City/municipality	✓
`federal_state`	string	German federal state	✓
`archive_type`	string	Archive sector/category	◯
`archive_id`	string	Portal internal ID	◯
`profile_url`	string	Link to archive profile	◯
`description`	string	Archive description	◯
`isil`	string	ISIL identifier (if available)	◯
`contact`	object	Email, phone, website	◯
`finding_aids`	integer	Number of finding aids	◯
`digital_copies`	integer	Digitized items count	◯
`coordinates`	object	Latitude/longitude	◯

HTML Selector Strategy

Current Selectors (Subject to Change)

The script uses multiple fallback selectors to maximize compatibility:

# Archive name
['h2', 'h3', 'h4'], class_=['title', 'heading', 'name']

# Location
class_=['location', 'place', 'city']

# Federal state
class_=['state', 'federal-state', 'bundesland']

# Archive type
class_=['type', 'sector', 'category']

# Description
class_=['description', 'abstract', 'summary']

# Profile links
'a', href=re.compile(r'/struktur/[A-Za-z0-9_-]+')

Updating Selectors

If harvest results are poor, inspect the HTML and update selectors:

Navigate to: https://www.archivportal-d.de/struktur
Inspect Element (browser dev tools)
Find archive listing containers and note CSS classes
Update selectors in extract_archive_from_listing() function
Rerun test harvest to verify

Data Quality Validation

Expected Data Completeness

After harvest, validate using the statistics output:

# Check statistics file
cat data/isil/germany/archivportal_d_stats_TIMESTAMP.json

Expected Metrics:

Total archives: 10,000-20,000
With ISIL code: 30-50% (archives already in ISIL registry)
With profile URL: 95-100% (portal listings)
With email: 20-40% (if profile enrichment enabled)
With coordinates: 10-30% (if profile enrichment enabled)

Archive Distribution by Federal State

Expected Top States (by archive count):

Nordrhein-Westfalen (NRW) - ~2,000-3,000 archives
Bayern (Bavaria) - ~1,500-2,500 archives
Baden-Württemberg - ~1,000-1,500 archives
Niedersachsen - ~1,000-1,500 archives
Hessen - ~800-1,200 archives

If distribution is skewed, check:

Pagination working correctly
All federal states being harvested
No selector issues filtering results

Next Steps After Harvest

1. Cross-Reference with ISIL Dataset

# Match archives by ISIL code
python3 scripts/merge_archivportal_isil.py

Expected Matches:

~30-50% of Archivportal-D archives have ISIL codes
These will match with existing ISIL dataset
~50-70% are new discoveries (archives without ISIL)

2. Deduplication

# Find duplicates (same name + city)
python3 scripts/deduplicate_german_archives.py

Expected Duplicates: < 1% (archives listed multiple times)

3. Geocoding

# Add coordinates to archives missing them
python3 scripts/geocode_german_archives.py

Expected Geocoding Success: 80-90% (using Nominatim)

4. Create Unified Dataset

# Merge ISIL + Archivportal-D data
python3 scripts/create_german_unified_dataset.py

Expected Output:

~25,000-27,000 total institutions (deduplicated)
~12,000-15,000 archives (complete coverage)
~12,000-15,000 libraries/museums (from ISIL)

Troubleshooting

Issue: No archives found on page 0

Cause: HTML selectors don't match current website structure

Fix:

Visit https://www.archivportal-d.de/struktur
Inspect page source
Update selectors in extract_archive_from_listing()
Test with max_pages=1

Issue: Request timeout or connection errors

Cause: Server rate limiting or network issues

Fix:

Increase REQUEST_DELAY (line 19) from 1.5s to 3.0s
Reduce BATCH_SIZE if using pagination
Check internet connection
Verify Archivportal-D is accessible

Issue: Incomplete data (many missing fields)

Cause: Portal structure changed or fields not present in listings

Fix:

Enable profile enrichment: enrich_profiles=True
Profiles have more detailed information than list pages
Increases harvest time but improves data quality

Issue: HTTP 403 Forbidden

Cause: User-Agent blocking or bot detection

Fix:

Update USER_AGENT (line 20) to look more like a real browser
Add additional headers (Accept-Language, Referer, etc.)
Implement random delays between requests

API Migration (Future)

When to Upgrade to DDB API

Need for production-scale harvest (10,000+ archives)
Web scraping becomes unreliable (frequent HTML changes)
Structured data required (JSON responses instead of HTML)
Rate limits are problematic (API has higher quotas)

API Registration Steps

Create account: https://www.deutsche-digitale-bibliothek.de/
Verify email and log in
Navigate to "My DDB" (Meine DDB)
Generate API key in account settings
Copy API key to environment variable or config file

API Endpoint Documentation

Base URL: https://api.deutsche-digitale-bibliothek.de/
Documentation: https://api.deutsche-digitale-bibliothek.de/ (requires login)
OpenAPI Spec: Available after API key generation

Example API Request (Future Implementation)

import requests

API_KEY = "your-api-key-here"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Search for archives
response = requests.get(
    "https://api.deutsche-digitale-bibliothek.de/search",
    params={
        "query": "*",
        "sector": "arc_archives",  # Archives only
        "rows": 100,
        "offset": 0
    },
    headers=headers
)

archives = response.json()['results']

Performance Optimization

Current Performance

Request rate: 1 request per 1.5 seconds
Throughput: ~40 archives/minute (listing pages only)
Full harvest time: 4-6 hours (10,000 archives)

With Profile Enrichment

Request rate: 1 request per 1.5 seconds (per profile)
Throughput: ~40 profiles/minute
Full harvest time: 4-6 hours (listing) + 4-8 hours (profiles) = 8-14 hours

Optimization Strategies

Parallel requests (use asyncio + aiohttp)
- Potential throughput: 100-200 archives/minute
- Full harvest time: 1-2 hours
Batch processing (if API available)
- Fetch 100 archives per request
- Dramatically reduces total requests
Caching (avoid re-fetching known archives)
- Store archive IDs in database
- Only fetch new/updated archives

Legal and Ethical Considerations

Robots.txt Compliance

Check: https://www.archivportal-d.de/robots.txt

Current Policy: (Verify before running full harvest)

Respect crawl delays
Avoid overloading server
Use descriptive User-Agent

Data License

Archivportal-D Metadata: CC0 1.0 Universal (Public Domain)

Free to reuse, remix, and redistribute
No attribution required (but appreciated)
Safe for commercial use

Rate Limiting

Current Implementation: 1.5 seconds between requests
Justification: Respectful to server resources
Do NOT: Reduce delay below 1.0 second without permission

Support and Maintenance

Script Maintenance

Frequency: Check quarterly (every 3 months)
Reason: HTML structure may change

Maintenance Tasks:

Test harvest with max_pages=1
Verify selectors still match
Check for new fields in profile pages
Update documentation

Issue Reporting

If you encounter problems:

Check documentation first (this file)
Verify dependencies (pip list)
Test with small dataset (max_pages=1)
Inspect HTML manually (browser dev tools)
Document the issue with example URLs

Success Criteria

✅ Harvest Complete when:

Fetched all pages (no "Next" button found)
Total archives: 10,000-20,000 (expected range)
Archive distribution across all 16 federal states
< 1% duplicates (same name + city)

✅ High Data Quality when:

95%+ archives have name
90%+ archives have location
80%+ archives have federal state
30%+ archives have ISIL code
50%+ archives have profile URL

✅ Ready for Integration when:

JSON format valid
Statistics generated
Cross-referenced with ISIL dataset
Duplicates removed
Geocoding complete (80%+)

References

Archivportal-D: https://www.archivportal-d.de/
Deutsche Digitale Bibliothek: https://www.deutsche-digitale-bibliothek.de/
DDB API: https://api.deutsche-digitale-bibliothek.de/
ISIL Registry: https://sigel.staatsbibliothek-berlin.de/
ISIL Harvest Report: /data/isil/germany/HARVEST_REPORT.md
Completeness Plan: /data/isil/germany/COMPLETENESS_PLAN.md

Version: 1.0
Last Updated: November 19, 2025
Maintainer: GLAM Data Extraction Project
License: MIT (script), CC0 (data)

End of README

12 KiB Raw Blame History

Archivportal-D Harvester - README

Overview

Current Status: Web Scraping Implementation

Why Web Scraping?

Installation

Prerequisites

Verify Installation

Usage

Basic Usage (Test Mode)

Full Harvest (Production Mode)

Profile Enrichment (Optional)

Output Format

JSON Structure

Archive Record Fields

HTML Selector Strategy

Current Selectors (Subject to Change)

Updating Selectors

Data Quality Validation

Expected Data Completeness

Archive Distribution by Federal State

Next Steps After Harvest

1. Cross-Reference with ISIL Dataset

2. Deduplication

3. Geocoding

4. Create Unified Dataset

Troubleshooting

Issue: No archives found on page 0

Issue: Request timeout or connection errors

Issue: Incomplete data (many missing fields)

Issue: HTTP 403 Forbidden

API Migration (Future)

When to Upgrade to DDB API

API Registration Steps

API Endpoint Documentation

Example API Request (Future Implementation)

Performance Optimization

Current Performance

With Profile Enrichment

Optimization Strategies

Legal and Ethical Considerations

Robots.txt Compliance

Data License

Rate Limiting

Support and Maintenance

Script Maintenance

Issue Reporting

Success Criteria

References

12 KiB

Raw Blame History