glam/data/isil/germany/NEXT_SESSION_QUICK_START.md
2025-11-19 23:25:22 +01:00

13 KiB

German Archive Completion - Quick Start Guide

Goal: Achieve 100% German archive coverage by harvesting Archivportal-D via DDB API


Current Status

ISIL Registry: 16,979 institutions harvested
🔄 Archivportal-D: Awaiting API access (~10,000-20,000 archives)
🎯 Target: ~25,000-27,000 total German institutions


Step-by-Step: Complete German Archive Harvest

Step 1: Register for DDB API (10 minutes)

  1. Visit: https://www.deutsche-digitale-bibliothek.de/
  2. Click: "Registrieren" (Register button, top right)
  3. Fill form:
    • Email: [your-email]
    • Username: [choose username]
    • Password: [secure password]
  4. Verify email: Check inbox and click confirmation link
  5. Log in: Use credentials to log into DDB portal
  6. Navigate to: "Meine DDB" (My DDB) in account menu
  7. Generate API key: Look for "API-Schlüssel" or "API Key" section
  8. Copy key: Save to secure location (e.g., password manager)

Result: You now have a DDB API key (e.g., ddb_abc123xyz456...)


Step 2: Create API Harvester (2 hours)

File: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d_api.py

#!/usr/bin/env python3
"""
Archivportal-D API Harvester
Fetches all German archives via Deutsche Digitale Bibliothek REST API
"""

import requests
import json
import time
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional

# Configuration
API_BASE_URL = "https://api.deutsche-digitale-bibliothek.de"
API_KEY = "YOUR_API_KEY_HERE"  # Replace with your DDB API key
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
BATCH_SIZE = 100  # Archives per request
REQUEST_DELAY = 0.5  # Seconds between requests
MAX_RETRIES = 3

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def fetch_archives_batch(offset: int = 0, rows: int = 100) -> Optional[Dict]:
    """
    Fetch a batch of archives via DDB API.
    
    Args:
        offset: Starting record number
        rows: Number of records to fetch
        
    Returns:
        API response dict or None on error
    """
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Accept": "application/json"
    }
    
    params = {
        "query": "*",  # All archives
        "sector": "arc_archives",  # Archives sector only
        "rows": rows,
        "offset": offset
    }
    
    for attempt in range(MAX_RETRIES):
        try:
            print(f"Fetching archives {offset}-{offset+rows-1}...", end=' ')
            response = requests.get(
                f"{API_BASE_URL}/search",
                headers=headers,
                params=params,
                timeout=30
            )
            response.raise_for_status()
            
            data = response.json()
            total = data.get('numberOfResults', 0)
            print(f"OK (total: {total})")
            
            return data
        
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt+1}/{MAX_RETRIES} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(REQUEST_DELAY * (attempt + 1))
            else:
                return None


def parse_archive_record(record: Dict) -> Dict:
    """
    Parse DDB API archive record into simplified format.
    
    Args:
        record: Raw API record
        
    Returns:
        Parsed archive dictionary
    """
    return {
        'id': record.get('id'),
        'name': record.get('title'),
        'location': record.get('place'),
        'federal_state': record.get('federalState'),
        'archive_type': record.get('label'),
        'isil': record.get('isil'),
        'latitude': record.get('latitude'),
        'longitude': record.get('longitude'),
        'thumbnail': record.get('thumbnail'),
        'profile_url': f"https://www.archivportal-d.de/item/{record.get('id')}" if record.get('id') else None
    }


def harvest_all_archives() -> List[Dict]:
    """
    Harvest all archives from DDB API.
    
    Returns:
        List of parsed archive records
    """
    print(f"\n{'='*70}")
    print(f"Harvesting Archivportal-D via DDB API")
    print(f"Endpoint: {API_BASE_URL}/search")
    print(f"{'='*70}\n")
    
    all_archives = []
    offset = 0
    
    while True:
        # Fetch batch
        data = fetch_archives_batch(offset, BATCH_SIZE)
        if not data:
            print(f"Warning: Failed to fetch batch at offset {offset}. Stopping.")
            break
        
        # Parse results
        results = data.get('results', [])
        for result in results:
            archive = parse_archive_record(result)
            all_archives.append(archive)
        
        print(f"Progress: {len(all_archives)} archives collected")
        
        # Check if done
        total = data.get('numberOfResults', 0)
        if len(all_archives) >= total or len(results) < BATCH_SIZE:
            break
        
        offset += BATCH_SIZE
        time.sleep(REQUEST_DELAY)
    
    print(f"\n{'='*70}")
    print(f"Harvest complete: {len(all_archives)} archives")
    print(f"{'='*70}\n")
    
    return all_archives


def save_archives(archives: List[Dict]):
    """Save archives to JSON file."""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    output_file = OUTPUT_DIR / f"archivportal_d_api_{timestamp}.json"
    
    output = {
        'metadata': {
            'source': 'Archivportal-D via DDB API',
            'source_url': 'https://www.archivportal-d.de',
            'api_endpoint': f'{API_BASE_URL}/search',
            'operator': 'Deutsche Digitale Bibliothek',
            'harvest_date': datetime.utcnow().isoformat() + 'Z',
            'total_archives': len(archives),
            'method': 'REST API',
            'license': 'CC0 1.0 Universal (Public Domain)'
        },
        'archives': archives
    }
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    
    print(f"✓ Saved to: {output_file}")
    print(f"  File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB\n")


def generate_statistics(archives: List[Dict]):
    """Generate statistics."""
    stats = {
        'total': len(archives),
        'by_state': {},
        'by_type': {},
        'with_isil': 0,
        'with_coordinates': 0
    }
    
    for archive in archives:
        # By state
        state = archive.get('federal_state', 'Unknown')
        stats['by_state'][state] = stats['by_state'].get(state, 0) + 1
        
        # By type
        arch_type = archive.get('archive_type', 'Unknown')
        stats['by_type'][arch_type] = stats['by_type'].get(arch_type, 0) + 1
        
        # Completeness
        if archive.get('isil'):
            stats['with_isil'] += 1
        if archive.get('latitude'):
            stats['with_coordinates'] += 1
    
    print(f"\n{'='*70}")
    print("Statistics:")
    print(f"{'='*70}")
    print(f"Total archives: {stats['total']}")
    print(f"With ISIL: {stats['with_isil']} ({stats['with_isil']/stats['total']*100:.1f}%)")
    print(f"With coordinates: {stats['with_coordinates']} ({stats['with_coordinates']/stats['total']*100:.1f}%)")
    
    print(f"\nTop 10 federal states:")
    for state, count in sorted(stats['by_state'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {state}: {count}")
    
    print(f"\nTop 10 archive types:")
    for arch_type, count in sorted(stats['by_type'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {arch_type}: {count}")
    
    print(f"{'='*70}\n")
    
    return stats


def main():
    """Main execution."""
    print(f"\n{'#'*70}")
    print(f"# Archivportal-D API Harvester")
    print(f"# Deutsche Digitale Bibliothek REST API")
    print(f"{'#'*70}\n")
    
    if API_KEY == "YOUR_API_KEY_HERE":
        print("ERROR: Please set your DDB API key in the script!")
        print("Edit line 21: API_KEY = 'your-actual-api-key'")
        return
    
    # Harvest
    archives = harvest_all_archives()
    
    if not archives:
        print("No archives harvested. Exiting.")
        return
    
    # Save
    save_archives(archives)
    
    # Statistics
    stats = generate_statistics(archives)
    
    # Save stats
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    stats_file = OUTPUT_DIR / f"archivportal_d_api_stats_{timestamp}.json"
    with open(stats_file, 'w', encoding='utf-8') as f:
        json.dump(stats, f, ensure_ascii=False, indent=2)
    print(f"✓ Statistics saved to: {stats_file}\n")
    
    print("✓ Harvest complete!\n")


if __name__ == "__main__":
    main()

Edit line 21: Replace YOUR_API_KEY_HERE with your actual DDB API key.


Step 3: Test Harvest (30 minutes)

cd /Users/kempersc/apps/glam

# Test with first 100 archives
python3 scripts/scrapers/harvest_archivportal_d_api.py

Verify Output:

  • JSON file created in data/isil/germany/
  • 100-10,000+ archives (depending on total)
  • Statistics show reasonable distribution by state

Step 4: Full Harvest (1-2 hours)

If test successful, run full harvest (script automatically fetches all):

python3 scripts/scrapers/harvest_archivportal_d_api.py

Monitor Progress:

  • Watch console output for batch progress
  • Estimated time: 1-2 hours for ~10,000-20,000 archives
  • Final output: JSON file with complete archive listing

Step 5: Cross-Reference with ISIL (1 hour)

Create merge script: scripts/merge_archivportal_isil.py

# Match by ISIL code, identify new discoveries
python3 scripts/merge_archivportal_isil.py

Expected Results:

  • 30-50% of archives match ISIL by code
  • 50-70% are new discoveries (no ISIL)
  • Report: Overlap statistics, new archive count

Step 6: Create Unified Dataset (1 hour)

Merge ISIL + Archivportal-D data:

python3 scripts/create_german_unified_dataset.py

Output:

  • data/isil/germany/german_unified_TIMESTAMP.json
  • ~25,000-27,000 total institutions
  • ~12,000-15,000 archives (complete coverage)
  • ~12,000-15,000 libraries/museums (from ISIL)

Success Criteria

Harvest Successful when:

  • 10,000-20,000 archives fetched
  • All 16 federal states represented
  • 30-50% have ISIL codes
  • 80%+ have geographic coordinates
  • JSON file size: 5-20 MB

Integration Complete when:

  • Cross-referenced with ISIL dataset
  • Duplicates removed (< 1%)
  • Unified dataset created
  • Statistics generated
  • Documentation updated

Troubleshooting

Issue: API Key Invalid

Error: 401 Unauthorized or 403 Forbidden

Fix:

  • Verify API key copied correctly (no extra spaces)
  • Check key is active in DDB account
  • Ensure using Bearer token format in Authorization header

Issue: No Results Returned

Error: numberOfResults: 0

Fix:

  • Try different query parameter (e.g., query: "archiv*")
  • Check sector parameter is correct (arc_archives)
  • Verify API endpoint URL is correct

Issue: Rate Limit Exceeded

Error: 429 Too Many Requests

Fix:

  • Increase REQUEST_DELAY from 0.5s to 1.0s or 2.0s
  • Reduce BATCH_SIZE from 100 to 50
  • Wait a few minutes before retrying

Files You'll Create

/data/isil/germany/
├── archivportal_d_api_TIMESTAMP.json      # Harvested archives
├── archivportal_d_api_stats_TIMESTAMP.json  # Statistics
├── german_unified_TIMESTAMP.json          # Merged dataset
└── GERMAN_UNIFIED_REPORT.md               # Final report

/scripts/scrapers/
├── harvest_archivportal_d_api.py          # API harvester
└── merge_archivportal_isil.py             # Merge script

Estimated Timeline

Task Time Status
DDB API Registration 10 min To-do
Create API Harvester 2 hours To-do
Test Harvest 30 min To-do
Full Harvest 1-2 hours To-do
Cross-Reference 1 hour To-do
Create Unified Dataset 1 hour To-do
Documentation 1 hour To-do
TOTAL ~6-9 hours

After Completion

German harvest will be 100% complete:

  • All ISIL-registered institutions (16,979)
  • All Archivportal-D archives (~10,000-20,000)
  • Unified, deduplicated dataset (~25,000-27,000)
  • Ready for LinkML conversion
  • Ready for GHCID generation

Next countries to harvest:

  1. Czech Republic - ISIL + Caslin registry
  2. Austria - ISIL + BiPHAN
  3. France - ISIL + Archives de France
  4. Belgium - ISIL + LOCUS


Ready to Start? Register for DDB API now!

Questions? Review /data/isil/germany/SESSION_SUMMARY_20251119_ARCHIVPORTAL_D.md