kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

13 KiB

Raw Blame History

German Archive Completion - Quick Start Guide

Goal: Achieve 100% German archive coverage by harvesting Archivportal-D via DDB API

Current Status

✅ ISIL Registry: 16,979 institutions harvested
🔄 Archivportal-D: Awaiting API access (~10,000-20,000 archives)
🎯 Target: ~25,000-27,000 total German institutions

Step-by-Step: Complete German Archive Harvest

Step 1: Register for DDB API (10 minutes)

Visit: https://www.deutsche-digitale-bibliothek.de/
Click: "Registrieren" (Register button, top right)
Fill form:
- Email: [your-email]
- Username: [choose username]
- Password: [secure password]
Verify email: Check inbox and click confirmation link
Log in: Use credentials to log into DDB portal
Navigate to: "Meine DDB" (My DDB) in account menu
Generate API key: Look for "API-Schlüssel" or "API Key" section
Copy key: Save to secure location (e.g., password manager)

Result: You now have a DDB API key (e.g., ddb_abc123xyz456...)

Step 2: Create API Harvester (2 hours)

File: /Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d_api.py

#!/usr/bin/env python3
"""
Archivportal-D API Harvester
Fetches all German archives via Deutsche Digitale Bibliothek REST API
"""

import requests
import json
import time
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional

# Configuration
API_BASE_URL = "https://api.deutsche-digitale-bibliothek.de"
API_KEY = "YOUR_API_KEY_HERE"  # Replace with your DDB API key
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
BATCH_SIZE = 100  # Archives per request
REQUEST_DELAY = 0.5  # Seconds between requests
MAX_RETRIES = 3

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def fetch_archives_batch(offset: int = 0, rows: int = 100) -> Optional[Dict]:
    """
    Fetch a batch of archives via DDB API.
    
    Args:
        offset: Starting record number
        rows: Number of records to fetch
        
    Returns:
        API response dict or None on error
    """
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Accept": "application/json"
    }
    
    params = {
        "query": "*",  # All archives
        "sector": "arc_archives",  # Archives sector only
        "rows": rows,
        "offset": offset
    }
    
    for attempt in range(MAX_RETRIES):
        try:
            print(f"Fetching archives {offset}-{offset+rows-1}...", end=' ')
            response = requests.get(
                f"{API_BASE_URL}/search",
                headers=headers,
                params=params,
                timeout=30
            )
            response.raise_for_status()
            
            data = response.json()
            total = data.get('numberOfResults', 0)
            print(f"OK (total: {total})")
            
            return data
        
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt+1}/{MAX_RETRIES} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(REQUEST_DELAY * (attempt + 1))
            else:
                return None


def parse_archive_record(record: Dict) -> Dict:
    """
    Parse DDB API archive record into simplified format.
    
    Args:
        record: Raw API record
        
    Returns:
        Parsed archive dictionary
    """
    return {
        'id': record.get('id'),
        'name': record.get('title'),
        'location': record.get('place'),
        'federal_state': record.get('federalState'),
        'archive_type': record.get('label'),
        'isil': record.get('isil'),
        'latitude': record.get('latitude'),
        'longitude': record.get('longitude'),
        'thumbnail': record.get('thumbnail'),
        'profile_url': f"https://www.archivportal-d.de/item/{record.get('id')}" if record.get('id') else None
    }


def harvest_all_archives() -> List[Dict]:
    """
    Harvest all archives from DDB API.
    
    Returns:
        List of parsed archive records
    """
    print(f"\n{'='*70}")
    print(f"Harvesting Archivportal-D via DDB API")
    print(f"Endpoint: {API_BASE_URL}/search")
    print(f"{'='*70}\n")
    
    all_archives = []
    offset = 0
    
    while True:
        # Fetch batch
        data = fetch_archives_batch(offset, BATCH_SIZE)
        if not data:
            print(f"Warning: Failed to fetch batch at offset {offset}. Stopping.")
            break
        
        # Parse results
        results = data.get('results', [])
        for result in results:
            archive = parse_archive_record(result)
            all_archives.append(archive)
        
        print(f"Progress: {len(all_archives)} archives collected")
        
        # Check if done
        total = data.get('numberOfResults', 0)
        if len(all_archives) >= total or len(results) < BATCH_SIZE:
            break
        
        offset += BATCH_SIZE
        time.sleep(REQUEST_DELAY)
    
    print(f"\n{'='*70}")
    print(f"Harvest complete: {len(all_archives)} archives")
    print(f"{'='*70}\n")
    
    return all_archives


def save_archives(archives: List[Dict]):
    """Save archives to JSON file."""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    output_file = OUTPUT_DIR / f"archivportal_d_api_{timestamp}.json"
    
    output = {
        'metadata': {
            'source': 'Archivportal-D via DDB API',
            'source_url': 'https://www.archivportal-d.de',
            'api_endpoint': f'{API_BASE_URL}/search',
            'operator': 'Deutsche Digitale Bibliothek',
            'harvest_date': datetime.utcnow().isoformat() + 'Z',
            'total_archives': len(archives),
            'method': 'REST API',
            'license': 'CC0 1.0 Universal (Public Domain)'
        },
        'archives': archives
    }
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    
    print(f"✓ Saved to: {output_file}")
    print(f"  File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB\n")


def generate_statistics(archives: List[Dict]):
    """Generate statistics."""
    stats = {
        'total': len(archives),
        'by_state': {},
        'by_type': {},
        'with_isil': 0,
        'with_coordinates': 0
    }
    
    for archive in archives:
        # By state
        state = archive.get('federal_state', 'Unknown')
        stats['by_state'][state] = stats['by_state'].get(state, 0) + 1
        
        # By type
        arch_type = archive.get('archive_type', 'Unknown')
        stats['by_type'][arch_type] = stats['by_type'].get(arch_type, 0) + 1
        
        # Completeness
        if archive.get('isil'):
            stats['with_isil'] += 1
        if archive.get('latitude'):
            stats['with_coordinates'] += 1
    
    print(f"\n{'='*70}")
    print("Statistics:")
    print(f"{'='*70}")
    print(f"Total archives: {stats['total']}")
    print(f"With ISIL: {stats['with_isil']} ({stats['with_isil']/stats['total']*100:.1f}%)")
    print(f"With coordinates: {stats['with_coordinates']} ({stats['with_coordinates']/stats['total']*100:.1f}%)")
    
    print(f"\nTop 10 federal states:")
    for state, count in sorted(stats['by_state'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {state}: {count}")
    
    print(f"\nTop 10 archive types:")
    for arch_type, count in sorted(stats['by_type'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {arch_type}: {count}")
    
    print(f"{'='*70}\n")
    
    return stats


def main():
    """Main execution."""
    print(f"\n{'#'*70}")
    print(f"# Archivportal-D API Harvester")
    print(f"# Deutsche Digitale Bibliothek REST API")
    print(f"{'#'*70}\n")
    
    if API_KEY == "YOUR_API_KEY_HERE":
        print("ERROR: Please set your DDB API key in the script!")
        print("Edit line 21: API_KEY = 'your-actual-api-key'")
        return
    
    # Harvest
    archives = harvest_all_archives()
    
    if not archives:
        print("No archives harvested. Exiting.")
        return
    
    # Save
    save_archives(archives)
    
    # Statistics
    stats = generate_statistics(archives)
    
    # Save stats
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    stats_file = OUTPUT_DIR / f"archivportal_d_api_stats_{timestamp}.json"
    with open(stats_file, 'w', encoding='utf-8') as f:
        json.dump(stats, f, ensure_ascii=False, indent=2)
    print(f"✓ Statistics saved to: {stats_file}\n")
    
    print("✓ Harvest complete!\n")


if __name__ == "__main__":
    main()

Edit line 21: Replace YOUR_API_KEY_HERE with your actual DDB API key.

Step 3: Test Harvest (30 minutes)

cd /Users/kempersc/apps/glam

# Test with first 100 archives
python3 scripts/scrapers/harvest_archivportal_d_api.py

Verify Output:

JSON file created in data/isil/germany/
100-10,000+ archives (depending on total)
Statistics show reasonable distribution by state

Step 4: Full Harvest (1-2 hours)

If test successful, run full harvest (script automatically fetches all):

python3 scripts/scrapers/harvest_archivportal_d_api.py

Monitor Progress:

Watch console output for batch progress
Estimated time: 1-2 hours for ~10,000-20,000 archives
Final output: JSON file with complete archive listing

Step 5: Cross-Reference with ISIL (1 hour)

Create merge script: scripts/merge_archivportal_isil.py

# Match by ISIL code, identify new discoveries
python3 scripts/merge_archivportal_isil.py

Expected Results:

30-50% of archives match ISIL by code
50-70% are new discoveries (no ISIL)
Report: Overlap statistics, new archive count

Step 6: Create Unified Dataset (1 hour)

Merge ISIL + Archivportal-D data:

python3 scripts/create_german_unified_dataset.py

Output:

data/isil/germany/german_unified_TIMESTAMP.json
~25,000-27,000 total institutions
~12,000-15,000 archives (complete coverage)
~12,000-15,000 libraries/museums (from ISIL)

Success Criteria

✅ Harvest Successful when:

10,000-20,000 archives fetched
All 16 federal states represented
30-50% have ISIL codes
80%+ have geographic coordinates
JSON file size: 5-20 MB

✅ Integration Complete when:

Cross-referenced with ISIL dataset
Duplicates removed (< 1%)
Unified dataset created
Statistics generated
Documentation updated

Troubleshooting

Issue: API Key Invalid

Error: 401 Unauthorized or 403 Forbidden

Fix:

Verify API key copied correctly (no extra spaces)
Check key is active in DDB account
Ensure using Bearer token format in Authorization header

Issue: No Results Returned

Error: numberOfResults: 0

Fix:

Try different query parameter (e.g., query: "archiv*")
Check sector parameter is correct (arc_archives)
Verify API endpoint URL is correct

Issue: Rate Limit Exceeded

Error: 429 Too Many Requests

Fix:

Increase REQUEST_DELAY from 0.5s to 1.0s or 2.0s
Reduce BATCH_SIZE from 100 to 50
Wait a few minutes before retrying

Files You'll Create

/data/isil/germany/
├── archivportal_d_api_TIMESTAMP.json      # Harvested archives
├── archivportal_d_api_stats_TIMESTAMP.json  # Statistics
├── german_unified_TIMESTAMP.json          # Merged dataset
└── GERMAN_UNIFIED_REPORT.md               # Final report

/scripts/scrapers/
├── harvest_archivportal_d_api.py          # API harvester
└── merge_archivportal_isil.py             # Merge script

Estimated Timeline

Task	Time	Status
DDB API Registration	10 min	⏳ To-do
Create API Harvester	2 hours	⏳ To-do
Test Harvest	30 min	⏳ To-do
Full Harvest	1-2 hours	⏳ To-do
Cross-Reference	1 hour	⏳ To-do
Create Unified Dataset	1 hour	⏳ To-do
Documentation	1 hour	⏳ To-do
TOTAL	~6-9 hours

After Completion

German harvest will be 100% complete:

✅ All ISIL-registered institutions (16,979)
✅ All Archivportal-D archives (~10,000-20,000)
✅ Unified, deduplicated dataset (~25,000-27,000)
✅ Ready for LinkML conversion
✅ Ready for GHCID generation

Next countries to harvest:

Czech Republic - ISIL + Caslin registry
Austria - ISIL + BiPHAN
France - ISIL + Archives de France
Belgium - ISIL + LOCUS

Quick Links

DDB Portal: https://www.deutsche-digitale-bibliothek.de/
API Docs: https://api.deutsche-digitale-bibliothek.de/ (after login)
Archivportal-D: https://www.archivportal-d.de/
ISIL Registry: https://sigel.staatsbibliothek-berlin.de/

Ready to Start? ✅ Register for DDB API now!

Questions? Review /data/isil/germany/SESSION_SUMMARY_20251119_ARCHIVPORTAL_D.md

13 KiB Raw Blame History

German Archive Completion - Quick Start Guide

Current Status

Step-by-Step: Complete German Archive Harvest

Step 1: Register for DDB API (10 minutes)

Step 2: Create API Harvester (2 hours)

Step 3: Test Harvest (30 minutes)

Step 4: Full Harvest (1-2 hours)

Step 5: Cross-Reference with ISIL (1 hour)

Step 6: Create Unified Dataset (1 hour)

Success Criteria

Troubleshooting

Issue: API Key Invalid

Issue: No Results Returned

Issue: Rate Limit Exceeded

Files You'll Create

Estimated Timeline

After Completion

Quick Links

13 KiB

Raw Blame History