glam/data/isil/germany/NEXT_SESSION_QUICK_START.md

# German Archive Completion - Quick Start Guide

**Goal**: Achieve 100% German archive coverage by harvesting Archivportal-D via DDB API

---

## Current Status

✅ **ISIL Registry**: 16,979 institutions harvested
🔄 **Archivportal-D**: Awaiting API access (~10,000-20,000 archives)
🎯 **Target**: ~25,000-27,000 total German institutions

---

## Step-by-Step: Complete German Archive Harvest

### Step 1: Register for DDB API (10 minutes)

1. **Visit**: https://www.deutsche-digitale-bibliothek.de/
2. **Click**: "Registrieren" (Register button, top right)
3. **Fill form**:
   - Email: [your-email]
   - Username: [choose username]
   - Password: [secure password]
4. **Verify email**: Check inbox and click confirmation link
5. **Log in**: Use credentials to log into DDB portal
6. **Navigate to**: "Meine DDB" (My DDB) in account menu
7. **Generate API key**: Look for "API-Schlüssel" or "API Key" section
8. **Copy key**: Save to secure location (e.g., password manager)

**Result**: You now have a DDB API key (e.g., `ddb_abc123xyz456...`)

---

### Step 2: Create API Harvester (2 hours)

**File**: `/Users/kempersc/apps/glam/scripts/scrapers/harvest_archivportal_d_api.py`

```python
#!/usr/bin/env python3
"""
Archivportal-D API Harvester
Fetches all German archives via Deutsche Digitale Bibliothek REST API
"""

import requests
import json
import time
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional

# Configuration
API_BASE_URL = "https://api.deutsche-digitale-bibliothek.de"
API_KEY = "YOUR_API_KEY_HERE"  # Replace with your DDB API key
OUTPUT_DIR = Path("/Users/kempersc/apps/glam/data/isil/germany")
BATCH_SIZE = 100  # Archives per request
REQUEST_DELAY = 0.5  # Seconds between requests
MAX_RETRIES = 3

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def fetch_archives_batch(offset: int = 0, rows: int = 100) -> Optional[Dict]:
    """
    Fetch a batch of archives via DDB API.

    Args:
        offset: Starting record number
        rows: Number of records to fetch

    Returns:
        API response dict or None on error
    """
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Accept": "application/json"
    }

    params = {
        "query": "*",  # All archives
        "sector": "arc_archives",  # Archives sector only
        "rows": rows,
        "offset": offset
    }

    for attempt in range(MAX_RETRIES):
        try:
            print(f"Fetching archives {offset}-{offset+rows-1}...", end=' ')
            response = requests.get(
                f"{API_BASE_URL}/search",
                headers=headers,
                params=params,
                timeout=30
            )
            response.raise_for_status()

            data = response.json()
            total = data.get('numberOfResults', 0)
            print(f"OK (total: {total})")

            return data

        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt+1}/{MAX_RETRIES} failed: {e}")
            if attempt < MAX_RETRIES - 1:
                time.sleep(REQUEST_DELAY * (attempt + 1))
            else:
                return None


def parse_archive_record(record: Dict) -> Dict:
    """
    Parse DDB API archive record into simplified format.

    Args:
        record: Raw API record

    Returns:
        Parsed archive dictionary
    """
    return {
        'id': record.get('id'),
        'name': record.get('title'),
        'location': record.get('place'),
        'federal_state': record.get('federalState'),
        'archive_type': record.get('label'),
        'isil': record.get('isil'),
        'latitude': record.get('latitude'),
        'longitude': record.get('longitude'),
        'thumbnail': record.get('thumbnail'),
        'profile_url': f"https://www.archivportal-d.de/item/{record.get('id')}" if record.get('id') else None
    }


def harvest_all_archives() -> List[Dict]:
    """
    Harvest all archives from DDB API.

    Returns:
        List of parsed archive records
    """
    print(f"\n{'='*70}")
    print(f"Harvesting Archivportal-D via DDB API")
    print(f"Endpoint: {API_BASE_URL}/search")
    print(f"{'='*70}\n")

    all_archives = []
    offset = 0

    while True:
        # Fetch batch
        data = fetch_archives_batch(offset, BATCH_SIZE)
        if not data:
            print(f"Warning: Failed to fetch batch at offset {offset}. Stopping.")
            break

        # Parse results
        results = data.get('results', [])
        for result in results:
            archive = parse_archive_record(result)
            all_archives.append(archive)

        print(f"Progress: {len(all_archives)} archives collected")

        # Check if done
        total = data.get('numberOfResults', 0)
        if len(all_archives) >= total or len(results) < BATCH_SIZE:
            break

        offset += BATCH_SIZE
        time.sleep(REQUEST_DELAY)

    print(f"\n{'='*70}")
    print(f"Harvest complete: {len(all_archives)} archives")
    print(f"{'='*70}\n")

    return all_archives


def save_archives(archives: List[Dict]):
    """Save archives to JSON file."""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    output_file = OUTPUT_DIR / f"archivportal_d_api_{timestamp}.json"

    output = {
        'metadata': {
            'source': 'Archivportal-D via DDB API',
            'source_url': 'https://www.archivportal-d.de',
            'api_endpoint': f'{API_BASE_URL}/search',
            'operator': 'Deutsche Digitale Bibliothek',
            'harvest_date': datetime.utcnow().isoformat() + 'Z',
            'total_archives': len(archives),
            'method': 'REST API',
            'license': 'CC0 1.0 Universal (Public Domain)'
        },
        'archives': archives
    }

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)

    print(f"✓ Saved to: {output_file}")
    print(f"  File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB\n")


def generate_statistics(archives: List[Dict]):
    """Generate statistics."""
    stats = {
        'total': len(archives),
        'by_state': {},
        'by_type': {},
        'with_isil': 0,
        'with_coordinates': 0
    }

    for archive in archives:
        # By state
        state = archive.get('federal_state', 'Unknown')
        stats['by_state'][state] = stats['by_state'].get(state, 0) + 1

        # By type
        arch_type = archive.get('archive_type', 'Unknown')
        stats['by_type'][arch_type] = stats['by_type'].get(arch_type, 0) + 1

        # Completeness
        if archive.get('isil'):
            stats['with_isil'] += 1
        if archive.get('latitude'):
            stats['with_coordinates'] += 1

    print(f"\n{'='*70}")
    print("Statistics:")
    print(f"{'='*70}")
    print(f"Total archives: {stats['total']}")
    print(f"With ISIL: {stats['with_isil']} ({stats['with_isil']/stats['total']*100:.1f}%)")
    print(f"With coordinates: {stats['with_coordinates']} ({stats['with_coordinates']/stats['total']*100:.1f}%)")

    print(f"\nTop 10 federal states:")
    for state, count in sorted(stats['by_state'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {state}: {count}")

    print(f"\nTop 10 archive types:")
    for arch_type, count in sorted(stats['by_type'].items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {arch_type}: {count}")

    print(f"{'='*70}\n")

    return stats


def main():
    """Main execution."""
    print(f"\n{'#'*70}")
    print(f"# Archivportal-D API Harvester")
    print(f"# Deutsche Digitale Bibliothek REST API")
    print(f"{'#'*70}\n")

    if API_KEY == "YOUR_API_KEY_HERE":
        print("ERROR: Please set your DDB API key in the script!")
        print("Edit line 21: API_KEY = 'your-actual-api-key'")
        return

    # Harvest
    archives = harvest_all_archives()

    if not archives:
        print("No archives harvested. Exiting.")
        return

    # Save
    save_archives(archives)

    # Statistics
    stats = generate_statistics(archives)

    # Save stats
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    stats_file = OUTPUT_DIR / f"archivportal_d_api_stats_{timestamp}.json"
    with open(stats_file, 'w', encoding='utf-8') as f:
        json.dump(stats, f, ensure_ascii=False, indent=2)
    print(f"✓ Statistics saved to: {stats_file}\n")

    print("✓ Harvest complete!\n")


if __name__ == "__main__":
    main()
```

**Edit line 21**: Replace `YOUR_API_KEY_HERE` with your actual DDB API key.

---

### Step 3: Test Harvest (30 minutes)

```bash
cd /Users/kempersc/apps/glam

# Test with first 100 archives
python3 scripts/scrapers/harvest_archivportal_d_api.py
```

**Verify Output**:
- JSON file created in `data/isil/germany/`
- 100-10,000+ archives (depending on total)
- Statistics show reasonable distribution by state

---

### Step 4: Full Harvest (1-2 hours)

If test successful, run full harvest (script automatically fetches all):

```bash
python3 scripts/scrapers/harvest_archivportal_d_api.py
```

**Monitor Progress**:
- Watch console output for batch progress
- Estimated time: 1-2 hours for ~10,000-20,000 archives
- Final output: JSON file with complete archive listing

---

### Step 5: Cross-Reference with ISIL (1 hour)

Create merge script: `scripts/merge_archivportal_isil.py`

```bash
# Match by ISIL code, identify new discoveries
python3 scripts/merge_archivportal_isil.py
```

**Expected Results**:
- 30-50% of archives match ISIL by code
- 50-70% are new discoveries (no ISIL)
- Report: Overlap statistics, new archive count

---

### Step 6: Create Unified Dataset (1 hour)

Merge ISIL + Archivportal-D data:

```bash
python3 scripts/create_german_unified_dataset.py
```

**Output**:
- `data/isil/germany/german_unified_TIMESTAMP.json`
- ~25,000-27,000 total institutions
- ~12,000-15,000 archives (complete coverage)
- ~12,000-15,000 libraries/museums (from ISIL)

---

## Success Criteria

✅ **Harvest Successful** when:
- [ ] 10,000-20,000 archives fetched
- [ ] All 16 federal states represented
- [ ] 30-50% have ISIL codes
- [ ] 80%+ have geographic coordinates
- [ ] JSON file size: 5-20 MB

✅ **Integration Complete** when:
- [ ] Cross-referenced with ISIL dataset
- [ ] Duplicates removed (< 1%)
- [ ] Unified dataset created
- [ ] Statistics generated
- [ ] Documentation updated

---

## Troubleshooting

### Issue: API Key Invalid

**Error**: `401 Unauthorized` or `403 Forbidden`

**Fix**:
- Verify API key copied correctly (no extra spaces)
- Check key is active in DDB account
- Ensure using `Bearer` token format in Authorization header

### Issue: No Results Returned

**Error**: `numberOfResults: 0`

**Fix**:
- Try different query parameter (e.g., `query: "archiv*"`)
- Check `sector` parameter is correct (`arc_archives`)
- Verify API endpoint URL is correct

### Issue: Rate Limit Exceeded

**Error**: `429 Too Many Requests`

**Fix**:
- Increase `REQUEST_DELAY` from 0.5s to 1.0s or 2.0s
- Reduce `BATCH_SIZE` from 100 to 50
- Wait a few minutes before retrying

---

## Files You'll Create

```
/data/isil/germany/
├── archivportal_d_api_TIMESTAMP.json      # Harvested archives
├── archivportal_d_api_stats_TIMESTAMP.json  # Statistics
├── german_unified_TIMESTAMP.json          # Merged dataset
└── GERMAN_UNIFIED_REPORT.md               # Final report

/scripts/scrapers/
├── harvest_archivportal_d_api.py          # API harvester
└── merge_archivportal_isil.py             # Merge script
```

---

## Estimated Timeline

| Task | Time | Status |
|------|------|--------|
| **DDB API Registration** | 10 min | ⏳ To-do |
| **Create API Harvester** | 2 hours | ⏳ To-do |
| **Test Harvest** | 30 min | ⏳ To-do |
| **Full Harvest** | 1-2 hours | ⏳ To-do |
| **Cross-Reference** | 1 hour | ⏳ To-do |
| **Create Unified Dataset** | 1 hour | ⏳ To-do |
| **Documentation** | 1 hour | ⏳ To-do |
| **TOTAL** | **~6-9 hours** | |

---

## After Completion

**German harvest will be 100% complete**:
- ✅ All ISIL-registered institutions (16,979)
- ✅ All Archivportal-D archives (~10,000-20,000)
- ✅ Unified, deduplicated dataset (~25,000-27,000)
- ✅ Ready for LinkML conversion
- ✅ Ready for GHCID generation

**Next countries to harvest**:
1. **Czech Republic** - ISIL + Caslin registry
2. **Austria** - ISIL + BiPHAN
3. **France** - ISIL + Archives de France
4. **Belgium** - ISIL + LOCUS

---

## Quick Links

- **DDB Portal**: https://www.deutsche-digitale-bibliothek.de/
- **API Docs**: https://api.deutsche-digitale-bibliothek.de/ (after login)
- **Archivportal-D**: https://www.archivportal-d.de/
- **ISIL Registry**: https://sigel.staatsbibliothek-berlin.de/

---

**Ready to Start?** ✅ Register for DDB API now!

**Questions?** Review `/data/isil/germany/SESSION_SUMMARY_20251119_ARCHIVPORTAL_D.md`