glam/data/isil/germany/COMPLETENESS_PLAN.md

# German Archive Completeness Plan - Archivportal-D Integration

**Date**: November 19, 2025
**Status**: Discovery → Planning → Implementation
**Goal**: Achieve 100% coverage of German archives

---

## Current Status

### What We Have ✅
- **16,979 ISIL-registered institutions** (Tier 1 data)
  - Libraries, archives, museums
  - Excellent metadata (87% geocoded)
  - Source: Deutsche Nationalbibliothek SRU API

### What We're Missing ⚠️
- **~5,000-10,000 archives without ISIL codes**
  - Smaller municipal archives
  - Specialized archives (family, business, church)
  - Newly founded archives
  - Archives choosing not to register for ISIL

### Gap Example: North Rhine-Westphalia
- **ISIL registry**: 301 archives (63%)
- **archive.nrw.de portal**: 477 archives (100%)
- **Missing**: 176 archives (37%)

---

## Solution: Archivportal-D Harvest

### What is Archivportal-D?

**URL**: https://www.archivportal-d.de/
**Operator**: Deutsche Digitale Bibliothek (German Digital Library)
**Coverage**: ALL archives across Germany (national aggregator)

**Key Features**:
- ✅ **16 federal states** - Complete national coverage
- ✅ **~10,000-20,000 archives** - Comprehensive archive listings
- ✅ **9 archive sectors** - State, municipal, church, business, etc.
- ✅ **API access available** - Machine-readable data via DDB API
- ✅ **CC0 metadata** - Open data license for reuse
- ✅ **Actively maintained** - Updated by national library infrastructure

---

## Archive Sectors in Archivportal-D

1. **State archives** (Landesarchive) - Federal state archives
2. **Local/municipal archives** (Kommunalarchive) - City, county archives
3. **Church archives** (Kirchenarchive) - Catholic, Protestant, Jewish
4. **Nobility and family archives** (Adelsarchive, Familienarchive)
5. **Business archives** (Wirtschaftsarchive) - Corporate, economic
6. **Political archives** (Politische Archive) - Parties, movements, foundations
7. **Media archives** (Medienarchive) - Broadcasting, film, press
8. **University archives** (Hochschularchive) - Academic institutions
9. **Other archives** (Sonstige Archive) - Specialized collections

---

## DDB API Access

### API Documentation
**URL**: https://api.deutsche-digitale-bibliothek.de/
**Format**: REST API with OpenAPI 3.0 specification
**Authentication**: API key (free for registered users)
**License**: CC0 for metadata

### Registration Process
1. Create account at https://www.deutsche-digitale-bibliothek.de/
2. Log in to "My DDB" (Meine DDB)
3. Generate API key in account settings
4. Use API key in requests (Authorization header)

### API Endpoints (Relevant for Archives)

```http
# Search archives
GET https://api.deutsche-digitale-bibliothek.de/search
  ?query=*
  &sector=arc_archives
  &rows=100
  &offset=0

# Get archive details
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}

# Filter by federal state
GET https://api.deutsche-digitale-bibliothek.de/search
  ?facetValues[]=federalState-Nordrhein-Westfalen
  &sector=arc_archives
```

### Response Format
```json
{
  "numberOfResults": 10000,
  "results": [
    {
      "id": "archive_id",
      "title": "Archive Name",
      "label": "Archive Type",
      "federalState": "Nordrhein-Westfalen",
      "place": "City",
      "latitude": "51.5",
      "longitude": "7.2",
      "isil": "DE-123" // if available
    }
  ]
}
```

---

## Harvest Strategy - Three-Phase Approach

### Phase 1: API Harvest (RECOMMENDED) 🚀
**Method**: Use DDB API to fetch all archive records
**Estimated time**: 1-2 hours
**Coverage**: ~10,000-20,000 archives

```python
# Pseudocode
def harvest_archivportal_d():
    api_key = get_api_key()
    archives = []
    offset = 0
    batch_size = 100

    while True:
        response = requests.get(
            "https://api.deutsche-digitale-bibliothek.de/search",
            params={
                "query": "*",
                "sector": "arc_archives",
                "rows": batch_size,
                "offset": offset
            },
            headers={"Authorization": f"Bearer {api_key}"}
        )

        data = response.json()
        archives.extend(data['results'])

        if len(data['results']) < batch_size:
            break

        offset += batch_size
        time.sleep(0.5)  # Rate limiting

    return archives
```

### Phase 2: Web Scraping Fallback (if API limited)
**Method**: Scrape archive list pages
**Estimated time**: 3-4 hours
**Coverage**: Same as API

```python
# Scrape archive listing pages
# URL: https://www.archivportal-d.de/struktur?page=X
# Pagination: ~400-800 pages (10-20 results per page)
```

### Phase 3: Regional Portal Enrichment (OPTIONAL)
**Method**: Scrape individual state portals for detailed metadata
**Estimated time**: 8-12 hours (16 states)
**Coverage**: Enrichment only (addresses, phone numbers, etc.)

**State Portals**:
- NRW: https://www.archive.nrw.de/ (477 archives)
- Baden-Württemberg: https://www.landesarchiv-bw.de/
- Niedersachsen: https://www.arcinsys.niedersachsen.de/
- Bavaria, Saxony, etc.: Various systems

---

## Data Integration Plan

### Step 1: Harvest Archivportal-D
- Fetch all ~10,000-20,000 archive records via API
- Store in JSON format
- Include: name, location, federal state, archive type, ISIL (if present)

### Step 2: Cross-Reference with ISIL Dataset
```python
# Match archives by ISIL code
for archive in archivportal_archives:
    if archive.get('isil'):
        isil_match = find_in_isil_dataset(archive['isil'])
        if isil_match:
            archive['isil_data'] = isil_match  # Merge metadata
        else:
            archive['new_isil_discovery'] = True
    else:
        archive['no_isil_code'] = True  # New discovery
```

### Step 3: Identify New Discoveries
- Archives in Archivportal-D **without** ISIL codes
- Archives in Archivportal-D **with** ISIL codes not in our dataset
- Expected: ~5,000-10,000 new archives

### Step 4: Merge Datasets
```yaml
# Unified German GLAM Dataset
- id: unique_id
  name: Institution name
  institution_type: ARCHIVE/LIBRARY/MUSEUM
  isil: ISIL code (if available)
  location:
    city: City
    state: Federal state
    coordinates: [lat, lon]
  archive_type: Sector category (if archive)
  data_sources:
    - ISIL_REGISTRY (Tier 1)
    - ARCHIVPORTAL_D (Tier 2)
  provenance:
    tier: TIER_1 or TIER_2
    extraction_date: ISO timestamp
```

### Step 5: Quality Validation
- Check for duplicates (same name + city)
- Verify ISIL code matches
- Geocode addresses (Nominatim API)
- Validate institution types

---

## Expected Results

### Before Integration
| Source | Institutions | Archives | Coverage |
|--------|--------------|----------|----------|
| ISIL Registry | 16,979 | ~2,000-3,000 | 20-30% of archives |
| **Total** | **16,979** | **~2,500** | **Partial** |

### After Integration
| Source | Institutions | Archives | Coverage |
|--------|--------------|----------|----------|
| ISIL Registry | 16,979 | ~2,500 | Core authoritative data |
| Archivportal-D | ~12,000 | ~12,000 | Complete archive coverage |
| **Merged (deduplicated)** | **~25,000** | **~12,000** | **~100% archives** |

### Coverage Improvement: NRW Example
- **Before**: 301 archives (63%)
- **After**: ~477 archives (100%)
- **Gain**: +176 archives (+58%)

---

## Timeline and Resources

### Time Estimate
- **API registration**: 10 minutes
- **API harvester development**: 2 hours
- **Data harvest**: 1-2 hours
- **Cross-referencing**: 1 hour
- **Deduplication**: 1 hour
- **Validation**: 2 hours
- **Documentation**: 1 hour
- **Total**: ~8-10 hours

### Required Tools
- Python 3.9+
- `requests` library (API calls)
- `beautifulsoup4` (if web scraping needed)
- Nominatim API (geocoding)
- LinkML validator (schema validation)

### Output Files
```
/data/isil/germany/
  ├── archivportal_d_raw_20251119.json           # Raw API data
  ├── archivportal_d_archives_20251119.json      # Processed archives
  ├── german_isil_complete_20251119_134939.json  # Existing ISIL data
  ├── german_unified_20251119.json               # Merged dataset
  ├── german_unified_20251119.jsonl              # Line-delimited
  ├── german_new_discoveries_20251119.json       # Archives without ISIL
  ├── ARCHIVPORTAL_D_HARVEST_REPORT.md          # Documentation
  └── GERMAN_UNIFIED_STATISTICS.json             # Comprehensive stats
```

---

## Next Steps (Priority Order)

### 1. Register for DDB API ✅ (Immediate)
- Go to https://www.deutsche-digitale-bibliothek.de/
- Create free account
- Generate API key in "My DDB"
- Test API access

### 2. Develop Archivportal-D Harvester ✅ (Today)
- Script: `scripts/scrapers/harvest_archivportal_d.py`
- Use DDB API (preferred) or web scraping (fallback)
- Implement rate limiting (1 req/second)
- Add error handling and retry logic

### 3. Harvest All German Archives ✅ (Today)
- Run harvester for ~10,000-20,000 archives
- Save raw data + processed records
- Generate harvest report

### 4. Cross-Reference and Merge ✅ (Today)
- Match by ISIL codes
- Identify new discoveries
- Create unified dataset
- Generate statistics

### 5. Validate and Document ✅ (Today)
- Check for duplicates
- Validate data quality
- Create comprehensive documentation
- Update progress reports

### 6. OPTIONAL: Regional Portal Enrichment (Future)
- Scrape NRW portal for detailed metadata
- Repeat for other states
- Enrich merged dataset with contact info

---

## Success Criteria

✅ **Complete German Archive Coverage**
- All ~12,000 German archives harvested
- 100% of Archivportal-D listings
- Cross-referenced with ISIL registry

✅ **High Data Quality**
- Deduplication complete (< 1% duplicates)
- Geocoding for 80%+ of archives
- Institution types classified

✅ **Comprehensive Documentation**
- Harvest reports for all sources
- Integration methodology documented
- Data quality metrics published

✅ **Ready for GLAM Integration**
- LinkML format conversion complete
- GLAMORCUBESFIXPHDNT taxonomy applied
- GHCIDs generated

---

## Risk Mitigation

### Risk 1: API Rate Limits
- **Mitigation**: Implement 1 req/second limit, use batch requests
- **Fallback**: Web scraping if API unavailable

### Risk 2: Data Quality Issues
- **Mitigation**: Validation scripts, manual review of suspicious records
- **Fallback**: Flag low-quality records, document issues

### Risk 3: Duplicates
- **Mitigation**: Fuzzy matching by name+city, ISIL code verification
- **Fallback**: Manual deduplication for high-confidence matches

### Risk 4: Missing Metadata
- **Mitigation**: Use multiple sources (ISIL + Archivportal-D + regional portals)
- **Fallback**: Mark incomplete records, enrich later

---

## References

### Primary Sources
- **ISIL Registry**: https://services.dnb.de/sru/bib (16,979 records) ✅
- **Archivportal-D**: https://www.archivportal-d.de/ (~12,000 archives) 🔄
- **DDB API**: https://api.deutsche-digitale-bibliothek.de/ 🔄

### Regional Portals
- **NRW**: https://www.archive.nrw.de/ (477 archives)
- **Baden-Württemberg**: https://www.landesarchiv-bw.de/
- **Niedersachsen**: https://www.arcinsys.niedersachsen.de/

### Documentation
- ISIL Harvest: `/data/isil/germany/HARVEST_REPORT.md`
- Comprehensiveness Check: `/data/isil/germany/COMPREHENSIVENESS_REPORT.md`
- Archivportal-D Discovery: `/data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md`
- This Plan: `/data/isil/germany/COMPLETENESS_PLAN.md`

---

**Status**: Plan complete, ready for implementation
**Priority**: HIGH - Critical for 100% German coverage
**Next Action**: Register for DDB API key, develop harvester
**Estimated Completion**: Same day (8-10 hours work)

---

**End of Plan**