glam/data/isil/germany/COMPLETENESS_PLAN.md
2025-11-19 23:25:22 +01:00

12 KiB

German Archive Completeness Plan - Archivportal-D Integration

Date: November 19, 2025
Status: Discovery → Planning → Implementation
Goal: Achieve 100% coverage of German archives


Current Status

What We Have

  • 16,979 ISIL-registered institutions (Tier 1 data)
    • Libraries, archives, museums
    • Excellent metadata (87% geocoded)
    • Source: Deutsche Nationalbibliothek SRU API

What We're Missing ⚠️

  • ~5,000-10,000 archives without ISIL codes
    • Smaller municipal archives
    • Specialized archives (family, business, church)
    • Newly founded archives
    • Archives choosing not to register for ISIL

Gap Example: North Rhine-Westphalia

  • ISIL registry: 301 archives (63%)
  • archive.nrw.de portal: 477 archives (100%)
  • Missing: 176 archives (37%)

Solution: Archivportal-D Harvest

What is Archivportal-D?

URL: https://www.archivportal-d.de/
Operator: Deutsche Digitale Bibliothek (German Digital Library)
Coverage: ALL archives across Germany (national aggregator)

Key Features:

  • 16 federal states - Complete national coverage
  • ~10,000-20,000 archives - Comprehensive archive listings
  • 9 archive sectors - State, municipal, church, business, etc.
  • API access available - Machine-readable data via DDB API
  • CC0 metadata - Open data license for reuse
  • Actively maintained - Updated by national library infrastructure

Archive Sectors in Archivportal-D

  1. State archives (Landesarchive) - Federal state archives
  2. Local/municipal archives (Kommunalarchive) - City, county archives
  3. Church archives (Kirchenarchive) - Catholic, Protestant, Jewish
  4. Nobility and family archives (Adelsarchive, Familienarchive)
  5. Business archives (Wirtschaftsarchive) - Corporate, economic
  6. Political archives (Politische Archive) - Parties, movements, foundations
  7. Media archives (Medienarchive) - Broadcasting, film, press
  8. University archives (Hochschularchive) - Academic institutions
  9. Other archives (Sonstige Archive) - Specialized collections

DDB API Access

API Documentation

URL: https://api.deutsche-digitale-bibliothek.de/
Format: REST API with OpenAPI 3.0 specification
Authentication: API key (free for registered users)
License: CC0 for metadata

Registration Process

  1. Create account at https://www.deutsche-digitale-bibliothek.de/
  2. Log in to "My DDB" (Meine DDB)
  3. Generate API key in account settings
  4. Use API key in requests (Authorization header)

API Endpoints (Relevant for Archives)

# Search archives
GET https://api.deutsche-digitale-bibliothek.de/search
  ?query=*
  &sector=arc_archives
  &rows=100
  &offset=0

# Get archive details
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}

# Filter by federal state
GET https://api.deutsche-digitale-bibliothek.de/search
  ?facetValues[]=federalState-Nordrhein-Westfalen
  &sector=arc_archives

Response Format

{
  "numberOfResults": 10000,
  "results": [
    {
      "id": "archive_id",
      "title": "Archive Name",
      "label": "Archive Type",
      "federalState": "Nordrhein-Westfalen",
      "place": "City",
      "latitude": "51.5",
      "longitude": "7.2",
      "isil": "DE-123" // if available
    }
  ]
}

Harvest Strategy - Three-Phase Approach

Method: Use DDB API to fetch all archive records
Estimated time: 1-2 hours
Coverage: ~10,000-20,000 archives

# Pseudocode
def harvest_archivportal_d():
    api_key = get_api_key()
    archives = []
    offset = 0
    batch_size = 100
    
    while True:
        response = requests.get(
            "https://api.deutsche-digitale-bibliothek.de/search",
            params={
                "query": "*",
                "sector": "arc_archives",
                "rows": batch_size,
                "offset": offset
            },
            headers={"Authorization": f"Bearer {api_key}"}
        )
        
        data = response.json()
        archives.extend(data['results'])
        
        if len(data['results']) < batch_size:
            break
        
        offset += batch_size
        time.sleep(0.5)  # Rate limiting
    
    return archives

Phase 2: Web Scraping Fallback (if API limited)

Method: Scrape archive list pages
Estimated time: 3-4 hours
Coverage: Same as API

# Scrape archive listing pages
# URL: https://www.archivportal-d.de/struktur?page=X
# Pagination: ~400-800 pages (10-20 results per page)

Phase 3: Regional Portal Enrichment (OPTIONAL)

Method: Scrape individual state portals for detailed metadata
Estimated time: 8-12 hours (16 states)
Coverage: Enrichment only (addresses, phone numbers, etc.)

State Portals:


Data Integration Plan

Step 1: Harvest Archivportal-D

  • Fetch all ~10,000-20,000 archive records via API
  • Store in JSON format
  • Include: name, location, federal state, archive type, ISIL (if present)

Step 2: Cross-Reference with ISIL Dataset

# Match archives by ISIL code
for archive in archivportal_archives:
    if archive.get('isil'):
        isil_match = find_in_isil_dataset(archive['isil'])
        if isil_match:
            archive['isil_data'] = isil_match  # Merge metadata
        else:
            archive['new_isil_discovery'] = True
    else:
        archive['no_isil_code'] = True  # New discovery

Step 3: Identify New Discoveries

  • Archives in Archivportal-D without ISIL codes
  • Archives in Archivportal-D with ISIL codes not in our dataset
  • Expected: ~5,000-10,000 new archives

Step 4: Merge Datasets

# Unified German GLAM Dataset
- id: unique_id
  name: Institution name
  institution_type: ARCHIVE/LIBRARY/MUSEUM
  isil: ISIL code (if available)
  location:
    city: City
    state: Federal state
    coordinates: [lat, lon]
  archive_type: Sector category (if archive)
  data_sources:
    - ISIL_REGISTRY (Tier 1)
    - ARCHIVPORTAL_D (Tier 2)
  provenance:
    tier: TIER_1 or TIER_2
    extraction_date: ISO timestamp

Step 5: Quality Validation

  • Check for duplicates (same name + city)
  • Verify ISIL code matches
  • Geocode addresses (Nominatim API)
  • Validate institution types

Expected Results

Before Integration

Source Institutions Archives Coverage
ISIL Registry 16,979 ~2,000-3,000 20-30% of archives
Total 16,979 ~2,500 Partial

After Integration

Source Institutions Archives Coverage
ISIL Registry 16,979 ~2,500 Core authoritative data
Archivportal-D ~12,000 ~12,000 Complete archive coverage
Merged (deduplicated) ~25,000 ~12,000 ~100% archives

Coverage Improvement: NRW Example

  • Before: 301 archives (63%)
  • After: ~477 archives (100%)
  • Gain: +176 archives (+58%)

Timeline and Resources

Time Estimate

  • API registration: 10 minutes
  • API harvester development: 2 hours
  • Data harvest: 1-2 hours
  • Cross-referencing: 1 hour
  • Deduplication: 1 hour
  • Validation: 2 hours
  • Documentation: 1 hour
  • Total: ~8-10 hours

Required Tools

  • Python 3.9+
  • requests library (API calls)
  • beautifulsoup4 (if web scraping needed)
  • Nominatim API (geocoding)
  • LinkML validator (schema validation)

Output Files

/data/isil/germany/
  ├── archivportal_d_raw_20251119.json           # Raw API data
  ├── archivportal_d_archives_20251119.json      # Processed archives
  ├── german_isil_complete_20251119_134939.json  # Existing ISIL data
  ├── german_unified_20251119.json               # Merged dataset
  ├── german_unified_20251119.jsonl              # Line-delimited
  ├── german_new_discoveries_20251119.json       # Archives without ISIL
  ├── ARCHIVPORTAL_D_HARVEST_REPORT.md          # Documentation
  └── GERMAN_UNIFIED_STATISTICS.json             # Comprehensive stats

Next Steps (Priority Order)

1. Register for DDB API (Immediate)

2. Develop Archivportal-D Harvester (Today)

  • Script: scripts/scrapers/harvest_archivportal_d.py
  • Use DDB API (preferred) or web scraping (fallback)
  • Implement rate limiting (1 req/second)
  • Add error handling and retry logic

3. Harvest All German Archives (Today)

  • Run harvester for ~10,000-20,000 archives
  • Save raw data + processed records
  • Generate harvest report

4. Cross-Reference and Merge (Today)

  • Match by ISIL codes
  • Identify new discoveries
  • Create unified dataset
  • Generate statistics

5. Validate and Document (Today)

  • Check for duplicates
  • Validate data quality
  • Create comprehensive documentation
  • Update progress reports

6. OPTIONAL: Regional Portal Enrichment (Future)

  • Scrape NRW portal for detailed metadata
  • Repeat for other states
  • Enrich merged dataset with contact info

Success Criteria

Complete German Archive Coverage

  • All ~12,000 German archives harvested
  • 100% of Archivportal-D listings
  • Cross-referenced with ISIL registry

High Data Quality

  • Deduplication complete (< 1% duplicates)
  • Geocoding for 80%+ of archives
  • Institution types classified

Comprehensive Documentation

  • Harvest reports for all sources
  • Integration methodology documented
  • Data quality metrics published

Ready for GLAM Integration

  • LinkML format conversion complete
  • GLAMORCUBESFIXPHDNT taxonomy applied
  • GHCIDs generated

Risk Mitigation

Risk 1: API Rate Limits

  • Mitigation: Implement 1 req/second limit, use batch requests
  • Fallback: Web scraping if API unavailable

Risk 2: Data Quality Issues

  • Mitigation: Validation scripts, manual review of suspicious records
  • Fallback: Flag low-quality records, document issues

Risk 3: Duplicates

  • Mitigation: Fuzzy matching by name+city, ISIL code verification
  • Fallback: Manual deduplication for high-confidence matches

Risk 4: Missing Metadata

  • Mitigation: Use multiple sources (ISIL + Archivportal-D + regional portals)
  • Fallback: Mark incomplete records, enrich later

References

Primary Sources

Regional Portals

Documentation

  • ISIL Harvest: /data/isil/germany/HARVEST_REPORT.md
  • Comprehensiveness Check: /data/isil/germany/COMPREHENSIVENESS_REPORT.md
  • Archivportal-D Discovery: /data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md
  • This Plan: /data/isil/germany/COMPLETENESS_PLAN.md

Status: Plan complete, ready for implementation
Priority: HIGH - Critical for 100% German coverage
Next Action: Register for DDB API key, develop harvester
Estimated Completion: Same day (8-10 hours work)


End of Plan