kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

12 KiB

Raw Blame History

German Archive Completeness Plan - Archivportal-D Integration

Date: November 19, 2025
Status: Discovery → Planning → Implementation
Goal: Achieve 100% coverage of German archives

Current Status

What We Have ✅

16,979 ISIL-registered institutions (Tier 1 data)
- Libraries, archives, museums
- Excellent metadata (87% geocoded)
- Source: Deutsche Nationalbibliothek SRU API

What We're Missing ⚠️

~5,000-10,000 archives without ISIL codes
- Smaller municipal archives
- Specialized archives (family, business, church)
- Newly founded archives
- Archives choosing not to register for ISIL

Gap Example: North Rhine-Westphalia

ISIL registry: 301 archives (63%)
archive.nrw.de portal: 477 archives (100%)
Missing: 176 archives (37%)

Solution: Archivportal-D Harvest

What is Archivportal-D?

URL: https://www.archivportal-d.de/
Operator: Deutsche Digitale Bibliothek (German Digital Library)
Coverage: ALL archives across Germany (national aggregator)

Key Features:

✅ 16 federal states - Complete national coverage
✅ ~10,000-20,000 archives - Comprehensive archive listings
✅ 9 archive sectors - State, municipal, church, business, etc.
✅ API access available - Machine-readable data via DDB API
✅ CC0 metadata - Open data license for reuse
✅ Actively maintained - Updated by national library infrastructure

Archive Sectors in Archivportal-D

State archives (Landesarchive) - Federal state archives
Local/municipal archives (Kommunalarchive) - City, county archives
Church archives (Kirchenarchive) - Catholic, Protestant, Jewish
Nobility and family archives (Adelsarchive, Familienarchive)
Business archives (Wirtschaftsarchive) - Corporate, economic
Political archives (Politische Archive) - Parties, movements, foundations
Media archives (Medienarchive) - Broadcasting, film, press
University archives (Hochschularchive) - Academic institutions
Other archives (Sonstige Archive) - Specialized collections

DDB API Access

API Documentation

URL: https://api.deutsche-digitale-bibliothek.de/
Format: REST API with OpenAPI 3.0 specification
Authentication: API key (free for registered users)
License: CC0 for metadata

Registration Process

Create account at https://www.deutsche-digitale-bibliothek.de/
Log in to "My DDB" (Meine DDB)
Generate API key in account settings
Use API key in requests (Authorization header)

API Endpoints (Relevant for Archives)

# Search archives
GET https://api.deutsche-digitale-bibliothek.de/search
  ?query=*
  &sector=arc_archives
  &rows=100
  &offset=0

# Get archive details
GET https://api.deutsche-digitale-bibliothek.de/items/{archive_id}

# Filter by federal state
GET https://api.deutsche-digitale-bibliothek.de/search
  ?facetValues[]=federalState-Nordrhein-Westfalen
  &sector=arc_archives

Response Format

{
  "numberOfResults": 10000,
  "results": [
    {
      "id": "archive_id",
      "title": "Archive Name",
      "label": "Archive Type",
      "federalState": "Nordrhein-Westfalen",
      "place": "City",
      "latitude": "51.5",
      "longitude": "7.2",
      "isil": "DE-123" // if available
    }
  ]
}

Harvest Strategy - Three-Phase Approach

Phase 1: API Harvest (RECOMMENDED) 🚀

Method: Use DDB API to fetch all archive records
Estimated time: 1-2 hours
Coverage: ~10,000-20,000 archives

# Pseudocode
def harvest_archivportal_d():
    api_key = get_api_key()
    archives = []
    offset = 0
    batch_size = 100
    
    while True:
        response = requests.get(
            "https://api.deutsche-digitale-bibliothek.de/search",
            params={
                "query": "*",
                "sector": "arc_archives",
                "rows": batch_size,
                "offset": offset
            },
            headers={"Authorization": f"Bearer {api_key}"}
        )
        
        data = response.json()
        archives.extend(data['results'])
        
        if len(data['results']) < batch_size:
            break
        
        offset += batch_size
        time.sleep(0.5)  # Rate limiting
    
    return archives

Phase 2: Web Scraping Fallback (if API limited)

Method: Scrape archive list pages
Estimated time: 3-4 hours
Coverage: Same as API

# Scrape archive listing pages
# URL: https://www.archivportal-d.de/struktur?page=X
# Pagination: ~400-800 pages (10-20 results per page)

Phase 3: Regional Portal Enrichment (OPTIONAL)

Method: Scrape individual state portals for detailed metadata
Estimated time: 8-12 hours (16 states)
Coverage: Enrichment only (addresses, phone numbers, etc.)

State Portals:

NRW: https://www.archive.nrw.de/ (477 archives)
Baden-Württemberg: https://www.landesarchiv-bw.de/
Niedersachsen: https://www.arcinsys.niedersachsen.de/
Bavaria, Saxony, etc.: Various systems

Data Integration Plan

Step 1: Harvest Archivportal-D

Fetch all ~10,000-20,000 archive records via API
Store in JSON format
Include: name, location, federal state, archive type, ISIL (if present)

Step 2: Cross-Reference with ISIL Dataset

# Match archives by ISIL code
for archive in archivportal_archives:
    if archive.get('isil'):
        isil_match = find_in_isil_dataset(archive['isil'])
        if isil_match:
            archive['isil_data'] = isil_match  # Merge metadata
        else:
            archive['new_isil_discovery'] = True
    else:
        archive['no_isil_code'] = True  # New discovery

Step 3: Identify New Discoveries

Archives in Archivportal-D without ISIL codes
Archives in Archivportal-D with ISIL codes not in our dataset
Expected: ~5,000-10,000 new archives

Step 4: Merge Datasets

# Unified German GLAM Dataset
- id: unique_id
  name: Institution name
  institution_type: ARCHIVE/LIBRARY/MUSEUM
  isil: ISIL code (if available)
  location:
    city: City
    state: Federal state
    coordinates: [lat, lon]
  archive_type: Sector category (if archive)
  data_sources:
    - ISIL_REGISTRY (Tier 1)
    - ARCHIVPORTAL_D (Tier 2)
  provenance:
    tier: TIER_1 or TIER_2
    extraction_date: ISO timestamp

Step 5: Quality Validation

Check for duplicates (same name + city)
Verify ISIL code matches
Geocode addresses (Nominatim API)
Validate institution types

Expected Results

Before Integration

Source	Institutions	Archives	Coverage
ISIL Registry	16,979	~2,000-3,000	20-30% of archives
Total	16,979	~2,500	Partial

After Integration

Source	Institutions	Archives	Coverage
ISIL Registry	16,979	~2,500	Core authoritative data
Archivportal-D	~12,000	~12,000	Complete archive coverage
Merged (deduplicated)	~25,000	~12,000	~100% archives

Coverage Improvement: NRW Example

Before: 301 archives (63%)
After: ~477 archives (100%)
Gain: +176 archives (+58%)

Timeline and Resources

Time Estimate

API registration: 10 minutes
API harvester development: 2 hours
Data harvest: 1-2 hours
Cross-referencing: 1 hour
Deduplication: 1 hour
Validation: 2 hours
Documentation: 1 hour
Total: ~8-10 hours

Required Tools

Python 3.9+
requests library (API calls)
beautifulsoup4 (if web scraping needed)
Nominatim API (geocoding)
LinkML validator (schema validation)

Output Files

/data/isil/germany/
  ├── archivportal_d_raw_20251119.json           # Raw API data
  ├── archivportal_d_archives_20251119.json      # Processed archives
  ├── german_isil_complete_20251119_134939.json  # Existing ISIL data
  ├── german_unified_20251119.json               # Merged dataset
  ├── german_unified_20251119.jsonl              # Line-delimited
  ├── german_new_discoveries_20251119.json       # Archives without ISIL
  ├── ARCHIVPORTAL_D_HARVEST_REPORT.md          # Documentation
  └── GERMAN_UNIFIED_STATISTICS.json             # Comprehensive stats

Next Steps (Priority Order)

1. Register for DDB API ✅ (Immediate)

Go to https://www.deutsche-digitale-bibliothek.de/
Create free account
Generate API key in "My DDB"
Test API access

2. Develop Archivportal-D Harvester ✅ (Today)

Script: scripts/scrapers/harvest_archivportal_d.py
Use DDB API (preferred) or web scraping (fallback)
Implement rate limiting (1 req/second)
Add error handling and retry logic

3. Harvest All German Archives ✅ (Today)

Run harvester for ~10,000-20,000 archives
Save raw data + processed records
Generate harvest report

4. Cross-Reference and Merge ✅ (Today)

Match by ISIL codes
Identify new discoveries
Create unified dataset
Generate statistics

5. Validate and Document ✅ (Today)

Check for duplicates
Validate data quality
Create comprehensive documentation
Update progress reports

6. OPTIONAL: Regional Portal Enrichment (Future)

Scrape NRW portal for detailed metadata
Repeat for other states
Enrich merged dataset with contact info

Success Criteria

✅ Complete German Archive Coverage

All ~12,000 German archives harvested
100% of Archivportal-D listings
Cross-referenced with ISIL registry

✅ High Data Quality

Deduplication complete (< 1% duplicates)
Geocoding for 80%+ of archives
Institution types classified

✅ Comprehensive Documentation

Harvest reports for all sources
Integration methodology documented
Data quality metrics published

✅ Ready for GLAM Integration

LinkML format conversion complete
GLAMORCUBESFIXPHDNT taxonomy applied
GHCIDs generated

Risk Mitigation

Risk 1: API Rate Limits

Mitigation: Implement 1 req/second limit, use batch requests
Fallback: Web scraping if API unavailable

Risk 2: Data Quality Issues

Mitigation: Validation scripts, manual review of suspicious records
Fallback: Flag low-quality records, document issues

Risk 3: Duplicates

Mitigation: Fuzzy matching by name+city, ISIL code verification
Fallback: Manual deduplication for high-confidence matches

Risk 4: Missing Metadata

Mitigation: Use multiple sources (ISIL + Archivportal-D + regional portals)
Fallback: Mark incomplete records, enrich later

References

Primary Sources

ISIL Registry: https://services.dnb.de/sru/bib (16,979 records) ✅
Archivportal-D: https://www.archivportal-d.de/ (~12,000 archives) 🔄
DDB API: https://api.deutsche-digitale-bibliothek.de/ 🔄

Regional Portals

NRW: https://www.archive.nrw.de/ (477 archives)
Baden-Württemberg: https://www.landesarchiv-bw.de/
Niedersachsen: https://www.arcinsys.niedersachsen.de/

Documentation

ISIL Harvest: /data/isil/germany/HARVEST_REPORT.md
Comprehensiveness Check: /data/isil/germany/COMPREHENSIVENESS_REPORT.md
Archivportal-D Discovery: /data/isil/germany/ARCHIVPORTAL_D_DISCOVERY.md
This Plan: /data/isil/germany/COMPLETENESS_PLAN.md

Status: Plan complete, ready for implementation
Priority: HIGH - Critical for 100% German coverage
Next Action: Register for DDB API key, develop harvester
Estimated Completion: Same day (8-10 hours work)

End of Plan

12 KiB Raw Blame History