glam/THUERINGEN_V4_MERGE_COMPLETE.md
2025-11-21 22:12:33 +01:00

5.8 KiB

Thüringen Archives v4.0 Merge Complete

Executive Summary

Successfully merged Thüringen archives v4.0 (95.6% metadata completeness) into German unified dataset v4.

Result: German dataset v4 with 20,944 institutions (+9 new Thüringen archives)

Merge Statistics

  • Total Thüringen archives processed: 149
  • Duplicates (skipped): 140 (94.0%)
  • New additions: 9 (6.0%)
  • German dataset growth: 20,935 → 20,944 institutions

New Institutions Added

The 9 new institutions not previously in the German dataset:

  1. Kreisarchiv Ilm-Kreis - Altkreis Ilmenau (Arnstadt)
  2. Landkreis Altenburger Land - Kreisarchiv (Altenburg)
  3. Landkreis Gotha - Kreisarchiv (Gotha)
  4. Landkreis Eichsfeld - Kreisarchiv (Heilbad Heiligenstadt)
  5. Landkreis Greiz - Kreisarchiv (Greiz)
  6. EKM - Landeskirchenarchiv Eisenach (Eisenach)
  7. Bundesarchiv - Stasi-Unterlagen-Archiv Gera (Gera)
  8. Bundesarchiv - Stasi-Unterlagen-Archiv Erfurt (Erfurt)
  9. Bundesarchiv - Stasi-Unterlagen-Archiv Suhl (Suhl)

Metadata Enrichment (v4.0)

All 149 Thüringen institutions now have:

Contact Information

  • Email: 98.7% coverage
  • Phone: 99.3% coverage
  • Fax: Available where provided
  • Website: Comprehensive coverage

Location Data

  • Physical addresses: 100% coverage (vs 0% in v2.0)
  • Street addresses: Complete with postal codes
  • Structured format: organization, street, postal_code, city

Administrative Metadata

  • Directors: 96% coverage (vs 0% in v2.0)
  • Opening hours: 99.3% coverage (vs 0% in v2.0)
  • Format: Structured day/time information

Collection Metadata

  • Collection size: 91.3% coverage (e.g., "1,300 lfm")
  • Temporal coverage: Time periods (e.g., "Mitte 17.Jh. - dato")

Historical Context

  • Archive histories: 84.6% coverage (vs 0% in v2.0)
  • Format: Narrative descriptions (truncated to 2000 chars)

Sample Enriched Record

name: Kreisarchiv Ilm-Kreis - Altkreis Ilmenau
institution_type: ARCHIVE
locations:
  - city: Arnstadt
    region: Thüringen
    country: DE
    street_address: Ichtershäuser Str.40
    postal_code: 99310

contact:
  email: c.zentgraf@ilm-kreis.de
  phone: +49(0)3628 738 217
  website: https://landesarchiv.thueringen.de/...

administrative:
  director: Claudia Zentgraf
  opening_hours: |
    Di  9:00-12:00 und 13.00-18.00 Uhr
    Do 9:00-12:00 und 13.00-14.30 Uhr
    und nach Vereinbarung    

collections:
  - collection_size: 1.300,0 lfm
    temporal_coverage: Mitte 17.Jh. - dato

description: |
  Mit der Kreisgründung im Jahr 1952 begann auch die Entwicklung und 
  der Aufbau des Kreisarchivs Ilmenau. Deshalb erfolgten die ersten 
  Aktenüberlieferungen erst am Anfang der 50iger Jahre...  

Metadata Completeness Comparison

Field v2.0 v4.0 Improvement
Email 98.7% 98.7% Maintained
Phone 99.3% 99.3% Maintained
Physical address 0% 100% +100% 🚀
Director 0% 96% +96% 🚀
Opening hours 0% 99.3% +99% 🚀
Collection size 91.3% 91.3% Maintained
Archive history 0% 84.6% +85% 🚀
Overall 60% 95.6% +35.6% 🚀

Technical Implementation

Deduplication Strategy

  • Fuzzy name matching: 90% similarity threshold
  • City confirmation: Bonus matching for location overlap
  • Result: 94% duplicate detection rate (140/149 archives)

Data Preservation

  • Existing ISIL codes: Preserved from earlier harvests
  • Coordinate data: Maintained from previous versions
  • Quality tier: TIER_2_VERIFIED (web scraping)

New Data Structures

# v4.0 adds these optional fields:
record['contact'] = {
    'email': str,
    'phone': str,
    'fax': str,
    'website': str
}

record['administrative'] = {
    'director': str,
    'opening_hours': str
}

record['collections'] = [{
    'collection_size': str,
    'temporal_coverage': str
}]

record['description'] = str  # Archive history

Files Generated

  • Output: data/isil/germany/german_institutions_unified_v4_20251120_113920.json
  • Size: 39.4 MB
  • Total institutions: 20,944

Validation Status

Merge completed successfully
Rich metadata preserved
Sample records verified
Full validation pending (next step)

Next Steps

  1. Validate enriched records (spot-check 5 archives)

    • Stadtarchiv Erfurt
    • Landesarchiv Thüringen Altenburg
    • Carl Zeiss Archiv
    • Universitätsarchiv Jena
    • Bistumsarchiv Erfurt
  2. Continue German harvest

    • Target: Archivportal-D (national aggregator)
    • Expected: ~2,500-3,000 archives
    • Method: API-based harvest
  3. Regional portal targets

    • Bavaria (Bayern)
    • Baden-Württemberg
    • Hessen

Session Context

  • Session date: 2025-11-20
  • Previous version: v3.0 (20,935 institutions)
  • Current version: v4.0 (20,944 institutions)
  • Source harvest: Thüringen v4.0 (100% metadata goal)
  • Extraction method: DOM debugging + comprehensive detail page scraping

Technical Notes

DOM Debugging Success

The v4.0 harvest achieved 95.6% completeness by fixing wrapper div pattern:

  • H4 headings wrapped in empty divs
  • Content in parent.nextElementSibling (not h4.nextElementSibling)
  • Applied to 4 major fields: addresses, directors, opening hours, archive histories

Merge Script Updates

  • Updated input paths to v3.0 dataset and v4.0 harvest
  • Enhanced conversion function to handle rich metadata
  • Added contact, administrative, collections, description fields
  • Preserved backward compatibility with existing records

Status: COMPLETE
Quality: 95.6% metadata completeness
Impact: +35.6 percentage points improvement over v2.0