5.8 KiB
5.8 KiB
Thüringen Archives v4.0 Merge Complete
Executive Summary
Successfully merged Thüringen archives v4.0 (95.6% metadata completeness) into German unified dataset v4.
Result: German dataset v4 with 20,944 institutions (+9 new Thüringen archives)
Merge Statistics
- Total Thüringen archives processed: 149
- Duplicates (skipped): 140 (94.0%)
- New additions: 9 (6.0%)
- German dataset growth: 20,935 → 20,944 institutions
New Institutions Added
The 9 new institutions not previously in the German dataset:
- Kreisarchiv Ilm-Kreis - Altkreis Ilmenau (Arnstadt)
- Landkreis Altenburger Land - Kreisarchiv (Altenburg)
- Landkreis Gotha - Kreisarchiv (Gotha)
- Landkreis Eichsfeld - Kreisarchiv (Heilbad Heiligenstadt)
- Landkreis Greiz - Kreisarchiv (Greiz)
- EKM - Landeskirchenarchiv Eisenach (Eisenach)
- Bundesarchiv - Stasi-Unterlagen-Archiv Gera (Gera)
- Bundesarchiv - Stasi-Unterlagen-Archiv Erfurt (Erfurt)
- Bundesarchiv - Stasi-Unterlagen-Archiv Suhl (Suhl)
Metadata Enrichment (v4.0)
All 149 Thüringen institutions now have:
Contact Information
- Email: 98.7% coverage
- Phone: 99.3% coverage
- Fax: Available where provided
- Website: Comprehensive coverage
Location Data
- Physical addresses: 100% coverage (vs 0% in v2.0) ✅
- Street addresses: Complete with postal codes
- Structured format: organization, street, postal_code, city
Administrative Metadata
- Directors: 96% coverage (vs 0% in v2.0) ✅
- Opening hours: 99.3% coverage (vs 0% in v2.0) ✅
- Format: Structured day/time information
Collection Metadata
- Collection size: 91.3% coverage (e.g., "1,300 lfm")
- Temporal coverage: Time periods (e.g., "Mitte 17.Jh. - dato")
Historical Context
- Archive histories: 84.6% coverage (vs 0% in v2.0) ✅
- Format: Narrative descriptions (truncated to 2000 chars)
Sample Enriched Record
name: Kreisarchiv Ilm-Kreis - Altkreis Ilmenau
institution_type: ARCHIVE
locations:
- city: Arnstadt
region: Thüringen
country: DE
street_address: Ichtershäuser Str.40
postal_code: 99310
contact:
email: c.zentgraf@ilm-kreis.de
phone: +49(0)3628 738 217
website: https://landesarchiv.thueringen.de/...
administrative:
director: Claudia Zentgraf
opening_hours: |
Di 9:00-12:00 und 13.00-18.00 Uhr
Do 9:00-12:00 und 13.00-14.30 Uhr
und nach Vereinbarung
collections:
- collection_size: 1.300,0 lfm
temporal_coverage: Mitte 17.Jh. - dato
description: |
Mit der Kreisgründung im Jahr 1952 begann auch die Entwicklung und
der Aufbau des Kreisarchivs Ilmenau. Deshalb erfolgten die ersten
Aktenüberlieferungen erst am Anfang der 50iger Jahre...
Metadata Completeness Comparison
| Field | v2.0 | v4.0 | Improvement |
|---|---|---|---|
| 98.7% | 98.7% | Maintained | |
| Phone | 99.3% | 99.3% | Maintained |
| Physical address | 0% | 100% | +100% 🚀 |
| Director | 0% | 96% | +96% 🚀 |
| Opening hours | 0% | 99.3% | +99% 🚀 |
| Collection size | 91.3% | 91.3% | Maintained |
| Archive history | 0% | 84.6% | +85% 🚀 |
| Overall | 60% | 95.6% | +35.6% 🚀 |
Technical Implementation
Deduplication Strategy
- Fuzzy name matching: 90% similarity threshold
- City confirmation: Bonus matching for location overlap
- Result: 94% duplicate detection rate (140/149 archives)
Data Preservation
- Existing ISIL codes: Preserved from earlier harvests
- Coordinate data: Maintained from previous versions
- Quality tier: TIER_2_VERIFIED (web scraping)
New Data Structures
# v4.0 adds these optional fields:
record['contact'] = {
'email': str,
'phone': str,
'fax': str,
'website': str
}
record['administrative'] = {
'director': str,
'opening_hours': str
}
record['collections'] = [{
'collection_size': str,
'temporal_coverage': str
}]
record['description'] = str # Archive history
Files Generated
- Output:
data/isil/germany/german_institutions_unified_v4_20251120_113920.json - Size: 39.4 MB
- Total institutions: 20,944
Validation Status
✅ Merge completed successfully
✅ Rich metadata preserved
✅ Sample records verified
⏳ Full validation pending (next step)
Next Steps
-
Validate enriched records (spot-check 5 archives)
- Stadtarchiv Erfurt
- Landesarchiv Thüringen Altenburg
- Carl Zeiss Archiv
- Universitätsarchiv Jena
- Bistumsarchiv Erfurt
-
Continue German harvest
- Target: Archivportal-D (national aggregator)
- Expected: ~2,500-3,000 archives
- Method: API-based harvest
-
Regional portal targets
- Bavaria (Bayern)
- Baden-Württemberg
- Hessen
Session Context
- Session date: 2025-11-20
- Previous version: v3.0 (20,935 institutions)
- Current version: v4.0 (20,944 institutions)
- Source harvest: Thüringen v4.0 (100% metadata goal)
- Extraction method: DOM debugging + comprehensive detail page scraping
Technical Notes
DOM Debugging Success
The v4.0 harvest achieved 95.6% completeness by fixing wrapper div pattern:
- H4 headings wrapped in empty divs
- Content in
parent.nextElementSibling(noth4.nextElementSibling) - Applied to 4 major fields: addresses, directors, opening hours, archive histories
Merge Script Updates
- Updated input paths to v3.0 dataset and v4.0 harvest
- Enhanced conversion function to handle rich metadata
- Added contact, administrative, collections, description fields
- Preserved backward compatibility with existing records
Status: ✅ COMPLETE
Quality: 95.6% metadata completeness
Impact: +35.6 percentage points improvement over v2.0