198 lines
5.8 KiB
Markdown
198 lines
5.8 KiB
Markdown
# Thüringen Archives v4.0 Merge Complete
|
|
|
|
## Executive Summary
|
|
|
|
Successfully merged Thüringen archives v4.0 (95.6% metadata completeness) into German unified dataset v4.
|
|
|
|
**Result**: German dataset v4 with 20,944 institutions (+9 new Thüringen archives)
|
|
|
|
## Merge Statistics
|
|
|
|
- **Total Thüringen archives processed**: 149
|
|
- **Duplicates (skipped)**: 140 (94.0%)
|
|
- **New additions**: 9 (6.0%)
|
|
- **German dataset growth**: 20,935 → 20,944 institutions
|
|
|
|
## New Institutions Added
|
|
|
|
The 9 new institutions not previously in the German dataset:
|
|
|
|
1. **Kreisarchiv Ilm-Kreis - Altkreis Ilmenau** (Arnstadt)
|
|
2. **Landkreis Altenburger Land - Kreisarchiv** (Altenburg)
|
|
3. **Landkreis Gotha - Kreisarchiv** (Gotha)
|
|
4. **Landkreis Eichsfeld - Kreisarchiv** (Heilbad Heiligenstadt)
|
|
5. **Landkreis Greiz - Kreisarchiv** (Greiz)
|
|
6. **EKM - Landeskirchenarchiv Eisenach** (Eisenach)
|
|
7. **Bundesarchiv - Stasi-Unterlagen-Archiv Gera** (Gera)
|
|
8. **Bundesarchiv - Stasi-Unterlagen-Archiv Erfurt** (Erfurt)
|
|
9. **Bundesarchiv - Stasi-Unterlagen-Archiv Suhl** (Suhl)
|
|
|
|
## Metadata Enrichment (v4.0)
|
|
|
|
All 149 Thüringen institutions now have:
|
|
|
|
### Contact Information
|
|
- **Email**: 98.7% coverage
|
|
- **Phone**: 99.3% coverage
|
|
- **Fax**: Available where provided
|
|
- **Website**: Comprehensive coverage
|
|
|
|
### Location Data
|
|
- **Physical addresses**: 100% coverage (vs 0% in v2.0) ✅
|
|
- **Street addresses**: Complete with postal codes
|
|
- **Structured format**: organization, street, postal_code, city
|
|
|
|
### Administrative Metadata
|
|
- **Directors**: 96% coverage (vs 0% in v2.0) ✅
|
|
- **Opening hours**: 99.3% coverage (vs 0% in v2.0) ✅
|
|
- **Format**: Structured day/time information
|
|
|
|
### Collection Metadata
|
|
- **Collection size**: 91.3% coverage (e.g., "1,300 lfm")
|
|
- **Temporal coverage**: Time periods (e.g., "Mitte 17.Jh. - dato")
|
|
|
|
### Historical Context
|
|
- **Archive histories**: 84.6% coverage (vs 0% in v2.0) ✅
|
|
- **Format**: Narrative descriptions (truncated to 2000 chars)
|
|
|
|
## Sample Enriched Record
|
|
|
|
```yaml
|
|
name: Kreisarchiv Ilm-Kreis - Altkreis Ilmenau
|
|
institution_type: ARCHIVE
|
|
locations:
|
|
- city: Arnstadt
|
|
region: Thüringen
|
|
country: DE
|
|
street_address: Ichtershäuser Str.40
|
|
postal_code: 99310
|
|
|
|
contact:
|
|
email: c.zentgraf@ilm-kreis.de
|
|
phone: +49(0)3628 738 217
|
|
website: https://landesarchiv.thueringen.de/...
|
|
|
|
administrative:
|
|
director: Claudia Zentgraf
|
|
opening_hours: |
|
|
Di 9:00-12:00 und 13.00-18.00 Uhr
|
|
Do 9:00-12:00 und 13.00-14.30 Uhr
|
|
und nach Vereinbarung
|
|
|
|
collections:
|
|
- collection_size: 1.300,0 lfm
|
|
temporal_coverage: Mitte 17.Jh. - dato
|
|
|
|
description: |
|
|
Mit der Kreisgründung im Jahr 1952 begann auch die Entwicklung und
|
|
der Aufbau des Kreisarchivs Ilmenau. Deshalb erfolgten die ersten
|
|
Aktenüberlieferungen erst am Anfang der 50iger Jahre...
|
|
```
|
|
|
|
## Metadata Completeness Comparison
|
|
|
|
| Field | v2.0 | v4.0 | Improvement |
|
|
|-------|------|------|-------------|
|
|
| Email | 98.7% | 98.7% | Maintained |
|
|
| Phone | 99.3% | 99.3% | Maintained |
|
|
| **Physical address** | **0%** | **100%** | **+100%** 🚀 |
|
|
| **Director** | **0%** | **96%** | **+96%** 🚀 |
|
|
| **Opening hours** | **0%** | **99.3%** | **+99%** 🚀 |
|
|
| Collection size | 91.3% | 91.3% | Maintained |
|
|
| **Archive history** | **0%** | **84.6%** | **+85%** 🚀 |
|
|
| **Overall** | **60%** | **95.6%** | **+35.6%** 🚀 |
|
|
|
|
## Technical Implementation
|
|
|
|
### Deduplication Strategy
|
|
- **Fuzzy name matching**: 90% similarity threshold
|
|
- **City confirmation**: Bonus matching for location overlap
|
|
- **Result**: 94% duplicate detection rate (140/149 archives)
|
|
|
|
### Data Preservation
|
|
- **Existing ISIL codes**: Preserved from earlier harvests
|
|
- **Coordinate data**: Maintained from previous versions
|
|
- **Quality tier**: TIER_2_VERIFIED (web scraping)
|
|
|
|
### New Data Structures
|
|
```python
|
|
# v4.0 adds these optional fields:
|
|
record['contact'] = {
|
|
'email': str,
|
|
'phone': str,
|
|
'fax': str,
|
|
'website': str
|
|
}
|
|
|
|
record['administrative'] = {
|
|
'director': str,
|
|
'opening_hours': str
|
|
}
|
|
|
|
record['collections'] = [{
|
|
'collection_size': str,
|
|
'temporal_coverage': str
|
|
}]
|
|
|
|
record['description'] = str # Archive history
|
|
```
|
|
|
|
## Files Generated
|
|
|
|
- **Output**: `data/isil/germany/german_institutions_unified_v4_20251120_113920.json`
|
|
- **Size**: 39.4 MB
|
|
- **Total institutions**: 20,944
|
|
|
|
## Validation Status
|
|
|
|
✅ **Merge completed successfully**
|
|
✅ **Rich metadata preserved**
|
|
✅ **Sample records verified**
|
|
⏳ **Full validation pending** (next step)
|
|
|
|
## Next Steps
|
|
|
|
1. **Validate enriched records** (spot-check 5 archives)
|
|
- Stadtarchiv Erfurt
|
|
- Landesarchiv Thüringen Altenburg
|
|
- Carl Zeiss Archiv
|
|
- Universitätsarchiv Jena
|
|
- Bistumsarchiv Erfurt
|
|
|
|
2. **Continue German harvest**
|
|
- Target: Archivportal-D (national aggregator)
|
|
- Expected: ~2,500-3,000 archives
|
|
- Method: API-based harvest
|
|
|
|
3. **Regional portal targets**
|
|
- Bavaria (Bayern)
|
|
- Baden-Württemberg
|
|
- Hessen
|
|
|
|
## Session Context
|
|
|
|
- **Session date**: 2025-11-20
|
|
- **Previous version**: v3.0 (20,935 institutions)
|
|
- **Current version**: v4.0 (20,944 institutions)
|
|
- **Source harvest**: Thüringen v4.0 (100% metadata goal)
|
|
- **Extraction method**: DOM debugging + comprehensive detail page scraping
|
|
|
|
## Technical Notes
|
|
|
|
### DOM Debugging Success
|
|
The v4.0 harvest achieved 95.6% completeness by fixing wrapper div pattern:
|
|
- H4 headings wrapped in empty divs
|
|
- Content in `parent.nextElementSibling` (not `h4.nextElementSibling`)
|
|
- Applied to 4 major fields: addresses, directors, opening hours, archive histories
|
|
|
|
### Merge Script Updates
|
|
- Updated input paths to v3.0 dataset and v4.0 harvest
|
|
- Enhanced conversion function to handle rich metadata
|
|
- Added contact, administrative, collections, description fields
|
|
- Preserved backward compatibility with existing records
|
|
|
|
---
|
|
|
|
**Status**: ✅ COMPLETE
|
|
**Quality**: 95.6% metadata completeness
|
|
**Impact**: +35.6 percentage points improvement over v2.0
|