glam/THUERINGEN_V4_MERGE_COMPLETE.md
2025-11-21 22:12:33 +01:00

198 lines
5.8 KiB
Markdown

# Thüringen Archives v4.0 Merge Complete
## Executive Summary
Successfully merged Thüringen archives v4.0 (95.6% metadata completeness) into German unified dataset v4.
**Result**: German dataset v4 with 20,944 institutions (+9 new Thüringen archives)
## Merge Statistics
- **Total Thüringen archives processed**: 149
- **Duplicates (skipped)**: 140 (94.0%)
- **New additions**: 9 (6.0%)
- **German dataset growth**: 20,935 → 20,944 institutions
## New Institutions Added
The 9 new institutions not previously in the German dataset:
1. **Kreisarchiv Ilm-Kreis - Altkreis Ilmenau** (Arnstadt)
2. **Landkreis Altenburger Land - Kreisarchiv** (Altenburg)
3. **Landkreis Gotha - Kreisarchiv** (Gotha)
4. **Landkreis Eichsfeld - Kreisarchiv** (Heilbad Heiligenstadt)
5. **Landkreis Greiz - Kreisarchiv** (Greiz)
6. **EKM - Landeskirchenarchiv Eisenach** (Eisenach)
7. **Bundesarchiv - Stasi-Unterlagen-Archiv Gera** (Gera)
8. **Bundesarchiv - Stasi-Unterlagen-Archiv Erfurt** (Erfurt)
9. **Bundesarchiv - Stasi-Unterlagen-Archiv Suhl** (Suhl)
## Metadata Enrichment (v4.0)
All 149 Thüringen institutions now have:
### Contact Information
- **Email**: 98.7% coverage
- **Phone**: 99.3% coverage
- **Fax**: Available where provided
- **Website**: Comprehensive coverage
### Location Data
- **Physical addresses**: 100% coverage (vs 0% in v2.0) ✅
- **Street addresses**: Complete with postal codes
- **Structured format**: organization, street, postal_code, city
### Administrative Metadata
- **Directors**: 96% coverage (vs 0% in v2.0) ✅
- **Opening hours**: 99.3% coverage (vs 0% in v2.0) ✅
- **Format**: Structured day/time information
### Collection Metadata
- **Collection size**: 91.3% coverage (e.g., "1,300 lfm")
- **Temporal coverage**: Time periods (e.g., "Mitte 17.Jh. - dato")
### Historical Context
- **Archive histories**: 84.6% coverage (vs 0% in v2.0) ✅
- **Format**: Narrative descriptions (truncated to 2000 chars)
## Sample Enriched Record
```yaml
name: Kreisarchiv Ilm-Kreis - Altkreis Ilmenau
institution_type: ARCHIVE
locations:
- city: Arnstadt
region: Thüringen
country: DE
street_address: Ichtershäuser Str.40
postal_code: 99310
contact:
email: c.zentgraf@ilm-kreis.de
phone: +49(0)3628 738 217
website: https://landesarchiv.thueringen.de/...
administrative:
director: Claudia Zentgraf
opening_hours: |
Di 9:00-12:00 und 13.00-18.00 Uhr
Do 9:00-12:00 und 13.00-14.30 Uhr
und nach Vereinbarung
collections:
- collection_size: 1.300,0 lfm
temporal_coverage: Mitte 17.Jh. - dato
description: |
Mit der Kreisgründung im Jahr 1952 begann auch die Entwicklung und
der Aufbau des Kreisarchivs Ilmenau. Deshalb erfolgten die ersten
Aktenüberlieferungen erst am Anfang der 50iger Jahre...
```
## Metadata Completeness Comparison
| Field | v2.0 | v4.0 | Improvement |
|-------|------|------|-------------|
| Email | 98.7% | 98.7% | Maintained |
| Phone | 99.3% | 99.3% | Maintained |
| **Physical address** | **0%** | **100%** | **+100%** 🚀 |
| **Director** | **0%** | **96%** | **+96%** 🚀 |
| **Opening hours** | **0%** | **99.3%** | **+99%** 🚀 |
| Collection size | 91.3% | 91.3% | Maintained |
| **Archive history** | **0%** | **84.6%** | **+85%** 🚀 |
| **Overall** | **60%** | **95.6%** | **+35.6%** 🚀 |
## Technical Implementation
### Deduplication Strategy
- **Fuzzy name matching**: 90% similarity threshold
- **City confirmation**: Bonus matching for location overlap
- **Result**: 94% duplicate detection rate (140/149 archives)
### Data Preservation
- **Existing ISIL codes**: Preserved from earlier harvests
- **Coordinate data**: Maintained from previous versions
- **Quality tier**: TIER_2_VERIFIED (web scraping)
### New Data Structures
```python
# v4.0 adds these optional fields:
record['contact'] = {
'email': str,
'phone': str,
'fax': str,
'website': str
}
record['administrative'] = {
'director': str,
'opening_hours': str
}
record['collections'] = [{
'collection_size': str,
'temporal_coverage': str
}]
record['description'] = str # Archive history
```
## Files Generated
- **Output**: `data/isil/germany/german_institutions_unified_v4_20251120_113920.json`
- **Size**: 39.4 MB
- **Total institutions**: 20,944
## Validation Status
**Merge completed successfully**
**Rich metadata preserved**
**Sample records verified**
**Full validation pending** (next step)
## Next Steps
1. **Validate enriched records** (spot-check 5 archives)
- Stadtarchiv Erfurt
- Landesarchiv Thüringen Altenburg
- Carl Zeiss Archiv
- Universitätsarchiv Jena
- Bistumsarchiv Erfurt
2. **Continue German harvest**
- Target: Archivportal-D (national aggregator)
- Expected: ~2,500-3,000 archives
- Method: API-based harvest
3. **Regional portal targets**
- Bavaria (Bayern)
- Baden-Württemberg
- Hessen
## Session Context
- **Session date**: 2025-11-20
- **Previous version**: v3.0 (20,935 institutions)
- **Current version**: v4.0 (20,944 institutions)
- **Source harvest**: Thüringen v4.0 (100% metadata goal)
- **Extraction method**: DOM debugging + comprehensive detail page scraping
## Technical Notes
### DOM Debugging Success
The v4.0 harvest achieved 95.6% completeness by fixing wrapper div pattern:
- H4 headings wrapped in empty divs
- Content in `parent.nextElementSibling` (not `h4.nextElementSibling`)
- Applied to 4 major fields: addresses, directors, opening hours, archive histories
### Merge Script Updates
- Updated input paths to v3.0 dataset and v4.0 harvest
- Enhanced conversion function to handle rich metadata
- Added contact, administrative, collections, description fields
- Preserved backward compatibility with existing records
---
**Status**: ✅ COMPLETE
**Quality**: 95.6% metadata completeness
**Impact**: +35.6 percentage points improvement over v2.0