glam/docs/migration/ghcid_locode_to_geonames.md
2025-11-19 23:25:22 +01:00

586 lines
16 KiB
Markdown

# Migration Guide: UN/LOCODE to GeoNames for GHCID
**Version**: 1.0
**Migration Date**: 2025-11-05
**Status**: Pre-Migration Planning
---
## Overview
This guide explains how to migrate existing GHCID identifiers from UN/LOCODE-based city codes to GeoNames-based city abbreviations.
**Why migrate?**
- UN/LOCODE covers only 10.5% of Dutch cities (50/475)
- GeoNames provides 100% coverage (475+ Dutch cities)
- Enables global expansion to 60+ countries
**Impact**:
- Existing 152 GHCIDs may change
- New GHCIDs generated for 212 previously uncovered institutions
- Overall GHCID coverage increases from 41.8% to >95%
---
## Breaking Changes
### 1. City Code Source Change
**Before** (UN/LOCODE):
```
Amsterdam → AMS (from UN/LOCODE registry)
Rotterdam → RTM (from UN/LOCODE registry)
```
**After** (GeoNames):
```
Amsterdam → AMS (first 3 letters of city name)
Rotterdam → ROT (first 3 letters of city name)
```
**Impact**: Some city codes will change format.
### 2. GHCID String Changes
**Example institution**: Science Museum in Rotterdam
**Old GHCID**:
```
NL-ZH-RTM-M-SM
```
**New GHCID**:
```
NL-ZH-ROT-M-SM
```
**Note**: Rotterdam's UN/LOCODE is "RTM", but GeoNames abbreviation is "ROT" (first 3 letters).
### 3. Numeric Hash Changes
**Critical**: The `ghcid_numeric` field (SHA256 hash) will **change** because it's computed from the GHCID string.
**Old hash**:
```python
SHA256("NL-ZH-RTM-M-SM")[:8] 12345678901234567890
```
**New hash**:
```python
SHA256("NL-ZH-ROT-M-SM")[:8] 98765432109876543210
```
**Impact**: Any systems referencing `ghcid_numeric` must update references.
---
## Migration Strategy
### Phase 1: Preparation (Pre-Migration)
#### 1.1 Export Current GHCIDs
```python
# scripts/export_current_ghcids.py
from glam_extractor.parsers.isil_registry import ISILRegistryParser
parser = ISILRegistryParser()
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")
# Export current GHCIDs to CSV
with open("data/migration/ghcids_before_migration.csv", "w") as f:
f.write("isil_code,institution_name,ghcid_current,ghcid_numeric\n")
for record in records:
if record.ghcid_current:
f.write(f"{record.identifiers[0].identifier_value},"
f"{record.name},"
f"{record.ghcid_current},"
f"{record.ghcid_numeric}\n")
```
**Output**: 152 records with current GHCIDs saved to CSV.
#### 1.2 Identify Affected Records
```python
# scripts/identify_ghcid_changes.py
def compare_locodes_vs_geonames():
"""Compare old LOCODE-based GHCIDs with new GeoNames-based GHCIDs."""
old_records = load_csv("data/migration/ghcids_before_migration.csv")
changes = []
for record in old_records:
old_ghcid = record['ghcid_current']
new_ghcid = generate_new_ghcid(record) # Using GeoNames
if old_ghcid != new_ghcid:
changes.append({
'isil_code': record['isil_code'],
'institution': record['institution_name'],
'old_ghcid': old_ghcid,
'new_ghcid': new_ghcid,
'reason': 'City code changed (LOCODE → GeoNames)'
})
# Save report
save_csv("data/migration/ghcid_changes.csv", changes)
print(f"Found {len(changes)} GHCIDs that will change")
return changes
```
**Expected**: 10-30% of existing GHCIDs will change.
#### 1.3 Create Mapping Table
```python
# data/migration/ghcid_mapping.csv
# Maps old GHCID → new GHCID for backward compatibility
old_ghcid,new_ghcid,old_numeric,new_numeric,change_reason
NL-ZH-RTM-M-SM,NL-ZH-ROT-M-SM,12345678901234567890,98765432109876543210,City code RTMROT
NL-NH-HAG-A-NA,NL-NH-DEN-A-NA,11111111111111111111,22222222222222222222,City code HAGDEN
```
**Purpose**: Support lookups by old GHCID during transition period.
---
### Phase 2: Implementation (Migration Day)
#### 2.1 Build GeoNames Database
```bash
# Download GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip
# Build SQLite database
python scripts/build_geonames_db.py \
--input NL.txt \
--output data/reference/geonames.db
# Validate completeness
python scripts/validate_geonames_db.py
# Expected: 475+ Dutch cities loaded
```
#### 2.2 Update Code
**Update `lookups.py`**:
```python
# OLD (deprecated)
def get_city_locode(city: str, country: str = "NL") -> Optional[str]:
return _NL_CITY_LOCODES.get("cities", {}).get(city)
# NEW (GeoNames-based)
from glam_extractor.geocoding.geonames_lookup import GeoNamesDB
_geonames_db = GeoNamesDB("data/reference/geonames.db")
def get_city_abbreviation(city: str, country: str = "NL") -> Optional[str]:
"""Get 3-letter city abbreviation from GeoNames."""
result = _geonames_db.get_city_details(city, country)
if result:
return result['abbreviation'] # First 3 letters, uppercase
return None
```
**Update GHCID generation in `isil_registry.py`**:
```python
# OLD
city_locode = get_city_locode(record.plaats, "NL")
if not city_locode:
return None # Skip GHCID generation
# NEW
city_abbr = get_city_abbreviation(record.plaats, "NL")
if not city_abbr:
print(f"Warning: No GeoNames match for city: {record.plaats}")
return None
```
#### 2.3 Regenerate All GHCIDs
```python
# scripts/regenerate_ghcids_with_geonames.py
from glam_extractor.parsers.isil_registry import ISILRegistryParser
from datetime import datetime, timezone
parser = ISILRegistryParser()
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")
migration_date = datetime.now(timezone.utc)
for record in records:
# Save old GHCID to history
if record.ghcid_current:
old_ghcid = record.ghcid_current
old_numeric = record.ghcid_numeric
# Generate new GHCID (using GeoNames)
new_components = generate_ghcid_components(record) # Now uses GeoNames
new_ghcid = new_components.to_string()
new_numeric = new_components.to_numeric()
if old_ghcid != new_ghcid:
# Add history entry for change
history_entry = GHCIDHistoryEntry(
ghcid=old_ghcid,
ghcid_numeric=old_numeric,
valid_from=record.ghcid_history[0].valid_from, # Original date
valid_to=migration_date,
reason="Migrated from UN/LOCODE to GeoNames city abbreviation",
institution_name=record.name,
location_city=record.locations[0].city,
location_country=record.locations[0].country
)
record.ghcid_history.append(history_entry)
# Update current GHCID
record.ghcid_current = new_ghcid
record.ghcid_numeric = new_numeric
# Keep ghcid_original unchanged (immutable)
# Save updated records
export_to_jsonld(records, "output/heritage_custodians_geonames.jsonld")
```
---
### Phase 3: Validation (Post-Migration)
#### 3.1 Verify Coverage Improvement
```python
# scripts/validate_migration.py
def validate_migration():
"""Verify GHCID generation improved after GeoNames migration."""
# Before migration
old_stats = load_csv("data/migration/ghcids_before_migration.csv")
old_coverage = len(old_stats) # 152 records
# After migration
new_records = parse_isil_registry()
new_coverage = sum(1 for r in new_records if r.ghcid_current)
print(f"Coverage before: {old_coverage}/364 ({old_coverage/364*100:.1f}%)")
print(f"Coverage after: {new_coverage}/364 ({new_coverage/364*100:.1f}%)")
print(f"Improvement: +{new_coverage - old_coverage} records")
assert new_coverage > old_coverage, "Migration should improve coverage"
assert new_coverage >= 345, "Should reach >95% coverage"
```
**Expected results**:
- Before: 152/364 (41.8%)
- After: 345+/364 (>95%)
- Improvement: +193 records
#### 3.2 Verify Mapping Table
```python
def verify_ghcid_mapping():
"""Ensure all old GHCIDs map to new GHCIDs."""
mapping = load_csv("data/migration/ghcid_mapping.csv")
for row in mapping:
old_ghcid = row['old_ghcid']
new_ghcid = row['new_ghcid']
# Verify new GHCID exists
record = lookup_by_ghcid(new_ghcid)
assert record, f"New GHCID not found: {new_ghcid}"
# Verify history entry exists
assert any(h.ghcid == old_ghcid for h in record.ghcid_history), \
f"Old GHCID not in history: {old_ghcid}"
```
#### 3.3 Test Suite
```bash
# Run full test suite
pytest tests/ -v
# Expected: All 150+ tests passing
# New tests for GeoNames integration should be added
```
---
### Phase 4: Backward Compatibility
#### 4.1 Support Old GHCID Lookups
```python
# src/glam_extractor/identifiers/ghcid_lookup.py
class GHCIDLookup:
"""Lookup institutions by GHCID (supports old + new identifiers)."""
def __init__(self, mapping_file: str = "data/migration/ghcid_mapping.csv"):
self.mapping = load_mapping(mapping_file)
def lookup(self, ghcid: str) -> Optional[HeritageCustodian]:
"""
Lookup institution by GHCID.
Supports:
- Current GHCID (GeoNames-based)
- Legacy GHCID (UN/LOCODE-based, via mapping table)
"""
# Try current GHCID first
record = self._lookup_by_current_ghcid(ghcid)
if record:
return record
# Check if it's a legacy GHCID
new_ghcid = self.mapping.get(ghcid)
if new_ghcid:
print(f"Info: Old GHCID {ghcid} → new GHCID {new_ghcid}")
return self._lookup_by_current_ghcid(new_ghcid)
return None # Not found
```
**Purpose**: Existing systems can continue using old GHCIDs for 6-12 months.
#### 4.2 Deprecation Timeline
| Date | Action |
|------|--------|
| **2025-11-05** | Migration complete, mapping table created |
| **2025-11-05 - 2026-05-05** | Support both old + new GHCIDs (6 months) |
| **2026-02-05** | Send deprecation warnings for old GHCIDs (3 months notice) |
| **2026-05-05** | Remove old GHCID support, mapping table archived |
**Communication**: Notify users via:
- Documentation updates
- Log warnings when old GHCID used
- Email to API consumers (if applicable)
---
## City Code Changes Reference
### Common Changes
| City | UN/LOCODE (Old) | GeoNames (New) | Change? |
|------|----------------|----------------|---------|
| Amsterdam | AMS | AMS | No ✅ |
| Rotterdam | RTM | ROT | Yes ⚠️ |
| Den Haag | HAG | DEN | Yes ⚠️ |
| Utrecht | UTC | UTR | Yes ⚠️ |
| Eindhoven | EIN | EIN | No ✅ |
| Groningen | GRQ | GRO | Yes ⚠️ |
| Tilburg | TIL | TIL | No ✅ |
| Almere | ALM | ALM | No ✅ |
| Breda | BRE | BRE | No ✅ |
| Nijmegen | NIM | NIJ | Yes ⚠️ |
**Pattern**: Cities where UN/LOCODE uses different abbreviation than first 3 letters will change.
### Newly Covered Cities
Cities **not in UN/LOCODE** but **now in GeoNames**:
| City | GeoNames Abbr | New GHCIDs Generated |
|------|--------------|---------------------|
| Achtkarspelen | ACH | ~2 institutions |
| Almkerk | ALM | ~1 institution |
| Ameland | AME | ~1 institution |
| Bunschoten | BUN | ~3 institutions |
| ... | ... | ... |
**Total**: +212 institutions can now generate GHCIDs.
---
## Rollback Plan
If migration causes critical issues:
### 1. Immediate Rollback (Emergency)
```bash
# Restore old code
git checkout tags/pre-geonames-migration
# Restore old data
cp data/migration/ghcids_before_migration.csv data/ghcids_current.csv
# Restart services
systemctl restart glam-extractor
```
**Downtime**: <5 minutes
### 2. Keep GeoNames, Revert GHCIDs
If GeoNames database is fine but GHCIDs need adjustment:
```python
# Restore old GHCIDs from backup
restore_ghcids_from_backup("data/migration/ghcids_before_migration.csv")
# Keep GeoNames database for new institutions
# Only update new institutions, leave existing unchanged
```
**Downtime**: <1 hour
---
## Testing Checklist
Pre-Migration Testing:
- [ ] GeoNames database contains all 475 Dutch cities
- [ ] City abbreviation algorithm tested (Amsterdam AMS)
- [ ] Province code mapping works (Amsterdam NH)
- [ ] Mapping table created (old GHCID new GHCID)
Migration Testing:
- [ ] All 150+ tests pass
- [ ] GHCID coverage increased from 41.8% to >95%
- [ ] No duplicate GHCIDs generated
- [ ] History entries correctly capture old GHCIDs
Post-Migration Testing:
- [ ] Old GHCID lookup works (via mapping table)
- [ ] New GHCID lookup works (direct)
- [ ] Export formats valid (JSON-LD, RDF, CSV)
- [ ] No data loss (all original ISIL records present)
---
## Communication Plan
### Internal Team
**Email template**:
```
Subject: GHCID Migration to GeoNames - Action Required
Team,
On 2025-11-05, we're migrating GHCID city codes from UN/LOCODE to GeoNames.
BENEFITS:
- Coverage increases from 41.8% to >95%
- +212 institutions can now generate GHCIDs
- Enables global expansion to 60+ countries
BREAKING CHANGES:
- Some GHCIDs will change (10-30% of existing)
- Numeric hashes will change (computed from GHCID)
- Mapping table provided for backward compatibility
ACTIONS:
1. Review migration guide: docs/migration/ghcid_locode_to_geonames.md
2. Test with new GHCIDs in staging environment
3. Update any hardcoded GHCID references
4. Plan to migrate to new GHCIDs by 2026-05-05
Questions? Reply to this email or see documentation.
```
### External Users (if applicable)
**API changelog**:
```markdown
## Version 2.0.0 - 2025-11-05
### BREAKING CHANGES
- GHCID city codes now use GeoNames abbreviations instead of UN/LOCODE
- Some existing GHCIDs have changed format (see mapping table)
- `ghcid_numeric` field may have new values
### Upgrade Guide
- Download GHCID mapping table: https://example.org/ghcid_mapping.csv
- Update your local GHCID references
- Old GHCIDs supported until 2026-05-05
### New Features
- GHCID coverage increased to >95% for Dutch institutions
- Support for 475+ Dutch cities (previously 50)
- Global expansion enabled for 60+ countries
```
---
## Success Criteria
Migration is successful if:
- ✅ GHCID coverage >95% (>345/364 ISIL records)
- ✅ All 150+ tests passing
- ✅ Mapping table covers all changed GHCIDs
- ✅ Zero data loss (all ISIL records present)
- ✅ History entries capture old GHCIDs
- ✅ Old GHCID lookup works (6-month compatibility)
- ✅ No duplicate GHCIDs generated
- ✅ GeoNames database <10MB (NL-only)
---
## Post-Migration Tasks
1. **Monitor for issues** (first 2 weeks)
- Check error logs daily
- Track old GHCID lookup usage
- Collect user feedback
2. **Update documentation** (within 1 week)
- Mark UN/LOCODE approach as deprecated
- Update all examples to use GeoNames
- Add migration date to CHANGELOG
3. **Performance monitoring** (first month)
- Measure city lookup latency (<1ms target)
- Check database query performance
- Monitor storage usage
4. **Quarterly GeoNames updates** (ongoing)
- Download latest GeoNames dump
- Rebuild database
- Validate no regressions
---
## Lessons Learned (Post-Migration)
_To be filled after migration complete:_
**What went well**:
- TBD
**What could be improved**:
- TBD
**Unexpected issues**:
- TBD
**Recommendations for future migrations**:
- TBD
---
## References
- **GeoNames Integration Design**: `docs/plan/global_glam/08-geonames-integration.md`
- **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md`
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- **GeoNames Official**: https://www.geonames.org
---
**Version**: 1.0
**Last Updated**: 2025-11-05
**Status**: Pre-Migration Planning
**Next Review**: After migration complete