586 lines
16 KiB
Markdown
586 lines
16 KiB
Markdown
# Migration Guide: UN/LOCODE to GeoNames for GHCID
|
|
|
|
**Version**: 1.0
|
|
**Migration Date**: 2025-11-05
|
|
**Status**: Pre-Migration Planning
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This guide explains how to migrate existing GHCID identifiers from UN/LOCODE-based city codes to GeoNames-based city abbreviations.
|
|
|
|
**Why migrate?**
|
|
- UN/LOCODE covers only 10.5% of Dutch cities (50/475)
|
|
- GeoNames provides 100% coverage (475+ Dutch cities)
|
|
- Enables global expansion to 60+ countries
|
|
|
|
**Impact**:
|
|
- Existing 152 GHCIDs may change
|
|
- New GHCIDs generated for 212 previously uncovered institutions
|
|
- Overall GHCID coverage increases from 41.8% to >95%
|
|
|
|
---
|
|
|
|
## Breaking Changes
|
|
|
|
### 1. City Code Source Change
|
|
|
|
**Before** (UN/LOCODE):
|
|
```
|
|
Amsterdam → AMS (from UN/LOCODE registry)
|
|
Rotterdam → RTM (from UN/LOCODE registry)
|
|
```
|
|
|
|
**After** (GeoNames):
|
|
```
|
|
Amsterdam → AMS (first 3 letters of city name)
|
|
Rotterdam → ROT (first 3 letters of city name)
|
|
```
|
|
|
|
**Impact**: Some city codes will change format.
|
|
|
|
### 2. GHCID String Changes
|
|
|
|
**Example institution**: Science Museum in Rotterdam
|
|
|
|
**Old GHCID**:
|
|
```
|
|
NL-ZH-RTM-M-SM
|
|
```
|
|
|
|
**New GHCID**:
|
|
```
|
|
NL-ZH-ROT-M-SM
|
|
```
|
|
|
|
**Note**: Rotterdam's UN/LOCODE is "RTM", but GeoNames abbreviation is "ROT" (first 3 letters).
|
|
|
|
### 3. Numeric Hash Changes
|
|
|
|
**Critical**: The `ghcid_numeric` field (SHA256 hash) will **change** because it's computed from the GHCID string.
|
|
|
|
**Old hash**:
|
|
```python
|
|
SHA256("NL-ZH-RTM-M-SM")[:8] → 12345678901234567890
|
|
```
|
|
|
|
**New hash**:
|
|
```python
|
|
SHA256("NL-ZH-ROT-M-SM")[:8] → 98765432109876543210
|
|
```
|
|
|
|
**Impact**: Any systems referencing `ghcid_numeric` must update references.
|
|
|
|
---
|
|
|
|
## Migration Strategy
|
|
|
|
### Phase 1: Preparation (Pre-Migration)
|
|
|
|
#### 1.1 Export Current GHCIDs
|
|
|
|
```python
|
|
# scripts/export_current_ghcids.py
|
|
|
|
from glam_extractor.parsers.isil_registry import ISILRegistryParser
|
|
|
|
parser = ISILRegistryParser()
|
|
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")
|
|
|
|
# Export current GHCIDs to CSV
|
|
with open("data/migration/ghcids_before_migration.csv", "w") as f:
|
|
f.write("isil_code,institution_name,ghcid_current,ghcid_numeric\n")
|
|
for record in records:
|
|
if record.ghcid_current:
|
|
f.write(f"{record.identifiers[0].identifier_value},"
|
|
f"{record.name},"
|
|
f"{record.ghcid_current},"
|
|
f"{record.ghcid_numeric}\n")
|
|
```
|
|
|
|
**Output**: 152 records with current GHCIDs saved to CSV.
|
|
|
|
#### 1.2 Identify Affected Records
|
|
|
|
```python
|
|
# scripts/identify_ghcid_changes.py
|
|
|
|
def compare_locodes_vs_geonames():
|
|
"""Compare old LOCODE-based GHCIDs with new GeoNames-based GHCIDs."""
|
|
|
|
old_records = load_csv("data/migration/ghcids_before_migration.csv")
|
|
|
|
changes = []
|
|
for record in old_records:
|
|
old_ghcid = record['ghcid_current']
|
|
new_ghcid = generate_new_ghcid(record) # Using GeoNames
|
|
|
|
if old_ghcid != new_ghcid:
|
|
changes.append({
|
|
'isil_code': record['isil_code'],
|
|
'institution': record['institution_name'],
|
|
'old_ghcid': old_ghcid,
|
|
'new_ghcid': new_ghcid,
|
|
'reason': 'City code changed (LOCODE → GeoNames)'
|
|
})
|
|
|
|
# Save report
|
|
save_csv("data/migration/ghcid_changes.csv", changes)
|
|
print(f"Found {len(changes)} GHCIDs that will change")
|
|
return changes
|
|
```
|
|
|
|
**Expected**: 10-30% of existing GHCIDs will change.
|
|
|
|
#### 1.3 Create Mapping Table
|
|
|
|
```python
|
|
# data/migration/ghcid_mapping.csv
|
|
# Maps old GHCID → new GHCID for backward compatibility
|
|
|
|
old_ghcid,new_ghcid,old_numeric,new_numeric,change_reason
|
|
NL-ZH-RTM-M-SM,NL-ZH-ROT-M-SM,12345678901234567890,98765432109876543210,City code RTM→ROT
|
|
NL-NH-HAG-A-NA,NL-NH-DEN-A-NA,11111111111111111111,22222222222222222222,City code HAG→DEN
|
|
```
|
|
|
|
**Purpose**: Support lookups by old GHCID during transition period.
|
|
|
|
---
|
|
|
|
### Phase 2: Implementation (Migration Day)
|
|
|
|
#### 2.1 Build GeoNames Database
|
|
|
|
```bash
|
|
# Download GeoNames data
|
|
wget http://download.geonames.org/export/dump/NL.zip
|
|
unzip NL.zip
|
|
|
|
# Build SQLite database
|
|
python scripts/build_geonames_db.py \
|
|
--input NL.txt \
|
|
--output data/reference/geonames.db
|
|
|
|
# Validate completeness
|
|
python scripts/validate_geonames_db.py
|
|
# Expected: 475+ Dutch cities loaded
|
|
```
|
|
|
|
#### 2.2 Update Code
|
|
|
|
**Update `lookups.py`**:
|
|
```python
|
|
# OLD (deprecated)
|
|
def get_city_locode(city: str, country: str = "NL") -> Optional[str]:
|
|
return _NL_CITY_LOCODES.get("cities", {}).get(city)
|
|
|
|
# NEW (GeoNames-based)
|
|
from glam_extractor.geocoding.geonames_lookup import GeoNamesDB
|
|
|
|
_geonames_db = GeoNamesDB("data/reference/geonames.db")
|
|
|
|
def get_city_abbreviation(city: str, country: str = "NL") -> Optional[str]:
|
|
"""Get 3-letter city abbreviation from GeoNames."""
|
|
result = _geonames_db.get_city_details(city, country)
|
|
if result:
|
|
return result['abbreviation'] # First 3 letters, uppercase
|
|
return None
|
|
```
|
|
|
|
**Update GHCID generation in `isil_registry.py`**:
|
|
```python
|
|
# OLD
|
|
city_locode = get_city_locode(record.plaats, "NL")
|
|
if not city_locode:
|
|
return None # Skip GHCID generation
|
|
|
|
# NEW
|
|
city_abbr = get_city_abbreviation(record.plaats, "NL")
|
|
if not city_abbr:
|
|
print(f"Warning: No GeoNames match for city: {record.plaats}")
|
|
return None
|
|
```
|
|
|
|
#### 2.3 Regenerate All GHCIDs
|
|
|
|
```python
|
|
# scripts/regenerate_ghcids_with_geonames.py
|
|
|
|
from glam_extractor.parsers.isil_registry import ISILRegistryParser
|
|
from datetime import datetime, timezone
|
|
|
|
parser = ISILRegistryParser()
|
|
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")
|
|
|
|
migration_date = datetime.now(timezone.utc)
|
|
|
|
for record in records:
|
|
# Save old GHCID to history
|
|
if record.ghcid_current:
|
|
old_ghcid = record.ghcid_current
|
|
old_numeric = record.ghcid_numeric
|
|
|
|
# Generate new GHCID (using GeoNames)
|
|
new_components = generate_ghcid_components(record) # Now uses GeoNames
|
|
new_ghcid = new_components.to_string()
|
|
new_numeric = new_components.to_numeric()
|
|
|
|
if old_ghcid != new_ghcid:
|
|
# Add history entry for change
|
|
history_entry = GHCIDHistoryEntry(
|
|
ghcid=old_ghcid,
|
|
ghcid_numeric=old_numeric,
|
|
valid_from=record.ghcid_history[0].valid_from, # Original date
|
|
valid_to=migration_date,
|
|
reason="Migrated from UN/LOCODE to GeoNames city abbreviation",
|
|
institution_name=record.name,
|
|
location_city=record.locations[0].city,
|
|
location_country=record.locations[0].country
|
|
)
|
|
record.ghcid_history.append(history_entry)
|
|
|
|
# Update current GHCID
|
|
record.ghcid_current = new_ghcid
|
|
record.ghcid_numeric = new_numeric
|
|
# Keep ghcid_original unchanged (immutable)
|
|
|
|
# Save updated records
|
|
export_to_jsonld(records, "output/heritage_custodians_geonames.jsonld")
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 3: Validation (Post-Migration)
|
|
|
|
#### 3.1 Verify Coverage Improvement
|
|
|
|
```python
|
|
# scripts/validate_migration.py
|
|
|
|
def validate_migration():
|
|
"""Verify GHCID generation improved after GeoNames migration."""
|
|
|
|
# Before migration
|
|
old_stats = load_csv("data/migration/ghcids_before_migration.csv")
|
|
old_coverage = len(old_stats) # 152 records
|
|
|
|
# After migration
|
|
new_records = parse_isil_registry()
|
|
new_coverage = sum(1 for r in new_records if r.ghcid_current)
|
|
|
|
print(f"Coverage before: {old_coverage}/364 ({old_coverage/364*100:.1f}%)")
|
|
print(f"Coverage after: {new_coverage}/364 ({new_coverage/364*100:.1f}%)")
|
|
print(f"Improvement: +{new_coverage - old_coverage} records")
|
|
|
|
assert new_coverage > old_coverage, "Migration should improve coverage"
|
|
assert new_coverage >= 345, "Should reach >95% coverage"
|
|
```
|
|
|
|
**Expected results**:
|
|
- Before: 152/364 (41.8%)
|
|
- After: 345+/364 (>95%)
|
|
- Improvement: +193 records
|
|
|
|
#### 3.2 Verify Mapping Table
|
|
|
|
```python
|
|
def verify_ghcid_mapping():
|
|
"""Ensure all old GHCIDs map to new GHCIDs."""
|
|
|
|
mapping = load_csv("data/migration/ghcid_mapping.csv")
|
|
|
|
for row in mapping:
|
|
old_ghcid = row['old_ghcid']
|
|
new_ghcid = row['new_ghcid']
|
|
|
|
# Verify new GHCID exists
|
|
record = lookup_by_ghcid(new_ghcid)
|
|
assert record, f"New GHCID not found: {new_ghcid}"
|
|
|
|
# Verify history entry exists
|
|
assert any(h.ghcid == old_ghcid for h in record.ghcid_history), \
|
|
f"Old GHCID not in history: {old_ghcid}"
|
|
```
|
|
|
|
#### 3.3 Test Suite
|
|
|
|
```bash
|
|
# Run full test suite
|
|
pytest tests/ -v
|
|
|
|
# Expected: All 150+ tests passing
|
|
# New tests for GeoNames integration should be added
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Backward Compatibility
|
|
|
|
#### 4.1 Support Old GHCID Lookups
|
|
|
|
```python
|
|
# src/glam_extractor/identifiers/ghcid_lookup.py
|
|
|
|
class GHCIDLookup:
|
|
"""Lookup institutions by GHCID (supports old + new identifiers)."""
|
|
|
|
def __init__(self, mapping_file: str = "data/migration/ghcid_mapping.csv"):
|
|
self.mapping = load_mapping(mapping_file)
|
|
|
|
def lookup(self, ghcid: str) -> Optional[HeritageCustodian]:
|
|
"""
|
|
Lookup institution by GHCID.
|
|
|
|
Supports:
|
|
- Current GHCID (GeoNames-based)
|
|
- Legacy GHCID (UN/LOCODE-based, via mapping table)
|
|
"""
|
|
# Try current GHCID first
|
|
record = self._lookup_by_current_ghcid(ghcid)
|
|
if record:
|
|
return record
|
|
|
|
# Check if it's a legacy GHCID
|
|
new_ghcid = self.mapping.get(ghcid)
|
|
if new_ghcid:
|
|
print(f"Info: Old GHCID {ghcid} → new GHCID {new_ghcid}")
|
|
return self._lookup_by_current_ghcid(new_ghcid)
|
|
|
|
return None # Not found
|
|
```
|
|
|
|
**Purpose**: Existing systems can continue using old GHCIDs for 6-12 months.
|
|
|
|
#### 4.2 Deprecation Timeline
|
|
|
|
| Date | Action |
|
|
|------|--------|
|
|
| **2025-11-05** | Migration complete, mapping table created |
|
|
| **2025-11-05 - 2026-05-05** | Support both old + new GHCIDs (6 months) |
|
|
| **2026-02-05** | Send deprecation warnings for old GHCIDs (3 months notice) |
|
|
| **2026-05-05** | Remove old GHCID support, mapping table archived |
|
|
|
|
**Communication**: Notify users via:
|
|
- Documentation updates
|
|
- Log warnings when old GHCID used
|
|
- Email to API consumers (if applicable)
|
|
|
|
---
|
|
|
|
## City Code Changes Reference
|
|
|
|
### Common Changes
|
|
|
|
| City | UN/LOCODE (Old) | GeoNames (New) | Change? |
|
|
|------|----------------|----------------|---------|
|
|
| Amsterdam | AMS | AMS | No ✅ |
|
|
| Rotterdam | RTM | ROT | Yes ⚠️ |
|
|
| Den Haag | HAG | DEN | Yes ⚠️ |
|
|
| Utrecht | UTC | UTR | Yes ⚠️ |
|
|
| Eindhoven | EIN | EIN | No ✅ |
|
|
| Groningen | GRQ | GRO | Yes ⚠️ |
|
|
| Tilburg | TIL | TIL | No ✅ |
|
|
| Almere | ALM | ALM | No ✅ |
|
|
| Breda | BRE | BRE | No ✅ |
|
|
| Nijmegen | NIM | NIJ | Yes ⚠️ |
|
|
|
|
**Pattern**: Cities where UN/LOCODE uses different abbreviation than first 3 letters will change.
|
|
|
|
### Newly Covered Cities
|
|
|
|
Cities **not in UN/LOCODE** but **now in GeoNames**:
|
|
|
|
| City | GeoNames Abbr | New GHCIDs Generated |
|
|
|------|--------------|---------------------|
|
|
| Achtkarspelen | ACH | ~2 institutions |
|
|
| Almkerk | ALM | ~1 institution |
|
|
| Ameland | AME | ~1 institution |
|
|
| Bunschoten | BUN | ~3 institutions |
|
|
| ... | ... | ... |
|
|
|
|
**Total**: +212 institutions can now generate GHCIDs.
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If migration causes critical issues:
|
|
|
|
### 1. Immediate Rollback (Emergency)
|
|
|
|
```bash
|
|
# Restore old code
|
|
git checkout tags/pre-geonames-migration
|
|
|
|
# Restore old data
|
|
cp data/migration/ghcids_before_migration.csv data/ghcids_current.csv
|
|
|
|
# Restart services
|
|
systemctl restart glam-extractor
|
|
```
|
|
|
|
**Downtime**: <5 minutes
|
|
|
|
### 2. Keep GeoNames, Revert GHCIDs
|
|
|
|
If GeoNames database is fine but GHCIDs need adjustment:
|
|
|
|
```python
|
|
# Restore old GHCIDs from backup
|
|
restore_ghcids_from_backup("data/migration/ghcids_before_migration.csv")
|
|
|
|
# Keep GeoNames database for new institutions
|
|
# Only update new institutions, leave existing unchanged
|
|
```
|
|
|
|
**Downtime**: <1 hour
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
Pre-Migration Testing:
|
|
- [ ] GeoNames database contains all 475 Dutch cities
|
|
- [ ] City abbreviation algorithm tested (Amsterdam → AMS)
|
|
- [ ] Province code mapping works (Amsterdam → NH)
|
|
- [ ] Mapping table created (old GHCID → new GHCID)
|
|
|
|
Migration Testing:
|
|
- [ ] All 150+ tests pass
|
|
- [ ] GHCID coverage increased from 41.8% to >95%
|
|
- [ ] No duplicate GHCIDs generated
|
|
- [ ] History entries correctly capture old GHCIDs
|
|
|
|
Post-Migration Testing:
|
|
- [ ] Old GHCID lookup works (via mapping table)
|
|
- [ ] New GHCID lookup works (direct)
|
|
- [ ] Export formats valid (JSON-LD, RDF, CSV)
|
|
- [ ] No data loss (all original ISIL records present)
|
|
|
|
---
|
|
|
|
## Communication Plan
|
|
|
|
### Internal Team
|
|
|
|
**Email template**:
|
|
```
|
|
Subject: GHCID Migration to GeoNames - Action Required
|
|
|
|
Team,
|
|
|
|
On 2025-11-05, we're migrating GHCID city codes from UN/LOCODE to GeoNames.
|
|
|
|
BENEFITS:
|
|
- Coverage increases from 41.8% to >95%
|
|
- +212 institutions can now generate GHCIDs
|
|
- Enables global expansion to 60+ countries
|
|
|
|
BREAKING CHANGES:
|
|
- Some GHCIDs will change (10-30% of existing)
|
|
- Numeric hashes will change (computed from GHCID)
|
|
- Mapping table provided for backward compatibility
|
|
|
|
ACTIONS:
|
|
1. Review migration guide: docs/migration/ghcid_locode_to_geonames.md
|
|
2. Test with new GHCIDs in staging environment
|
|
3. Update any hardcoded GHCID references
|
|
4. Plan to migrate to new GHCIDs by 2026-05-05
|
|
|
|
Questions? Reply to this email or see documentation.
|
|
```
|
|
|
|
### External Users (if applicable)
|
|
|
|
**API changelog**:
|
|
```markdown
|
|
## Version 2.0.0 - 2025-11-05
|
|
|
|
### BREAKING CHANGES
|
|
- GHCID city codes now use GeoNames abbreviations instead of UN/LOCODE
|
|
- Some existing GHCIDs have changed format (see mapping table)
|
|
- `ghcid_numeric` field may have new values
|
|
|
|
### Upgrade Guide
|
|
- Download GHCID mapping table: https://example.org/ghcid_mapping.csv
|
|
- Update your local GHCID references
|
|
- Old GHCIDs supported until 2026-05-05
|
|
|
|
### New Features
|
|
- GHCID coverage increased to >95% for Dutch institutions
|
|
- Support for 475+ Dutch cities (previously 50)
|
|
- Global expansion enabled for 60+ countries
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
Migration is successful if:
|
|
- ✅ GHCID coverage >95% (>345/364 ISIL records)
|
|
- ✅ All 150+ tests passing
|
|
- ✅ Mapping table covers all changed GHCIDs
|
|
- ✅ Zero data loss (all ISIL records present)
|
|
- ✅ History entries capture old GHCIDs
|
|
- ✅ Old GHCID lookup works (6-month compatibility)
|
|
- ✅ No duplicate GHCIDs generated
|
|
- ✅ GeoNames database <10MB (NL-only)
|
|
|
|
---
|
|
|
|
## Post-Migration Tasks
|
|
|
|
1. **Monitor for issues** (first 2 weeks)
|
|
- Check error logs daily
|
|
- Track old GHCID lookup usage
|
|
- Collect user feedback
|
|
|
|
2. **Update documentation** (within 1 week)
|
|
- Mark UN/LOCODE approach as deprecated
|
|
- Update all examples to use GeoNames
|
|
- Add migration date to CHANGELOG
|
|
|
|
3. **Performance monitoring** (first month)
|
|
- Measure city lookup latency (<1ms target)
|
|
- Check database query performance
|
|
- Monitor storage usage
|
|
|
|
4. **Quarterly GeoNames updates** (ongoing)
|
|
- Download latest GeoNames dump
|
|
- Rebuild database
|
|
- Validate no regressions
|
|
|
|
---
|
|
|
|
## Lessons Learned (Post-Migration)
|
|
|
|
_To be filled after migration complete:_
|
|
|
|
**What went well**:
|
|
- TBD
|
|
|
|
**What could be improved**:
|
|
- TBD
|
|
|
|
**Unexpected issues**:
|
|
- TBD
|
|
|
|
**Recommendations for future migrations**:
|
|
- TBD
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **GeoNames Integration Design**: `docs/plan/global_glam/08-geonames-integration.md`
|
|
- **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md`
|
|
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
|
- **GeoNames Official**: https://www.geonames.org
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Last Updated**: 2025-11-05
|
|
**Status**: Pre-Migration Planning
|
|
**Next Review**: After migration complete
|