# Migration Guide: UN/LOCODE to GeoNames for GHCID **Version**: 1.0 **Migration Date**: 2025-11-05 **Status**: Pre-Migration Planning --- ## Overview This guide explains how to migrate existing GHCID identifiers from UN/LOCODE-based city codes to GeoNames-based city abbreviations. **Why migrate?** - UN/LOCODE covers only 10.5% of Dutch cities (50/475) - GeoNames provides 100% coverage (475+ Dutch cities) - Enables global expansion to 60+ countries **Impact**: - Existing 152 GHCIDs may change - New GHCIDs generated for 212 previously uncovered institutions - Overall GHCID coverage increases from 41.8% to >95% --- ## Breaking Changes ### 1. City Code Source Change **Before** (UN/LOCODE): ``` Amsterdam → AMS (from UN/LOCODE registry) Rotterdam → RTM (from UN/LOCODE registry) ``` **After** (GeoNames): ``` Amsterdam → AMS (first 3 letters of city name) Rotterdam → ROT (first 3 letters of city name) ``` **Impact**: Some city codes will change format. ### 2. GHCID String Changes **Example institution**: Science Museum in Rotterdam **Old GHCID**: ``` NL-ZH-RTM-M-SM ``` **New GHCID**: ``` NL-ZH-ROT-M-SM ``` **Note**: Rotterdam's UN/LOCODE is "RTM", but GeoNames abbreviation is "ROT" (first 3 letters). ### 3. Numeric Hash Changes **Critical**: The `ghcid_numeric` field (SHA256 hash) will **change** because it's computed from the GHCID string. **Old hash**: ```python SHA256("NL-ZH-RTM-M-SM")[:8] → 12345678901234567890 ``` **New hash**: ```python SHA256("NL-ZH-ROT-M-SM")[:8] → 98765432109876543210 ``` **Impact**: Any systems referencing `ghcid_numeric` must update references. --- ## Migration Strategy ### Phase 1: Preparation (Pre-Migration) #### 1.1 Export Current GHCIDs ```python # scripts/export_current_ghcids.py from glam_extractor.parsers.isil_registry import ISILRegistryParser parser = ISILRegistryParser() records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv") # Export current GHCIDs to CSV with open("data/migration/ghcids_before_migration.csv", "w") as f: f.write("isil_code,institution_name,ghcid_current,ghcid_numeric\n") for record in records: if record.ghcid_current: f.write(f"{record.identifiers[0].identifier_value}," f"{record.name}," f"{record.ghcid_current}," f"{record.ghcid_numeric}\n") ``` **Output**: 152 records with current GHCIDs saved to CSV. #### 1.2 Identify Affected Records ```python # scripts/identify_ghcid_changes.py def compare_locodes_vs_geonames(): """Compare old LOCODE-based GHCIDs with new GeoNames-based GHCIDs.""" old_records = load_csv("data/migration/ghcids_before_migration.csv") changes = [] for record in old_records: old_ghcid = record['ghcid_current'] new_ghcid = generate_new_ghcid(record) # Using GeoNames if old_ghcid != new_ghcid: changes.append({ 'isil_code': record['isil_code'], 'institution': record['institution_name'], 'old_ghcid': old_ghcid, 'new_ghcid': new_ghcid, 'reason': 'City code changed (LOCODE → GeoNames)' }) # Save report save_csv("data/migration/ghcid_changes.csv", changes) print(f"Found {len(changes)} GHCIDs that will change") return changes ``` **Expected**: 10-30% of existing GHCIDs will change. #### 1.3 Create Mapping Table ```python # data/migration/ghcid_mapping.csv # Maps old GHCID → new GHCID for backward compatibility old_ghcid,new_ghcid,old_numeric,new_numeric,change_reason NL-ZH-RTM-M-SM,NL-ZH-ROT-M-SM,12345678901234567890,98765432109876543210,City code RTM→ROT NL-NH-HAG-A-NA,NL-NH-DEN-A-NA,11111111111111111111,22222222222222222222,City code HAG→DEN ``` **Purpose**: Support lookups by old GHCID during transition period. --- ### Phase 2: Implementation (Migration Day) #### 2.1 Build GeoNames Database ```bash # Download GeoNames data wget http://download.geonames.org/export/dump/NL.zip unzip NL.zip # Build SQLite database python scripts/build_geonames_db.py \ --input NL.txt \ --output data/reference/geonames.db # Validate completeness python scripts/validate_geonames_db.py # Expected: 475+ Dutch cities loaded ``` #### 2.2 Update Code **Update `lookups.py`**: ```python # OLD (deprecated) def get_city_locode(city: str, country: str = "NL") -> Optional[str]: return _NL_CITY_LOCODES.get("cities", {}).get(city) # NEW (GeoNames-based) from glam_extractor.geocoding.geonames_lookup import GeoNamesDB _geonames_db = GeoNamesDB("data/reference/geonames.db") def get_city_abbreviation(city: str, country: str = "NL") -> Optional[str]: """Get 3-letter city abbreviation from GeoNames.""" result = _geonames_db.get_city_details(city, country) if result: return result['abbreviation'] # First 3 letters, uppercase return None ``` **Update GHCID generation in `isil_registry.py`**: ```python # OLD city_locode = get_city_locode(record.plaats, "NL") if not city_locode: return None # Skip GHCID generation # NEW city_abbr = get_city_abbreviation(record.plaats, "NL") if not city_abbr: print(f"Warning: No GeoNames match for city: {record.plaats}") return None ``` #### 2.3 Regenerate All GHCIDs ```python # scripts/regenerate_ghcids_with_geonames.py from glam_extractor.parsers.isil_registry import ISILRegistryParser from datetime import datetime, timezone parser = ISILRegistryParser() records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv") migration_date = datetime.now(timezone.utc) for record in records: # Save old GHCID to history if record.ghcid_current: old_ghcid = record.ghcid_current old_numeric = record.ghcid_numeric # Generate new GHCID (using GeoNames) new_components = generate_ghcid_components(record) # Now uses GeoNames new_ghcid = new_components.to_string() new_numeric = new_components.to_numeric() if old_ghcid != new_ghcid: # Add history entry for change history_entry = GHCIDHistoryEntry( ghcid=old_ghcid, ghcid_numeric=old_numeric, valid_from=record.ghcid_history[0].valid_from, # Original date valid_to=migration_date, reason="Migrated from UN/LOCODE to GeoNames city abbreviation", institution_name=record.name, location_city=record.locations[0].city, location_country=record.locations[0].country ) record.ghcid_history.append(history_entry) # Update current GHCID record.ghcid_current = new_ghcid record.ghcid_numeric = new_numeric # Keep ghcid_original unchanged (immutable) # Save updated records export_to_jsonld(records, "output/heritage_custodians_geonames.jsonld") ``` --- ### Phase 3: Validation (Post-Migration) #### 3.1 Verify Coverage Improvement ```python # scripts/validate_migration.py def validate_migration(): """Verify GHCID generation improved after GeoNames migration.""" # Before migration old_stats = load_csv("data/migration/ghcids_before_migration.csv") old_coverage = len(old_stats) # 152 records # After migration new_records = parse_isil_registry() new_coverage = sum(1 for r in new_records if r.ghcid_current) print(f"Coverage before: {old_coverage}/364 ({old_coverage/364*100:.1f}%)") print(f"Coverage after: {new_coverage}/364 ({new_coverage/364*100:.1f}%)") print(f"Improvement: +{new_coverage - old_coverage} records") assert new_coverage > old_coverage, "Migration should improve coverage" assert new_coverage >= 345, "Should reach >95% coverage" ``` **Expected results**: - Before: 152/364 (41.8%) - After: 345+/364 (>95%) - Improvement: +193 records #### 3.2 Verify Mapping Table ```python def verify_ghcid_mapping(): """Ensure all old GHCIDs map to new GHCIDs.""" mapping = load_csv("data/migration/ghcid_mapping.csv") for row in mapping: old_ghcid = row['old_ghcid'] new_ghcid = row['new_ghcid'] # Verify new GHCID exists record = lookup_by_ghcid(new_ghcid) assert record, f"New GHCID not found: {new_ghcid}" # Verify history entry exists assert any(h.ghcid == old_ghcid for h in record.ghcid_history), \ f"Old GHCID not in history: {old_ghcid}" ``` #### 3.3 Test Suite ```bash # Run full test suite pytest tests/ -v # Expected: All 150+ tests passing # New tests for GeoNames integration should be added ``` --- ### Phase 4: Backward Compatibility #### 4.1 Support Old GHCID Lookups ```python # src/glam_extractor/identifiers/ghcid_lookup.py class GHCIDLookup: """Lookup institutions by GHCID (supports old + new identifiers).""" def __init__(self, mapping_file: str = "data/migration/ghcid_mapping.csv"): self.mapping = load_mapping(mapping_file) def lookup(self, ghcid: str) -> Optional[HeritageCustodian]: """ Lookup institution by GHCID. Supports: - Current GHCID (GeoNames-based) - Legacy GHCID (UN/LOCODE-based, via mapping table) """ # Try current GHCID first record = self._lookup_by_current_ghcid(ghcid) if record: return record # Check if it's a legacy GHCID new_ghcid = self.mapping.get(ghcid) if new_ghcid: print(f"Info: Old GHCID {ghcid} → new GHCID {new_ghcid}") return self._lookup_by_current_ghcid(new_ghcid) return None # Not found ``` **Purpose**: Existing systems can continue using old GHCIDs for 6-12 months. #### 4.2 Deprecation Timeline | Date | Action | |------|--------| | **2025-11-05** | Migration complete, mapping table created | | **2025-11-05 - 2026-05-05** | Support both old + new GHCIDs (6 months) | | **2026-02-05** | Send deprecation warnings for old GHCIDs (3 months notice) | | **2026-05-05** | Remove old GHCID support, mapping table archived | **Communication**: Notify users via: - Documentation updates - Log warnings when old GHCID used - Email to API consumers (if applicable) --- ## City Code Changes Reference ### Common Changes | City | UN/LOCODE (Old) | GeoNames (New) | Change? | |------|----------------|----------------|---------| | Amsterdam | AMS | AMS | No ✅ | | Rotterdam | RTM | ROT | Yes ⚠️ | | Den Haag | HAG | DEN | Yes ⚠️ | | Utrecht | UTC | UTR | Yes ⚠️ | | Eindhoven | EIN | EIN | No ✅ | | Groningen | GRQ | GRO | Yes ⚠️ | | Tilburg | TIL | TIL | No ✅ | | Almere | ALM | ALM | No ✅ | | Breda | BRE | BRE | No ✅ | | Nijmegen | NIM | NIJ | Yes ⚠️ | **Pattern**: Cities where UN/LOCODE uses different abbreviation than first 3 letters will change. ### Newly Covered Cities Cities **not in UN/LOCODE** but **now in GeoNames**: | City | GeoNames Abbr | New GHCIDs Generated | |------|--------------|---------------------| | Achtkarspelen | ACH | ~2 institutions | | Almkerk | ALM | ~1 institution | | Ameland | AME | ~1 institution | | Bunschoten | BUN | ~3 institutions | | ... | ... | ... | **Total**: +212 institutions can now generate GHCIDs. --- ## Rollback Plan If migration causes critical issues: ### 1. Immediate Rollback (Emergency) ```bash # Restore old code git checkout tags/pre-geonames-migration # Restore old data cp data/migration/ghcids_before_migration.csv data/ghcids_current.csv # Restart services systemctl restart glam-extractor ``` **Downtime**: <5 minutes ### 2. Keep GeoNames, Revert GHCIDs If GeoNames database is fine but GHCIDs need adjustment: ```python # Restore old GHCIDs from backup restore_ghcids_from_backup("data/migration/ghcids_before_migration.csv") # Keep GeoNames database for new institutions # Only update new institutions, leave existing unchanged ``` **Downtime**: <1 hour --- ## Testing Checklist Pre-Migration Testing: - [ ] GeoNames database contains all 475 Dutch cities - [ ] City abbreviation algorithm tested (Amsterdam → AMS) - [ ] Province code mapping works (Amsterdam → NH) - [ ] Mapping table created (old GHCID → new GHCID) Migration Testing: - [ ] All 150+ tests pass - [ ] GHCID coverage increased from 41.8% to >95% - [ ] No duplicate GHCIDs generated - [ ] History entries correctly capture old GHCIDs Post-Migration Testing: - [ ] Old GHCID lookup works (via mapping table) - [ ] New GHCID lookup works (direct) - [ ] Export formats valid (JSON-LD, RDF, CSV) - [ ] No data loss (all original ISIL records present) --- ## Communication Plan ### Internal Team **Email template**: ``` Subject: GHCID Migration to GeoNames - Action Required Team, On 2025-11-05, we're migrating GHCID city codes from UN/LOCODE to GeoNames. BENEFITS: - Coverage increases from 41.8% to >95% - +212 institutions can now generate GHCIDs - Enables global expansion to 60+ countries BREAKING CHANGES: - Some GHCIDs will change (10-30% of existing) - Numeric hashes will change (computed from GHCID) - Mapping table provided for backward compatibility ACTIONS: 1. Review migration guide: docs/migration/ghcid_locode_to_geonames.md 2. Test with new GHCIDs in staging environment 3. Update any hardcoded GHCID references 4. Plan to migrate to new GHCIDs by 2026-05-05 Questions? Reply to this email or see documentation. ``` ### External Users (if applicable) **API changelog**: ```markdown ## Version 2.0.0 - 2025-11-05 ### BREAKING CHANGES - GHCID city codes now use GeoNames abbreviations instead of UN/LOCODE - Some existing GHCIDs have changed format (see mapping table) - `ghcid_numeric` field may have new values ### Upgrade Guide - Download GHCID mapping table: https://example.org/ghcid_mapping.csv - Update your local GHCID references - Old GHCIDs supported until 2026-05-05 ### New Features - GHCID coverage increased to >95% for Dutch institutions - Support for 475+ Dutch cities (previously 50) - Global expansion enabled for 60+ countries ``` --- ## Success Criteria Migration is successful if: - ✅ GHCID coverage >95% (>345/364 ISIL records) - ✅ All 150+ tests passing - ✅ Mapping table covers all changed GHCIDs - ✅ Zero data loss (all ISIL records present) - ✅ History entries capture old GHCIDs - ✅ Old GHCID lookup works (6-month compatibility) - ✅ No duplicate GHCIDs generated - ✅ GeoNames database <10MB (NL-only) --- ## Post-Migration Tasks 1. **Monitor for issues** (first 2 weeks) - Check error logs daily - Track old GHCID lookup usage - Collect user feedback 2. **Update documentation** (within 1 week) - Mark UN/LOCODE approach as deprecated - Update all examples to use GeoNames - Add migration date to CHANGELOG 3. **Performance monitoring** (first month) - Measure city lookup latency (<1ms target) - Check database query performance - Monitor storage usage 4. **Quarterly GeoNames updates** (ongoing) - Download latest GeoNames dump - Rebuild database - Validate no regressions --- ## Lessons Learned (Post-Migration) _To be filled after migration complete:_ **What went well**: - TBD **What could be improved**: - TBD **Unexpected issues**: - TBD **Recommendations for future migrations**: - TBD --- ## References - **GeoNames Integration Design**: `docs/plan/global_glam/08-geonames-integration.md` - **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md` - **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md` - **GeoNames Official**: https://www.geonames.org --- **Version**: 1.0 **Last Updated**: 2025-11-05 **Status**: Pre-Migration Planning **Next Review**: After migration complete