glam/docs/migration/ghcid_locode_to_geonames.md
2025-11-19 23:25:22 +01:00

16 KiB

Migration Guide: UN/LOCODE to GeoNames for GHCID

Version: 1.0
Migration Date: 2025-11-05
Status: Pre-Migration Planning


Overview

This guide explains how to migrate existing GHCID identifiers from UN/LOCODE-based city codes to GeoNames-based city abbreviations.

Why migrate?

  • UN/LOCODE covers only 10.5% of Dutch cities (50/475)
  • GeoNames provides 100% coverage (475+ Dutch cities)
  • Enables global expansion to 60+ countries

Impact:

  • Existing 152 GHCIDs may change
  • New GHCIDs generated for 212 previously uncovered institutions
  • Overall GHCID coverage increases from 41.8% to >95%

Breaking Changes

1. City Code Source Change

Before (UN/LOCODE):

Amsterdam → AMS (from UN/LOCODE registry)
Rotterdam → RTM (from UN/LOCODE registry)

After (GeoNames):

Amsterdam → AMS (first 3 letters of city name)
Rotterdam → ROT (first 3 letters of city name)

Impact: Some city codes will change format.

2. GHCID String Changes

Example institution: Science Museum in Rotterdam

Old GHCID:

NL-ZH-RTM-M-SM

New GHCID:

NL-ZH-ROT-M-SM

Note: Rotterdam's UN/LOCODE is "RTM", but GeoNames abbreviation is "ROT" (first 3 letters).

3. Numeric Hash Changes

Critical: The ghcid_numeric field (SHA256 hash) will change because it's computed from the GHCID string.

Old hash:

SHA256("NL-ZH-RTM-M-SM")[:8]  12345678901234567890

New hash:

SHA256("NL-ZH-ROT-M-SM")[:8]  98765432109876543210

Impact: Any systems referencing ghcid_numeric must update references.


Migration Strategy

Phase 1: Preparation (Pre-Migration)

1.1 Export Current GHCIDs

# scripts/export_current_ghcids.py

from glam_extractor.parsers.isil_registry import ISILRegistryParser

parser = ISILRegistryParser()
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")

# Export current GHCIDs to CSV
with open("data/migration/ghcids_before_migration.csv", "w") as f:
    f.write("isil_code,institution_name,ghcid_current,ghcid_numeric\n")
    for record in records:
        if record.ghcid_current:
            f.write(f"{record.identifiers[0].identifier_value},"
                   f"{record.name},"
                   f"{record.ghcid_current},"
                   f"{record.ghcid_numeric}\n")

Output: 152 records with current GHCIDs saved to CSV.

1.2 Identify Affected Records

# scripts/identify_ghcid_changes.py

def compare_locodes_vs_geonames():
    """Compare old LOCODE-based GHCIDs with new GeoNames-based GHCIDs."""
    
    old_records = load_csv("data/migration/ghcids_before_migration.csv")
    
    changes = []
    for record in old_records:
        old_ghcid = record['ghcid_current']
        new_ghcid = generate_new_ghcid(record)  # Using GeoNames
        
        if old_ghcid != new_ghcid:
            changes.append({
                'isil_code': record['isil_code'],
                'institution': record['institution_name'],
                'old_ghcid': old_ghcid,
                'new_ghcid': new_ghcid,
                'reason': 'City code changed (LOCODE → GeoNames)'
            })
    
    # Save report
    save_csv("data/migration/ghcid_changes.csv", changes)
    print(f"Found {len(changes)} GHCIDs that will change")
    return changes

Expected: 10-30% of existing GHCIDs will change.

1.3 Create Mapping Table

# data/migration/ghcid_mapping.csv
# Maps old GHCID → new GHCID for backward compatibility

old_ghcid,new_ghcid,old_numeric,new_numeric,change_reason
NL-ZH-RTM-M-SM,NL-ZH-ROT-M-SM,12345678901234567890,98765432109876543210,City code RTMROT
NL-NH-HAG-A-NA,NL-NH-DEN-A-NA,11111111111111111111,22222222222222222222,City code HAGDEN

Purpose: Support lookups by old GHCID during transition period.


Phase 2: Implementation (Migration Day)

2.1 Build GeoNames Database

# Download GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip

# Build SQLite database
python scripts/build_geonames_db.py \
    --input NL.txt \
    --output data/reference/geonames.db

# Validate completeness
python scripts/validate_geonames_db.py
# Expected: 475+ Dutch cities loaded

2.2 Update Code

Update lookups.py:

# OLD (deprecated)
def get_city_locode(city: str, country: str = "NL") -> Optional[str]:
    return _NL_CITY_LOCODES.get("cities", {}).get(city)

# NEW (GeoNames-based)
from glam_extractor.geocoding.geonames_lookup import GeoNamesDB

_geonames_db = GeoNamesDB("data/reference/geonames.db")

def get_city_abbreviation(city: str, country: str = "NL") -> Optional[str]:
    """Get 3-letter city abbreviation from GeoNames."""
    result = _geonames_db.get_city_details(city, country)
    if result:
        return result['abbreviation']  # First 3 letters, uppercase
    return None

Update GHCID generation in isil_registry.py:

# OLD
city_locode = get_city_locode(record.plaats, "NL")
if not city_locode:
    return None  # Skip GHCID generation

# NEW
city_abbr = get_city_abbreviation(record.plaats, "NL")
if not city_abbr:
    print(f"Warning: No GeoNames match for city: {record.plaats}")
    return None

2.3 Regenerate All GHCIDs

# scripts/regenerate_ghcids_with_geonames.py

from glam_extractor.parsers.isil_registry import ISILRegistryParser
from datetime import datetime, timezone

parser = ISILRegistryParser()
records = parser.parse_and_convert("data/ISIL-codes_2025-08-01.csv")

migration_date = datetime.now(timezone.utc)

for record in records:
    # Save old GHCID to history
    if record.ghcid_current:
        old_ghcid = record.ghcid_current
        old_numeric = record.ghcid_numeric
        
        # Generate new GHCID (using GeoNames)
        new_components = generate_ghcid_components(record)  # Now uses GeoNames
        new_ghcid = new_components.to_string()
        new_numeric = new_components.to_numeric()
        
        if old_ghcid != new_ghcid:
            # Add history entry for change
            history_entry = GHCIDHistoryEntry(
                ghcid=old_ghcid,
                ghcid_numeric=old_numeric,
                valid_from=record.ghcid_history[0].valid_from,  # Original date
                valid_to=migration_date,
                reason="Migrated from UN/LOCODE to GeoNames city abbreviation",
                institution_name=record.name,
                location_city=record.locations[0].city,
                location_country=record.locations[0].country
            )
            record.ghcid_history.append(history_entry)
            
            # Update current GHCID
            record.ghcid_current = new_ghcid
            record.ghcid_numeric = new_numeric
            # Keep ghcid_original unchanged (immutable)

# Save updated records
export_to_jsonld(records, "output/heritage_custodians_geonames.jsonld")

Phase 3: Validation (Post-Migration)

3.1 Verify Coverage Improvement

# scripts/validate_migration.py

def validate_migration():
    """Verify GHCID generation improved after GeoNames migration."""
    
    # Before migration
    old_stats = load_csv("data/migration/ghcids_before_migration.csv")
    old_coverage = len(old_stats)  # 152 records
    
    # After migration
    new_records = parse_isil_registry()
    new_coverage = sum(1 for r in new_records if r.ghcid_current)
    
    print(f"Coverage before: {old_coverage}/364 ({old_coverage/364*100:.1f}%)")
    print(f"Coverage after: {new_coverage}/364 ({new_coverage/364*100:.1f}%)")
    print(f"Improvement: +{new_coverage - old_coverage} records")
    
    assert new_coverage > old_coverage, "Migration should improve coverage"
    assert new_coverage >= 345, "Should reach >95% coverage"

Expected results:

  • Before: 152/364 (41.8%)
  • After: 345+/364 (>95%)
  • Improvement: +193 records

3.2 Verify Mapping Table

def verify_ghcid_mapping():
    """Ensure all old GHCIDs map to new GHCIDs."""
    
    mapping = load_csv("data/migration/ghcid_mapping.csv")
    
    for row in mapping:
        old_ghcid = row['old_ghcid']
        new_ghcid = row['new_ghcid']
        
        # Verify new GHCID exists
        record = lookup_by_ghcid(new_ghcid)
        assert record, f"New GHCID not found: {new_ghcid}"
        
        # Verify history entry exists
        assert any(h.ghcid == old_ghcid for h in record.ghcid_history), \
            f"Old GHCID not in history: {old_ghcid}"

3.3 Test Suite

# Run full test suite
pytest tests/ -v

# Expected: All 150+ tests passing
# New tests for GeoNames integration should be added

Phase 4: Backward Compatibility

4.1 Support Old GHCID Lookups

# src/glam_extractor/identifiers/ghcid_lookup.py

class GHCIDLookup:
    """Lookup institutions by GHCID (supports old + new identifiers)."""
    
    def __init__(self, mapping_file: str = "data/migration/ghcid_mapping.csv"):
        self.mapping = load_mapping(mapping_file)
    
    def lookup(self, ghcid: str) -> Optional[HeritageCustodian]:
        """
        Lookup institution by GHCID.
        
        Supports:
        - Current GHCID (GeoNames-based)
        - Legacy GHCID (UN/LOCODE-based, via mapping table)
        """
        # Try current GHCID first
        record = self._lookup_by_current_ghcid(ghcid)
        if record:
            return record
        
        # Check if it's a legacy GHCID
        new_ghcid = self.mapping.get(ghcid)
        if new_ghcid:
            print(f"Info: Old GHCID {ghcid} → new GHCID {new_ghcid}")
            return self._lookup_by_current_ghcid(new_ghcid)
        
        return None  # Not found

Purpose: Existing systems can continue using old GHCIDs for 6-12 months.

4.2 Deprecation Timeline

Date Action
2025-11-05 Migration complete, mapping table created
2025-11-05 - 2026-05-05 Support both old + new GHCIDs (6 months)
2026-02-05 Send deprecation warnings for old GHCIDs (3 months notice)
2026-05-05 Remove old GHCID support, mapping table archived

Communication: Notify users via:

  • Documentation updates
  • Log warnings when old GHCID used
  • Email to API consumers (if applicable)

City Code Changes Reference

Common Changes

City UN/LOCODE (Old) GeoNames (New) Change?
Amsterdam AMS AMS No
Rotterdam RTM ROT Yes ⚠️
Den Haag HAG DEN Yes ⚠️
Utrecht UTC UTR Yes ⚠️
Eindhoven EIN EIN No
Groningen GRQ GRO Yes ⚠️
Tilburg TIL TIL No
Almere ALM ALM No
Breda BRE BRE No
Nijmegen NIM NIJ Yes ⚠️

Pattern: Cities where UN/LOCODE uses different abbreviation than first 3 letters will change.

Newly Covered Cities

Cities not in UN/LOCODE but now in GeoNames:

City GeoNames Abbr New GHCIDs Generated
Achtkarspelen ACH ~2 institutions
Almkerk ALM ~1 institution
Ameland AME ~1 institution
Bunschoten BUN ~3 institutions
... ... ...

Total: +212 institutions can now generate GHCIDs.


Rollback Plan

If migration causes critical issues:

1. Immediate Rollback (Emergency)

# Restore old code
git checkout tags/pre-geonames-migration

# Restore old data
cp data/migration/ghcids_before_migration.csv data/ghcids_current.csv

# Restart services
systemctl restart glam-extractor

Downtime: <5 minutes

2. Keep GeoNames, Revert GHCIDs

If GeoNames database is fine but GHCIDs need adjustment:

# Restore old GHCIDs from backup
restore_ghcids_from_backup("data/migration/ghcids_before_migration.csv")

# Keep GeoNames database for new institutions
# Only update new institutions, leave existing unchanged

Downtime: <1 hour


Testing Checklist

Pre-Migration Testing:

  • GeoNames database contains all 475 Dutch cities
  • City abbreviation algorithm tested (Amsterdam → AMS)
  • Province code mapping works (Amsterdam → NH)
  • Mapping table created (old GHCID → new GHCID)

Migration Testing:

  • All 150+ tests pass
  • GHCID coverage increased from 41.8% to >95%
  • No duplicate GHCIDs generated
  • History entries correctly capture old GHCIDs

Post-Migration Testing:

  • Old GHCID lookup works (via mapping table)
  • New GHCID lookup works (direct)
  • Export formats valid (JSON-LD, RDF, CSV)
  • No data loss (all original ISIL records present)

Communication Plan

Internal Team

Email template:

Subject: GHCID Migration to GeoNames - Action Required

Team,

On 2025-11-05, we're migrating GHCID city codes from UN/LOCODE to GeoNames.

BENEFITS:
- Coverage increases from 41.8% to >95%
- +212 institutions can now generate GHCIDs
- Enables global expansion to 60+ countries

BREAKING CHANGES:
- Some GHCIDs will change (10-30% of existing)
- Numeric hashes will change (computed from GHCID)
- Mapping table provided for backward compatibility

ACTIONS:
1. Review migration guide: docs/migration/ghcid_locode_to_geonames.md
2. Test with new GHCIDs in staging environment
3. Update any hardcoded GHCID references
4. Plan to migrate to new GHCIDs by 2026-05-05

Questions? Reply to this email or see documentation.

External Users (if applicable)

API changelog:

## Version 2.0.0 - 2025-11-05

### BREAKING CHANGES
- GHCID city codes now use GeoNames abbreviations instead of UN/LOCODE
- Some existing GHCIDs have changed format (see mapping table)
- `ghcid_numeric` field may have new values

### Upgrade Guide
- Download GHCID mapping table: https://example.org/ghcid_mapping.csv
- Update your local GHCID references
- Old GHCIDs supported until 2026-05-05

### New Features
- GHCID coverage increased to >95% for Dutch institutions
- Support for 475+ Dutch cities (previously 50)
- Global expansion enabled for 60+ countries

Success Criteria

Migration is successful if:

  • GHCID coverage >95% (>345/364 ISIL records)
  • All 150+ tests passing
  • Mapping table covers all changed GHCIDs
  • Zero data loss (all ISIL records present)
  • History entries capture old GHCIDs
  • Old GHCID lookup works (6-month compatibility)
  • No duplicate GHCIDs generated
  • GeoNames database <10MB (NL-only)

Post-Migration Tasks

  1. Monitor for issues (first 2 weeks)

    • Check error logs daily
    • Track old GHCID lookup usage
    • Collect user feedback
  2. Update documentation (within 1 week)

    • Mark UN/LOCODE approach as deprecated
    • Update all examples to use GeoNames
    • Add migration date to CHANGELOG
  3. Performance monitoring (first month)

    • Measure city lookup latency (<1ms target)
    • Check database query performance
    • Monitor storage usage
  4. Quarterly GeoNames updates (ongoing)

    • Download latest GeoNames dump
    • Rebuild database
    • Validate no regressions

Lessons Learned (Post-Migration)

To be filled after migration complete:

What went well:

  • TBD

What could be improved:

  • TBD

Unexpected issues:

  • TBD

Recommendations for future migrations:

  • TBD

References

  • GeoNames Integration Design: docs/plan/global_glam/08-geonames-integration.md
  • GHCID Specification: docs/plan/global_glam/06-global-identifier-system.md
  • Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
  • GeoNames Official: https://www.geonames.org

Version: 1.0
Last Updated: 2025-11-05
Status: Pre-Migration Planning
Next Review: After migration complete