kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

28 KiB

Raw Blame History

GeoNames Integration for GHCID City Abbreviations

Version: 1.0
Date: 2025-11-05
Status: Design Complete - Implementation Pending
Decision: Option B - GeoNames Local Database

Executive Summary

To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with GeoNames local database and provides implementation guidelines.

Key Decision: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE.

Problem Statement

Current Limitation: UN/LOCODE Coverage

The GHCID format requires 3-letter city codes:

{ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM

Current approach uses UN/LOCODE (United Nations Code for Trade and Transport Locations):

Coverage: Only 50 Dutch cities in data/reference/nl_city_locodes.json
Required: 475 Dutch cities (based on Dutch heritage datasets)
Gap: 89.5% of cities missing from lookup table
Impact: Only 41.8% GHCID generation rate (152/364 ISIL records)

Why UN/LOCODE Falls Short

Trade-focused: UN/LOCODE prioritizes ports, airports, rail terminals
Major cities only: Small/medium cities often omitted
Incomplete coverage: Many heritage-rich towns excluded
Not heritage-centric: No correlation with museum/archive density
Limited maintenance: Updates infrequent, focused on logistics

Example Gap

Dutch ISIL registry includes institutions in:

Aalten ✅ (in UN/LOCODE as AAT)
Achtkarspelen ❌ (not in UN/LOCODE)
Almkerk ❌ (not in UN/LOCODE)
Ameland ❌ (not in UN/LOCODE)

Result: 212 out of 364 records (58.2%) could not generate GHCIDs.

Solution: GeoNames Geographic Database

What is GeoNames?

GeoNames (https://www.geonames.org) is a comprehensive geographical database covering:

11+ million place names worldwide
Hierarchical data: City → Province/State → Country
Multilingual names: Local + alternative names in multiple languages
Geographic coordinates: Latitude/longitude for all places
Population data: Size/importance ranking
Administrative divisions: ISO 3166-2 integration
Free & open: Creative Commons license

Why GeoNames for GHCID?

Criterion	UN/LOCODE	GeoNames	Winner
Dutch city coverage	50 cities (10.5%)	475+ cities (100%)	GeoNames ✅
Global coverage	~100,000 locations	11+ million	GeoNames ✅
Heritage relevance	Trade/transport focus	All populated places	GeoNames ✅
Coordinates included	Some	All	GeoNames ✅
Hierarchical data	No	Yes (city→province→country)	GeoNames ✅
Update frequency	Quarterly	Daily	GeoNames ✅
API availability	No official API	Yes (free tier)	GeoNames ✅
Offline usage	CSV dumps	SQLite/CSV dumps	Tie ✅
Cost	Free	Free	Tie ✅

Conclusion: GeoNames is superior for heritage institution geolocation.

Implementation Options Considered

Option A: GeoNames API Integration

Approach: Query GeoNames web API in real-time for city lookups.

# Pseudocode
class GeoNamesAPIClient:
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        response = requests.get(
            "http://api.geonames.org/searchJSON",
            params={
                "name": city_name,
                "country": country,
                "maxRows": 1,
                "username": "glam_extractor"
            }
        )
        # Generate 3-letter abbreviation from city name
        return {"abbreviation": city_name[:3].upper(), "geonames_id": ...}

Pros:

✅ No local storage required
✅ Always up-to-date data
✅ Fast to implement (~2 hours)
✅ Simple integration

Cons:

❌ Rate limits: 2,000 requests/hour (free tier), 30,000/day
❌ Requires network connectivity
❌ Latency on each lookup (~200-500ms)
❌ Single point of failure (API downtime)
❌ Processing 1,351 Dutch orgs = could hit rate limit

Verdict: Good for prototyping, not suitable for production.

Option B: GeoNames Local Database ✅ SELECTED

Approach: Download GeoNames data dump once, store in SQLite database for offline lookups.

# Pseudocode
class GeoNamesLocalDB:
    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)
    
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        cursor = self.conn.execute(
            "SELECT geonames_id, name, admin1_code, latitude, longitude "
            "FROM cities WHERE name = ? AND country_code = ?",
            (city_name, country)
        )
        result = cursor.fetchone()
        return {"abbreviation": self._generate_abbreviation(result['name']), ...}
    
    def _generate_abbreviation(self, city_name: str) -> str:
        # First 3 letters of city name, uppercase
        return city_name[:3].upper()

Pros:

✅ No rate limits - unlimited local queries
✅ Fast lookups - <1ms with indexes
✅ Offline operation - no network dependency
✅ Predictable performance - no API downtime risk
✅ Cost-effective - free download, ~200MB storage
✅ Reproducible - same data snapshot across environments
✅ Cacheable - SQLite file easy to distribute

Cons:

⚠️ Initial setup - requires download + database creation (~30 mins)
⚠️ Storage - ~200MB for worldwide data, ~5MB for NL-only
⚠️ Staleness - need periodic updates (monthly/quarterly)
⚠️ Implementation time - ~4 hours total

Verdict: BEST for production - predictable, fast, no dependencies.

Option C: Hybrid Approach (API + Local Cache)

Approach: Use local database, fallback to API for missing cities, cache results.

class GeoNamesHybrid:
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        # Try local DB first
        result = self.local_db.lookup(city_name, country)
        if result:
            return result
        
        # Fallback to API
        result = self.api_client.lookup(city_name, country)
        
        # Cache to local DB
        self.local_db.insert(result)
        return result

Pros:

✅ Best of both worlds
✅ Auto-updates for missing cities
✅ Resilient to data gaps

Cons:

❌ Most complex implementation
❌ Still subject to rate limits (for misses)
❌ Two failure modes (DB + API)
❌ Longer implementation time (~5-6 hours)

Verdict: Over-engineered for current needs, defer to future.

Selected Solution: Option B Implementation Plan

Architecture

┌─────────────────────────────────────────────────────┐
│                 GHCID Generator                      │
│  (src/glam_extractor/identifiers/ghcid.py)          │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│            GeoNames Lookup Service                   │
│  (src/glam_extractor/geocoding/geonames_lookup.py)  │
│                                                      │
│  - get_city_abbreviation(city, country)              │
│  - get_city_details(city, country)                   │
│  - get_geonames_id(city, country)                    │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│               SQLite Database                        │
│       (data/reference/geonames.db)                   │
│                                                      │
│  Tables:                                             │
│  - cities (geonames_id, name, country_code, ...)     │
│  - admin1_codes (province mappings)                  │
│  - alternate_names (multilingual)                    │
└──────────────────────────────────────────────────────┘

Database Schema

-- Main cities table
CREATE TABLE cities (
    geonames_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    ascii_name TEXT NOT NULL,
    country_code TEXT NOT NULL,
    admin1_code TEXT,          -- ISO 3166-2 province/state code
    admin2_code TEXT,          -- County/municipality
    latitude REAL NOT NULL,
    longitude REAL NOT NULL,
    population INTEGER,
    elevation INTEGER,
    feature_code TEXT,         -- PPL, PPLA, PPLC (city types)
    timezone TEXT,
    modification_date TEXT
);

-- Index for fast city lookups
CREATE INDEX idx_city_country ON cities(name, country_code);
CREATE INDEX idx_country_pop ON cities(country_code, population DESC);

-- Province/state codes (ISO 3166-2)
CREATE TABLE admin1_codes (
    code TEXT PRIMARY KEY,     -- e.g., "NL.07" for North Holland
    name TEXT NOT NULL,        -- "North Holland"
    ascii_name TEXT NOT NULL,
    geonames_id INTEGER
);

-- Alternative names (multilingual support)
CREATE TABLE alternate_names (
    alternate_name_id INTEGER PRIMARY KEY,
    geonames_id INTEGER NOT NULL,
    isolanguage TEXT,          -- en, nl, fr, etc.
    alternate_name TEXT NOT NULL,
    is_preferred_name INTEGER,
    is_short_name INTEGER,
    FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id)
);

CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id);
CREATE INDEX idx_altname_name ON alternate_names(alternate_name);

Data Source

GeoNames Download URL:

http://download.geonames.org/export/dump/

Files needed:

Cities: cities15000.zip (cities with >15,000 population) - ~25MB
- Alternative: allCountries.zip (all places) - ~350MB
Admin codes: admin1CodesASCII.txt - province/state mappings
Alternate names: alternateNamesV2.zip - multilingual names

Netherlands-specific:

NL.zip - All Dutch places (~200KB uncompressed)
Includes all 475+ cities in Dutch heritage datasets

Database Build Process

# scripts/build_geonames_db.py

import csv
import sqlite3
import zipfile
from pathlib import Path

def build_geonames_database(
    geonames_file: str = "NL.txt",
    admin1_file: str = "admin1CodesASCII.txt",
    output_db: str = "data/reference/geonames.db"
):
    """
    Build SQLite database from GeoNames text files.
    
    Steps:
    1. Download GeoNames data (if not exists)
    2. Create SQLite database
    3. Parse GeoNames TSV files
    4. Insert data with indexes
    5. Validate completeness
    """
    conn = sqlite3.connect(output_db)
    cursor = conn.cursor()
    
    # Create tables (see schema above)
    create_tables(cursor)
    
    # Parse cities file
    with open(geonames_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            cursor.execute(
                "INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
                (
                    row[0],   # geonames_id
                    row[1],   # name
                    row[2],   # ascii_name
                    row[8],   # country_code
                    row[10],  # admin1_code
                    row[11],  # admin2_code
                    row[4],   # latitude
                    row[5],   # longitude
                    row[14],  # population
                    row[16],  # elevation
                    row[7],   # feature_code
                    row[17],  # timezone
                    row[18],  # modification_date
                )
            )
    
    # Parse admin1 codes (province mappings)
    with open(admin1_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            cursor.execute(
                "INSERT INTO admin1_codes VALUES (?, ?, ?, ?)",
                (row[0], row[1], row[2], row[3])
            )
    
    # Create indexes
    cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)")
    cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)")
    
    conn.commit()
    conn.close()
    
    print(f"✅ GeoNames database created: {output_db}")
    print(f"   Cities: {count_cities(output_db)}")
    print(f"   Countries: {count_countries(output_db)}")

City Abbreviation Strategy

Decision: Use first 3 letters of city name (uppercase).

Rationale:

✅ Simplicity: Easy to understand and implement
✅ Compatibility: Many UN/LOCODE codes use this pattern (Amsterdam → AMS)
✅ Predictability: No complex rules, consistent output
✅ Readability: Often recognizable (Rotterdam → ROT, Utrecht → UTR)

Algorithm:

def generate_city_abbreviation(city_name: str) -> str:
    """
    Generate 3-letter city abbreviation from name.
    
    Rules:
    1. Take first 3 characters
    2. Convert to uppercase
    3. Remove spaces/punctuation
    
    Examples:
        Amsterdam → AMS
        Rotterdam → ROT
        Den Haag → DEN
        's-Hertogenbosch → SHE
        Aalst → AAL
    """
    # Normalize: remove special chars, spaces
    normalized = city_name.replace("'", "").replace("-", "").replace(" ", "")
    
    # Take first 3 letters, uppercase
    abbreviation = normalized[:3].upper()
    
    return abbreviation

Alternative considered: First 3 consonants

Amsterdam → MSD (M-S-D)
Rotterdam → RTD (R-T-D)
Rejected: Less intuitive, harder to reverse-lookup

Integration Points

Files to modify:

src/glam_extractor/geocoding/geonames_lookup.py (NEW)
- SQLite client for city lookups
- Abbreviation generation logic
- Province code mapping (ISO 3166-2)
src/glam_extractor/identifiers/lookups.py (UPDATE)
- Replace get_city_locode() with get_city_abbreviation()
- Update get_ghcid_components_for_dutch_city() to use GeoNames
src/glam_extractor/identifiers/ghcid.py (UPDATE)
- Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation"
- Update validation regex (same format, 3 uppercase letters)
data/reference/geonames.db (NEW)
- SQLite database with Dutch cities
- Initial build: ~5MB for Netherlands
- Global expansion: ~200MB for all countries
scripts/build_geonames_db.py (NEW)
- Downloads GeoNames data
- Builds SQLite database
- Validates completeness
data/reference/nl_city_locodes.json (DEPRECATE)
- Mark as deprecated in comments
- Keep for backward compatibility (short-term)
- Remove after GHCID migration complete

GHCID Format Changes

Before (UN/LOCODE):

NL-NH-AMS-M-RM
│  │  │   │ └─ Abbreviation (Rijksmuseum)
│  │  │   └─── Type (Museum)
│  │  └─────── City LOCODE (Amsterdam)
│  └────────── Province (North Holland)
└───────────── Country (Netherlands)

After (GeoNames):

NL-NH-AMS-M-RM
│  │  │   │ └─ Abbreviation (Rijksmuseum)
│  │  │   └─── Type (Museum)
│  │  └─────── City abbreviation (Amsterdam, from GeoNames)
│  └────────── Province (from GeoNames admin1_code)
└───────────── Country (Netherlands)

Note: Format is identical, only the source of city abbreviation changes.

Migration Impact

Breaking Changes:

⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE)
⚠️ Existing 152 GHCIDs may need regeneration
⚠️ Numeric hashes will change (computed from GHCID string)

Migration strategy:

Document changes in GHCID history
Preserve old GHCIDs in ghcid_original field
Update ghcid_current with new abbreviation
Regenerate ghcid_numeric from new GHCID

Add history entry:

GHCIDHistoryEntry(
    ghcid="NL-NH-XXX-M-RM",  # Old LOCODE-based
    valid_from="2025-10-01",
    valid_to="2025-11-05",
    reason="Migrated from UN/LOCODE to GeoNames abbreviation"
)

Backward compatibility:

Keep mapping table: old_ghcid → new_ghcid
Support lookups by both old and new identifiers
Phase out old format over 6-12 months

Implementation Checklist

Phase 1: Database Setup (Est. 1-2 hours)

Download GeoNames NL.zip dataset
Download admin1CodesASCII.txt for province codes
Create scripts/build_geonames_db.py script
Generate data/reference/geonames.db SQLite database
Validate: All 475 Dutch cities present
Create indexes for fast lookups
Document database schema in code comments

Phase 2: Lookup Module (Est. 1-2 hours)

Create src/glam_extractor/geocoding/geonames_lookup.py
Implement GeoNamesDB class with SQLite connection
Implement get_city_abbreviation(city, country) method
Implement get_city_details(city, country) method
Implement get_geonames_id(city, country) method
Add support for alternate names (multilingual)
Handle city name normalization (case, accents, hyphens)

Phase 3: Integration (Est. 30 mins)

Update src/glam_extractor/identifiers/lookups.py
- Replace _load_json("nl_city_locodes.json") with GeoNames DB
- Update get_city_locode() → get_city_abbreviation()
- Update get_province_code() to use GeoNames admin1_code
Update src/glam_extractor/identifiers/ghcid.py
- Update docstrings (UN/LOCODE → GeoNames)
- Keep validation regex unchanged (format identical)

Phase 4: Testing (Est. 1 hour)

Unit tests for GeoNamesDB class
- Test city lookup (Amsterdam → AMS)
- Test missing city handling
- Test province code mapping
- Test GeoNames ID retrieval
Integration tests with ISIL parser
- Re-run ISIL registry parsing
- Verify GHCID generation rate >95% (vs 41.8% before)
- Compare old vs new GHCIDs for changes
Edge case tests
- Cities with special characters ('s-Hertogenbosch)
- Cities with spaces (Den Haag)
- Multilingual city names

Phase 5: Documentation (Est. 30 mins)

Update docs/plan/global_glam/06-global-identifier-system.md
- Replace UN/LOCODE references with GeoNames
- Document city abbreviation algorithm
Update AGENTS.md
- Add GeoNames database instructions
- Explain migration from UN/LOCODE
Add migration guide: docs/migration/ghcid_locode_to_geonames.md

Phase 6: Validation (Est. 30 mins)

Run full test suite (target: 150+ tests passing)
Generate GHCIDs for all 364 ISIL records
Generate GHCIDs for all 1,351 Dutch orgs
Identify any remaining cities without GeoNames match
Document edge cases and resolution strategy

Total Estimated Time: ~4-5 hours

Performance Benchmarks

Expected Performance (SQLite Local DB)

Operation	Time	Notes
Database load	10ms	One-time per process
City lookup	<1ms	With index on (name, country_code)
Batch lookup (1,000 cities)	~100ms	0.1ms per city
Batch lookup (10,000 cities)	~1s	0.1ms per city

Comparison to API:

API request latency: 200-500ms per city
Batch 1,000 cities via API: ~200-500 seconds (with rate limits)
Batch 1,000 cities via SQLite: ~0.1 seconds

Winner: SQLite is 2,000-5,000x faster than API.

Storage Requirements

Scope	Database Size	Records
Netherlands only	~5 MB	~5,000 places
Western Europe	~50 MB	~50,000 places
Global (cities >15k pop)	~200 MB	~200,000 cities
Global (all places)	~1.5 GB	~11 million places

Decision: Start with Netherlands only (5MB), expand as needed.

Data Freshness & Maintenance

Update Frequency

GeoNames: Updated daily (new cities, name changes, population updates)

Our database: Update quarterly (every 3 months)

Rationale:

Heritage institutions rarely relocate
City names stable over years
Population changes irrelevant for GHCID
Quarterly updates sufficient for accuracy

Update Process

# scripts/update_geonames_db.sh

#!/bin/bash
# Download latest GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip

# Rebuild database
python scripts/build_geonames_db.py \
    --input NL.txt \
    --output data/reference/geonames.db

# Validate
python scripts/validate_geonames_db.py

echo "✅ GeoNames database updated"

Automation: Run via cron job or GitHub Actions monthly.

Comparison: Before & After

Before (UN/LOCODE)

Coverage:

50 Dutch cities (10.5%)
152/364 ISIL records with GHCIDs (41.8%)
212 records cannot generate GHCID (58.2%)

Data source:

JSON file: data/reference/nl_city_locodes.json
Manual curation required
No automation for updates

Limitations:

Cannot expand to global institutions
Missing most small/medium cities
No multilingual support

After (GeoNames)

Coverage:

475+ Dutch cities (100%)
Expected: 345+/364 ISIL records with GHCIDs (>95%)
<20 records without GHCID (edge cases)

Data source:

SQLite database: data/reference/geonames.db
Automated build from GeoNames dumps
Quarterly updates via script

Benefits:

✅ Ready for global expansion (139 conversation files)
✅ Covers all heritage institution locations
✅ Multilingual name support
✅ Coordinates for mapping
✅ Province/state codes included

Alternatives Rejected

1. OpenStreetMap Nominatim

Pros: Free geocoding API, comprehensive Cons: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps

Why rejected: Same API limitations as GeoNames, less structured data

2. Google Maps Geocoding API

Pros: High quality, fast, reliable Cons: Costs money ($5 per 1,000 requests), vendor lock-in, requires API key

Why rejected: Not free/open, unsustainable for large-scale extraction

3. Custom City Abbreviation Registry

Pros: Full control, optimized for heritage sector Cons: Months of manual curation, ongoing maintenance burden, duplication of effort

Why rejected: Reinventing the wheel, GeoNames already exists and is maintained

4. Keep UN/LOCODE + Manual Additions

Pros: Minimal code changes Cons: Doesn't solve scalability, still manual curation, not global-ready

Why rejected: Doesn't address root problem, technical debt

Risk Assessment

Risk 1: GeoNames Service Discontinuation

Likelihood: Low (operated by GeoNames since 2005, widely used)
Impact: Medium (need alternative source)
Mitigation:

We use offline database, not dependent on API
GeoNames data dumps archived by multiple organizations
Could switch to OpenStreetMap/WikiData if needed

Risk 2: City Name Ambiguity

Scenario: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia)

Likelihood: Medium (common in global dataset)
Impact: Medium (wrong GHCID if not disambiguated)
Mitigation:

Always provide country code in lookups
Use population ranking (larger city preferred)
Validate with province/state code match
Log warnings for ambiguous matches

Risk 3: Database Corruption

Likelihood: Low (SQLite very stable)
Impact: High (all GHCIDs incorrect)
Mitigation:

Checksum validation after build
Version control: geonames-v2025.11.db
Keep backup copies
Automated tests verify data integrity

Risk 4: Abbreviation Collisions

Scenario: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" → "AMS")

Likelihood: Medium (common with 3-letter codes)
Impact: Medium (same city code, different GHCIDs differentiated by province)
Mitigation:

Province code in GHCID prevents collision: NL-NH-AMS vs NL-XX-AMS
Most collisions will be in different provinces
If same province: Wikidata Q-number resolves (collision resolution)

Future Enhancements

1. Global Expansion (Priority: High)

Goal: Support 139 conversation files covering 60+ countries

Tasks:

Download GeoNames global dataset (allCountries.zip)
Build SQLite database for all countries (~200MB)
Test with conversation JSON extraction
Validate GHCID generation for international institutions

Timeline: After Phase 1 complete (Dutch institutions)

2. Multilingual City Name Matching (Priority: Medium)

Goal: Match cities by local names (e.g., "Den Haag" = "The Hague")

Tasks:

Load GeoNames alternateNamesV2.txt into database
Support lookup by any alternate name
Return canonical name + GeoNames ID

Use case: Conversation JSONs mention cities in local language

3. Geocoding Integration (Priority: Medium)

Goal: Populate location.latitude and location.longitude fields

Tasks:

Add get_coordinates(city, country) method
Integrate with HeritageCustodian.locations list
Support reverse geocoding (lat/lon → city name)

Benefit: Enable map visualizations, geographic analysis

4. Province/State Inference (Priority: Low)

Goal: Auto-detect province from city name (if not provided)

Tasks:

Use GeoNames admin1_code field
Map to ISO 3166-2 province codes
Handle edge cases (disputed territories)

Use case: ISIL registry doesn't always specify province

5. City Similarity Matching (Priority: Low)

Goal: Handle typos, spelling variations

Tasks:

Implement fuzzy matching (Levenshtein distance)
Suggest corrections for unmatched cities
Confidence scoring for matches

Example: "Amsterdm" → "Amsterdam" (confidence: 0.95)

Success Metrics

Coverage Targets

Metric	Before (UN/LOCODE)	Target (GeoNames)	Status
Dutch city coverage	50 (10.5%)	475 (100%)	Pending ✅
GHCID generation rate (ISIL)	152/364 (41.8%)	>345/364 (>95%)	Pending ✅
GHCID generation rate (Dutch orgs)	Unknown	>1,280/1,351 (>95%)	Pending ✅
Global city coverage	N/A	>200,000 cities	Future 🔮

Performance Targets

Metric	Target	Status
City lookup latency	<1ms	Pending ✅
Database load time	<10ms	Pending ✅
Database size (NL-only)	<10MB	Pending ✅
Test coverage	>90%	Pending ✅

Quality Targets

Metric	Target	Status
City name accuracy	>99%	Pending ✅
Province code accuracy	>95%	Pending ✅
GeoNames ID linkage	>90%	Pending ✅
Zero database corruption	100%	Pending ✅

References

External Resources

GeoNames Official: https://www.geonames.org
GeoNames Downloads: http://download.geonames.org/export/dump/
GeoNames Documentation: https://www.geonames.org/export/
GeoNames Web Services: https://www.geonames.org/export/web-services.html
ISO 3166-2 (Provinces): https://en.wikipedia.org/wiki/ISO_3166-2

Internal Documentation

GHCID Specification: docs/plan/global_glam/06-global-identifier-system.md
Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
Architecture: docs/plan/global_glam/02-architecture.md
Schema: schemas/heritage_custodian.yaml

Implementation Files

GHCID Generator: src/glam_extractor/identifiers/ghcid.py
Lookups: src/glam_extractor/identifiers/lookups.py
GeoNames Client (to be created): src/glam_extractor/geocoding/geonames_lookup.py
Database (to be created): data/reference/geonames.db
Build Script (to be created): scripts/build_geonames_db.py

Approval & Sign-off

Decision Made: 2025-11-05
Approved By: GLAM Data Extraction Project Team
Implementation Owner: GeoNames Integration Team
Review Date: 2025-12-05 (1 month post-implementation)

Status: ✅ Design Approved - Ready for Implementation

Last Updated: 2025-11-05
Version: 1.0
Next Review: After Phase 1 implementation complete

28 KiB Raw Blame History

GeoNames Integration for GHCID City Abbreviations

Executive Summary

Problem Statement

Current Limitation: UN/LOCODE Coverage

Why UN/LOCODE Falls Short

Example Gap

Solution: GeoNames Geographic Database

What is GeoNames?

Why GeoNames for GHCID?

Implementation Options Considered

Option A: GeoNames API Integration

Option B: GeoNames Local Database ✅ SELECTED

Option C: Hybrid Approach (API + Local Cache)

Selected Solution: Option B Implementation Plan

Architecture

Database Schema

Data Source

Database Build Process

City Abbreviation Strategy

Integration Points

GHCID Format Changes

Migration Impact

Implementation Checklist

Phase 1: Database Setup (Est. 1-2 hours)

Phase 2: Lookup Module (Est. 1-2 hours)

Phase 3: Integration (Est. 30 mins)

Phase 4: Testing (Est. 1 hour)

Phase 5: Documentation (Est. 30 mins)

Phase 6: Validation (Est. 30 mins)

Performance Benchmarks

Expected Performance (SQLite Local DB)

Storage Requirements

Data Freshness & Maintenance

Update Frequency

Update Process

Comparison: Before & After

Before (UN/LOCODE)

After (GeoNames)

Alternatives Rejected

1. OpenStreetMap Nominatim

2. Google Maps Geocoding API

3. Custom City Abbreviation Registry

4. Keep UN/LOCODE + Manual Additions

Risk Assessment

Risk 1: GeoNames Service Discontinuation

Risk 2: City Name Ambiguity

Risk 3: Database Corruption

Risk 4: Abbreviation Collisions

Future Enhancements

1. Global Expansion (Priority: High)

2. Multilingual City Name Matching (Priority: Medium)

3. Geocoding Integration (Priority: Medium)

4. Province/State Inference (Priority: Low)

5. City Similarity Matching (Priority: Low)

Success Metrics

Coverage Targets

Performance Targets

Quality Targets

References

External Resources

Internal Documentation

Implementation Files

Approval & Sign-off

28 KiB

Raw Blame History