glam/docs/plan/global_glam/08-geonames-integration.md
2025-11-19 23:25:22 +01:00

28 KiB

GeoNames Integration for GHCID City Abbreviations

Version: 1.0
Date: 2025-11-05
Status: Design Complete - Implementation Pending
Decision: Option B - GeoNames Local Database


Executive Summary

To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with GeoNames local database and provides implementation guidelines.

Key Decision: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE.


Problem Statement

Current Limitation: UN/LOCODE Coverage

The GHCID format requires 3-letter city codes:

{ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM

Current approach uses UN/LOCODE (United Nations Code for Trade and Transport Locations):

  • Coverage: Only 50 Dutch cities in data/reference/nl_city_locodes.json
  • Required: 475 Dutch cities (based on Dutch heritage datasets)
  • Gap: 89.5% of cities missing from lookup table
  • Impact: Only 41.8% GHCID generation rate (152/364 ISIL records)

Why UN/LOCODE Falls Short

  1. Trade-focused: UN/LOCODE prioritizes ports, airports, rail terminals
  2. Major cities only: Small/medium cities often omitted
  3. Incomplete coverage: Many heritage-rich towns excluded
  4. Not heritage-centric: No correlation with museum/archive density
  5. Limited maintenance: Updates infrequent, focused on logistics

Example Gap

Dutch ISIL registry includes institutions in:

  • Aalten (in UN/LOCODE as AAT)
  • Achtkarspelen (not in UN/LOCODE)
  • Almkerk (not in UN/LOCODE)
  • Ameland (not in UN/LOCODE)

Result: 212 out of 364 records (58.2%) could not generate GHCIDs.


Solution: GeoNames Geographic Database

What is GeoNames?

GeoNames (https://www.geonames.org) is a comprehensive geographical database covering:

  • 11+ million place names worldwide
  • Hierarchical data: City → Province/State → Country
  • Multilingual names: Local + alternative names in multiple languages
  • Geographic coordinates: Latitude/longitude for all places
  • Population data: Size/importance ranking
  • Administrative divisions: ISO 3166-2 integration
  • Free & open: Creative Commons license

Why GeoNames for GHCID?

Criterion UN/LOCODE GeoNames Winner
Dutch city coverage 50 cities (10.5%) 475+ cities (100%) GeoNames
Global coverage ~100,000 locations 11+ million GeoNames
Heritage relevance Trade/transport focus All populated places GeoNames
Coordinates included Some All GeoNames
Hierarchical data No Yes (city→province→country) GeoNames
Update frequency Quarterly Daily GeoNames
API availability No official API Yes (free tier) GeoNames
Offline usage CSV dumps SQLite/CSV dumps Tie
Cost Free Free Tie

Conclusion: GeoNames is superior for heritage institution geolocation.


Implementation Options Considered

Option A: GeoNames API Integration

Approach: Query GeoNames web API in real-time for city lookups.

# Pseudocode
class GeoNamesAPIClient:
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        response = requests.get(
            "http://api.geonames.org/searchJSON",
            params={
                "name": city_name,
                "country": country,
                "maxRows": 1,
                "username": "glam_extractor"
            }
        )
        # Generate 3-letter abbreviation from city name
        return {"abbreviation": city_name[:3].upper(), "geonames_id": ...}

Pros:

  • No local storage required
  • Always up-to-date data
  • Fast to implement (~2 hours)
  • Simple integration

Cons:

  • Rate limits: 2,000 requests/hour (free tier), 30,000/day
  • Requires network connectivity
  • Latency on each lookup (~200-500ms)
  • Single point of failure (API downtime)
  • Processing 1,351 Dutch orgs = could hit rate limit

Verdict: Good for prototyping, not suitable for production.


Option B: GeoNames Local Database SELECTED

Approach: Download GeoNames data dump once, store in SQLite database for offline lookups.

# Pseudocode
class GeoNamesLocalDB:
    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)
    
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        cursor = self.conn.execute(
            "SELECT geonames_id, name, admin1_code, latitude, longitude "
            "FROM cities WHERE name = ? AND country_code = ?",
            (city_name, country)
        )
        result = cursor.fetchone()
        return {"abbreviation": self._generate_abbreviation(result['name']), ...}
    
    def _generate_abbreviation(self, city_name: str) -> str:
        # First 3 letters of city name, uppercase
        return city_name[:3].upper()

Pros:

  • No rate limits - unlimited local queries
  • Fast lookups - <1ms with indexes
  • Offline operation - no network dependency
  • Predictable performance - no API downtime risk
  • Cost-effective - free download, ~200MB storage
  • Reproducible - same data snapshot across environments
  • Cacheable - SQLite file easy to distribute

Cons:

  • ⚠️ Initial setup - requires download + database creation (~30 mins)
  • ⚠️ Storage - ~200MB for worldwide data, ~5MB for NL-only
  • ⚠️ Staleness - need periodic updates (monthly/quarterly)
  • ⚠️ Implementation time - ~4 hours total

Verdict: BEST for production - predictable, fast, no dependencies.


Option C: Hybrid Approach (API + Local Cache)

Approach: Use local database, fallback to API for missing cities, cache results.

class GeoNamesHybrid:
    def get_city_abbreviation(self, city_name: str, country: str) -> dict:
        # Try local DB first
        result = self.local_db.lookup(city_name, country)
        if result:
            return result
        
        # Fallback to API
        result = self.api_client.lookup(city_name, country)
        
        # Cache to local DB
        self.local_db.insert(result)
        return result

Pros:

  • Best of both worlds
  • Auto-updates for missing cities
  • Resilient to data gaps

Cons:

  • Most complex implementation
  • Still subject to rate limits (for misses)
  • Two failure modes (DB + API)
  • Longer implementation time (~5-6 hours)

Verdict: Over-engineered for current needs, defer to future.


Selected Solution: Option B Implementation Plan

Architecture

┌─────────────────────────────────────────────────────┐
│                 GHCID Generator                      │
│  (src/glam_extractor/identifiers/ghcid.py)          │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│            GeoNames Lookup Service                   │
│  (src/glam_extractor/geocoding/geonames_lookup.py)  │
│                                                      │
│  - get_city_abbreviation(city, country)              │
│  - get_city_details(city, country)                   │
│  - get_geonames_id(city, country)                    │
└────────────────┬────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────┐
│               SQLite Database                        │
│       (data/reference/geonames.db)                   │
│                                                      │
│  Tables:                                             │
│  - cities (geonames_id, name, country_code, ...)     │
│  - admin1_codes (province mappings)                  │
│  - alternate_names (multilingual)                    │
└──────────────────────────────────────────────────────┘

Database Schema

-- Main cities table
CREATE TABLE cities (
    geonames_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    ascii_name TEXT NOT NULL,
    country_code TEXT NOT NULL,
    admin1_code TEXT,          -- ISO 3166-2 province/state code
    admin2_code TEXT,          -- County/municipality
    latitude REAL NOT NULL,
    longitude REAL NOT NULL,
    population INTEGER,
    elevation INTEGER,
    feature_code TEXT,         -- PPL, PPLA, PPLC (city types)
    timezone TEXT,
    modification_date TEXT
);

-- Index for fast city lookups
CREATE INDEX idx_city_country ON cities(name, country_code);
CREATE INDEX idx_country_pop ON cities(country_code, population DESC);

-- Province/state codes (ISO 3166-2)
CREATE TABLE admin1_codes (
    code TEXT PRIMARY KEY,     -- e.g., "NL.07" for North Holland
    name TEXT NOT NULL,        -- "North Holland"
    ascii_name TEXT NOT NULL,
    geonames_id INTEGER
);

-- Alternative names (multilingual support)
CREATE TABLE alternate_names (
    alternate_name_id INTEGER PRIMARY KEY,
    geonames_id INTEGER NOT NULL,
    isolanguage TEXT,          -- en, nl, fr, etc.
    alternate_name TEXT NOT NULL,
    is_preferred_name INTEGER,
    is_short_name INTEGER,
    FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id)
);

CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id);
CREATE INDEX idx_altname_name ON alternate_names(alternate_name);

Data Source

GeoNames Download URL:

http://download.geonames.org/export/dump/

Files needed:

  1. Cities: cities15000.zip (cities with >15,000 population) - ~25MB
    • Alternative: allCountries.zip (all places) - ~350MB
  2. Admin codes: admin1CodesASCII.txt - province/state mappings
  3. Alternate names: alternateNamesV2.zip - multilingual names

Netherlands-specific:

  • NL.zip - All Dutch places (~200KB uncompressed)
  • Includes all 475+ cities in Dutch heritage datasets

Database Build Process

# scripts/build_geonames_db.py

import csv
import sqlite3
import zipfile
from pathlib import Path

def build_geonames_database(
    geonames_file: str = "NL.txt",
    admin1_file: str = "admin1CodesASCII.txt",
    output_db: str = "data/reference/geonames.db"
):
    """
    Build SQLite database from GeoNames text files.
    
    Steps:
    1. Download GeoNames data (if not exists)
    2. Create SQLite database
    3. Parse GeoNames TSV files
    4. Insert data with indexes
    5. Validate completeness
    """
    conn = sqlite3.connect(output_db)
    cursor = conn.cursor()
    
    # Create tables (see schema above)
    create_tables(cursor)
    
    # Parse cities file
    with open(geonames_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            cursor.execute(
                "INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
                (
                    row[0],   # geonames_id
                    row[1],   # name
                    row[2],   # ascii_name
                    row[8],   # country_code
                    row[10],  # admin1_code
                    row[11],  # admin2_code
                    row[4],   # latitude
                    row[5],   # longitude
                    row[14],  # population
                    row[16],  # elevation
                    row[7],   # feature_code
                    row[17],  # timezone
                    row[18],  # modification_date
                )
            )
    
    # Parse admin1 codes (province mappings)
    with open(admin1_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            cursor.execute(
                "INSERT INTO admin1_codes VALUES (?, ?, ?, ?)",
                (row[0], row[1], row[2], row[3])
            )
    
    # Create indexes
    cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)")
    cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)")
    
    conn.commit()
    conn.close()
    
    print(f"✅ GeoNames database created: {output_db}")
    print(f"   Cities: {count_cities(output_db)}")
    print(f"   Countries: {count_countries(output_db)}")

City Abbreviation Strategy

Decision: Use first 3 letters of city name (uppercase).

Rationale:

  1. Simplicity: Easy to understand and implement
  2. Compatibility: Many UN/LOCODE codes use this pattern (Amsterdam → AMS)
  3. Predictability: No complex rules, consistent output
  4. Readability: Often recognizable (Rotterdam → ROT, Utrecht → UTR)

Algorithm:

def generate_city_abbreviation(city_name: str) -> str:
    """
    Generate 3-letter city abbreviation from name.
    
    Rules:
    1. Take first 3 characters
    2. Convert to uppercase
    3. Remove spaces/punctuation
    
    Examples:
        Amsterdam → AMS
        Rotterdam → ROT
        Den Haag → DEN
        's-Hertogenbosch → SHE
        Aalst → AAL
    """
    # Normalize: remove special chars, spaces
    normalized = city_name.replace("'", "").replace("-", "").replace(" ", "")
    
    # Take first 3 letters, uppercase
    abbreviation = normalized[:3].upper()
    
    return abbreviation

Alternative considered: First 3 consonants

  • Amsterdam → MSD (M-S-D)
  • Rotterdam → RTD (R-T-D)
  • Rejected: Less intuitive, harder to reverse-lookup

Integration Points

Files to modify:

  1. src/glam_extractor/geocoding/geonames_lookup.py (NEW)

    • SQLite client for city lookups
    • Abbreviation generation logic
    • Province code mapping (ISO 3166-2)
  2. src/glam_extractor/identifiers/lookups.py (UPDATE)

    • Replace get_city_locode() with get_city_abbreviation()
    • Update get_ghcid_components_for_dutch_city() to use GeoNames
  3. src/glam_extractor/identifiers/ghcid.py (UPDATE)

    • Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation"
    • Update validation regex (same format, 3 uppercase letters)
  4. data/reference/geonames.db (NEW)

    • SQLite database with Dutch cities
    • Initial build: ~5MB for Netherlands
    • Global expansion: ~200MB for all countries
  5. scripts/build_geonames_db.py (NEW)

    • Downloads GeoNames data
    • Builds SQLite database
    • Validates completeness
  6. data/reference/nl_city_locodes.json (DEPRECATE)

    • Mark as deprecated in comments
    • Keep for backward compatibility (short-term)
    • Remove after GHCID migration complete

GHCID Format Changes

Before (UN/LOCODE):

NL-NH-AMS-M-RM
│  │  │   │ └─ Abbreviation (Rijksmuseum)
│  │  │   └─── Type (Museum)
│  │  └─────── City LOCODE (Amsterdam)
│  └────────── Province (North Holland)
└───────────── Country (Netherlands)

After (GeoNames):

NL-NH-AMS-M-RM
│  │  │   │ └─ Abbreviation (Rijksmuseum)
│  │  │   └─── Type (Museum)
│  │  └─────── City abbreviation (Amsterdam, from GeoNames)
│  └────────── Province (from GeoNames admin1_code)
└───────────── Country (Netherlands)

Note: Format is identical, only the source of city abbreviation changes.

Migration Impact

Breaking Changes:

  • ⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE)
  • ⚠️ Existing 152 GHCIDs may need regeneration
  • ⚠️ Numeric hashes will change (computed from GHCID string)

Migration strategy:

  1. Document changes in GHCID history
  2. Preserve old GHCIDs in ghcid_original field
  3. Update ghcid_current with new abbreviation
  4. Regenerate ghcid_numeric from new GHCID
  5. Add history entry:
    GHCIDHistoryEntry(
        ghcid="NL-NH-XXX-M-RM",  # Old LOCODE-based
        valid_from="2025-10-01",
        valid_to="2025-11-05",
        reason="Migrated from UN/LOCODE to GeoNames abbreviation"
    )
    

Backward compatibility:

  • Keep mapping table: old_ghcid → new_ghcid
  • Support lookups by both old and new identifiers
  • Phase out old format over 6-12 months

Implementation Checklist

Phase 1: Database Setup (Est. 1-2 hours)

  • Download GeoNames NL.zip dataset
  • Download admin1CodesASCII.txt for province codes
  • Create scripts/build_geonames_db.py script
  • Generate data/reference/geonames.db SQLite database
  • Validate: All 475 Dutch cities present
  • Create indexes for fast lookups
  • Document database schema in code comments

Phase 2: Lookup Module (Est. 1-2 hours)

  • Create src/glam_extractor/geocoding/geonames_lookup.py
  • Implement GeoNamesDB class with SQLite connection
  • Implement get_city_abbreviation(city, country) method
  • Implement get_city_details(city, country) method
  • Implement get_geonames_id(city, country) method
  • Add support for alternate names (multilingual)
  • Handle city name normalization (case, accents, hyphens)

Phase 3: Integration (Est. 30 mins)

  • Update src/glam_extractor/identifiers/lookups.py
    • Replace _load_json("nl_city_locodes.json") with GeoNames DB
    • Update get_city_locode()get_city_abbreviation()
    • Update get_province_code() to use GeoNames admin1_code
  • Update src/glam_extractor/identifiers/ghcid.py
    • Update docstrings (UN/LOCODE → GeoNames)
    • Keep validation regex unchanged (format identical)

Phase 4: Testing (Est. 1 hour)

  • Unit tests for GeoNamesDB class
    • Test city lookup (Amsterdam → AMS)
    • Test missing city handling
    • Test province code mapping
    • Test GeoNames ID retrieval
  • Integration tests with ISIL parser
    • Re-run ISIL registry parsing
    • Verify GHCID generation rate >95% (vs 41.8% before)
    • Compare old vs new GHCIDs for changes
  • Edge case tests
    • Cities with special characters ('s-Hertogenbosch)
    • Cities with spaces (Den Haag)
    • Multilingual city names

Phase 5: Documentation (Est. 30 mins)

  • Update docs/plan/global_glam/06-global-identifier-system.md
    • Replace UN/LOCODE references with GeoNames
    • Document city abbreviation algorithm
  • Update AGENTS.md
    • Add GeoNames database instructions
    • Explain migration from UN/LOCODE
  • Add migration guide: docs/migration/ghcid_locode_to_geonames.md

Phase 6: Validation (Est. 30 mins)

  • Run full test suite (target: 150+ tests passing)
  • Generate GHCIDs for all 364 ISIL records
  • Generate GHCIDs for all 1,351 Dutch orgs
  • Identify any remaining cities without GeoNames match
  • Document edge cases and resolution strategy

Total Estimated Time: ~4-5 hours


Performance Benchmarks

Expected Performance (SQLite Local DB)

Operation Time Notes
Database load 10ms One-time per process
City lookup <1ms With index on (name, country_code)
Batch lookup (1,000 cities) ~100ms 0.1ms per city
Batch lookup (10,000 cities) ~1s 0.1ms per city

Comparison to API:

  • API request latency: 200-500ms per city
  • Batch 1,000 cities via API: ~200-500 seconds (with rate limits)
  • Batch 1,000 cities via SQLite: ~0.1 seconds

Winner: SQLite is 2,000-5,000x faster than API.

Storage Requirements

Scope Database Size Records
Netherlands only ~5 MB ~5,000 places
Western Europe ~50 MB ~50,000 places
Global (cities >15k pop) ~200 MB ~200,000 cities
Global (all places) ~1.5 GB ~11 million places

Decision: Start with Netherlands only (5MB), expand as needed.


Data Freshness & Maintenance

Update Frequency

GeoNames: Updated daily (new cities, name changes, population updates)

Our database: Update quarterly (every 3 months)

Rationale:

  • Heritage institutions rarely relocate
  • City names stable over years
  • Population changes irrelevant for GHCID
  • Quarterly updates sufficient for accuracy

Update Process

# scripts/update_geonames_db.sh

#!/bin/bash
# Download latest GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip

# Rebuild database
python scripts/build_geonames_db.py \
    --input NL.txt \
    --output data/reference/geonames.db

# Validate
python scripts/validate_geonames_db.py

echo "✅ GeoNames database updated"

Automation: Run via cron job or GitHub Actions monthly.


Comparison: Before & After

Before (UN/LOCODE)

Coverage:

  • 50 Dutch cities (10.5%)
  • 152/364 ISIL records with GHCIDs (41.8%)
  • 212 records cannot generate GHCID (58.2%)

Data source:

  • JSON file: data/reference/nl_city_locodes.json
  • Manual curation required
  • No automation for updates

Limitations:

  • Cannot expand to global institutions
  • Missing most small/medium cities
  • No multilingual support

After (GeoNames)

Coverage:

  • 475+ Dutch cities (100%)
  • Expected: 345+/364 ISIL records with GHCIDs (>95%)
  • <20 records without GHCID (edge cases)

Data source:

  • SQLite database: data/reference/geonames.db
  • Automated build from GeoNames dumps
  • Quarterly updates via script

Benefits:

  • Ready for global expansion (139 conversation files)
  • Covers all heritage institution locations
  • Multilingual name support
  • Coordinates for mapping
  • Province/state codes included

Alternatives Rejected

1. OpenStreetMap Nominatim

Pros: Free geocoding API, comprehensive Cons: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps

Why rejected: Same API limitations as GeoNames, less structured data

2. Google Maps Geocoding API

Pros: High quality, fast, reliable Cons: Costs money ($5 per 1,000 requests), vendor lock-in, requires API key

Why rejected: Not free/open, unsustainable for large-scale extraction

3. Custom City Abbreviation Registry

Pros: Full control, optimized for heritage sector Cons: Months of manual curation, ongoing maintenance burden, duplication of effort

Why rejected: Reinventing the wheel, GeoNames already exists and is maintained

4. Keep UN/LOCODE + Manual Additions

Pros: Minimal code changes Cons: Doesn't solve scalability, still manual curation, not global-ready

Why rejected: Doesn't address root problem, technical debt


Risk Assessment

Risk 1: GeoNames Service Discontinuation

Likelihood: Low (operated by GeoNames since 2005, widely used)
Impact: Medium (need alternative source)
Mitigation:

  • We use offline database, not dependent on API
  • GeoNames data dumps archived by multiple organizations
  • Could switch to OpenStreetMap/WikiData if needed

Risk 2: City Name Ambiguity

Scenario: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia)

Likelihood: Medium (common in global dataset)
Impact: Medium (wrong GHCID if not disambiguated)
Mitigation:

  • Always provide country code in lookups
  • Use population ranking (larger city preferred)
  • Validate with province/state code match
  • Log warnings for ambiguous matches

Risk 3: Database Corruption

Likelihood: Low (SQLite very stable)
Impact: High (all GHCIDs incorrect)
Mitigation:

  • Checksum validation after build
  • Version control: geonames-v2025.11.db
  • Keep backup copies
  • Automated tests verify data integrity

Risk 4: Abbreviation Collisions

Scenario: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" → "AMS")

Likelihood: Medium (common with 3-letter codes)
Impact: Medium (same city code, different GHCIDs differentiated by province)
Mitigation:

  • Province code in GHCID prevents collision: NL-NH-AMS vs NL-XX-AMS
  • Most collisions will be in different provinces
  • If same province: Wikidata Q-number resolves (collision resolution)

Future Enhancements

1. Global Expansion (Priority: High)

Goal: Support 139 conversation files covering 60+ countries

Tasks:

  • Download GeoNames global dataset (allCountries.zip)
  • Build SQLite database for all countries (~200MB)
  • Test with conversation JSON extraction
  • Validate GHCID generation for international institutions

Timeline: After Phase 1 complete (Dutch institutions)

2. Multilingual City Name Matching (Priority: Medium)

Goal: Match cities by local names (e.g., "Den Haag" = "The Hague")

Tasks:

  • Load GeoNames alternateNamesV2.txt into database
  • Support lookup by any alternate name
  • Return canonical name + GeoNames ID

Use case: Conversation JSONs mention cities in local language

3. Geocoding Integration (Priority: Medium)

Goal: Populate location.latitude and location.longitude fields

Tasks:

  • Add get_coordinates(city, country) method
  • Integrate with HeritageCustodian.locations list
  • Support reverse geocoding (lat/lon → city name)

Benefit: Enable map visualizations, geographic analysis

4. Province/State Inference (Priority: Low)

Goal: Auto-detect province from city name (if not provided)

Tasks:

  • Use GeoNames admin1_code field
  • Map to ISO 3166-2 province codes
  • Handle edge cases (disputed territories)

Use case: ISIL registry doesn't always specify province

5. City Similarity Matching (Priority: Low)

Goal: Handle typos, spelling variations

Tasks:

  • Implement fuzzy matching (Levenshtein distance)
  • Suggest corrections for unmatched cities
  • Confidence scoring for matches

Example: "Amsterdm" → "Amsterdam" (confidence: 0.95)


Success Metrics

Coverage Targets

Metric Before (UN/LOCODE) Target (GeoNames) Status
Dutch city coverage 50 (10.5%) 475 (100%) Pending
GHCID generation rate (ISIL) 152/364 (41.8%) >345/364 (>95%) Pending
GHCID generation rate (Dutch orgs) Unknown >1,280/1,351 (>95%) Pending
Global city coverage N/A >200,000 cities Future 🔮

Performance Targets

Metric Target Status
City lookup latency <1ms Pending
Database load time <10ms Pending
Database size (NL-only) <10MB Pending
Test coverage >90% Pending

Quality Targets

Metric Target Status
City name accuracy >99% Pending
Province code accuracy >95% Pending
GeoNames ID linkage >90% Pending
Zero database corruption 100% Pending

References

External Resources

Internal Documentation

  • GHCID Specification: docs/plan/global_glam/06-global-identifier-system.md
  • Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
  • Architecture: docs/plan/global_glam/02-architecture.md
  • Schema: schemas/heritage_custodian.yaml

Implementation Files

  • GHCID Generator: src/glam_extractor/identifiers/ghcid.py
  • Lookups: src/glam_extractor/identifiers/lookups.py
  • GeoNames Client (to be created): src/glam_extractor/geocoding/geonames_lookup.py
  • Database (to be created): data/reference/geonames.db
  • Build Script (to be created): scripts/build_geonames_db.py

Approval & Sign-off

Decision Made: 2025-11-05
Approved By: GLAM Data Extraction Project Team
Implementation Owner: GeoNames Integration Team
Review Date: 2025-12-05 (1 month post-implementation)

Status: Design Approved - Ready for Implementation


Last Updated: 2025-11-05
Version: 1.0
Next Review: After Phase 1 implementation complete