# GeoNames Integration for GHCID City Abbreviations **Version**: 1.0 **Date**: 2025-11-05 **Status**: Design Complete - Implementation Pending **Decision**: **Option B - GeoNames Local Database** --- ## Executive Summary To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with **GeoNames local database** and provides implementation guidelines. **Key Decision**: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE. --- ## Problem Statement ### Current Limitation: UN/LOCODE Coverage The GHCID format requires 3-letter city codes: ``` {ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation} Example: NL-NH-AMS-M-RM ``` **Current approach** uses UN/LOCODE (United Nations Code for Trade and Transport Locations): - **Coverage**: Only 50 Dutch cities in `data/reference/nl_city_locodes.json` - **Required**: 475 Dutch cities (based on Dutch heritage datasets) - **Gap**: 89.5% of cities missing from lookup table - **Impact**: Only 41.8% GHCID generation rate (152/364 ISIL records) ### Why UN/LOCODE Falls Short 1. **Trade-focused**: UN/LOCODE prioritizes ports, airports, rail terminals 2. **Major cities only**: Small/medium cities often omitted 3. **Incomplete coverage**: Many heritage-rich towns excluded 4. **Not heritage-centric**: No correlation with museum/archive density 5. **Limited maintenance**: Updates infrequent, focused on logistics ### Example Gap Dutch ISIL registry includes institutions in: - **Aalten** ✅ (in UN/LOCODE as AAT) - **Achtkarspelen** ❌ (not in UN/LOCODE) - **Almkerk** ❌ (not in UN/LOCODE) - **Ameland** ❌ (not in UN/LOCODE) Result: **212 out of 364 records** (58.2%) could not generate GHCIDs. --- ## Solution: GeoNames Geographic Database ### What is GeoNames? **GeoNames** (https://www.geonames.org) is a comprehensive geographical database covering: - **11+ million place names** worldwide - **Hierarchical data**: City → Province/State → Country - **Multilingual names**: Local + alternative names in multiple languages - **Geographic coordinates**: Latitude/longitude for all places - **Population data**: Size/importance ranking - **Administrative divisions**: ISO 3166-2 integration - **Free & open**: Creative Commons license ### Why GeoNames for GHCID? | Criterion | UN/LOCODE | GeoNames | Winner | |-----------|-----------|----------|--------| | **Dutch city coverage** | 50 cities (10.5%) | 475+ cities (100%) | GeoNames ✅ | | **Global coverage** | ~100,000 locations | 11+ million | GeoNames ✅ | | **Heritage relevance** | Trade/transport focus | All populated places | GeoNames ✅ | | **Coordinates included** | Some | All | GeoNames ✅ | | **Hierarchical data** | No | Yes (city→province→country) | GeoNames ✅ | | **Update frequency** | Quarterly | Daily | GeoNames ✅ | | **API availability** | No official API | Yes (free tier) | GeoNames ✅ | | **Offline usage** | CSV dumps | SQLite/CSV dumps | Tie ✅ | | **Cost** | Free | Free | Tie ✅ | **Conclusion**: GeoNames is superior for heritage institution geolocation. --- ## Implementation Options Considered ### Option A: GeoNames API Integration **Approach**: Query GeoNames web API in real-time for city lookups. ```python # Pseudocode class GeoNamesAPIClient: def get_city_abbreviation(self, city_name: str, country: str) -> dict: response = requests.get( "http://api.geonames.org/searchJSON", params={ "name": city_name, "country": country, "maxRows": 1, "username": "glam_extractor" } ) # Generate 3-letter abbreviation from city name return {"abbreviation": city_name[:3].upper(), "geonames_id": ...} ``` **Pros**: - ✅ No local storage required - ✅ Always up-to-date data - ✅ Fast to implement (~2 hours) - ✅ Simple integration **Cons**: - ❌ **Rate limits**: 2,000 requests/hour (free tier), 30,000/day - ❌ Requires network connectivity - ❌ Latency on each lookup (~200-500ms) - ❌ Single point of failure (API downtime) - ❌ Processing 1,351 Dutch orgs = could hit rate limit **Verdict**: Good for prototyping, **not suitable for production**. --- ### Option B: GeoNames Local Database ✅ **SELECTED** **Approach**: Download GeoNames data dump once, store in SQLite database for offline lookups. ```python # Pseudocode class GeoNamesLocalDB: def __init__(self, db_path: str): self.conn = sqlite3.connect(db_path) def get_city_abbreviation(self, city_name: str, country: str) -> dict: cursor = self.conn.execute( "SELECT geonames_id, name, admin1_code, latitude, longitude " "FROM cities WHERE name = ? AND country_code = ?", (city_name, country) ) result = cursor.fetchone() return {"abbreviation": self._generate_abbreviation(result['name']), ...} def _generate_abbreviation(self, city_name: str) -> str: # First 3 letters of city name, uppercase return city_name[:3].upper() ``` **Pros**: - ✅ **No rate limits** - unlimited local queries - ✅ **Fast lookups** - <1ms with indexes - ✅ **Offline operation** - no network dependency - ✅ **Predictable performance** - no API downtime risk - ✅ **Cost-effective** - free download, ~200MB storage - ✅ **Reproducible** - same data snapshot across environments - ✅ **Cacheable** - SQLite file easy to distribute **Cons**: - ⚠️ **Initial setup** - requires download + database creation (~30 mins) - ⚠️ **Storage** - ~200MB for worldwide data, ~5MB for NL-only - ⚠️ **Staleness** - need periodic updates (monthly/quarterly) - ⚠️ **Implementation time** - ~4 hours total **Verdict**: **BEST for production** - predictable, fast, no dependencies. --- ### Option C: Hybrid Approach (API + Local Cache) **Approach**: Use local database, fallback to API for missing cities, cache results. ```python class GeoNamesHybrid: def get_city_abbreviation(self, city_name: str, country: str) -> dict: # Try local DB first result = self.local_db.lookup(city_name, country) if result: return result # Fallback to API result = self.api_client.lookup(city_name, country) # Cache to local DB self.local_db.insert(result) return result ``` **Pros**: - ✅ Best of both worlds - ✅ Auto-updates for missing cities - ✅ Resilient to data gaps **Cons**: - ❌ **Most complex** implementation - ❌ Still subject to rate limits (for misses) - ❌ Two failure modes (DB + API) - ❌ Longer implementation time (~5-6 hours) **Verdict**: Over-engineered for current needs, **defer to future**. --- ## Selected Solution: Option B Implementation Plan ### Architecture ``` ┌─────────────────────────────────────────────────────┐ │ GHCID Generator │ │ (src/glam_extractor/identifiers/ghcid.py) │ └────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────┐ │ GeoNames Lookup Service │ │ (src/glam_extractor/geocoding/geonames_lookup.py) │ │ │ │ - get_city_abbreviation(city, country) │ │ - get_city_details(city, country) │ │ - get_geonames_id(city, country) │ └────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────┐ │ SQLite Database │ │ (data/reference/geonames.db) │ │ │ │ Tables: │ │ - cities (geonames_id, name, country_code, ...) │ │ - admin1_codes (province mappings) │ │ - alternate_names (multilingual) │ └──────────────────────────────────────────────────────┘ ``` ### Database Schema ```sql -- Main cities table CREATE TABLE cities ( geonames_id INTEGER PRIMARY KEY, name TEXT NOT NULL, ascii_name TEXT NOT NULL, country_code TEXT NOT NULL, admin1_code TEXT, -- ISO 3166-2 province/state code admin2_code TEXT, -- County/municipality latitude REAL NOT NULL, longitude REAL NOT NULL, population INTEGER, elevation INTEGER, feature_code TEXT, -- PPL, PPLA, PPLC (city types) timezone TEXT, modification_date TEXT ); -- Index for fast city lookups CREATE INDEX idx_city_country ON cities(name, country_code); CREATE INDEX idx_country_pop ON cities(country_code, population DESC); -- Province/state codes (ISO 3166-2) CREATE TABLE admin1_codes ( code TEXT PRIMARY KEY, -- e.g., "NL.07" for North Holland name TEXT NOT NULL, -- "North Holland" ascii_name TEXT NOT NULL, geonames_id INTEGER ); -- Alternative names (multilingual support) CREATE TABLE alternate_names ( alternate_name_id INTEGER PRIMARY KEY, geonames_id INTEGER NOT NULL, isolanguage TEXT, -- en, nl, fr, etc. alternate_name TEXT NOT NULL, is_preferred_name INTEGER, is_short_name INTEGER, FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id) ); CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id); CREATE INDEX idx_altname_name ON alternate_names(alternate_name); ``` ### Data Source **GeoNames Download URL**: ``` http://download.geonames.org/export/dump/ ``` **Files needed**: 1. **Cities**: `cities15000.zip` (cities with >15,000 population) - ~25MB - Alternative: `allCountries.zip` (all places) - ~350MB 2. **Admin codes**: `admin1CodesASCII.txt` - province/state mappings 3. **Alternate names**: `alternateNamesV2.zip` - multilingual names **Netherlands-specific**: - `NL.zip` - All Dutch places (~200KB uncompressed) - Includes all 475+ cities in Dutch heritage datasets ### Database Build Process ```python # scripts/build_geonames_db.py import csv import sqlite3 import zipfile from pathlib import Path def build_geonames_database( geonames_file: str = "NL.txt", admin1_file: str = "admin1CodesASCII.txt", output_db: str = "data/reference/geonames.db" ): """ Build SQLite database from GeoNames text files. Steps: 1. Download GeoNames data (if not exists) 2. Create SQLite database 3. Parse GeoNames TSV files 4. Insert data with indexes 5. Validate completeness """ conn = sqlite3.connect(output_db) cursor = conn.cursor() # Create tables (see schema above) create_tables(cursor) # Parse cities file with open(geonames_file, 'r', encoding='utf-8') as f: reader = csv.reader(f, delimiter='\t') for row in reader: cursor.execute( "INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", ( row[0], # geonames_id row[1], # name row[2], # ascii_name row[8], # country_code row[10], # admin1_code row[11], # admin2_code row[4], # latitude row[5], # longitude row[14], # population row[16], # elevation row[7], # feature_code row[17], # timezone row[18], # modification_date ) ) # Parse admin1 codes (province mappings) with open(admin1_file, 'r', encoding='utf-8') as f: reader = csv.reader(f, delimiter='\t') for row in reader: cursor.execute( "INSERT INTO admin1_codes VALUES (?, ?, ?, ?)", (row[0], row[1], row[2], row[3]) ) # Create indexes cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)") cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)") conn.commit() conn.close() print(f"✅ GeoNames database created: {output_db}") print(f" Cities: {count_cities(output_db)}") print(f" Countries: {count_countries(output_db)}") ``` ### City Abbreviation Strategy **Decision**: Use **first 3 letters** of city name (uppercase). **Rationale**: 1. ✅ **Simplicity**: Easy to understand and implement 2. ✅ **Compatibility**: Many UN/LOCODE codes use this pattern (Amsterdam → AMS) 3. ✅ **Predictability**: No complex rules, consistent output 4. ✅ **Readability**: Often recognizable (Rotterdam → ROT, Utrecht → UTR) **Algorithm**: ```python def generate_city_abbreviation(city_name: str) -> str: """ Generate 3-letter city abbreviation from name. Rules: 1. Take first 3 characters 2. Convert to uppercase 3. Remove spaces/punctuation Examples: Amsterdam → AMS Rotterdam → ROT Den Haag → DEN 's-Hertogenbosch → SHE Aalst → AAL """ # Normalize: remove special chars, spaces normalized = city_name.replace("'", "").replace("-", "").replace(" ", "") # Take first 3 letters, uppercase abbreviation = normalized[:3].upper() return abbreviation ``` **Alternative considered**: First 3 consonants - Amsterdam → MSD (M-S-D) - Rotterdam → RTD (R-T-D) - **Rejected**: Less intuitive, harder to reverse-lookup ### Integration Points **Files to modify**: 1. **`src/glam_extractor/geocoding/geonames_lookup.py`** (NEW) - SQLite client for city lookups - Abbreviation generation logic - Province code mapping (ISO 3166-2) 2. **`src/glam_extractor/identifiers/lookups.py`** (UPDATE) - Replace `get_city_locode()` with `get_city_abbreviation()` - Update `get_ghcid_components_for_dutch_city()` to use GeoNames 3. **`src/glam_extractor/identifiers/ghcid.py`** (UPDATE) - Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation" - Update validation regex (same format, 3 uppercase letters) 4. **`data/reference/geonames.db`** (NEW) - SQLite database with Dutch cities - Initial build: ~5MB for Netherlands - Global expansion: ~200MB for all countries 5. **`scripts/build_geonames_db.py`** (NEW) - Downloads GeoNames data - Builds SQLite database - Validates completeness 6. **`data/reference/nl_city_locodes.json`** (DEPRECATE) - Mark as deprecated in comments - Keep for backward compatibility (short-term) - Remove after GHCID migration complete ### GHCID Format Changes **Before (UN/LOCODE)**: ``` NL-NH-AMS-M-RM │ │ │ │ └─ Abbreviation (Rijksmuseum) │ │ │ └─── Type (Museum) │ │ └─────── City LOCODE (Amsterdam) │ └────────── Province (North Holland) └───────────── Country (Netherlands) ``` **After (GeoNames)**: ``` NL-NH-AMS-M-RM │ │ │ │ └─ Abbreviation (Rijksmuseum) │ │ │ └─── Type (Museum) │ │ └─────── City abbreviation (Amsterdam, from GeoNames) │ └────────── Province (from GeoNames admin1_code) └───────────── Country (Netherlands) ``` **Note**: Format is **identical**, only the source of city abbreviation changes. ### Migration Impact **Breaking Changes**: - ⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE) - ⚠️ Existing 152 GHCIDs may need regeneration - ⚠️ Numeric hashes will change (computed from GHCID string) **Migration strategy**: 1. **Document changes** in GHCID history 2. **Preserve old GHCIDs** in `ghcid_original` field 3. **Update `ghcid_current`** with new abbreviation 4. **Regenerate `ghcid_numeric`** from new GHCID 5. **Add history entry**: ```python GHCIDHistoryEntry( ghcid="NL-NH-XXX-M-RM", # Old LOCODE-based valid_from="2025-10-01", valid_to="2025-11-05", reason="Migrated from UN/LOCODE to GeoNames abbreviation" ) ``` **Backward compatibility**: - Keep mapping table: `old_ghcid → new_ghcid` - Support lookups by both old and new identifiers - Phase out old format over 6-12 months --- ## Implementation Checklist ### Phase 1: Database Setup (Est. 1-2 hours) - [ ] Download GeoNames `NL.zip` dataset - [ ] Download `admin1CodesASCII.txt` for province codes - [ ] Create `scripts/build_geonames_db.py` script - [ ] Generate `data/reference/geonames.db` SQLite database - [ ] Validate: All 475 Dutch cities present - [ ] Create indexes for fast lookups - [ ] Document database schema in code comments ### Phase 2: Lookup Module (Est. 1-2 hours) - [ ] Create `src/glam_extractor/geocoding/geonames_lookup.py` - [ ] Implement `GeoNamesDB` class with SQLite connection - [ ] Implement `get_city_abbreviation(city, country)` method - [ ] Implement `get_city_details(city, country)` method - [ ] Implement `get_geonames_id(city, country)` method - [ ] Add support for alternate names (multilingual) - [ ] Handle city name normalization (case, accents, hyphens) ### Phase 3: Integration (Est. 30 mins) - [ ] Update `src/glam_extractor/identifiers/lookups.py` - [ ] Replace `_load_json("nl_city_locodes.json")` with GeoNames DB - [ ] Update `get_city_locode()` → `get_city_abbreviation()` - [ ] Update `get_province_code()` to use GeoNames admin1_code - [ ] Update `src/glam_extractor/identifiers/ghcid.py` - [ ] Update docstrings (UN/LOCODE → GeoNames) - [ ] Keep validation regex unchanged (format identical) ### Phase 4: Testing (Est. 1 hour) - [ ] Unit tests for `GeoNamesDB` class - [ ] Test city lookup (Amsterdam → AMS) - [ ] Test missing city handling - [ ] Test province code mapping - [ ] Test GeoNames ID retrieval - [ ] Integration tests with ISIL parser - [ ] Re-run ISIL registry parsing - [ ] Verify GHCID generation rate >95% (vs 41.8% before) - [ ] Compare old vs new GHCIDs for changes - [ ] Edge case tests - [ ] Cities with special characters ('s-Hertogenbosch) - [ ] Cities with spaces (Den Haag) - [ ] Multilingual city names ### Phase 5: Documentation (Est. 30 mins) - [ ] Update `docs/plan/global_glam/06-global-identifier-system.md` - [ ] Replace UN/LOCODE references with GeoNames - [ ] Document city abbreviation algorithm - [ ] Update `AGENTS.md` - [ ] Add GeoNames database instructions - [ ] Explain migration from UN/LOCODE - [ ] Add migration guide: `docs/migration/ghcid_locode_to_geonames.md` ### Phase 6: Validation (Est. 30 mins) - [ ] Run full test suite (target: 150+ tests passing) - [ ] Generate GHCIDs for all 364 ISIL records - [ ] Generate GHCIDs for all 1,351 Dutch orgs - [ ] Identify any remaining cities without GeoNames match - [ ] Document edge cases and resolution strategy **Total Estimated Time**: ~4-5 hours --- ## Performance Benchmarks ### Expected Performance (SQLite Local DB) | Operation | Time | Notes | |-----------|------|-------| | Database load | 10ms | One-time per process | | City lookup | <1ms | With index on (name, country_code) | | Batch lookup (1,000 cities) | ~100ms | 0.1ms per city | | Batch lookup (10,000 cities) | ~1s | 0.1ms per city | **Comparison to API**: - API request latency: 200-500ms per city - Batch 1,000 cities via API: ~200-500 seconds (with rate limits) - Batch 1,000 cities via SQLite: ~0.1 seconds **Winner**: SQLite is **2,000-5,000x faster** than API. ### Storage Requirements | Scope | Database Size | Records | |-------|---------------|---------| | Netherlands only | ~5 MB | ~5,000 places | | Western Europe | ~50 MB | ~50,000 places | | Global (cities >15k pop) | ~200 MB | ~200,000 cities | | Global (all places) | ~1.5 GB | ~11 million places | **Decision**: Start with **Netherlands only** (5MB), expand as needed. --- ## Data Freshness & Maintenance ### Update Frequency **GeoNames**: Updated daily (new cities, name changes, population updates) **Our database**: Update **quarterly** (every 3 months) **Rationale**: - Heritage institutions rarely relocate - City names stable over years - Population changes irrelevant for GHCID - Quarterly updates sufficient for accuracy ### Update Process ```bash # scripts/update_geonames_db.sh #!/bin/bash # Download latest GeoNames data wget http://download.geonames.org/export/dump/NL.zip unzip NL.zip # Rebuild database python scripts/build_geonames_db.py \ --input NL.txt \ --output data/reference/geonames.db # Validate python scripts/validate_geonames_db.py echo "✅ GeoNames database updated" ``` **Automation**: Run via cron job or GitHub Actions monthly. --- ## Comparison: Before & After ### Before (UN/LOCODE) **Coverage**: - 50 Dutch cities (10.5%) - 152/364 ISIL records with GHCIDs (41.8%) - 212 records **cannot generate GHCID** (58.2%) **Data source**: - JSON file: `data/reference/nl_city_locodes.json` - Manual curation required - No automation for updates **Limitations**: - Cannot expand to global institutions - Missing most small/medium cities - No multilingual support ### After (GeoNames) **Coverage**: - 475+ Dutch cities (100%) - Expected: 345+/364 ISIL records with GHCIDs (>95%) - <20 records without GHCID (edge cases) **Data source**: - SQLite database: `data/reference/geonames.db` - Automated build from GeoNames dumps - Quarterly updates via script **Benefits**: - ✅ Ready for global expansion (139 conversation files) - ✅ Covers all heritage institution locations - ✅ Multilingual name support - ✅ Coordinates for mapping - ✅ Province/state codes included --- ## Alternatives Rejected ### 1. OpenStreetMap Nominatim **Pros**: Free geocoding API, comprehensive **Cons**: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps **Why rejected**: Same API limitations as GeoNames, less structured data ### 2. Google Maps Geocoding API **Pros**: High quality, fast, reliable **Cons**: **Costs money** ($5 per 1,000 requests), vendor lock-in, requires API key **Why rejected**: Not free/open, unsustainable for large-scale extraction ### 3. Custom City Abbreviation Registry **Pros**: Full control, optimized for heritage sector **Cons**: **Months of manual curation**, ongoing maintenance burden, duplication of effort **Why rejected**: Reinventing the wheel, GeoNames already exists and is maintained ### 4. Keep UN/LOCODE + Manual Additions **Pros**: Minimal code changes **Cons**: Doesn't solve scalability, still manual curation, not global-ready **Why rejected**: Doesn't address root problem, technical debt --- ## Risk Assessment ### Risk 1: GeoNames Service Discontinuation **Likelihood**: Low (operated by GeoNames since 2005, widely used) **Impact**: Medium (need alternative source) **Mitigation**: - We use **offline database**, not dependent on API - GeoNames data dumps archived by multiple organizations - Could switch to OpenStreetMap/WikiData if needed ### Risk 2: City Name Ambiguity **Scenario**: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia) **Likelihood**: Medium (common in global dataset) **Impact**: Medium (wrong GHCID if not disambiguated) **Mitigation**: - Always provide **country code** in lookups - Use **population ranking** (larger city preferred) - Validate with **province/state code** match - Log warnings for ambiguous matches ### Risk 3: Database Corruption **Likelihood**: Low (SQLite very stable) **Impact**: High (all GHCIDs incorrect) **Mitigation**: - **Checksum validation** after build - Version control: `geonames-v2025.11.db` - Keep backup copies - Automated tests verify data integrity ### Risk 4: Abbreviation Collisions **Scenario**: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" → "AMS") **Likelihood**: Medium (common with 3-letter codes) **Impact**: Medium (same city code, different GHCIDs differentiated by province) **Mitigation**: - **Province code** in GHCID prevents collision: `NL-NH-AMS` vs `NL-XX-AMS` - Most collisions will be in different provinces - If same province: Wikidata Q-number resolves (collision resolution) --- ## Future Enhancements ### 1. Global Expansion (Priority: High) **Goal**: Support 139 conversation files covering 60+ countries **Tasks**: - Download GeoNames global dataset (`allCountries.zip`) - Build SQLite database for all countries (~200MB) - Test with conversation JSON extraction - Validate GHCID generation for international institutions **Timeline**: After Phase 1 complete (Dutch institutions) ### 2. Multilingual City Name Matching (Priority: Medium) **Goal**: Match cities by local names (e.g., "Den Haag" = "The Hague") **Tasks**: - Load GeoNames `alternateNamesV2.txt` into database - Support lookup by any alternate name - Return canonical name + GeoNames ID **Use case**: Conversation JSONs mention cities in local language ### 3. Geocoding Integration (Priority: Medium) **Goal**: Populate `location.latitude` and `location.longitude` fields **Tasks**: - Add `get_coordinates(city, country)` method - Integrate with `HeritageCustodian.locations` list - Support reverse geocoding (lat/lon → city name) **Benefit**: Enable map visualizations, geographic analysis ### 4. Province/State Inference (Priority: Low) **Goal**: Auto-detect province from city name (if not provided) **Tasks**: - Use GeoNames `admin1_code` field - Map to ISO 3166-2 province codes - Handle edge cases (disputed territories) **Use case**: ISIL registry doesn't always specify province ### 5. City Similarity Matching (Priority: Low) **Goal**: Handle typos, spelling variations **Tasks**: - Implement fuzzy matching (Levenshtein distance) - Suggest corrections for unmatched cities - Confidence scoring for matches **Example**: "Amsterdm" → "Amsterdam" (confidence: 0.95) --- ## Success Metrics ### Coverage Targets | Metric | Before (UN/LOCODE) | Target (GeoNames) | Status | |--------|-------------------|-------------------|--------| | Dutch city coverage | 50 (10.5%) | 475 (100%) | Pending ✅ | | GHCID generation rate (ISIL) | 152/364 (41.8%) | >345/364 (>95%) | Pending ✅ | | GHCID generation rate (Dutch orgs) | Unknown | >1,280/1,351 (>95%) | Pending ✅ | | Global city coverage | N/A | >200,000 cities | Future 🔮 | ### Performance Targets | Metric | Target | Status | |--------|--------|--------| | City lookup latency | <1ms | Pending ✅ | | Database load time | <10ms | Pending ✅ | | Database size (NL-only) | <10MB | Pending ✅ | | Test coverage | >90% | Pending ✅ | ### Quality Targets | Metric | Target | Status | |--------|--------|--------| | City name accuracy | >99% | Pending ✅ | | Province code accuracy | >95% | Pending ✅ | | GeoNames ID linkage | >90% | Pending ✅ | | Zero database corruption | 100% | Pending ✅ | --- ## References ### External Resources - **GeoNames Official**: https://www.geonames.org - **GeoNames Downloads**: http://download.geonames.org/export/dump/ - **GeoNames Documentation**: https://www.geonames.org/export/ - **GeoNames Web Services**: https://www.geonames.org/export/web-services.html - **ISO 3166-2 (Provinces)**: https://en.wikipedia.org/wiki/ISO_3166-2 ### Internal Documentation - **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md` - **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md` - **Architecture**: `docs/plan/global_glam/02-architecture.md` - **Schema**: `schemas/heritage_custodian.yaml` ### Implementation Files - **GHCID Generator**: `src/glam_extractor/identifiers/ghcid.py` - **Lookups**: `src/glam_extractor/identifiers/lookups.py` - **GeoNames Client** (to be created): `src/glam_extractor/geocoding/geonames_lookup.py` - **Database** (to be created): `data/reference/geonames.db` - **Build Script** (to be created): `scripts/build_geonames_db.py` --- ## Approval & Sign-off **Decision Made**: 2025-11-05 **Approved By**: GLAM Data Extraction Project Team **Implementation Owner**: GeoNames Integration Team **Review Date**: 2025-12-05 (1 month post-implementation) **Status**: ✅ **Design Approved - Ready for Implementation** --- **Last Updated**: 2025-11-05 **Version**: 1.0 **Next Review**: After Phase 1 implementation complete