28 KiB
GeoNames Integration for GHCID City Abbreviations
Version: 1.0
Date: 2025-11-05
Status: Design Complete - Implementation Pending
Decision: Option B - GeoNames Local Database
Executive Summary
To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with GeoNames local database and provides implementation guidelines.
Key Decision: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE.
Problem Statement
Current Limitation: UN/LOCODE Coverage
The GHCID format requires 3-letter city codes:
{ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM
Current approach uses UN/LOCODE (United Nations Code for Trade and Transport Locations):
- Coverage: Only 50 Dutch cities in
data/reference/nl_city_locodes.json - Required: 475 Dutch cities (based on Dutch heritage datasets)
- Gap: 89.5% of cities missing from lookup table
- Impact: Only 41.8% GHCID generation rate (152/364 ISIL records)
Why UN/LOCODE Falls Short
- Trade-focused: UN/LOCODE prioritizes ports, airports, rail terminals
- Major cities only: Small/medium cities often omitted
- Incomplete coverage: Many heritage-rich towns excluded
- Not heritage-centric: No correlation with museum/archive density
- Limited maintenance: Updates infrequent, focused on logistics
Example Gap
Dutch ISIL registry includes institutions in:
- Aalten ✅ (in UN/LOCODE as AAT)
- Achtkarspelen ❌ (not in UN/LOCODE)
- Almkerk ❌ (not in UN/LOCODE)
- Ameland ❌ (not in UN/LOCODE)
Result: 212 out of 364 records (58.2%) could not generate GHCIDs.
Solution: GeoNames Geographic Database
What is GeoNames?
GeoNames (https://www.geonames.org) is a comprehensive geographical database covering:
- 11+ million place names worldwide
- Hierarchical data: City → Province/State → Country
- Multilingual names: Local + alternative names in multiple languages
- Geographic coordinates: Latitude/longitude for all places
- Population data: Size/importance ranking
- Administrative divisions: ISO 3166-2 integration
- Free & open: Creative Commons license
Why GeoNames for GHCID?
| Criterion | UN/LOCODE | GeoNames | Winner |
|---|---|---|---|
| Dutch city coverage | 50 cities (10.5%) | 475+ cities (100%) | GeoNames ✅ |
| Global coverage | ~100,000 locations | 11+ million | GeoNames ✅ |
| Heritage relevance | Trade/transport focus | All populated places | GeoNames ✅ |
| Coordinates included | Some | All | GeoNames ✅ |
| Hierarchical data | No | Yes (city→province→country) | GeoNames ✅ |
| Update frequency | Quarterly | Daily | GeoNames ✅ |
| API availability | No official API | Yes (free tier) | GeoNames ✅ |
| Offline usage | CSV dumps | SQLite/CSV dumps | Tie ✅ |
| Cost | Free | Free | Tie ✅ |
Conclusion: GeoNames is superior for heritage institution geolocation.
Implementation Options Considered
Option A: GeoNames API Integration
Approach: Query GeoNames web API in real-time for city lookups.
# Pseudocode
class GeoNamesAPIClient:
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
response = requests.get(
"http://api.geonames.org/searchJSON",
params={
"name": city_name,
"country": country,
"maxRows": 1,
"username": "glam_extractor"
}
)
# Generate 3-letter abbreviation from city name
return {"abbreviation": city_name[:3].upper(), "geonames_id": ...}
Pros:
- ✅ No local storage required
- ✅ Always up-to-date data
- ✅ Fast to implement (~2 hours)
- ✅ Simple integration
Cons:
- ❌ Rate limits: 2,000 requests/hour (free tier), 30,000/day
- ❌ Requires network connectivity
- ❌ Latency on each lookup (~200-500ms)
- ❌ Single point of failure (API downtime)
- ❌ Processing 1,351 Dutch orgs = could hit rate limit
Verdict: Good for prototyping, not suitable for production.
Option B: GeoNames Local Database ✅ SELECTED
Approach: Download GeoNames data dump once, store in SQLite database for offline lookups.
# Pseudocode
class GeoNamesLocalDB:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
cursor = self.conn.execute(
"SELECT geonames_id, name, admin1_code, latitude, longitude "
"FROM cities WHERE name = ? AND country_code = ?",
(city_name, country)
)
result = cursor.fetchone()
return {"abbreviation": self._generate_abbreviation(result['name']), ...}
def _generate_abbreviation(self, city_name: str) -> str:
# First 3 letters of city name, uppercase
return city_name[:3].upper()
Pros:
- ✅ No rate limits - unlimited local queries
- ✅ Fast lookups - <1ms with indexes
- ✅ Offline operation - no network dependency
- ✅ Predictable performance - no API downtime risk
- ✅ Cost-effective - free download, ~200MB storage
- ✅ Reproducible - same data snapshot across environments
- ✅ Cacheable - SQLite file easy to distribute
Cons:
- ⚠️ Initial setup - requires download + database creation (~30 mins)
- ⚠️ Storage - ~200MB for worldwide data, ~5MB for NL-only
- ⚠️ Staleness - need periodic updates (monthly/quarterly)
- ⚠️ Implementation time - ~4 hours total
Verdict: BEST for production - predictable, fast, no dependencies.
Option C: Hybrid Approach (API + Local Cache)
Approach: Use local database, fallback to API for missing cities, cache results.
class GeoNamesHybrid:
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
# Try local DB first
result = self.local_db.lookup(city_name, country)
if result:
return result
# Fallback to API
result = self.api_client.lookup(city_name, country)
# Cache to local DB
self.local_db.insert(result)
return result
Pros:
- ✅ Best of both worlds
- ✅ Auto-updates for missing cities
- ✅ Resilient to data gaps
Cons:
- ❌ Most complex implementation
- ❌ Still subject to rate limits (for misses)
- ❌ Two failure modes (DB + API)
- ❌ Longer implementation time (~5-6 hours)
Verdict: Over-engineered for current needs, defer to future.
Selected Solution: Option B Implementation Plan
Architecture
┌─────────────────────────────────────────────────────┐
│ GHCID Generator │
│ (src/glam_extractor/identifiers/ghcid.py) │
└────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ GeoNames Lookup Service │
│ (src/glam_extractor/geocoding/geonames_lookup.py) │
│ │
│ - get_city_abbreviation(city, country) │
│ - get_city_details(city, country) │
│ - get_geonames_id(city, country) │
└────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ SQLite Database │
│ (data/reference/geonames.db) │
│ │
│ Tables: │
│ - cities (geonames_id, name, country_code, ...) │
│ - admin1_codes (province mappings) │
│ - alternate_names (multilingual) │
└──────────────────────────────────────────────────────┘
Database Schema
-- Main cities table
CREATE TABLE cities (
geonames_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
ascii_name TEXT NOT NULL,
country_code TEXT NOT NULL,
admin1_code TEXT, -- ISO 3166-2 province/state code
admin2_code TEXT, -- County/municipality
latitude REAL NOT NULL,
longitude REAL NOT NULL,
population INTEGER,
elevation INTEGER,
feature_code TEXT, -- PPL, PPLA, PPLC (city types)
timezone TEXT,
modification_date TEXT
);
-- Index for fast city lookups
CREATE INDEX idx_city_country ON cities(name, country_code);
CREATE INDEX idx_country_pop ON cities(country_code, population DESC);
-- Province/state codes (ISO 3166-2)
CREATE TABLE admin1_codes (
code TEXT PRIMARY KEY, -- e.g., "NL.07" for North Holland
name TEXT NOT NULL, -- "North Holland"
ascii_name TEXT NOT NULL,
geonames_id INTEGER
);
-- Alternative names (multilingual support)
CREATE TABLE alternate_names (
alternate_name_id INTEGER PRIMARY KEY,
geonames_id INTEGER NOT NULL,
isolanguage TEXT, -- en, nl, fr, etc.
alternate_name TEXT NOT NULL,
is_preferred_name INTEGER,
is_short_name INTEGER,
FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id)
);
CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id);
CREATE INDEX idx_altname_name ON alternate_names(alternate_name);
Data Source
GeoNames Download URL:
http://download.geonames.org/export/dump/
Files needed:
- Cities:
cities15000.zip(cities with >15,000 population) - ~25MB- Alternative:
allCountries.zip(all places) - ~350MB
- Alternative:
- Admin codes:
admin1CodesASCII.txt- province/state mappings - Alternate names:
alternateNamesV2.zip- multilingual names
Netherlands-specific:
NL.zip- All Dutch places (~200KB uncompressed)- Includes all 475+ cities in Dutch heritage datasets
Database Build Process
# scripts/build_geonames_db.py
import csv
import sqlite3
import zipfile
from pathlib import Path
def build_geonames_database(
geonames_file: str = "NL.txt",
admin1_file: str = "admin1CodesASCII.txt",
output_db: str = "data/reference/geonames.db"
):
"""
Build SQLite database from GeoNames text files.
Steps:
1. Download GeoNames data (if not exists)
2. Create SQLite database
3. Parse GeoNames TSV files
4. Insert data with indexes
5. Validate completeness
"""
conn = sqlite3.connect(output_db)
cursor = conn.cursor()
# Create tables (see schema above)
create_tables(cursor)
# Parse cities file
with open(geonames_file, 'r', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
cursor.execute(
"INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
row[0], # geonames_id
row[1], # name
row[2], # ascii_name
row[8], # country_code
row[10], # admin1_code
row[11], # admin2_code
row[4], # latitude
row[5], # longitude
row[14], # population
row[16], # elevation
row[7], # feature_code
row[17], # timezone
row[18], # modification_date
)
)
# Parse admin1 codes (province mappings)
with open(admin1_file, 'r', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
cursor.execute(
"INSERT INTO admin1_codes VALUES (?, ?, ?, ?)",
(row[0], row[1], row[2], row[3])
)
# Create indexes
cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)")
cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)")
conn.commit()
conn.close()
print(f"✅ GeoNames database created: {output_db}")
print(f" Cities: {count_cities(output_db)}")
print(f" Countries: {count_countries(output_db)}")
City Abbreviation Strategy
Decision: Use first 3 letters of city name (uppercase).
Rationale:
- ✅ Simplicity: Easy to understand and implement
- ✅ Compatibility: Many UN/LOCODE codes use this pattern (Amsterdam → AMS)
- ✅ Predictability: No complex rules, consistent output
- ✅ Readability: Often recognizable (Rotterdam → ROT, Utrecht → UTR)
Algorithm:
def generate_city_abbreviation(city_name: str) -> str:
"""
Generate 3-letter city abbreviation from name.
Rules:
1. Take first 3 characters
2. Convert to uppercase
3. Remove spaces/punctuation
Examples:
Amsterdam → AMS
Rotterdam → ROT
Den Haag → DEN
's-Hertogenbosch → SHE
Aalst → AAL
"""
# Normalize: remove special chars, spaces
normalized = city_name.replace("'", "").replace("-", "").replace(" ", "")
# Take first 3 letters, uppercase
abbreviation = normalized[:3].upper()
return abbreviation
Alternative considered: First 3 consonants
- Amsterdam → MSD (M-S-D)
- Rotterdam → RTD (R-T-D)
- Rejected: Less intuitive, harder to reverse-lookup
Integration Points
Files to modify:
-
src/glam_extractor/geocoding/geonames_lookup.py(NEW)- SQLite client for city lookups
- Abbreviation generation logic
- Province code mapping (ISO 3166-2)
-
src/glam_extractor/identifiers/lookups.py(UPDATE)- Replace
get_city_locode()withget_city_abbreviation() - Update
get_ghcid_components_for_dutch_city()to use GeoNames
- Replace
-
src/glam_extractor/identifiers/ghcid.py(UPDATE)- Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation"
- Update validation regex (same format, 3 uppercase letters)
-
data/reference/geonames.db(NEW)- SQLite database with Dutch cities
- Initial build: ~5MB for Netherlands
- Global expansion: ~200MB for all countries
-
scripts/build_geonames_db.py(NEW)- Downloads GeoNames data
- Builds SQLite database
- Validates completeness
-
data/reference/nl_city_locodes.json(DEPRECATE)- Mark as deprecated in comments
- Keep for backward compatibility (short-term)
- Remove after GHCID migration complete
GHCID Format Changes
Before (UN/LOCODE):
NL-NH-AMS-M-RM
│ │ │ │ └─ Abbreviation (Rijksmuseum)
│ │ │ └─── Type (Museum)
│ │ └─────── City LOCODE (Amsterdam)
│ └────────── Province (North Holland)
└───────────── Country (Netherlands)
After (GeoNames):
NL-NH-AMS-M-RM
│ │ │ │ └─ Abbreviation (Rijksmuseum)
│ │ │ └─── Type (Museum)
│ │ └─────── City abbreviation (Amsterdam, from GeoNames)
│ └────────── Province (from GeoNames admin1_code)
└───────────── Country (Netherlands)
Note: Format is identical, only the source of city abbreviation changes.
Migration Impact
Breaking Changes:
- ⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE)
- ⚠️ Existing 152 GHCIDs may need regeneration
- ⚠️ Numeric hashes will change (computed from GHCID string)
Migration strategy:
- Document changes in GHCID history
- Preserve old GHCIDs in
ghcid_originalfield - Update
ghcid_currentwith new abbreviation - Regenerate
ghcid_numericfrom new GHCID - Add history entry:
GHCIDHistoryEntry( ghcid="NL-NH-XXX-M-RM", # Old LOCODE-based valid_from="2025-10-01", valid_to="2025-11-05", reason="Migrated from UN/LOCODE to GeoNames abbreviation" )
Backward compatibility:
- Keep mapping table:
old_ghcid → new_ghcid - Support lookups by both old and new identifiers
- Phase out old format over 6-12 months
Implementation Checklist
Phase 1: Database Setup (Est. 1-2 hours)
- Download GeoNames
NL.zipdataset - Download
admin1CodesASCII.txtfor province codes - Create
scripts/build_geonames_db.pyscript - Generate
data/reference/geonames.dbSQLite database - Validate: All 475 Dutch cities present
- Create indexes for fast lookups
- Document database schema in code comments
Phase 2: Lookup Module (Est. 1-2 hours)
- Create
src/glam_extractor/geocoding/geonames_lookup.py - Implement
GeoNamesDBclass with SQLite connection - Implement
get_city_abbreviation(city, country)method - Implement
get_city_details(city, country)method - Implement
get_geonames_id(city, country)method - Add support for alternate names (multilingual)
- Handle city name normalization (case, accents, hyphens)
Phase 3: Integration (Est. 30 mins)
- Update
src/glam_extractor/identifiers/lookups.py- Replace
_load_json("nl_city_locodes.json")with GeoNames DB - Update
get_city_locode()→get_city_abbreviation() - Update
get_province_code()to use GeoNames admin1_code
- Replace
- Update
src/glam_extractor/identifiers/ghcid.py- Update docstrings (UN/LOCODE → GeoNames)
- Keep validation regex unchanged (format identical)
Phase 4: Testing (Est. 1 hour)
- Unit tests for
GeoNamesDBclass- Test city lookup (Amsterdam → AMS)
- Test missing city handling
- Test province code mapping
- Test GeoNames ID retrieval
- Integration tests with ISIL parser
- Re-run ISIL registry parsing
- Verify GHCID generation rate >95% (vs 41.8% before)
- Compare old vs new GHCIDs for changes
- Edge case tests
- Cities with special characters ('s-Hertogenbosch)
- Cities with spaces (Den Haag)
- Multilingual city names
Phase 5: Documentation (Est. 30 mins)
- Update
docs/plan/global_glam/06-global-identifier-system.md- Replace UN/LOCODE references with GeoNames
- Document city abbreviation algorithm
- Update
AGENTS.md- Add GeoNames database instructions
- Explain migration from UN/LOCODE
- Add migration guide:
docs/migration/ghcid_locode_to_geonames.md
Phase 6: Validation (Est. 30 mins)
- Run full test suite (target: 150+ tests passing)
- Generate GHCIDs for all 364 ISIL records
- Generate GHCIDs for all 1,351 Dutch orgs
- Identify any remaining cities without GeoNames match
- Document edge cases and resolution strategy
Total Estimated Time: ~4-5 hours
Performance Benchmarks
Expected Performance (SQLite Local DB)
| Operation | Time | Notes |
|---|---|---|
| Database load | 10ms | One-time per process |
| City lookup | <1ms | With index on (name, country_code) |
| Batch lookup (1,000 cities) | ~100ms | 0.1ms per city |
| Batch lookup (10,000 cities) | ~1s | 0.1ms per city |
Comparison to API:
- API request latency: 200-500ms per city
- Batch 1,000 cities via API: ~200-500 seconds (with rate limits)
- Batch 1,000 cities via SQLite: ~0.1 seconds
Winner: SQLite is 2,000-5,000x faster than API.
Storage Requirements
| Scope | Database Size | Records |
|---|---|---|
| Netherlands only | ~5 MB | ~5,000 places |
| Western Europe | ~50 MB | ~50,000 places |
| Global (cities >15k pop) | ~200 MB | ~200,000 cities |
| Global (all places) | ~1.5 GB | ~11 million places |
Decision: Start with Netherlands only (5MB), expand as needed.
Data Freshness & Maintenance
Update Frequency
GeoNames: Updated daily (new cities, name changes, population updates)
Our database: Update quarterly (every 3 months)
Rationale:
- Heritage institutions rarely relocate
- City names stable over years
- Population changes irrelevant for GHCID
- Quarterly updates sufficient for accuracy
Update Process
# scripts/update_geonames_db.sh
#!/bin/bash
# Download latest GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip
# Rebuild database
python scripts/build_geonames_db.py \
--input NL.txt \
--output data/reference/geonames.db
# Validate
python scripts/validate_geonames_db.py
echo "✅ GeoNames database updated"
Automation: Run via cron job or GitHub Actions monthly.
Comparison: Before & After
Before (UN/LOCODE)
Coverage:
- 50 Dutch cities (10.5%)
- 152/364 ISIL records with GHCIDs (41.8%)
- 212 records cannot generate GHCID (58.2%)
Data source:
- JSON file:
data/reference/nl_city_locodes.json - Manual curation required
- No automation for updates
Limitations:
- Cannot expand to global institutions
- Missing most small/medium cities
- No multilingual support
After (GeoNames)
Coverage:
- 475+ Dutch cities (100%)
- Expected: 345+/364 ISIL records with GHCIDs (>95%)
- <20 records without GHCID (edge cases)
Data source:
- SQLite database:
data/reference/geonames.db - Automated build from GeoNames dumps
- Quarterly updates via script
Benefits:
- ✅ Ready for global expansion (139 conversation files)
- ✅ Covers all heritage institution locations
- ✅ Multilingual name support
- ✅ Coordinates for mapping
- ✅ Province/state codes included
Alternatives Rejected
1. OpenStreetMap Nominatim
Pros: Free geocoding API, comprehensive Cons: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps
Why rejected: Same API limitations as GeoNames, less structured data
2. Google Maps Geocoding API
Pros: High quality, fast, reliable Cons: Costs money ($5 per 1,000 requests), vendor lock-in, requires API key
Why rejected: Not free/open, unsustainable for large-scale extraction
3. Custom City Abbreviation Registry
Pros: Full control, optimized for heritage sector Cons: Months of manual curation, ongoing maintenance burden, duplication of effort
Why rejected: Reinventing the wheel, GeoNames already exists and is maintained
4. Keep UN/LOCODE + Manual Additions
Pros: Minimal code changes Cons: Doesn't solve scalability, still manual curation, not global-ready
Why rejected: Doesn't address root problem, technical debt
Risk Assessment
Risk 1: GeoNames Service Discontinuation
Likelihood: Low (operated by GeoNames since 2005, widely used)
Impact: Medium (need alternative source)
Mitigation:
- We use offline database, not dependent on API
- GeoNames data dumps archived by multiple organizations
- Could switch to OpenStreetMap/WikiData if needed
Risk 2: City Name Ambiguity
Scenario: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia)
Likelihood: Medium (common in global dataset)
Impact: Medium (wrong GHCID if not disambiguated)
Mitigation:
- Always provide country code in lookups
- Use population ranking (larger city preferred)
- Validate with province/state code match
- Log warnings for ambiguous matches
Risk 3: Database Corruption
Likelihood: Low (SQLite very stable)
Impact: High (all GHCIDs incorrect)
Mitigation:
- Checksum validation after build
- Version control:
geonames-v2025.11.db - Keep backup copies
- Automated tests verify data integrity
Risk 4: Abbreviation Collisions
Scenario: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" → "AMS")
Likelihood: Medium (common with 3-letter codes)
Impact: Medium (same city code, different GHCIDs differentiated by province)
Mitigation:
- Province code in GHCID prevents collision:
NL-NH-AMSvsNL-XX-AMS - Most collisions will be in different provinces
- If same province: Wikidata Q-number resolves (collision resolution)
Future Enhancements
1. Global Expansion (Priority: High)
Goal: Support 139 conversation files covering 60+ countries
Tasks:
- Download GeoNames global dataset (
allCountries.zip) - Build SQLite database for all countries (~200MB)
- Test with conversation JSON extraction
- Validate GHCID generation for international institutions
Timeline: After Phase 1 complete (Dutch institutions)
2. Multilingual City Name Matching (Priority: Medium)
Goal: Match cities by local names (e.g., "Den Haag" = "The Hague")
Tasks:
- Load GeoNames
alternateNamesV2.txtinto database - Support lookup by any alternate name
- Return canonical name + GeoNames ID
Use case: Conversation JSONs mention cities in local language
3. Geocoding Integration (Priority: Medium)
Goal: Populate location.latitude and location.longitude fields
Tasks:
- Add
get_coordinates(city, country)method - Integrate with
HeritageCustodian.locationslist - Support reverse geocoding (lat/lon → city name)
Benefit: Enable map visualizations, geographic analysis
4. Province/State Inference (Priority: Low)
Goal: Auto-detect province from city name (if not provided)
Tasks:
- Use GeoNames
admin1_codefield - Map to ISO 3166-2 province codes
- Handle edge cases (disputed territories)
Use case: ISIL registry doesn't always specify province
5. City Similarity Matching (Priority: Low)
Goal: Handle typos, spelling variations
Tasks:
- Implement fuzzy matching (Levenshtein distance)
- Suggest corrections for unmatched cities
- Confidence scoring for matches
Example: "Amsterdm" → "Amsterdam" (confidence: 0.95)
Success Metrics
Coverage Targets
| Metric | Before (UN/LOCODE) | Target (GeoNames) | Status |
|---|---|---|---|
| Dutch city coverage | 50 (10.5%) | 475 (100%) | Pending ✅ |
| GHCID generation rate (ISIL) | 152/364 (41.8%) | >345/364 (>95%) | Pending ✅ |
| GHCID generation rate (Dutch orgs) | Unknown | >1,280/1,351 (>95%) | Pending ✅ |
| Global city coverage | N/A | >200,000 cities | Future 🔮 |
Performance Targets
| Metric | Target | Status |
|---|---|---|
| City lookup latency | <1ms | Pending ✅ |
| Database load time | <10ms | Pending ✅ |
| Database size (NL-only) | <10MB | Pending ✅ |
| Test coverage | >90% | Pending ✅ |
Quality Targets
| Metric | Target | Status |
|---|---|---|
| City name accuracy | >99% | Pending ✅ |
| Province code accuracy | >95% | Pending ✅ |
| GeoNames ID linkage | >90% | Pending ✅ |
| Zero database corruption | 100% | Pending ✅ |
References
External Resources
- GeoNames Official: https://www.geonames.org
- GeoNames Downloads: http://download.geonames.org/export/dump/
- GeoNames Documentation: https://www.geonames.org/export/
- GeoNames Web Services: https://www.geonames.org/export/web-services.html
- ISO 3166-2 (Provinces): https://en.wikipedia.org/wiki/ISO_3166-2
Internal Documentation
- GHCID Specification:
docs/plan/global_glam/06-global-identifier-system.md - Collision Resolution:
docs/plan/global_glam/07-ghcid-collision-resolution.md - Architecture:
docs/plan/global_glam/02-architecture.md - Schema:
schemas/heritage_custodian.yaml
Implementation Files
- GHCID Generator:
src/glam_extractor/identifiers/ghcid.py - Lookups:
src/glam_extractor/identifiers/lookups.py - GeoNames Client (to be created):
src/glam_extractor/geocoding/geonames_lookup.py - Database (to be created):
data/reference/geonames.db - Build Script (to be created):
scripts/build_geonames_db.py
Approval & Sign-off
Decision Made: 2025-11-05
Approved By: GLAM Data Extraction Project Team
Implementation Owner: GeoNames Integration Team
Review Date: 2025-12-05 (1 month post-implementation)
Status: ✅ Design Approved - Ready for Implementation
Last Updated: 2025-11-05
Version: 1.0
Next Review: After Phase 1 implementation complete