glam/docs/plan/global_glam/08-geonames-integration.md
2025-11-19 23:25:22 +01:00

890 lines
28 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GeoNames Integration for GHCID City Abbreviations
**Version**: 1.0
**Date**: 2025-11-05
**Status**: Design Complete - Implementation Pending
**Decision**: **Option B - GeoNames Local Database**
---
## Executive Summary
To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with **GeoNames local database** and provides implementation guidelines.
**Key Decision**: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE.
---
## Problem Statement
### Current Limitation: UN/LOCODE Coverage
The GHCID format requires 3-letter city codes:
```
{ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM
```
**Current approach** uses UN/LOCODE (United Nations Code for Trade and Transport Locations):
- **Coverage**: Only 50 Dutch cities in `data/reference/nl_city_locodes.json`
- **Required**: 475 Dutch cities (based on Dutch heritage datasets)
- **Gap**: 89.5% of cities missing from lookup table
- **Impact**: Only 41.8% GHCID generation rate (152/364 ISIL records)
### Why UN/LOCODE Falls Short
1. **Trade-focused**: UN/LOCODE prioritizes ports, airports, rail terminals
2. **Major cities only**: Small/medium cities often omitted
3. **Incomplete coverage**: Many heritage-rich towns excluded
4. **Not heritage-centric**: No correlation with museum/archive density
5. **Limited maintenance**: Updates infrequent, focused on logistics
### Example Gap
Dutch ISIL registry includes institutions in:
- **Aalten** ✅ (in UN/LOCODE as AAT)
- **Achtkarspelen** ❌ (not in UN/LOCODE)
- **Almkerk** ❌ (not in UN/LOCODE)
- **Ameland** ❌ (not in UN/LOCODE)
Result: **212 out of 364 records** (58.2%) could not generate GHCIDs.
---
## Solution: GeoNames Geographic Database
### What is GeoNames?
**GeoNames** (https://www.geonames.org) is a comprehensive geographical database covering:
- **11+ million place names** worldwide
- **Hierarchical data**: City → Province/State → Country
- **Multilingual names**: Local + alternative names in multiple languages
- **Geographic coordinates**: Latitude/longitude for all places
- **Population data**: Size/importance ranking
- **Administrative divisions**: ISO 3166-2 integration
- **Free & open**: Creative Commons license
### Why GeoNames for GHCID?
| Criterion | UN/LOCODE | GeoNames | Winner |
|-----------|-----------|----------|--------|
| **Dutch city coverage** | 50 cities (10.5%) | 475+ cities (100%) | GeoNames ✅ |
| **Global coverage** | ~100,000 locations | 11+ million | GeoNames ✅ |
| **Heritage relevance** | Trade/transport focus | All populated places | GeoNames ✅ |
| **Coordinates included** | Some | All | GeoNames ✅ |
| **Hierarchical data** | No | Yes (city→province→country) | GeoNames ✅ |
| **Update frequency** | Quarterly | Daily | GeoNames ✅ |
| **API availability** | No official API | Yes (free tier) | GeoNames ✅ |
| **Offline usage** | CSV dumps | SQLite/CSV dumps | Tie ✅ |
| **Cost** | Free | Free | Tie ✅ |
**Conclusion**: GeoNames is superior for heritage institution geolocation.
---
## Implementation Options Considered
### Option A: GeoNames API Integration
**Approach**: Query GeoNames web API in real-time for city lookups.
```python
# Pseudocode
class GeoNamesAPIClient:
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
response = requests.get(
"http://api.geonames.org/searchJSON",
params={
"name": city_name,
"country": country,
"maxRows": 1,
"username": "glam_extractor"
}
)
# Generate 3-letter abbreviation from city name
return {"abbreviation": city_name[:3].upper(), "geonames_id": ...}
```
**Pros**:
- ✅ No local storage required
- ✅ Always up-to-date data
- ✅ Fast to implement (~2 hours)
- ✅ Simple integration
**Cons**:
-**Rate limits**: 2,000 requests/hour (free tier), 30,000/day
- ❌ Requires network connectivity
- ❌ Latency on each lookup (~200-500ms)
- ❌ Single point of failure (API downtime)
- ❌ Processing 1,351 Dutch orgs = could hit rate limit
**Verdict**: Good for prototyping, **not suitable for production**.
---
### Option B: GeoNames Local Database ✅ **SELECTED**
**Approach**: Download GeoNames data dump once, store in SQLite database for offline lookups.
```python
# Pseudocode
class GeoNamesLocalDB:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
cursor = self.conn.execute(
"SELECT geonames_id, name, admin1_code, latitude, longitude "
"FROM cities WHERE name = ? AND country_code = ?",
(city_name, country)
)
result = cursor.fetchone()
return {"abbreviation": self._generate_abbreviation(result['name']), ...}
def _generate_abbreviation(self, city_name: str) -> str:
# First 3 letters of city name, uppercase
return city_name[:3].upper()
```
**Pros**:
-**No rate limits** - unlimited local queries
-**Fast lookups** - <1ms with indexes
- **Offline operation** - no network dependency
- **Predictable performance** - no API downtime risk
- **Cost-effective** - free download, ~200MB storage
- **Reproducible** - same data snapshot across environments
- **Cacheable** - SQLite file easy to distribute
**Cons**:
- **Initial setup** - requires download + database creation (~30 mins)
- **Storage** - ~200MB for worldwide data, ~5MB for NL-only
- **Staleness** - need periodic updates (monthly/quarterly)
- **Implementation time** - ~4 hours total
**Verdict**: **BEST for production** - predictable, fast, no dependencies.
---
### Option C: Hybrid Approach (API + Local Cache)
**Approach**: Use local database, fallback to API for missing cities, cache results.
```python
class GeoNamesHybrid:
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
# Try local DB first
result = self.local_db.lookup(city_name, country)
if result:
return result
# Fallback to API
result = self.api_client.lookup(city_name, country)
# Cache to local DB
self.local_db.insert(result)
return result
```
**Pros**:
- Best of both worlds
- Auto-updates for missing cities
- Resilient to data gaps
**Cons**:
- **Most complex** implementation
- Still subject to rate limits (for misses)
- Two failure modes (DB + API)
- Longer implementation time (~5-6 hours)
**Verdict**: Over-engineered for current needs, **defer to future**.
---
## Selected Solution: Option B Implementation Plan
### Architecture
```
┌─────────────────────────────────────────────────────┐
│ GHCID Generator │
│ (src/glam_extractor/identifiers/ghcid.py) │
└────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ GeoNames Lookup Service │
│ (src/glam_extractor/geocoding/geonames_lookup.py) │
│ │
│ - get_city_abbreviation(city, country) │
│ - get_city_details(city, country) │
│ - get_geonames_id(city, country) │
└────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ SQLite Database │
│ (data/reference/geonames.db) │
│ │
│ Tables: │
│ - cities (geonames_id, name, country_code, ...) │
│ - admin1_codes (province mappings) │
│ - alternate_names (multilingual) │
└──────────────────────────────────────────────────────┘
```
### Database Schema
```sql
-- Main cities table
CREATE TABLE cities (
geonames_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
ascii_name TEXT NOT NULL,
country_code TEXT NOT NULL,
admin1_code TEXT, -- ISO 3166-2 province/state code
admin2_code TEXT, -- County/municipality
latitude REAL NOT NULL,
longitude REAL NOT NULL,
population INTEGER,
elevation INTEGER,
feature_code TEXT, -- PPL, PPLA, PPLC (city types)
timezone TEXT,
modification_date TEXT
);
-- Index for fast city lookups
CREATE INDEX idx_city_country ON cities(name, country_code);
CREATE INDEX idx_country_pop ON cities(country_code, population DESC);
-- Province/state codes (ISO 3166-2)
CREATE TABLE admin1_codes (
code TEXT PRIMARY KEY, -- e.g., "NL.07" for North Holland
name TEXT NOT NULL, -- "North Holland"
ascii_name TEXT NOT NULL,
geonames_id INTEGER
);
-- Alternative names (multilingual support)
CREATE TABLE alternate_names (
alternate_name_id INTEGER PRIMARY KEY,
geonames_id INTEGER NOT NULL,
isolanguage TEXT, -- en, nl, fr, etc.
alternate_name TEXT NOT NULL,
is_preferred_name INTEGER,
is_short_name INTEGER,
FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id)
);
CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id);
CREATE INDEX idx_altname_name ON alternate_names(alternate_name);
```
### Data Source
**GeoNames Download URL**:
```
http://download.geonames.org/export/dump/
```
**Files needed**:
1. **Cities**: `cities15000.zip` (cities with >15,000 population) - ~25MB
- Alternative: `allCountries.zip` (all places) - ~350MB
2. **Admin codes**: `admin1CodesASCII.txt` - province/state mappings
3. **Alternate names**: `alternateNamesV2.zip` - multilingual names
**Netherlands-specific**:
- `NL.zip` - All Dutch places (~200KB uncompressed)
- Includes all 475+ cities in Dutch heritage datasets
### Database Build Process
```python
# scripts/build_geonames_db.py
import csv
import sqlite3
import zipfile
from pathlib import Path
def build_geonames_database(
geonames_file: str = "NL.txt",
admin1_file: str = "admin1CodesASCII.txt",
output_db: str = "data/reference/geonames.db"
):
"""
Build SQLite database from GeoNames text files.
Steps:
1. Download GeoNames data (if not exists)
2. Create SQLite database
3. Parse GeoNames TSV files
4. Insert data with indexes
5. Validate completeness
"""
conn = sqlite3.connect(output_db)
cursor = conn.cursor()
# Create tables (see schema above)
create_tables(cursor)
# Parse cities file
with open(geonames_file, 'r', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
cursor.execute(
"INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
row[0], # geonames_id
row[1], # name
row[2], # ascii_name
row[8], # country_code
row[10], # admin1_code
row[11], # admin2_code
row[4], # latitude
row[5], # longitude
row[14], # population
row[16], # elevation
row[7], # feature_code
row[17], # timezone
row[18], # modification_date
)
)
# Parse admin1 codes (province mappings)
with open(admin1_file, 'r', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
cursor.execute(
"INSERT INTO admin1_codes VALUES (?, ?, ?, ?)",
(row[0], row[1], row[2], row[3])
)
# Create indexes
cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)")
cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)")
conn.commit()
conn.close()
print(f"✅ GeoNames database created: {output_db}")
print(f" Cities: {count_cities(output_db)}")
print(f" Countries: {count_countries(output_db)}")
```
### City Abbreviation Strategy
**Decision**: Use **first 3 letters** of city name (uppercase).
**Rationale**:
1.**Simplicity**: Easy to understand and implement
2.**Compatibility**: Many UN/LOCODE codes use this pattern (Amsterdam → AMS)
3.**Predictability**: No complex rules, consistent output
4.**Readability**: Often recognizable (Rotterdam → ROT, Utrecht → UTR)
**Algorithm**:
```python
def generate_city_abbreviation(city_name: str) -> str:
"""
Generate 3-letter city abbreviation from name.
Rules:
1. Take first 3 characters
2. Convert to uppercase
3. Remove spaces/punctuation
Examples:
Amsterdam → AMS
Rotterdam → ROT
Den Haag → DEN
's-Hertogenbosch → SHE
Aalst → AAL
"""
# Normalize: remove special chars, spaces
normalized = city_name.replace("'", "").replace("-", "").replace(" ", "")
# Take first 3 letters, uppercase
abbreviation = normalized[:3].upper()
return abbreviation
```
**Alternative considered**: First 3 consonants
- Amsterdam → MSD (M-S-D)
- Rotterdam → RTD (R-T-D)
- **Rejected**: Less intuitive, harder to reverse-lookup
### Integration Points
**Files to modify**:
1. **`src/glam_extractor/geocoding/geonames_lookup.py`** (NEW)
- SQLite client for city lookups
- Abbreviation generation logic
- Province code mapping (ISO 3166-2)
2. **`src/glam_extractor/identifiers/lookups.py`** (UPDATE)
- Replace `get_city_locode()` with `get_city_abbreviation()`
- Update `get_ghcid_components_for_dutch_city()` to use GeoNames
3. **`src/glam_extractor/identifiers/ghcid.py`** (UPDATE)
- Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation"
- Update validation regex (same format, 3 uppercase letters)
4. **`data/reference/geonames.db`** (NEW)
- SQLite database with Dutch cities
- Initial build: ~5MB for Netherlands
- Global expansion: ~200MB for all countries
5. **`scripts/build_geonames_db.py`** (NEW)
- Downloads GeoNames data
- Builds SQLite database
- Validates completeness
6. **`data/reference/nl_city_locodes.json`** (DEPRECATE)
- Mark as deprecated in comments
- Keep for backward compatibility (short-term)
- Remove after GHCID migration complete
### GHCID Format Changes
**Before (UN/LOCODE)**:
```
NL-NH-AMS-M-RM
│ │ │ │ └─ Abbreviation (Rijksmuseum)
│ │ │ └─── Type (Museum)
│ │ └─────── City LOCODE (Amsterdam)
│ └────────── Province (North Holland)
└───────────── Country (Netherlands)
```
**After (GeoNames)**:
```
NL-NH-AMS-M-RM
│ │ │ │ └─ Abbreviation (Rijksmuseum)
│ │ │ └─── Type (Museum)
│ │ └─────── City abbreviation (Amsterdam, from GeoNames)
│ └────────── Province (from GeoNames admin1_code)
└───────────── Country (Netherlands)
```
**Note**: Format is **identical**, only the source of city abbreviation changes.
### Migration Impact
**Breaking Changes**:
- ⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE)
- ⚠️ Existing 152 GHCIDs may need regeneration
- ⚠️ Numeric hashes will change (computed from GHCID string)
**Migration strategy**:
1. **Document changes** in GHCID history
2. **Preserve old GHCIDs** in `ghcid_original` field
3. **Update `ghcid_current`** with new abbreviation
4. **Regenerate `ghcid_numeric`** from new GHCID
5. **Add history entry**:
```python
GHCIDHistoryEntry(
ghcid="NL-NH-XXX-M-RM", # Old LOCODE-based
valid_from="2025-10-01",
valid_to="2025-11-05",
reason="Migrated from UN/LOCODE to GeoNames abbreviation"
)
```
**Backward compatibility**:
- Keep mapping table: `old_ghcid → new_ghcid`
- Support lookups by both old and new identifiers
- Phase out old format over 6-12 months
---
## Implementation Checklist
### Phase 1: Database Setup (Est. 1-2 hours)
- [ ] Download GeoNames `NL.zip` dataset
- [ ] Download `admin1CodesASCII.txt` for province codes
- [ ] Create `scripts/build_geonames_db.py` script
- [ ] Generate `data/reference/geonames.db` SQLite database
- [ ] Validate: All 475 Dutch cities present
- [ ] Create indexes for fast lookups
- [ ] Document database schema in code comments
### Phase 2: Lookup Module (Est. 1-2 hours)
- [ ] Create `src/glam_extractor/geocoding/geonames_lookup.py`
- [ ] Implement `GeoNamesDB` class with SQLite connection
- [ ] Implement `get_city_abbreviation(city, country)` method
- [ ] Implement `get_city_details(city, country)` method
- [ ] Implement `get_geonames_id(city, country)` method
- [ ] Add support for alternate names (multilingual)
- [ ] Handle city name normalization (case, accents, hyphens)
### Phase 3: Integration (Est. 30 mins)
- [ ] Update `src/glam_extractor/identifiers/lookups.py`
- [ ] Replace `_load_json("nl_city_locodes.json")` with GeoNames DB
- [ ] Update `get_city_locode()``get_city_abbreviation()`
- [ ] Update `get_province_code()` to use GeoNames admin1_code
- [ ] Update `src/glam_extractor/identifiers/ghcid.py`
- [ ] Update docstrings (UN/LOCODE → GeoNames)
- [ ] Keep validation regex unchanged (format identical)
### Phase 4: Testing (Est. 1 hour)
- [ ] Unit tests for `GeoNamesDB` class
- [ ] Test city lookup (Amsterdam → AMS)
- [ ] Test missing city handling
- [ ] Test province code mapping
- [ ] Test GeoNames ID retrieval
- [ ] Integration tests with ISIL parser
- [ ] Re-run ISIL registry parsing
- [ ] Verify GHCID generation rate >95% (vs 41.8% before)
- [ ] Compare old vs new GHCIDs for changes
- [ ] Edge case tests
- [ ] Cities with special characters ('s-Hertogenbosch)
- [ ] Cities with spaces (Den Haag)
- [ ] Multilingual city names
### Phase 5: Documentation (Est. 30 mins)
- [ ] Update `docs/plan/global_glam/06-global-identifier-system.md`
- [ ] Replace UN/LOCODE references with GeoNames
- [ ] Document city abbreviation algorithm
- [ ] Update `AGENTS.md`
- [ ] Add GeoNames database instructions
- [ ] Explain migration from UN/LOCODE
- [ ] Add migration guide: `docs/migration/ghcid_locode_to_geonames.md`
### Phase 6: Validation (Est. 30 mins)
- [ ] Run full test suite (target: 150+ tests passing)
- [ ] Generate GHCIDs for all 364 ISIL records
- [ ] Generate GHCIDs for all 1,351 Dutch orgs
- [ ] Identify any remaining cities without GeoNames match
- [ ] Document edge cases and resolution strategy
**Total Estimated Time**: ~4-5 hours
---
## Performance Benchmarks
### Expected Performance (SQLite Local DB)
| Operation | Time | Notes |
|-----------|------|-------|
| Database load | 10ms | One-time per process |
| City lookup | <1ms | With index on (name, country_code) |
| Batch lookup (1,000 cities) | ~100ms | 0.1ms per city |
| Batch lookup (10,000 cities) | ~1s | 0.1ms per city |
**Comparison to API**:
- API request latency: 200-500ms per city
- Batch 1,000 cities via API: ~200-500 seconds (with rate limits)
- Batch 1,000 cities via SQLite: ~0.1 seconds
**Winner**: SQLite is **2,000-5,000x faster** than API.
### Storage Requirements
| Scope | Database Size | Records |
|-------|---------------|---------|
| Netherlands only | ~5 MB | ~5,000 places |
| Western Europe | ~50 MB | ~50,000 places |
| Global (cities >15k pop) | ~200 MB | ~200,000 cities |
| Global (all places) | ~1.5 GB | ~11 million places |
**Decision**: Start with **Netherlands only** (5MB), expand as needed.
---
## Data Freshness & Maintenance
### Update Frequency
**GeoNames**: Updated daily (new cities, name changes, population updates)
**Our database**: Update **quarterly** (every 3 months)
**Rationale**:
- Heritage institutions rarely relocate
- City names stable over years
- Population changes irrelevant for GHCID
- Quarterly updates sufficient for accuracy
### Update Process
```bash
# scripts/update_geonames_db.sh
#!/bin/bash
# Download latest GeoNames data
wget http://download.geonames.org/export/dump/NL.zip
unzip NL.zip
# Rebuild database
python scripts/build_geonames_db.py \
--input NL.txt \
--output data/reference/geonames.db
# Validate
python scripts/validate_geonames_db.py
echo "✅ GeoNames database updated"
```
**Automation**: Run via cron job or GitHub Actions monthly.
---
## Comparison: Before & After
### Before (UN/LOCODE)
**Coverage**:
- 50 Dutch cities (10.5%)
- 152/364 ISIL records with GHCIDs (41.8%)
- 212 records **cannot generate GHCID** (58.2%)
**Data source**:
- JSON file: `data/reference/nl_city_locodes.json`
- Manual curation required
- No automation for updates
**Limitations**:
- Cannot expand to global institutions
- Missing most small/medium cities
- No multilingual support
### After (GeoNames)
**Coverage**:
- 475+ Dutch cities (100%)
- Expected: 345+/364 ISIL records with GHCIDs (>95%)
- <20 records without GHCID (edge cases)
**Data source**:
- SQLite database: `data/reference/geonames.db`
- Automated build from GeoNames dumps
- Quarterly updates via script
**Benefits**:
- Ready for global expansion (139 conversation files)
- Covers all heritage institution locations
- Multilingual name support
- Coordinates for mapping
- Province/state codes included
---
## Alternatives Rejected
### 1. OpenStreetMap Nominatim
**Pros**: Free geocoding API, comprehensive
**Cons**: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps
**Why rejected**: Same API limitations as GeoNames, less structured data
### 2. Google Maps Geocoding API
**Pros**: High quality, fast, reliable
**Cons**: **Costs money** ($5 per 1,000 requests), vendor lock-in, requires API key
**Why rejected**: Not free/open, unsustainable for large-scale extraction
### 3. Custom City Abbreviation Registry
**Pros**: Full control, optimized for heritage sector
**Cons**: **Months of manual curation**, ongoing maintenance burden, duplication of effort
**Why rejected**: Reinventing the wheel, GeoNames already exists and is maintained
### 4. Keep UN/LOCODE + Manual Additions
**Pros**: Minimal code changes
**Cons**: Doesn't solve scalability, still manual curation, not global-ready
**Why rejected**: Doesn't address root problem, technical debt
---
## Risk Assessment
### Risk 1: GeoNames Service Discontinuation
**Likelihood**: Low (operated by GeoNames since 2005, widely used)
**Impact**: Medium (need alternative source)
**Mitigation**:
- We use **offline database**, not dependent on API
- GeoNames data dumps archived by multiple organizations
- Could switch to OpenStreetMap/WikiData if needed
### Risk 2: City Name Ambiguity
**Scenario**: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia)
**Likelihood**: Medium (common in global dataset)
**Impact**: Medium (wrong GHCID if not disambiguated)
**Mitigation**:
- Always provide **country code** in lookups
- Use **population ranking** (larger city preferred)
- Validate with **province/state code** match
- Log warnings for ambiguous matches
### Risk 3: Database Corruption
**Likelihood**: Low (SQLite very stable)
**Impact**: High (all GHCIDs incorrect)
**Mitigation**:
- **Checksum validation** after build
- Version control: `geonames-v2025.11.db`
- Keep backup copies
- Automated tests verify data integrity
### Risk 4: Abbreviation Collisions
**Scenario**: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" "AMS")
**Likelihood**: Medium (common with 3-letter codes)
**Impact**: Medium (same city code, different GHCIDs differentiated by province)
**Mitigation**:
- **Province code** in GHCID prevents collision: `NL-NH-AMS` vs `NL-XX-AMS`
- Most collisions will be in different provinces
- If same province: Wikidata Q-number resolves (collision resolution)
---
## Future Enhancements
### 1. Global Expansion (Priority: High)
**Goal**: Support 139 conversation files covering 60+ countries
**Tasks**:
- Download GeoNames global dataset (`allCountries.zip`)
- Build SQLite database for all countries (~200MB)
- Test with conversation JSON extraction
- Validate GHCID generation for international institutions
**Timeline**: After Phase 1 complete (Dutch institutions)
### 2. Multilingual City Name Matching (Priority: Medium)
**Goal**: Match cities by local names (e.g., "Den Haag" = "The Hague")
**Tasks**:
- Load GeoNames `alternateNamesV2.txt` into database
- Support lookup by any alternate name
- Return canonical name + GeoNames ID
**Use case**: Conversation JSONs mention cities in local language
### 3. Geocoding Integration (Priority: Medium)
**Goal**: Populate `location.latitude` and `location.longitude` fields
**Tasks**:
- Add `get_coordinates(city, country)` method
- Integrate with `HeritageCustodian.locations` list
- Support reverse geocoding (lat/lon city name)
**Benefit**: Enable map visualizations, geographic analysis
### 4. Province/State Inference (Priority: Low)
**Goal**: Auto-detect province from city name (if not provided)
**Tasks**:
- Use GeoNames `admin1_code` field
- Map to ISO 3166-2 province codes
- Handle edge cases (disputed territories)
**Use case**: ISIL registry doesn't always specify province
### 5. City Similarity Matching (Priority: Low)
**Goal**: Handle typos, spelling variations
**Tasks**:
- Implement fuzzy matching (Levenshtein distance)
- Suggest corrections for unmatched cities
- Confidence scoring for matches
**Example**: "Amsterdm" "Amsterdam" (confidence: 0.95)
---
## Success Metrics
### Coverage Targets
| Metric | Before (UN/LOCODE) | Target (GeoNames) | Status |
|--------|-------------------|-------------------|--------|
| Dutch city coverage | 50 (10.5%) | 475 (100%) | Pending |
| GHCID generation rate (ISIL) | 152/364 (41.8%) | >345/364 (>95%) | Pending ✅ |
| GHCID generation rate (Dutch orgs) | Unknown | >1,280/1,351 (>95%) | Pending ✅ |
| Global city coverage | N/A | >200,000 cities | Future 🔮 |
### Performance Targets
| Metric | Target | Status |
|--------|--------|--------|
| City lookup latency | <1ms | Pending |
| Database load time | <10ms | Pending |
| Database size (NL-only) | <10MB | Pending |
| Test coverage | >90% | Pending ✅ |
### Quality Targets
| Metric | Target | Status |
|--------|--------|--------|
| City name accuracy | >99% | Pending ✅ |
| Province code accuracy | >95% | Pending ✅ |
| GeoNames ID linkage | >90% | Pending ✅ |
| Zero database corruption | 100% | Pending ✅ |
---
## References
### External Resources
- **GeoNames Official**: https://www.geonames.org
- **GeoNames Downloads**: http://download.geonames.org/export/dump/
- **GeoNames Documentation**: https://www.geonames.org/export/
- **GeoNames Web Services**: https://www.geonames.org/export/web-services.html
- **ISO 3166-2 (Provinces)**: https://en.wikipedia.org/wiki/ISO_3166-2
### Internal Documentation
- **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md`
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- **Architecture**: `docs/plan/global_glam/02-architecture.md`
- **Schema**: `schemas/heritage_custodian.yaml`
### Implementation Files
- **GHCID Generator**: `src/glam_extractor/identifiers/ghcid.py`
- **Lookups**: `src/glam_extractor/identifiers/lookups.py`
- **GeoNames Client** (to be created): `src/glam_extractor/geocoding/geonames_lookup.py`
- **Database** (to be created): `data/reference/geonames.db`
- **Build Script** (to be created): `scripts/build_geonames_db.py`
---
## Approval & Sign-off
**Decision Made**: 2025-11-05
**Approved By**: GLAM Data Extraction Project Team
**Implementation Owner**: GeoNames Integration Team
**Review Date**: 2025-12-05 (1 month post-implementation)
**Status**: ✅ **Design Approved - Ready for Implementation**
---
**Last Updated**: 2025-11-05
**Version**: 1.0
**Next Review**: After Phase 1 implementation complete