890 lines
28 KiB
Markdown
890 lines
28 KiB
Markdown
# GeoNames Integration for GHCID City Abbreviations
|
||
|
||
**Version**: 1.0
|
||
**Date**: 2025-11-05
|
||
**Status**: Design Complete - Implementation Pending
|
||
**Decision**: **Option B - GeoNames Local Database**
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
To generate Global Heritage Custodian Identifiers (GHCID) for worldwide heritage institutions, we need a comprehensive source of city abbreviations. This document explains why we're replacing UN/LOCODE with **GeoNames local database** and provides implementation guidelines.
|
||
|
||
**Key Decision**: Use GeoNames offline database (SQLite) rather than API or UN/LOCODE.
|
||
|
||
---
|
||
|
||
## Problem Statement
|
||
|
||
### Current Limitation: UN/LOCODE Coverage
|
||
|
||
The GHCID format requires 3-letter city codes:
|
||
```
|
||
{ISO-3166-1}-{ISO-3166-2}-{CITY-CODE}-{Type}-{Abbreviation}
|
||
Example: NL-NH-AMS-M-RM
|
||
```
|
||
|
||
**Current approach** uses UN/LOCODE (United Nations Code for Trade and Transport Locations):
|
||
- **Coverage**: Only 50 Dutch cities in `data/reference/nl_city_locodes.json`
|
||
- **Required**: 475 Dutch cities (based on Dutch heritage datasets)
|
||
- **Gap**: 89.5% of cities missing from lookup table
|
||
- **Impact**: Only 41.8% GHCID generation rate (152/364 ISIL records)
|
||
|
||
### Why UN/LOCODE Falls Short
|
||
|
||
1. **Trade-focused**: UN/LOCODE prioritizes ports, airports, rail terminals
|
||
2. **Major cities only**: Small/medium cities often omitted
|
||
3. **Incomplete coverage**: Many heritage-rich towns excluded
|
||
4. **Not heritage-centric**: No correlation with museum/archive density
|
||
5. **Limited maintenance**: Updates infrequent, focused on logistics
|
||
|
||
### Example Gap
|
||
|
||
Dutch ISIL registry includes institutions in:
|
||
- **Aalten** ✅ (in UN/LOCODE as AAT)
|
||
- **Achtkarspelen** ❌ (not in UN/LOCODE)
|
||
- **Almkerk** ❌ (not in UN/LOCODE)
|
||
- **Ameland** ❌ (not in UN/LOCODE)
|
||
|
||
Result: **212 out of 364 records** (58.2%) could not generate GHCIDs.
|
||
|
||
---
|
||
|
||
## Solution: GeoNames Geographic Database
|
||
|
||
### What is GeoNames?
|
||
|
||
**GeoNames** (https://www.geonames.org) is a comprehensive geographical database covering:
|
||
- **11+ million place names** worldwide
|
||
- **Hierarchical data**: City → Province/State → Country
|
||
- **Multilingual names**: Local + alternative names in multiple languages
|
||
- **Geographic coordinates**: Latitude/longitude for all places
|
||
- **Population data**: Size/importance ranking
|
||
- **Administrative divisions**: ISO 3166-2 integration
|
||
- **Free & open**: Creative Commons license
|
||
|
||
### Why GeoNames for GHCID?
|
||
|
||
| Criterion | UN/LOCODE | GeoNames | Winner |
|
||
|-----------|-----------|----------|--------|
|
||
| **Dutch city coverage** | 50 cities (10.5%) | 475+ cities (100%) | GeoNames ✅ |
|
||
| **Global coverage** | ~100,000 locations | 11+ million | GeoNames ✅ |
|
||
| **Heritage relevance** | Trade/transport focus | All populated places | GeoNames ✅ |
|
||
| **Coordinates included** | Some | All | GeoNames ✅ |
|
||
| **Hierarchical data** | No | Yes (city→province→country) | GeoNames ✅ |
|
||
| **Update frequency** | Quarterly | Daily | GeoNames ✅ |
|
||
| **API availability** | No official API | Yes (free tier) | GeoNames ✅ |
|
||
| **Offline usage** | CSV dumps | SQLite/CSV dumps | Tie ✅ |
|
||
| **Cost** | Free | Free | Tie ✅ |
|
||
|
||
**Conclusion**: GeoNames is superior for heritage institution geolocation.
|
||
|
||
---
|
||
|
||
## Implementation Options Considered
|
||
|
||
### Option A: GeoNames API Integration
|
||
|
||
**Approach**: Query GeoNames web API in real-time for city lookups.
|
||
|
||
```python
|
||
# Pseudocode
|
||
class GeoNamesAPIClient:
|
||
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
|
||
response = requests.get(
|
||
"http://api.geonames.org/searchJSON",
|
||
params={
|
||
"name": city_name,
|
||
"country": country,
|
||
"maxRows": 1,
|
||
"username": "glam_extractor"
|
||
}
|
||
)
|
||
# Generate 3-letter abbreviation from city name
|
||
return {"abbreviation": city_name[:3].upper(), "geonames_id": ...}
|
||
```
|
||
|
||
**Pros**:
|
||
- ✅ No local storage required
|
||
- ✅ Always up-to-date data
|
||
- ✅ Fast to implement (~2 hours)
|
||
- ✅ Simple integration
|
||
|
||
**Cons**:
|
||
- ❌ **Rate limits**: 2,000 requests/hour (free tier), 30,000/day
|
||
- ❌ Requires network connectivity
|
||
- ❌ Latency on each lookup (~200-500ms)
|
||
- ❌ Single point of failure (API downtime)
|
||
- ❌ Processing 1,351 Dutch orgs = could hit rate limit
|
||
|
||
**Verdict**: Good for prototyping, **not suitable for production**.
|
||
|
||
---
|
||
|
||
### Option B: GeoNames Local Database ✅ **SELECTED**
|
||
|
||
**Approach**: Download GeoNames data dump once, store in SQLite database for offline lookups.
|
||
|
||
```python
|
||
# Pseudocode
|
||
class GeoNamesLocalDB:
|
||
def __init__(self, db_path: str):
|
||
self.conn = sqlite3.connect(db_path)
|
||
|
||
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
|
||
cursor = self.conn.execute(
|
||
"SELECT geonames_id, name, admin1_code, latitude, longitude "
|
||
"FROM cities WHERE name = ? AND country_code = ?",
|
||
(city_name, country)
|
||
)
|
||
result = cursor.fetchone()
|
||
return {"abbreviation": self._generate_abbreviation(result['name']), ...}
|
||
|
||
def _generate_abbreviation(self, city_name: str) -> str:
|
||
# First 3 letters of city name, uppercase
|
||
return city_name[:3].upper()
|
||
```
|
||
|
||
**Pros**:
|
||
- ✅ **No rate limits** - unlimited local queries
|
||
- ✅ **Fast lookups** - <1ms with indexes
|
||
- ✅ **Offline operation** - no network dependency
|
||
- ✅ **Predictable performance** - no API downtime risk
|
||
- ✅ **Cost-effective** - free download, ~200MB storage
|
||
- ✅ **Reproducible** - same data snapshot across environments
|
||
- ✅ **Cacheable** - SQLite file easy to distribute
|
||
|
||
**Cons**:
|
||
- ⚠️ **Initial setup** - requires download + database creation (~30 mins)
|
||
- ⚠️ **Storage** - ~200MB for worldwide data, ~5MB for NL-only
|
||
- ⚠️ **Staleness** - need periodic updates (monthly/quarterly)
|
||
- ⚠️ **Implementation time** - ~4 hours total
|
||
|
||
**Verdict**: **BEST for production** - predictable, fast, no dependencies.
|
||
|
||
---
|
||
|
||
### Option C: Hybrid Approach (API + Local Cache)
|
||
|
||
**Approach**: Use local database, fallback to API for missing cities, cache results.
|
||
|
||
```python
|
||
class GeoNamesHybrid:
|
||
def get_city_abbreviation(self, city_name: str, country: str) -> dict:
|
||
# Try local DB first
|
||
result = self.local_db.lookup(city_name, country)
|
||
if result:
|
||
return result
|
||
|
||
# Fallback to API
|
||
result = self.api_client.lookup(city_name, country)
|
||
|
||
# Cache to local DB
|
||
self.local_db.insert(result)
|
||
return result
|
||
```
|
||
|
||
**Pros**:
|
||
- ✅ Best of both worlds
|
||
- ✅ Auto-updates for missing cities
|
||
- ✅ Resilient to data gaps
|
||
|
||
**Cons**:
|
||
- ❌ **Most complex** implementation
|
||
- ❌ Still subject to rate limits (for misses)
|
||
- ❌ Two failure modes (DB + API)
|
||
- ❌ Longer implementation time (~5-6 hours)
|
||
|
||
**Verdict**: Over-engineered for current needs, **defer to future**.
|
||
|
||
---
|
||
|
||
## Selected Solution: Option B Implementation Plan
|
||
|
||
### Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ GHCID Generator │
|
||
│ (src/glam_extractor/identifiers/ghcid.py) │
|
||
└────────────────┬────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ GeoNames Lookup Service │
|
||
│ (src/glam_extractor/geocoding/geonames_lookup.py) │
|
||
│ │
|
||
│ - get_city_abbreviation(city, country) │
|
||
│ - get_city_details(city, country) │
|
||
│ - get_geonames_id(city, country) │
|
||
└────────────────┬────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ SQLite Database │
|
||
│ (data/reference/geonames.db) │
|
||
│ │
|
||
│ Tables: │
|
||
│ - cities (geonames_id, name, country_code, ...) │
|
||
│ - admin1_codes (province mappings) │
|
||
│ - alternate_names (multilingual) │
|
||
└──────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Database Schema
|
||
|
||
```sql
|
||
-- Main cities table
|
||
CREATE TABLE cities (
|
||
geonames_id INTEGER PRIMARY KEY,
|
||
name TEXT NOT NULL,
|
||
ascii_name TEXT NOT NULL,
|
||
country_code TEXT NOT NULL,
|
||
admin1_code TEXT, -- ISO 3166-2 province/state code
|
||
admin2_code TEXT, -- County/municipality
|
||
latitude REAL NOT NULL,
|
||
longitude REAL NOT NULL,
|
||
population INTEGER,
|
||
elevation INTEGER,
|
||
feature_code TEXT, -- PPL, PPLA, PPLC (city types)
|
||
timezone TEXT,
|
||
modification_date TEXT
|
||
);
|
||
|
||
-- Index for fast city lookups
|
||
CREATE INDEX idx_city_country ON cities(name, country_code);
|
||
CREATE INDEX idx_country_pop ON cities(country_code, population DESC);
|
||
|
||
-- Province/state codes (ISO 3166-2)
|
||
CREATE TABLE admin1_codes (
|
||
code TEXT PRIMARY KEY, -- e.g., "NL.07" for North Holland
|
||
name TEXT NOT NULL, -- "North Holland"
|
||
ascii_name TEXT NOT NULL,
|
||
geonames_id INTEGER
|
||
);
|
||
|
||
-- Alternative names (multilingual support)
|
||
CREATE TABLE alternate_names (
|
||
alternate_name_id INTEGER PRIMARY KEY,
|
||
geonames_id INTEGER NOT NULL,
|
||
isolanguage TEXT, -- en, nl, fr, etc.
|
||
alternate_name TEXT NOT NULL,
|
||
is_preferred_name INTEGER,
|
||
is_short_name INTEGER,
|
||
FOREIGN KEY (geonames_id) REFERENCES cities(geonames_id)
|
||
);
|
||
|
||
CREATE INDEX idx_altname_geonames ON alternate_names(geonames_id);
|
||
CREATE INDEX idx_altname_name ON alternate_names(alternate_name);
|
||
```
|
||
|
||
### Data Source
|
||
|
||
**GeoNames Download URL**:
|
||
```
|
||
http://download.geonames.org/export/dump/
|
||
```
|
||
|
||
**Files needed**:
|
||
1. **Cities**: `cities15000.zip` (cities with >15,000 population) - ~25MB
|
||
- Alternative: `allCountries.zip` (all places) - ~350MB
|
||
2. **Admin codes**: `admin1CodesASCII.txt` - province/state mappings
|
||
3. **Alternate names**: `alternateNamesV2.zip` - multilingual names
|
||
|
||
**Netherlands-specific**:
|
||
- `NL.zip` - All Dutch places (~200KB uncompressed)
|
||
- Includes all 475+ cities in Dutch heritage datasets
|
||
|
||
### Database Build Process
|
||
|
||
```python
|
||
# scripts/build_geonames_db.py
|
||
|
||
import csv
|
||
import sqlite3
|
||
import zipfile
|
||
from pathlib import Path
|
||
|
||
def build_geonames_database(
|
||
geonames_file: str = "NL.txt",
|
||
admin1_file: str = "admin1CodesASCII.txt",
|
||
output_db: str = "data/reference/geonames.db"
|
||
):
|
||
"""
|
||
Build SQLite database from GeoNames text files.
|
||
|
||
Steps:
|
||
1. Download GeoNames data (if not exists)
|
||
2. Create SQLite database
|
||
3. Parse GeoNames TSV files
|
||
4. Insert data with indexes
|
||
5. Validate completeness
|
||
"""
|
||
conn = sqlite3.connect(output_db)
|
||
cursor = conn.cursor()
|
||
|
||
# Create tables (see schema above)
|
||
create_tables(cursor)
|
||
|
||
# Parse cities file
|
||
with open(geonames_file, 'r', encoding='utf-8') as f:
|
||
reader = csv.reader(f, delimiter='\t')
|
||
for row in reader:
|
||
cursor.execute(
|
||
"INSERT INTO cities VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
|
||
(
|
||
row[0], # geonames_id
|
||
row[1], # name
|
||
row[2], # ascii_name
|
||
row[8], # country_code
|
||
row[10], # admin1_code
|
||
row[11], # admin2_code
|
||
row[4], # latitude
|
||
row[5], # longitude
|
||
row[14], # population
|
||
row[16], # elevation
|
||
row[7], # feature_code
|
||
row[17], # timezone
|
||
row[18], # modification_date
|
||
)
|
||
)
|
||
|
||
# Parse admin1 codes (province mappings)
|
||
with open(admin1_file, 'r', encoding='utf-8') as f:
|
||
reader = csv.reader(f, delimiter='\t')
|
||
for row in reader:
|
||
cursor.execute(
|
||
"INSERT INTO admin1_codes VALUES (?, ?, ?, ?)",
|
||
(row[0], row[1], row[2], row[3])
|
||
)
|
||
|
||
# Create indexes
|
||
cursor.execute("CREATE INDEX idx_city_country ON cities(name, country_code)")
|
||
cursor.execute("CREATE INDEX idx_country_pop ON cities(country_code, population DESC)")
|
||
|
||
conn.commit()
|
||
conn.close()
|
||
|
||
print(f"✅ GeoNames database created: {output_db}")
|
||
print(f" Cities: {count_cities(output_db)}")
|
||
print(f" Countries: {count_countries(output_db)}")
|
||
```
|
||
|
||
### City Abbreviation Strategy
|
||
|
||
**Decision**: Use **first 3 letters** of city name (uppercase).
|
||
|
||
**Rationale**:
|
||
1. ✅ **Simplicity**: Easy to understand and implement
|
||
2. ✅ **Compatibility**: Many UN/LOCODE codes use this pattern (Amsterdam → AMS)
|
||
3. ✅ **Predictability**: No complex rules, consistent output
|
||
4. ✅ **Readability**: Often recognizable (Rotterdam → ROT, Utrecht → UTR)
|
||
|
||
**Algorithm**:
|
||
```python
|
||
def generate_city_abbreviation(city_name: str) -> str:
|
||
"""
|
||
Generate 3-letter city abbreviation from name.
|
||
|
||
Rules:
|
||
1. Take first 3 characters
|
||
2. Convert to uppercase
|
||
3. Remove spaces/punctuation
|
||
|
||
Examples:
|
||
Amsterdam → AMS
|
||
Rotterdam → ROT
|
||
Den Haag → DEN
|
||
's-Hertogenbosch → SHE
|
||
Aalst → AAL
|
||
"""
|
||
# Normalize: remove special chars, spaces
|
||
normalized = city_name.replace("'", "").replace("-", "").replace(" ", "")
|
||
|
||
# Take first 3 letters, uppercase
|
||
abbreviation = normalized[:3].upper()
|
||
|
||
return abbreviation
|
||
```
|
||
|
||
**Alternative considered**: First 3 consonants
|
||
- Amsterdam → MSD (M-S-D)
|
||
- Rotterdam → RTD (R-T-D)
|
||
- **Rejected**: Less intuitive, harder to reverse-lookup
|
||
|
||
### Integration Points
|
||
|
||
**Files to modify**:
|
||
|
||
1. **`src/glam_extractor/geocoding/geonames_lookup.py`** (NEW)
|
||
- SQLite client for city lookups
|
||
- Abbreviation generation logic
|
||
- Province code mapping (ISO 3166-2)
|
||
|
||
2. **`src/glam_extractor/identifiers/lookups.py`** (UPDATE)
|
||
- Replace `get_city_locode()` with `get_city_abbreviation()`
|
||
- Update `get_ghcid_components_for_dutch_city()` to use GeoNames
|
||
|
||
3. **`src/glam_extractor/identifiers/ghcid.py`** (UPDATE)
|
||
- Update docstrings: "UN/LOCODE" → "GeoNames-based abbreviation"
|
||
- Update validation regex (same format, 3 uppercase letters)
|
||
|
||
4. **`data/reference/geonames.db`** (NEW)
|
||
- SQLite database with Dutch cities
|
||
- Initial build: ~5MB for Netherlands
|
||
- Global expansion: ~200MB for all countries
|
||
|
||
5. **`scripts/build_geonames_db.py`** (NEW)
|
||
- Downloads GeoNames data
|
||
- Builds SQLite database
|
||
- Validates completeness
|
||
|
||
6. **`data/reference/nl_city_locodes.json`** (DEPRECATE)
|
||
- Mark as deprecated in comments
|
||
- Keep for backward compatibility (short-term)
|
||
- Remove after GHCID migration complete
|
||
|
||
### GHCID Format Changes
|
||
|
||
**Before (UN/LOCODE)**:
|
||
```
|
||
NL-NH-AMS-M-RM
|
||
│ │ │ │ └─ Abbreviation (Rijksmuseum)
|
||
│ │ │ └─── Type (Museum)
|
||
│ │ └─────── City LOCODE (Amsterdam)
|
||
│ └────────── Province (North Holland)
|
||
└───────────── Country (Netherlands)
|
||
```
|
||
|
||
**After (GeoNames)**:
|
||
```
|
||
NL-NH-AMS-M-RM
|
||
│ │ │ │ └─ Abbreviation (Rijksmuseum)
|
||
│ │ │ └─── Type (Museum)
|
||
│ │ └─────── City abbreviation (Amsterdam, from GeoNames)
|
||
│ └────────── Province (from GeoNames admin1_code)
|
||
└───────────── Country (Netherlands)
|
||
```
|
||
|
||
**Note**: Format is **identical**, only the source of city abbreviation changes.
|
||
|
||
### Migration Impact
|
||
|
||
**Breaking Changes**:
|
||
- ⚠️ Some city codes may change (if GeoNames differs from UN/LOCODE)
|
||
- ⚠️ Existing 152 GHCIDs may need regeneration
|
||
- ⚠️ Numeric hashes will change (computed from GHCID string)
|
||
|
||
**Migration strategy**:
|
||
1. **Document changes** in GHCID history
|
||
2. **Preserve old GHCIDs** in `ghcid_original` field
|
||
3. **Update `ghcid_current`** with new abbreviation
|
||
4. **Regenerate `ghcid_numeric`** from new GHCID
|
||
5. **Add history entry**:
|
||
```python
|
||
GHCIDHistoryEntry(
|
||
ghcid="NL-NH-XXX-M-RM", # Old LOCODE-based
|
||
valid_from="2025-10-01",
|
||
valid_to="2025-11-05",
|
||
reason="Migrated from UN/LOCODE to GeoNames abbreviation"
|
||
)
|
||
```
|
||
|
||
**Backward compatibility**:
|
||
- Keep mapping table: `old_ghcid → new_ghcid`
|
||
- Support lookups by both old and new identifiers
|
||
- Phase out old format over 6-12 months
|
||
|
||
---
|
||
|
||
## Implementation Checklist
|
||
|
||
### Phase 1: Database Setup (Est. 1-2 hours)
|
||
|
||
- [ ] Download GeoNames `NL.zip` dataset
|
||
- [ ] Download `admin1CodesASCII.txt` for province codes
|
||
- [ ] Create `scripts/build_geonames_db.py` script
|
||
- [ ] Generate `data/reference/geonames.db` SQLite database
|
||
- [ ] Validate: All 475 Dutch cities present
|
||
- [ ] Create indexes for fast lookups
|
||
- [ ] Document database schema in code comments
|
||
|
||
### Phase 2: Lookup Module (Est. 1-2 hours)
|
||
|
||
- [ ] Create `src/glam_extractor/geocoding/geonames_lookup.py`
|
||
- [ ] Implement `GeoNamesDB` class with SQLite connection
|
||
- [ ] Implement `get_city_abbreviation(city, country)` method
|
||
- [ ] Implement `get_city_details(city, country)` method
|
||
- [ ] Implement `get_geonames_id(city, country)` method
|
||
- [ ] Add support for alternate names (multilingual)
|
||
- [ ] Handle city name normalization (case, accents, hyphens)
|
||
|
||
### Phase 3: Integration (Est. 30 mins)
|
||
|
||
- [ ] Update `src/glam_extractor/identifiers/lookups.py`
|
||
- [ ] Replace `_load_json("nl_city_locodes.json")` with GeoNames DB
|
||
- [ ] Update `get_city_locode()` → `get_city_abbreviation()`
|
||
- [ ] Update `get_province_code()` to use GeoNames admin1_code
|
||
- [ ] Update `src/glam_extractor/identifiers/ghcid.py`
|
||
- [ ] Update docstrings (UN/LOCODE → GeoNames)
|
||
- [ ] Keep validation regex unchanged (format identical)
|
||
|
||
### Phase 4: Testing (Est. 1 hour)
|
||
|
||
- [ ] Unit tests for `GeoNamesDB` class
|
||
- [ ] Test city lookup (Amsterdam → AMS)
|
||
- [ ] Test missing city handling
|
||
- [ ] Test province code mapping
|
||
- [ ] Test GeoNames ID retrieval
|
||
- [ ] Integration tests with ISIL parser
|
||
- [ ] Re-run ISIL registry parsing
|
||
- [ ] Verify GHCID generation rate >95% (vs 41.8% before)
|
||
- [ ] Compare old vs new GHCIDs for changes
|
||
- [ ] Edge case tests
|
||
- [ ] Cities with special characters ('s-Hertogenbosch)
|
||
- [ ] Cities with spaces (Den Haag)
|
||
- [ ] Multilingual city names
|
||
|
||
### Phase 5: Documentation (Est. 30 mins)
|
||
|
||
- [ ] Update `docs/plan/global_glam/06-global-identifier-system.md`
|
||
- [ ] Replace UN/LOCODE references with GeoNames
|
||
- [ ] Document city abbreviation algorithm
|
||
- [ ] Update `AGENTS.md`
|
||
- [ ] Add GeoNames database instructions
|
||
- [ ] Explain migration from UN/LOCODE
|
||
- [ ] Add migration guide: `docs/migration/ghcid_locode_to_geonames.md`
|
||
|
||
### Phase 6: Validation (Est. 30 mins)
|
||
|
||
- [ ] Run full test suite (target: 150+ tests passing)
|
||
- [ ] Generate GHCIDs for all 364 ISIL records
|
||
- [ ] Generate GHCIDs for all 1,351 Dutch orgs
|
||
- [ ] Identify any remaining cities without GeoNames match
|
||
- [ ] Document edge cases and resolution strategy
|
||
|
||
**Total Estimated Time**: ~4-5 hours
|
||
|
||
---
|
||
|
||
## Performance Benchmarks
|
||
|
||
### Expected Performance (SQLite Local DB)
|
||
|
||
| Operation | Time | Notes |
|
||
|-----------|------|-------|
|
||
| Database load | 10ms | One-time per process |
|
||
| City lookup | <1ms | With index on (name, country_code) |
|
||
| Batch lookup (1,000 cities) | ~100ms | 0.1ms per city |
|
||
| Batch lookup (10,000 cities) | ~1s | 0.1ms per city |
|
||
|
||
**Comparison to API**:
|
||
- API request latency: 200-500ms per city
|
||
- Batch 1,000 cities via API: ~200-500 seconds (with rate limits)
|
||
- Batch 1,000 cities via SQLite: ~0.1 seconds
|
||
|
||
**Winner**: SQLite is **2,000-5,000x faster** than API.
|
||
|
||
### Storage Requirements
|
||
|
||
| Scope | Database Size | Records |
|
||
|-------|---------------|---------|
|
||
| Netherlands only | ~5 MB | ~5,000 places |
|
||
| Western Europe | ~50 MB | ~50,000 places |
|
||
| Global (cities >15k pop) | ~200 MB | ~200,000 cities |
|
||
| Global (all places) | ~1.5 GB | ~11 million places |
|
||
|
||
**Decision**: Start with **Netherlands only** (5MB), expand as needed.
|
||
|
||
---
|
||
|
||
## Data Freshness & Maintenance
|
||
|
||
### Update Frequency
|
||
|
||
**GeoNames**: Updated daily (new cities, name changes, population updates)
|
||
|
||
**Our database**: Update **quarterly** (every 3 months)
|
||
|
||
**Rationale**:
|
||
- Heritage institutions rarely relocate
|
||
- City names stable over years
|
||
- Population changes irrelevant for GHCID
|
||
- Quarterly updates sufficient for accuracy
|
||
|
||
### Update Process
|
||
|
||
```bash
|
||
# scripts/update_geonames_db.sh
|
||
|
||
#!/bin/bash
|
||
# Download latest GeoNames data
|
||
wget http://download.geonames.org/export/dump/NL.zip
|
||
unzip NL.zip
|
||
|
||
# Rebuild database
|
||
python scripts/build_geonames_db.py \
|
||
--input NL.txt \
|
||
--output data/reference/geonames.db
|
||
|
||
# Validate
|
||
python scripts/validate_geonames_db.py
|
||
|
||
echo "✅ GeoNames database updated"
|
||
```
|
||
|
||
**Automation**: Run via cron job or GitHub Actions monthly.
|
||
|
||
---
|
||
|
||
## Comparison: Before & After
|
||
|
||
### Before (UN/LOCODE)
|
||
|
||
**Coverage**:
|
||
- 50 Dutch cities (10.5%)
|
||
- 152/364 ISIL records with GHCIDs (41.8%)
|
||
- 212 records **cannot generate GHCID** (58.2%)
|
||
|
||
**Data source**:
|
||
- JSON file: `data/reference/nl_city_locodes.json`
|
||
- Manual curation required
|
||
- No automation for updates
|
||
|
||
**Limitations**:
|
||
- Cannot expand to global institutions
|
||
- Missing most small/medium cities
|
||
- No multilingual support
|
||
|
||
### After (GeoNames)
|
||
|
||
**Coverage**:
|
||
- 475+ Dutch cities (100%)
|
||
- Expected: 345+/364 ISIL records with GHCIDs (>95%)
|
||
- <20 records without GHCID (edge cases)
|
||
|
||
**Data source**:
|
||
- SQLite database: `data/reference/geonames.db`
|
||
- Automated build from GeoNames dumps
|
||
- Quarterly updates via script
|
||
|
||
**Benefits**:
|
||
- ✅ Ready for global expansion (139 conversation files)
|
||
- ✅ Covers all heritage institution locations
|
||
- ✅ Multilingual name support
|
||
- ✅ Coordinates for mapping
|
||
- ✅ Province/state codes included
|
||
|
||
---
|
||
|
||
## Alternatives Rejected
|
||
|
||
### 1. OpenStreetMap Nominatim
|
||
|
||
**Pros**: Free geocoding API, comprehensive
|
||
**Cons**: Rate limits (1 req/sec), intended for geocoding not lookup, no offline dumps
|
||
|
||
**Why rejected**: Same API limitations as GeoNames, less structured data
|
||
|
||
### 2. Google Maps Geocoding API
|
||
|
||
**Pros**: High quality, fast, reliable
|
||
**Cons**: **Costs money** ($5 per 1,000 requests), vendor lock-in, requires API key
|
||
|
||
**Why rejected**: Not free/open, unsustainable for large-scale extraction
|
||
|
||
### 3. Custom City Abbreviation Registry
|
||
|
||
**Pros**: Full control, optimized for heritage sector
|
||
**Cons**: **Months of manual curation**, ongoing maintenance burden, duplication of effort
|
||
|
||
**Why rejected**: Reinventing the wheel, GeoNames already exists and is maintained
|
||
|
||
### 4. Keep UN/LOCODE + Manual Additions
|
||
|
||
**Pros**: Minimal code changes
|
||
**Cons**: Doesn't solve scalability, still manual curation, not global-ready
|
||
|
||
**Why rejected**: Doesn't address root problem, technical debt
|
||
|
||
---
|
||
|
||
## Risk Assessment
|
||
|
||
### Risk 1: GeoNames Service Discontinuation
|
||
|
||
**Likelihood**: Low (operated by GeoNames since 2005, widely used)
|
||
**Impact**: Medium (need alternative source)
|
||
**Mitigation**:
|
||
- We use **offline database**, not dependent on API
|
||
- GeoNames data dumps archived by multiple organizations
|
||
- Could switch to OpenStreetMap/WikiData if needed
|
||
|
||
### Risk 2: City Name Ambiguity
|
||
|
||
**Scenario**: Multiple cities with same name (e.g., "Portland" in USA, UK, Australia)
|
||
|
||
**Likelihood**: Medium (common in global dataset)
|
||
**Impact**: Medium (wrong GHCID if not disambiguated)
|
||
**Mitigation**:
|
||
- Always provide **country code** in lookups
|
||
- Use **population ranking** (larger city preferred)
|
||
- Validate with **province/state code** match
|
||
- Log warnings for ambiguous matches
|
||
|
||
### Risk 3: Database Corruption
|
||
|
||
**Likelihood**: Low (SQLite very stable)
|
||
**Impact**: High (all GHCIDs incorrect)
|
||
**Mitigation**:
|
||
- **Checksum validation** after build
|
||
- Version control: `geonames-v2025.11.db`
|
||
- Keep backup copies
|
||
- Automated tests verify data integrity
|
||
|
||
### Risk 4: Abbreviation Collisions
|
||
|
||
**Scenario**: Two different cities generate same abbreviation (e.g., "Amsterdam" and "Amstelveen" → "AMS")
|
||
|
||
**Likelihood**: Medium (common with 3-letter codes)
|
||
**Impact**: Medium (same city code, different GHCIDs differentiated by province)
|
||
**Mitigation**:
|
||
- **Province code** in GHCID prevents collision: `NL-NH-AMS` vs `NL-XX-AMS`
|
||
- Most collisions will be in different provinces
|
||
- If same province: Wikidata Q-number resolves (collision resolution)
|
||
|
||
---
|
||
|
||
## Future Enhancements
|
||
|
||
### 1. Global Expansion (Priority: High)
|
||
|
||
**Goal**: Support 139 conversation files covering 60+ countries
|
||
|
||
**Tasks**:
|
||
- Download GeoNames global dataset (`allCountries.zip`)
|
||
- Build SQLite database for all countries (~200MB)
|
||
- Test with conversation JSON extraction
|
||
- Validate GHCID generation for international institutions
|
||
|
||
**Timeline**: After Phase 1 complete (Dutch institutions)
|
||
|
||
### 2. Multilingual City Name Matching (Priority: Medium)
|
||
|
||
**Goal**: Match cities by local names (e.g., "Den Haag" = "The Hague")
|
||
|
||
**Tasks**:
|
||
- Load GeoNames `alternateNamesV2.txt` into database
|
||
- Support lookup by any alternate name
|
||
- Return canonical name + GeoNames ID
|
||
|
||
**Use case**: Conversation JSONs mention cities in local language
|
||
|
||
### 3. Geocoding Integration (Priority: Medium)
|
||
|
||
**Goal**: Populate `location.latitude` and `location.longitude` fields
|
||
|
||
**Tasks**:
|
||
- Add `get_coordinates(city, country)` method
|
||
- Integrate with `HeritageCustodian.locations` list
|
||
- Support reverse geocoding (lat/lon → city name)
|
||
|
||
**Benefit**: Enable map visualizations, geographic analysis
|
||
|
||
### 4. Province/State Inference (Priority: Low)
|
||
|
||
**Goal**: Auto-detect province from city name (if not provided)
|
||
|
||
**Tasks**:
|
||
- Use GeoNames `admin1_code` field
|
||
- Map to ISO 3166-2 province codes
|
||
- Handle edge cases (disputed territories)
|
||
|
||
**Use case**: ISIL registry doesn't always specify province
|
||
|
||
### 5. City Similarity Matching (Priority: Low)
|
||
|
||
**Goal**: Handle typos, spelling variations
|
||
|
||
**Tasks**:
|
||
- Implement fuzzy matching (Levenshtein distance)
|
||
- Suggest corrections for unmatched cities
|
||
- Confidence scoring for matches
|
||
|
||
**Example**: "Amsterdm" → "Amsterdam" (confidence: 0.95)
|
||
|
||
---
|
||
|
||
## Success Metrics
|
||
|
||
### Coverage Targets
|
||
|
||
| Metric | Before (UN/LOCODE) | Target (GeoNames) | Status |
|
||
|--------|-------------------|-------------------|--------|
|
||
| Dutch city coverage | 50 (10.5%) | 475 (100%) | Pending ✅ |
|
||
| GHCID generation rate (ISIL) | 152/364 (41.8%) | >345/364 (>95%) | Pending ✅ |
|
||
| GHCID generation rate (Dutch orgs) | Unknown | >1,280/1,351 (>95%) | Pending ✅ |
|
||
| Global city coverage | N/A | >200,000 cities | Future 🔮 |
|
||
|
||
### Performance Targets
|
||
|
||
| Metric | Target | Status |
|
||
|--------|--------|--------|
|
||
| City lookup latency | <1ms | Pending ✅ |
|
||
| Database load time | <10ms | Pending ✅ |
|
||
| Database size (NL-only) | <10MB | Pending ✅ |
|
||
| Test coverage | >90% | Pending ✅ |
|
||
|
||
### Quality Targets
|
||
|
||
| Metric | Target | Status |
|
||
|--------|--------|--------|
|
||
| City name accuracy | >99% | Pending ✅ |
|
||
| Province code accuracy | >95% | Pending ✅ |
|
||
| GeoNames ID linkage | >90% | Pending ✅ |
|
||
| Zero database corruption | 100% | Pending ✅ |
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
### External Resources
|
||
|
||
- **GeoNames Official**: https://www.geonames.org
|
||
- **GeoNames Downloads**: http://download.geonames.org/export/dump/
|
||
- **GeoNames Documentation**: https://www.geonames.org/export/
|
||
- **GeoNames Web Services**: https://www.geonames.org/export/web-services.html
|
||
- **ISO 3166-2 (Provinces)**: https://en.wikipedia.org/wiki/ISO_3166-2
|
||
|
||
### Internal Documentation
|
||
|
||
- **GHCID Specification**: `docs/plan/global_glam/06-global-identifier-system.md`
|
||
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
||
- **Architecture**: `docs/plan/global_glam/02-architecture.md`
|
||
- **Schema**: `schemas/heritage_custodian.yaml`
|
||
|
||
### Implementation Files
|
||
|
||
- **GHCID Generator**: `src/glam_extractor/identifiers/ghcid.py`
|
||
- **Lookups**: `src/glam_extractor/identifiers/lookups.py`
|
||
- **GeoNames Client** (to be created): `src/glam_extractor/geocoding/geonames_lookup.py`
|
||
- **Database** (to be created): `data/reference/geonames.db`
|
||
- **Build Script** (to be created): `scripts/build_geonames_db.py`
|
||
|
||
---
|
||
|
||
## Approval & Sign-off
|
||
|
||
**Decision Made**: 2025-11-05
|
||
**Approved By**: GLAM Data Extraction Project Team
|
||
**Implementation Owner**: GeoNames Integration Team
|
||
**Review Date**: 2025-12-05 (1 month post-implementation)
|
||
|
||
**Status**: ✅ **Design Approved - Ready for Implementation**
|
||
|
||
---
|
||
|
||
**Last Updated**: 2025-11-05
|
||
**Version**: 1.0
|
||
**Next Review**: After Phase 1 implementation complete
|