11 KiB
GeoNames Settlement Standardization Rules
Version: 1.1.0
Effective Date: 2025-12-01
Last Updated: 2025-12-01
Purpose
This document defines the rules for standardizing settlement names in GHCID (Global Heritage Custodian Identifier) generation using the GeoNames geographical database.
Core Principle
ALL settlement names in GHCID must be derived from GeoNames standardized names, not from source data.
The GeoNames database serves as the single source of truth for:
- Settlement names (cities, towns, villages)
- Settlement abbreviations/codes
- Administrative region codes (admin1)
- Geographic coordinates validation
Why GeoNames Standardization?
- Consistency: Same settlement = same GHCID component, regardless of source data variations
- Disambiguation: Handles duplicate city names across regions
- Internationalization: Provides ASCII-safe names for identifiers
- Authority: GeoNames is a well-maintained, CC-licensed geographic database
- Persistence: Settlement names don't change frequently, ensuring GHCID stability
🚨 CRITICAL: Feature Code Filtering
NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.
ALLOWED Feature Codes
| Code | Description | Example |
|---|---|---|
| PPL | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
| PPLA | Seat of first-order admin division | Provincial capitals |
| PPLA2 | Seat of second-order admin division | Municipal seats |
| PPLA3 | Seat of third-order admin division | District seats |
| PPLA4 | Seat of fourth-order admin division | Sub-district seats |
| PPLC | Capital of a political entity | Amsterdam, Brussels |
| PPLS | Populated places (multiple) | Settlement clusters |
| PPLG | Seat of government | The Hague |
EXCLUDED Feature Codes
| Code | Description | Why Excluded |
|---|---|---|
| PPLX | Section of populated place | Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad") |
Implementation
VALID_FEATURE_CODES = ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
query = """
SELECT name, feature_code, geonames_id, ...
FROM cities
WHERE country_code = ?
AND feature_code IN (?, ?, ?, ?, ?, ?, ?, ?)
ORDER BY distance_sq
LIMIT 1
"""
cursor.execute(query, (country_code, *VALID_FEATURE_CODES))
Verification
Always check feature_code in location_resolution metadata:
location_resolution:
geonames_name: Apeldoorn
feature_code: PPL # ← MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
If you see feature_code: PPLX, the GHCID is WRONG and must be regenerated.
🚨 CRITICAL: Country Code Detection
Determine country code from entry data BEFORE calling GeoNames reverse geocoding.
GeoNames queries are country-specific. Using the wrong country code will return incorrect results.
Country Code Resolution Priority
zcbs_enrichment.country- Most explicit sourcelocation.country- Direct location fieldlocations[].country- Array location fieldoriginal_entry.country- CSV source fieldgoogle_maps_enrichment.address- Parse from address stringwikidata_enrichment.located_in.label- Infer from Wikidata- Default:
"NL"(Netherlands) - Only if no other source
Example
# Determine country code FIRST
country_code = "NL" # Default
if entry.get('zcbs_enrichment', {}).get('country'):
country_code = entry['zcbs_enrichment']['country']
elif entry.get('google_maps_enrichment', {}).get('address', ''):
address = entry['google_maps_enrichment']['address']
if ', Belgium' in address:
country_code = "BE"
elif ', Germany' in address:
country_code = "DE"
# THEN call reverse geocoding
result = reverse_geocode_to_city(latitude, longitude, country_code)
Settlement Resolution Process
Step 1: Coordinate-Based Resolution (Preferred)
When coordinates are available, use reverse geocoding to find the nearest GeoNames settlement:
def resolve_settlement_from_coordinates(latitude: float, longitude: float, country_code: str = "NL") -> dict:
"""
Find the GeoNames settlement nearest to given coordinates.
Returns:
{
'settlement_name': 'Lelystad', # GeoNames standardized name
'settlement_code': 'LEL', # 3-letter abbreviation
'admin1_code': '16', # GeoNames admin1 code
'region_code': 'FL', # ISO 3166-2 region code
'geonames_id': 2751792, # GeoNames ID for provenance
'distance_km': 0.5 # Distance from coords to settlement center
}
"""
Step 2: Name-Based Resolution (Fallback)
When only a settlement name is available (no coordinates), look up in GeoNames:
def resolve_settlement_from_name(name: str, country_code: str = "NL") -> dict:
"""
Find the GeoNames settlement matching the given name.
Uses fuzzy matching and disambiguation when multiple matches exist.
"""
Step 3: Manual Resolution (Last Resort)
If GeoNames lookup fails, flag the entry for manual review with:
settlement_source: MANUALsettlement_needs_review: true
GHCID Settlement Component Rules
Format
The settlement component in GHCID uses a 3-letter uppercase code:
NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
^^^^^^^^^^^
3-letter code from GeoNames
Code Generation Rules
-
Single-word settlements: First 3 letters uppercase
Amsterdam→AMSRotterdam→ROTLelystad→LEL
-
Settlements with Dutch articles (
de,het,den,'s):- First letter of article + first 2 letters of main word
Den Haag→DHA's-Hertogenbosch→SHEDe Bilt→DBI
-
Multi-word settlements (no article):
- First letter of each word (up to 3)
Nieuw Amsterdam→NAMOud Beijerland→OBE
-
GeoNames Disambiguation Database:
- For known problematic settlements, use pre-defined codes from disambiguation table
- Example: Both
Zwolle(OV) andZwolle(LI) exist - useZWOwith region for uniqueness
Measurement Point for Historical Custodians
Rule: For heritage custodians that no longer exist or have historical coordinates, the modern-day settlement (as of 2025-12-01) is used.
Rationale:
- GHCIDs should be stable over time
- Historical place names may have changed
- Modern settlements are easier to verify and look up
- GeoNames reflects current geographic reality
Example:
- A museum that operated 1900-1950 in what was then "Nieuw Land" (before Flevoland province existed)
- Modern coordinates fall within Lelystad municipality
- GHCID uses
LEL(Lelystad) as settlement code, not historical name
GeoNames Database Integration
Database Location
/data/reference/geonames.db
Required Tables
-- Cities/settlements table
CREATE TABLE cities (
geonames_id INTEGER PRIMARY KEY,
name TEXT, -- Local name (may have diacritics)
ascii_name TEXT, -- ASCII-safe name for identifiers
country_code TEXT, -- ISO 3166-1 alpha-2
admin1_code TEXT, -- First-level administrative division
admin1_name TEXT, -- Region/province name
latitude REAL,
longitude REAL,
population INTEGER,
feature_code TEXT -- PPL, PPLA, PPLC, etc.
);
-- Disambiguation table for problematic settlements
CREATE TABLE settlement_codes (
geonames_id INTEGER PRIMARY KEY,
country_code TEXT,
settlement_code TEXT, -- 3-letter code
is_primary BOOLEAN, -- Primary code for this settlement
notes TEXT
);
Admin1 Code Mapping (Netherlands)
IMPORTANT: GeoNames admin1 codes differ from historical numbering. Use this mapping:
| GeoNames admin1 | Province | ISO 3166-2 |
|---|---|---|
| 01 | Drenthe | NL-DR |
| 02 | Friesland | NL-FR |
| 03 | Gelderland | NL-GE |
| 04 | Groningen | NL-GR |
| 05 | Limburg | NL-LI |
| 06 | Noord-Brabant | NL-NB |
| 07 | Noord-Holland | NL-NH |
| 09 | Utrecht | NL-UT |
| 10 | Zeeland | NL-ZE |
| 11 | Zuid-Holland | NL-ZH |
| 15 | Overijssel | NL-OV |
| 16 | Flevoland | NL-FL |
Note: Code 08 is not used in Netherlands (was assigned to former region).
Validation Requirements
Before GHCID Generation
Every entry MUST have:
- Settlement name resolved via GeoNames
geonames_idrecorded in entry metadata- Settlement code (3-letter) generated consistently
- Admin1/region code mapped correctly
Provenance Tracking
Record GeoNames resolution in entry metadata:
location_resolution:
method: REVERSE_GEOCODE # or NAME_LOOKUP or MANUAL
geonames_id: 2751792
geonames_name: Lelystad
settlement_code: LEL
admin1_code: "16"
region_code: FL
resolution_date: "2025-12-01T00:00:00Z"
source_coordinates:
latitude: 52.52111
longitude: 5.43722
distance_to_settlement_km: 0.5
Error Handling
No GeoNames Match
If a settlement cannot be resolved:
- Log warning with entry details
- Set
settlement_code: XXX(unknown) - Set
settlement_needs_review: true - Do NOT skip the entry - generate GHCID with XXX placeholder
Multiple GeoNames Matches
When multiple settlements match a name:
- Use coordinates to disambiguate (if available)
- Use admin1/region context (if available)
- Use population as tiebreaker (prefer larger settlement)
- Flag for manual review if still ambiguous
Coordinates Outside Country
If coordinates fall outside the expected country:
- Log warning
- Use nearest settlement within country
- Flag for manual review
Related Documentation
AGENTS.md- Section on GHCID generationdocs/PERSISTENT_IDENTIFIERS.md- Complete GHCID specificationdocs/GHCID_PID_SCHEME.md- PID scheme detailsscripts/enrich_nde_entries_ghcid.py- Implementation
Changelog
v1.1.0 (2025-12-01)
- CRITICAL: Added feature code filtering rules
- MUST filter for PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, PPLG
- MUST exclude PPLX (neighborhoods/districts)
- Example: Apeldoorn (PPL) not "Binnenstad" (PPLX)
- CRITICAL: Added country code detection rules
- Must determine country from entry data BEFORE reverse geocoding
- Priority: zcbs_enrichment.country > location.country > address parsing
- Example: Belgian institutions use BE, not NL
- Added Belgium admin1 code mapping (BRU, VLG, WAL)
v1.0.0 (2025-12-01)
- Initial version
- Established GeoNames as authoritative source for settlement standardization
- Defined measurement point rule for historical custodians
- Documented admin1 code mapping for Netherlands