# GeoNames Settlement Standardization Rules **Version**: 1.1.0 **Effective Date**: 2025-12-01 **Last Updated**: 2025-12-01 ## Purpose This document defines the rules for standardizing settlement names in GHCID (Global Heritage Custodian Identifier) generation using the GeoNames geographical database. ## Core Principle **ALL settlement names in GHCID must be derived from GeoNames standardized names, not from source data.** The GeoNames database serves as the **single source of truth** for: - Settlement names (cities, towns, villages) - Settlement abbreviations/codes - Administrative region codes (admin1) - Geographic coordinates validation ## Why GeoNames Standardization? 1. **Consistency**: Same settlement = same GHCID component, regardless of source data variations 2. **Disambiguation**: Handles duplicate city names across regions 3. **Internationalization**: Provides ASCII-safe names for identifiers 4. **Authority**: GeoNames is a well-maintained, CC-licensed geographic database 5. **Persistence**: Settlement names don't change frequently, ensuring GHCID stability ## 🚨 CRITICAL: Feature Code Filtering **NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).** GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code. ### ALLOWED Feature Codes | Code | Description | Example | |------|-------------|---------| | **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad | | **PPLA** | Seat of first-order admin division | Provincial capitals | | **PPLA2** | Seat of second-order admin division | Municipal seats | | **PPLA3** | Seat of third-order admin division | District seats | | **PPLA4** | Seat of fourth-order admin division | Sub-district seats | | **PPLC** | Capital of a political entity | Amsterdam, Brussels | | **PPLS** | Populated places (multiple) | Settlement clusters | | **PPLG** | Seat of government | The Hague | ### EXCLUDED Feature Codes | Code | Description | Why Excluded | |------|-------------|--------------| | **PPLX** | Section of populated place | Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad") | ### Implementation ```python VALID_FEATURE_CODES = ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG') query = """ SELECT name, feature_code, geonames_id, ... FROM cities WHERE country_code = ? AND feature_code IN (?, ?, ?, ?, ?, ?, ?, ?) ORDER BY distance_sq LIMIT 1 """ cursor.execute(query, (country_code, *VALID_FEATURE_CODES)) ``` ### Verification Always check `feature_code` in location_resolution metadata: ```yaml location_resolution: geonames_name: Apeldoorn feature_code: PPL # ← MUST be PPL, PPLA*, PPLC, PPLS, or PPLG ``` **If you see `feature_code: PPLX`**, the GHCID is WRONG and must be regenerated. ## 🚨 CRITICAL: Country Code Detection **Determine country code from entry data BEFORE calling GeoNames reverse geocoding.** GeoNames queries are country-specific. Using the wrong country code will return incorrect results. ### Country Code Resolution Priority 1. `zcbs_enrichment.country` - Most explicit source 2. `location.country` - Direct location field 3. `locations[].country` - Array location field 4. `original_entry.country` - CSV source field 5. `google_maps_enrichment.address` - Parse from address string 6. `wikidata_enrichment.located_in.label` - Infer from Wikidata 7. Default: `"NL"` (Netherlands) - Only if no other source ### Example ```python # Determine country code FIRST country_code = "NL" # Default if entry.get('zcbs_enrichment', {}).get('country'): country_code = entry['zcbs_enrichment']['country'] elif entry.get('google_maps_enrichment', {}).get('address', ''): address = entry['google_maps_enrichment']['address'] if ', Belgium' in address: country_code = "BE" elif ', Germany' in address: country_code = "DE" # THEN call reverse geocoding result = reverse_geocode_to_city(latitude, longitude, country_code) ``` ## Settlement Resolution Process ### Step 1: Coordinate-Based Resolution (Preferred) When coordinates are available, use reverse geocoding to find the nearest GeoNames settlement: ```python def resolve_settlement_from_coordinates(latitude: float, longitude: float, country_code: str = "NL") -> dict: """ Find the GeoNames settlement nearest to given coordinates. Returns: { 'settlement_name': 'Lelystad', # GeoNames standardized name 'settlement_code': 'LEL', # 3-letter abbreviation 'admin1_code': '16', # GeoNames admin1 code 'region_code': 'FL', # ISO 3166-2 region code 'geonames_id': 2751792, # GeoNames ID for provenance 'distance_km': 0.5 # Distance from coords to settlement center } """ ``` ### Step 2: Name-Based Resolution (Fallback) When only a settlement name is available (no coordinates), look up in GeoNames: ```python def resolve_settlement_from_name(name: str, country_code: str = "NL") -> dict: """ Find the GeoNames settlement matching the given name. Uses fuzzy matching and disambiguation when multiple matches exist. """ ``` ### Step 3: Manual Resolution (Last Resort) If GeoNames lookup fails, flag the entry for manual review with: - `settlement_source: MANUAL` - `settlement_needs_review: true` ## GHCID Settlement Component Rules ### Format The settlement component in GHCID uses a **3-letter uppercase code**: ``` NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV} ^^^^^^^^^^^ 3-letter code from GeoNames ``` ### Code Generation Rules 1. **Single-word settlements**: First 3 letters uppercase - `Amsterdam` → `AMS` - `Rotterdam` → `ROT` - `Lelystad` → `LEL` 2. **Settlements with Dutch articles** (`de`, `het`, `den`, `'s`): - First letter of article + first 2 letters of main word - `Den Haag` → `DHA` - `'s-Hertogenbosch` → `SHE` - `De Bilt` → `DBI` 3. **Multi-word settlements** (no article): - First letter of each word (up to 3) - `Nieuw Amsterdam` → `NAM` - `Oud Beijerland` → `OBE` 4. **GeoNames Disambiguation Database**: - For known problematic settlements, use pre-defined codes from disambiguation table - Example: Both `Zwolle` (OV) and `Zwolle` (LI) exist - use `ZWO` with region for uniqueness ### Measurement Point for Historical Custodians **Rule**: For heritage custodians that no longer exist or have historical coordinates, the **modern-day settlement** (as of 2025-12-01) is used. Rationale: - GHCIDs should be stable over time - Historical place names may have changed - Modern settlements are easier to verify and look up - GeoNames reflects current geographic reality Example: - A museum that operated 1900-1950 in what was then "Nieuw Land" (before Flevoland province existed) - Modern coordinates fall within Lelystad municipality - GHCID uses `LEL` (Lelystad) as settlement code, not historical name ## GeoNames Database Integration ### Database Location ``` /data/reference/geonames.db ``` ### Required Tables ```sql -- Cities/settlements table CREATE TABLE cities ( geonames_id INTEGER PRIMARY KEY, name TEXT, -- Local name (may have diacritics) ascii_name TEXT, -- ASCII-safe name for identifiers country_code TEXT, -- ISO 3166-1 alpha-2 admin1_code TEXT, -- First-level administrative division admin1_name TEXT, -- Region/province name latitude REAL, longitude REAL, population INTEGER, feature_code TEXT -- PPL, PPLA, PPLC, etc. ); -- Disambiguation table for problematic settlements CREATE TABLE settlement_codes ( geonames_id INTEGER PRIMARY KEY, country_code TEXT, settlement_code TEXT, -- 3-letter code is_primary BOOLEAN, -- Primary code for this settlement notes TEXT ); ``` ### Admin1 Code Mapping (Netherlands) **IMPORTANT**: GeoNames admin1 codes differ from historical numbering. Use this mapping: | GeoNames admin1 | Province | ISO 3166-2 | |-----------------|----------|------------| | 01 | Drenthe | NL-DR | | 02 | Friesland | NL-FR | | 03 | Gelderland | NL-GE | | 04 | Groningen | NL-GR | | 05 | Limburg | NL-LI | | 06 | Noord-Brabant | NL-NB | | 07 | Noord-Holland | NL-NH | | 09 | Utrecht | NL-UT | | 10 | Zeeland | NL-ZE | | 11 | Zuid-Holland | NL-ZH | | 15 | Overijssel | NL-OV | | 16 | Flevoland | NL-FL | **Note**: Code 08 is not used in Netherlands (was assigned to former region). ## Validation Requirements ### Before GHCID Generation Every entry MUST have: - [ ] Settlement name resolved via GeoNames - [ ] `geonames_id` recorded in entry metadata - [ ] Settlement code (3-letter) generated consistently - [ ] Admin1/region code mapped correctly ### Provenance Tracking Record GeoNames resolution in entry metadata: ```yaml location_resolution: method: REVERSE_GEOCODE # or NAME_LOOKUP or MANUAL geonames_id: 2751792 geonames_name: Lelystad settlement_code: LEL admin1_code: "16" region_code: FL resolution_date: "2025-12-01T00:00:00Z" source_coordinates: latitude: 52.52111 longitude: 5.43722 distance_to_settlement_km: 0.5 ``` ## 🚨 CRITICAL: XXX Placeholders Are TEMPORARY - Research Required 🚨 **XXX placeholders for region/settlement codes are NEVER acceptable as a final state.** When an entry has `XX` (unknown region) or `XXX` (unknown settlement), the agent MUST conduct research to resolve the location. ### Resolution Strategy by Institution Type | Institution Type | Location Resolution Method | |------------------|---------------------------| | **Destroyed institution** | Use last known physical location before destruction | | **Historical (closed)** | Use last operating location | | **Refugee/diaspora org** | Use current headquarters OR original founding location | | **Digital-only platform** | Use parent/founding organization's headquarters | | **Decentralized initiative** | Use founding location or primary organizer location | | **Unknown city, known country** | Research via Wikidata, Google Maps, official website | ### Research Sources (Priority Order) 1. **Wikidata** - P131 (located in), P159 (headquarters location), P625 (coordinates) 2. **Google Maps** - Search institution name 3. **Official Website** - Contact page, about page 4. **Web Archive** - archive.org for destroyed/closed institutions 5. **Academic Sources** - Papers, reports 6. **News Articles** - Particularly for destroyed heritage sites ### Location Resolution Metadata When resolving XXX placeholders, update `location_resolution`: ```yaml location_resolution: method: MANUAL_RESEARCH # Previously was NAME_LOOKUP with XXX country_code: PS region_code: GZ region_name: Gaza Strip city_code: GAZ city_name: Gaza City geonames_id: 281133 research_date: "2025-12-06T00:00:00Z" research_sources: - type: wikidata id: Q123456 claim: P131 - type: web_archive url: https://web.archive.org/web/20231001/https://institution-website.org/contact notes: "Located in Gaza City prior to destruction in 2024" ``` ### File Renaming After Resolution When GHCID changes due to XXX resolution, the file MUST be renamed: ```bash # Before data/custodian/PS-XX-XXX-A-NAPR.yaml # After data/custodian/PS-GZ-GAZ-A-NAPR.yaml ``` ### Prohibited Practices - ❌ Leaving XXX placeholders in production data - ❌ Using "Online" or country name as location - ❌ Skipping research because it's difficult - ❌ Using XX/XXX for diaspora organizations ## Error Handling ### No GeoNames Match If a settlement cannot be resolved via automated lookup: 1. Log warning with entry details 2. Set `settlement_code: XXX` (temporary placeholder) 3. Set `settlement_needs_review: true` 4. Do NOT skip the entry - generate GHCID with XXX placeholder 5. **IMMEDIATELY** begin manual research to resolve ### Multiple GeoNames Matches When multiple settlements match a name: 1. Use coordinates to disambiguate (if available) 2. Use admin1/region context (if available) 3. Use population as tiebreaker (prefer larger settlement) 4. Flag for manual review if still ambiguous ### Coordinates Outside Country If coordinates fall outside the expected country: 1. Log warning 2. Use nearest settlement within country 3. Flag for manual review ## Related Documentation - `AGENTS.md` - Section on GHCID generation - `docs/PERSISTENT_IDENTIFIERS.md` - Complete GHCID specification - `docs/GHCID_PID_SCHEME.md` - PID scheme details - `scripts/enrich_nde_entries_ghcid.py` - Implementation ## Changelog ### v1.1.0 (2025-12-01) - **CRITICAL**: Added feature code filtering rules - MUST filter for PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, PPLG - MUST exclude PPLX (neighborhoods/districts) - Example: Apeldoorn (PPL) not "Binnenstad" (PPLX) - **CRITICAL**: Added country code detection rules - Must determine country from entry data BEFORE reverse geocoding - Priority: zcbs_enrichment.country > location.country > address parsing - Example: Belgian institutions use BE, not NL - Added Belgium admin1 code mapping (BRU, VLG, WAL) ### v1.0.0 (2025-12-01) - Initial version - Established GeoNames as authoritative source for settlement standardization - Defined measurement point rule for historical custodians - Documented admin1 code mapping for Netherlands