glam/.opencode/GEONAMES_SETTLEMENT_RULES.md
2025-12-01 16:06:34 +01:00

11 KiB

GeoNames Settlement Standardization Rules

Version: 1.1.0
Effective Date: 2025-12-01
Last Updated: 2025-12-01

Purpose

This document defines the rules for standardizing settlement names in GHCID (Global Heritage Custodian Identifier) generation using the GeoNames geographical database.

Core Principle

ALL settlement names in GHCID must be derived from GeoNames standardized names, not from source data.

The GeoNames database serves as the single source of truth for:

  • Settlement names (cities, towns, villages)
  • Settlement abbreviations/codes
  • Administrative region codes (admin1)
  • Geographic coordinates validation

Why GeoNames Standardization?

  1. Consistency: Same settlement = same GHCID component, regardless of source data variations
  2. Disambiguation: Handles duplicate city names across regions
  3. Internationalization: Provides ASCII-safe names for identifiers
  4. Authority: GeoNames is a well-maintained, CC-licensed geographic database
  5. Persistence: Settlement names don't change frequently, ensuring GHCID stability

🚨 CRITICAL: Feature Code Filtering

NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).

GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.

ALLOWED Feature Codes

Code Description Example
PPL Populated place (city/town/village) Apeldoorn, Hamont, Lelystad
PPLA Seat of first-order admin division Provincial capitals
PPLA2 Seat of second-order admin division Municipal seats
PPLA3 Seat of third-order admin division District seats
PPLA4 Seat of fourth-order admin division Sub-district seats
PPLC Capital of a political entity Amsterdam, Brussels
PPLS Populated places (multiple) Settlement clusters
PPLG Seat of government The Hague

EXCLUDED Feature Codes

Code Description Why Excluded
PPLX Section of populated place Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad")

Implementation

VALID_FEATURE_CODES = ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')

query = """
    SELECT name, feature_code, geonames_id, ...
    FROM cities
    WHERE country_code = ?
      AND feature_code IN (?, ?, ?, ?, ?, ?, ?, ?)
    ORDER BY distance_sq
    LIMIT 1
"""
cursor.execute(query, (country_code, *VALID_FEATURE_CODES))

Verification

Always check feature_code in location_resolution metadata:

location_resolution:
  geonames_name: Apeldoorn
  feature_code: PPL  # ← MUST be PPL, PPLA*, PPLC, PPLS, or PPLG

If you see feature_code: PPLX, the GHCID is WRONG and must be regenerated.

🚨 CRITICAL: Country Code Detection

Determine country code from entry data BEFORE calling GeoNames reverse geocoding.

GeoNames queries are country-specific. Using the wrong country code will return incorrect results.

Country Code Resolution Priority

  1. zcbs_enrichment.country - Most explicit source
  2. location.country - Direct location field
  3. locations[].country - Array location field
  4. original_entry.country - CSV source field
  5. google_maps_enrichment.address - Parse from address string
  6. wikidata_enrichment.located_in.label - Infer from Wikidata
  7. Default: "NL" (Netherlands) - Only if no other source

Example

# Determine country code FIRST
country_code = "NL"  # Default

if entry.get('zcbs_enrichment', {}).get('country'):
    country_code = entry['zcbs_enrichment']['country']
elif entry.get('google_maps_enrichment', {}).get('address', ''):
    address = entry['google_maps_enrichment']['address']
    if ', Belgium' in address:
        country_code = "BE"
    elif ', Germany' in address:
        country_code = "DE"

# THEN call reverse geocoding
result = reverse_geocode_to_city(latitude, longitude, country_code)

Settlement Resolution Process

Step 1: Coordinate-Based Resolution (Preferred)

When coordinates are available, use reverse geocoding to find the nearest GeoNames settlement:

def resolve_settlement_from_coordinates(latitude: float, longitude: float, country_code: str = "NL") -> dict:
    """
    Find the GeoNames settlement nearest to given coordinates.
    
    Returns:
        {
            'settlement_name': 'Lelystad',      # GeoNames standardized name
            'settlement_code': 'LEL',            # 3-letter abbreviation
            'admin1_code': '16',                 # GeoNames admin1 code
            'region_code': 'FL',                 # ISO 3166-2 region code
            'geonames_id': 2751792,              # GeoNames ID for provenance
            'distance_km': 0.5                   # Distance from coords to settlement center
        }
    """

Step 2: Name-Based Resolution (Fallback)

When only a settlement name is available (no coordinates), look up in GeoNames:

def resolve_settlement_from_name(name: str, country_code: str = "NL") -> dict:
    """
    Find the GeoNames settlement matching the given name.
    
    Uses fuzzy matching and disambiguation when multiple matches exist.
    """

Step 3: Manual Resolution (Last Resort)

If GeoNames lookup fails, flag the entry for manual review with:

  • settlement_source: MANUAL
  • settlement_needs_review: true

GHCID Settlement Component Rules

Format

The settlement component in GHCID uses a 3-letter uppercase code:

NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
          ^^^^^^^^^^^
          3-letter code from GeoNames

Code Generation Rules

  1. Single-word settlements: First 3 letters uppercase

    • AmsterdamAMS
    • RotterdamROT
    • LelystadLEL
  2. Settlements with Dutch articles (de, het, den, 's):

    • First letter of article + first 2 letters of main word
    • Den HaagDHA
    • 's-HertogenboschSHE
    • De BiltDBI
  3. Multi-word settlements (no article):

    • First letter of each word (up to 3)
    • Nieuw AmsterdamNAM
    • Oud BeijerlandOBE
  4. GeoNames Disambiguation Database:

    • For known problematic settlements, use pre-defined codes from disambiguation table
    • Example: Both Zwolle (OV) and Zwolle (LI) exist - use ZWO with region for uniqueness

Measurement Point for Historical Custodians

Rule: For heritage custodians that no longer exist or have historical coordinates, the modern-day settlement (as of 2025-12-01) is used.

Rationale:

  • GHCIDs should be stable over time
  • Historical place names may have changed
  • Modern settlements are easier to verify and look up
  • GeoNames reflects current geographic reality

Example:

  • A museum that operated 1900-1950 in what was then "Nieuw Land" (before Flevoland province existed)
  • Modern coordinates fall within Lelystad municipality
  • GHCID uses LEL (Lelystad) as settlement code, not historical name

GeoNames Database Integration

Database Location

/data/reference/geonames.db

Required Tables

-- Cities/settlements table
CREATE TABLE cities (
    geonames_id INTEGER PRIMARY KEY,
    name TEXT,              -- Local name (may have diacritics)
    ascii_name TEXT,        -- ASCII-safe name for identifiers
    country_code TEXT,      -- ISO 3166-1 alpha-2
    admin1_code TEXT,       -- First-level administrative division
    admin1_name TEXT,       -- Region/province name
    latitude REAL,
    longitude REAL,
    population INTEGER,
    feature_code TEXT       -- PPL, PPLA, PPLC, etc.
);

-- Disambiguation table for problematic settlements
CREATE TABLE settlement_codes (
    geonames_id INTEGER PRIMARY KEY,
    country_code TEXT,
    settlement_code TEXT,   -- 3-letter code
    is_primary BOOLEAN,     -- Primary code for this settlement
    notes TEXT
);

Admin1 Code Mapping (Netherlands)

IMPORTANT: GeoNames admin1 codes differ from historical numbering. Use this mapping:

GeoNames admin1 Province ISO 3166-2
01 Drenthe NL-DR
02 Friesland NL-FR
03 Gelderland NL-GE
04 Groningen NL-GR
05 Limburg NL-LI
06 Noord-Brabant NL-NB
07 Noord-Holland NL-NH
09 Utrecht NL-UT
10 Zeeland NL-ZE
11 Zuid-Holland NL-ZH
15 Overijssel NL-OV
16 Flevoland NL-FL

Note: Code 08 is not used in Netherlands (was assigned to former region).

Validation Requirements

Before GHCID Generation

Every entry MUST have:

  • Settlement name resolved via GeoNames
  • geonames_id recorded in entry metadata
  • Settlement code (3-letter) generated consistently
  • Admin1/region code mapped correctly

Provenance Tracking

Record GeoNames resolution in entry metadata:

location_resolution:
  method: REVERSE_GEOCODE  # or NAME_LOOKUP or MANUAL
  geonames_id: 2751792
  geonames_name: Lelystad
  settlement_code: LEL
  admin1_code: "16"
  region_code: FL
  resolution_date: "2025-12-01T00:00:00Z"
  source_coordinates:
    latitude: 52.52111
    longitude: 5.43722
  distance_to_settlement_km: 0.5

Error Handling

No GeoNames Match

If a settlement cannot be resolved:

  1. Log warning with entry details
  2. Set settlement_code: XXX (unknown)
  3. Set settlement_needs_review: true
  4. Do NOT skip the entry - generate GHCID with XXX placeholder

Multiple GeoNames Matches

When multiple settlements match a name:

  1. Use coordinates to disambiguate (if available)
  2. Use admin1/region context (if available)
  3. Use population as tiebreaker (prefer larger settlement)
  4. Flag for manual review if still ambiguous

Coordinates Outside Country

If coordinates fall outside the expected country:

  1. Log warning
  2. Use nearest settlement within country
  3. Flag for manual review
  • AGENTS.md - Section on GHCID generation
  • docs/PERSISTENT_IDENTIFIERS.md - Complete GHCID specification
  • docs/GHCID_PID_SCHEME.md - PID scheme details
  • scripts/enrich_nde_entries_ghcid.py - Implementation

Changelog

v1.1.0 (2025-12-01)

  • CRITICAL: Added feature code filtering rules
    • MUST filter for PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, PPLG
    • MUST exclude PPLX (neighborhoods/districts)
    • Example: Apeldoorn (PPL) not "Binnenstad" (PPLX)
  • CRITICAL: Added country code detection rules
    • Must determine country from entry data BEFORE reverse geocoding
    • Priority: zcbs_enrichment.country > location.country > address parsing
    • Example: Belgian institutions use BE, not NL
  • Added Belgium admin1 code mapping (BRU, VLG, WAL)

v1.0.0 (2025-12-01)

  • Initial version
  • Established GeoNames as authoritative source for settlement standardization
  • Defined measurement point rule for historical custodians
  • Documented admin1 code mapping for Netherlands