glam/.opencode/GEONAMES_SETTLEMENT_RULES.md
2025-12-07 00:26:01 +01:00

411 lines
13 KiB
Markdown

# GeoNames Settlement Standardization Rules
**Version**: 1.1.0
**Effective Date**: 2025-12-01
**Last Updated**: 2025-12-01
## Purpose
This document defines the rules for standardizing settlement names in GHCID (Global Heritage Custodian Identifier) generation using the GeoNames geographical database.
## Core Principle
**ALL settlement names in GHCID must be derived from GeoNames standardized names, not from source data.**
The GeoNames database serves as the **single source of truth** for:
- Settlement names (cities, towns, villages)
- Settlement abbreviations/codes
- Administrative region codes (admin1)
- Geographic coordinates validation
## Why GeoNames Standardization?
1. **Consistency**: Same settlement = same GHCID component, regardless of source data variations
2. **Disambiguation**: Handles duplicate city names across regions
3. **Internationalization**: Provides ASCII-safe names for identifiers
4. **Authority**: GeoNames is a well-maintained, CC-licensed geographic database
5. **Persistence**: Settlement names don't change frequently, ensuring GHCID stability
## 🚨 CRITICAL: Feature Code Filtering
**NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).**
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.
### ALLOWED Feature Codes
| Code | Description | Example |
|------|-------------|---------|
| **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
| **PPLA** | Seat of first-order admin division | Provincial capitals |
| **PPLA2** | Seat of second-order admin division | Municipal seats |
| **PPLA3** | Seat of third-order admin division | District seats |
| **PPLA4** | Seat of fourth-order admin division | Sub-district seats |
| **PPLC** | Capital of a political entity | Amsterdam, Brussels |
| **PPLS** | Populated places (multiple) | Settlement clusters |
| **PPLG** | Seat of government | The Hague |
### EXCLUDED Feature Codes
| Code | Description | Why Excluded |
|------|-------------|--------------|
| **PPLX** | Section of populated place | Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad") |
### Implementation
```python
VALID_FEATURE_CODES = ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
query = """
SELECT name, feature_code, geonames_id, ...
FROM cities
WHERE country_code = ?
AND feature_code IN (?, ?, ?, ?, ?, ?, ?, ?)
ORDER BY distance_sq
LIMIT 1
"""
cursor.execute(query, (country_code, *VALID_FEATURE_CODES))
```
### Verification
Always check `feature_code` in location_resolution metadata:
```yaml
location_resolution:
geonames_name: Apeldoorn
feature_code: PPL # ← MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
```
**If you see `feature_code: PPLX`**, the GHCID is WRONG and must be regenerated.
## 🚨 CRITICAL: Country Code Detection
**Determine country code from entry data BEFORE calling GeoNames reverse geocoding.**
GeoNames queries are country-specific. Using the wrong country code will return incorrect results.
### Country Code Resolution Priority
1. `zcbs_enrichment.country` - Most explicit source
2. `location.country` - Direct location field
3. `locations[].country` - Array location field
4. `original_entry.country` - CSV source field
5. `google_maps_enrichment.address` - Parse from address string
6. `wikidata_enrichment.located_in.label` - Infer from Wikidata
7. Default: `"NL"` (Netherlands) - Only if no other source
### Example
```python
# Determine country code FIRST
country_code = "NL" # Default
if entry.get('zcbs_enrichment', {}).get('country'):
country_code = entry['zcbs_enrichment']['country']
elif entry.get('google_maps_enrichment', {}).get('address', ''):
address = entry['google_maps_enrichment']['address']
if ', Belgium' in address:
country_code = "BE"
elif ', Germany' in address:
country_code = "DE"
# THEN call reverse geocoding
result = reverse_geocode_to_city(latitude, longitude, country_code)
```
## Settlement Resolution Process
### Step 1: Coordinate-Based Resolution (Preferred)
When coordinates are available, use reverse geocoding to find the nearest GeoNames settlement:
```python
def resolve_settlement_from_coordinates(latitude: float, longitude: float, country_code: str = "NL") -> dict:
"""
Find the GeoNames settlement nearest to given coordinates.
Returns:
{
'settlement_name': 'Lelystad', # GeoNames standardized name
'settlement_code': 'LEL', # 3-letter abbreviation
'admin1_code': '16', # GeoNames admin1 code
'region_code': 'FL', # ISO 3166-2 region code
'geonames_id': 2751792, # GeoNames ID for provenance
'distance_km': 0.5 # Distance from coords to settlement center
}
"""
```
### Step 2: Name-Based Resolution (Fallback)
When only a settlement name is available (no coordinates), look up in GeoNames:
```python
def resolve_settlement_from_name(name: str, country_code: str = "NL") -> dict:
"""
Find the GeoNames settlement matching the given name.
Uses fuzzy matching and disambiguation when multiple matches exist.
"""
```
### Step 3: Manual Resolution (Last Resort)
If GeoNames lookup fails, flag the entry for manual review with:
- `settlement_source: MANUAL`
- `settlement_needs_review: true`
## GHCID Settlement Component Rules
### Format
The settlement component in GHCID uses a **3-letter uppercase code**:
```
NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
^^^^^^^^^^^
3-letter code from GeoNames
```
### Code Generation Rules
1. **Single-word settlements**: First 3 letters uppercase
- `Amsterdam``AMS`
- `Rotterdam``ROT`
- `Lelystad``LEL`
2. **Settlements with Dutch articles** (`de`, `het`, `den`, `'s`):
- First letter of article + first 2 letters of main word
- `Den Haag``DHA`
- `'s-Hertogenbosch``SHE`
- `De Bilt``DBI`
3. **Multi-word settlements** (no article):
- First letter of each word (up to 3)
- `Nieuw Amsterdam``NAM`
- `Oud Beijerland``OBE`
4. **GeoNames Disambiguation Database**:
- For known problematic settlements, use pre-defined codes from disambiguation table
- Example: Both `Zwolle` (OV) and `Zwolle` (LI) exist - use `ZWO` with region for uniqueness
### Measurement Point for Historical Custodians
**Rule**: For heritage custodians that no longer exist or have historical coordinates, the **modern-day settlement** (as of 2025-12-01) is used.
Rationale:
- GHCIDs should be stable over time
- Historical place names may have changed
- Modern settlements are easier to verify and look up
- GeoNames reflects current geographic reality
Example:
- A museum that operated 1900-1950 in what was then "Nieuw Land" (before Flevoland province existed)
- Modern coordinates fall within Lelystad municipality
- GHCID uses `LEL` (Lelystad) as settlement code, not historical name
## GeoNames Database Integration
### Database Location
```
/data/reference/geonames.db
```
### Required Tables
```sql
-- Cities/settlements table
CREATE TABLE cities (
geonames_id INTEGER PRIMARY KEY,
name TEXT, -- Local name (may have diacritics)
ascii_name TEXT, -- ASCII-safe name for identifiers
country_code TEXT, -- ISO 3166-1 alpha-2
admin1_code TEXT, -- First-level administrative division
admin1_name TEXT, -- Region/province name
latitude REAL,
longitude REAL,
population INTEGER,
feature_code TEXT -- PPL, PPLA, PPLC, etc.
);
-- Disambiguation table for problematic settlements
CREATE TABLE settlement_codes (
geonames_id INTEGER PRIMARY KEY,
country_code TEXT,
settlement_code TEXT, -- 3-letter code
is_primary BOOLEAN, -- Primary code for this settlement
notes TEXT
);
```
### Admin1 Code Mapping (Netherlands)
**IMPORTANT**: GeoNames admin1 codes differ from historical numbering. Use this mapping:
| GeoNames admin1 | Province | ISO 3166-2 |
|-----------------|----------|------------|
| 01 | Drenthe | NL-DR |
| 02 | Friesland | NL-FR |
| 03 | Gelderland | NL-GE |
| 04 | Groningen | NL-GR |
| 05 | Limburg | NL-LI |
| 06 | Noord-Brabant | NL-NB |
| 07 | Noord-Holland | NL-NH |
| 09 | Utrecht | NL-UT |
| 10 | Zeeland | NL-ZE |
| 11 | Zuid-Holland | NL-ZH |
| 15 | Overijssel | NL-OV |
| 16 | Flevoland | NL-FL |
**Note**: Code 08 is not used in Netherlands (was assigned to former region).
## Validation Requirements
### Before GHCID Generation
Every entry MUST have:
- [ ] Settlement name resolved via GeoNames
- [ ] `geonames_id` recorded in entry metadata
- [ ] Settlement code (3-letter) generated consistently
- [ ] Admin1/region code mapped correctly
### Provenance Tracking
Record GeoNames resolution in entry metadata:
```yaml
location_resolution:
method: REVERSE_GEOCODE # or NAME_LOOKUP or MANUAL
geonames_id: 2751792
geonames_name: Lelystad
settlement_code: LEL
admin1_code: "16"
region_code: FL
resolution_date: "2025-12-01T00:00:00Z"
source_coordinates:
latitude: 52.52111
longitude: 5.43722
distance_to_settlement_km: 0.5
```
## 🚨 CRITICAL: XXX Placeholders Are TEMPORARY - Research Required 🚨
**XXX placeholders for region/settlement codes are NEVER acceptable as a final state.**
When an entry has `XX` (unknown region) or `XXX` (unknown settlement), the agent MUST conduct research to resolve the location.
### Resolution Strategy by Institution Type
| Institution Type | Location Resolution Method |
|------------------|---------------------------|
| **Destroyed institution** | Use last known physical location before destruction |
| **Historical (closed)** | Use last operating location |
| **Refugee/diaspora org** | Use current headquarters OR original founding location |
| **Digital-only platform** | Use parent/founding organization's headquarters |
| **Decentralized initiative** | Use founding location or primary organizer location |
| **Unknown city, known country** | Research via Wikidata, Google Maps, official website |
### Research Sources (Priority Order)
1. **Wikidata** - P131 (located in), P159 (headquarters location), P625 (coordinates)
2. **Google Maps** - Search institution name
3. **Official Website** - Contact page, about page
4. **Web Archive** - archive.org for destroyed/closed institutions
5. **Academic Sources** - Papers, reports
6. **News Articles** - Particularly for destroyed heritage sites
### Location Resolution Metadata
When resolving XXX placeholders, update `location_resolution`:
```yaml
location_resolution:
method: MANUAL_RESEARCH # Previously was NAME_LOOKUP with XXX
country_code: PS
region_code: GZ
region_name: Gaza Strip
city_code: GAZ
city_name: Gaza City
geonames_id: 281133
research_date: "2025-12-06T00:00:00Z"
research_sources:
- type: wikidata
id: Q123456
claim: P131
- type: web_archive
url: https://web.archive.org/web/20231001/https://institution-website.org/contact
notes: "Located in Gaza City prior to destruction in 2024"
```
### File Renaming After Resolution
When GHCID changes due to XXX resolution, the file MUST be renamed:
```bash
# Before
data/custodian/PS-XX-XXX-A-NAPR.yaml
# After
data/custodian/PS-GZ-GAZ-A-NAPR.yaml
```
### Prohibited Practices
- ❌ Leaving XXX placeholders in production data
- ❌ Using "Online" or country name as location
- ❌ Skipping research because it's difficult
- ❌ Using XX/XXX for diaspora organizations
## Error Handling
### No GeoNames Match
If a settlement cannot be resolved via automated lookup:
1. Log warning with entry details
2. Set `settlement_code: XXX` (temporary placeholder)
3. Set `settlement_needs_review: true`
4. Do NOT skip the entry - generate GHCID with XXX placeholder
5. **IMMEDIATELY** begin manual research to resolve
### Multiple GeoNames Matches
When multiple settlements match a name:
1. Use coordinates to disambiguate (if available)
2. Use admin1/region context (if available)
3. Use population as tiebreaker (prefer larger settlement)
4. Flag for manual review if still ambiguous
### Coordinates Outside Country
If coordinates fall outside the expected country:
1. Log warning
2. Use nearest settlement within country
3. Flag for manual review
## Related Documentation
- `AGENTS.md` - Section on GHCID generation
- `docs/PERSISTENT_IDENTIFIERS.md` - Complete GHCID specification
- `docs/GHCID_PID_SCHEME.md` - PID scheme details
- `scripts/enrich_nde_entries_ghcid.py` - Implementation
## Changelog
### v1.1.0 (2025-12-01)
- **CRITICAL**: Added feature code filtering rules
- MUST filter for PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, PPLG
- MUST exclude PPLX (neighborhoods/districts)
- Example: Apeldoorn (PPL) not "Binnenstad" (PPLX)
- **CRITICAL**: Added country code detection rules
- Must determine country from entry data BEFORE reverse geocoding
- Priority: zcbs_enrichment.country > location.country > address parsing
- Example: Belgian institutions use BE, not NL
- Added Belgium admin1 code mapping (BRU, VLG, WAL)
### v1.0.0 (2025-12-01)
- Initial version
- Established GeoNames as authoritative source for settlement standardization
- Defined measurement point rule for historical custodians
- Documented admin1 code mapping for Netherlands