411 lines
13 KiB
Markdown
411 lines
13 KiB
Markdown
# GeoNames Settlement Standardization Rules
|
|
|
|
**Version**: 1.1.0
|
|
**Effective Date**: 2025-12-01
|
|
**Last Updated**: 2025-12-01
|
|
|
|
## Purpose
|
|
|
|
This document defines the rules for standardizing settlement names in GHCID (Global Heritage Custodian Identifier) generation using the GeoNames geographical database.
|
|
|
|
## Core Principle
|
|
|
|
**ALL settlement names in GHCID must be derived from GeoNames standardized names, not from source data.**
|
|
|
|
The GeoNames database serves as the **single source of truth** for:
|
|
- Settlement names (cities, towns, villages)
|
|
- Settlement abbreviations/codes
|
|
- Administrative region codes (admin1)
|
|
- Geographic coordinates validation
|
|
|
|
## Why GeoNames Standardization?
|
|
|
|
1. **Consistency**: Same settlement = same GHCID component, regardless of source data variations
|
|
2. **Disambiguation**: Handles duplicate city names across regions
|
|
3. **Internationalization**: Provides ASCII-safe names for identifiers
|
|
4. **Authority**: GeoNames is a well-maintained, CC-licensed geographic database
|
|
5. **Persistence**: Settlement names don't change frequently, ensuring GHCID stability
|
|
|
|
## 🚨 CRITICAL: Feature Code Filtering
|
|
|
|
**NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).**
|
|
|
|
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.
|
|
|
|
### ALLOWED Feature Codes
|
|
|
|
| Code | Description | Example |
|
|
|------|-------------|---------|
|
|
| **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
|
|
| **PPLA** | Seat of first-order admin division | Provincial capitals |
|
|
| **PPLA2** | Seat of second-order admin division | Municipal seats |
|
|
| **PPLA3** | Seat of third-order admin division | District seats |
|
|
| **PPLA4** | Seat of fourth-order admin division | Sub-district seats |
|
|
| **PPLC** | Capital of a political entity | Amsterdam, Brussels |
|
|
| **PPLS** | Populated places (multiple) | Settlement clusters |
|
|
| **PPLG** | Seat of government | The Hague |
|
|
|
|
### EXCLUDED Feature Codes
|
|
|
|
| Code | Description | Why Excluded |
|
|
|------|-------------|--------------|
|
|
| **PPLX** | Section of populated place | Neighborhoods, districts, quarters (e.g., "Binnenstad", "Amsterdam Binnenstad") |
|
|
|
|
### Implementation
|
|
|
|
```python
|
|
VALID_FEATURE_CODES = ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
|
|
|
|
query = """
|
|
SELECT name, feature_code, geonames_id, ...
|
|
FROM cities
|
|
WHERE country_code = ?
|
|
AND feature_code IN (?, ?, ?, ?, ?, ?, ?, ?)
|
|
ORDER BY distance_sq
|
|
LIMIT 1
|
|
"""
|
|
cursor.execute(query, (country_code, *VALID_FEATURE_CODES))
|
|
```
|
|
|
|
### Verification
|
|
|
|
Always check `feature_code` in location_resolution metadata:
|
|
|
|
```yaml
|
|
location_resolution:
|
|
geonames_name: Apeldoorn
|
|
feature_code: PPL # ← MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
|
|
```
|
|
|
|
**If you see `feature_code: PPLX`**, the GHCID is WRONG and must be regenerated.
|
|
|
|
## 🚨 CRITICAL: Country Code Detection
|
|
|
|
**Determine country code from entry data BEFORE calling GeoNames reverse geocoding.**
|
|
|
|
GeoNames queries are country-specific. Using the wrong country code will return incorrect results.
|
|
|
|
### Country Code Resolution Priority
|
|
|
|
1. `zcbs_enrichment.country` - Most explicit source
|
|
2. `location.country` - Direct location field
|
|
3. `locations[].country` - Array location field
|
|
4. `original_entry.country` - CSV source field
|
|
5. `google_maps_enrichment.address` - Parse from address string
|
|
6. `wikidata_enrichment.located_in.label` - Infer from Wikidata
|
|
7. Default: `"NL"` (Netherlands) - Only if no other source
|
|
|
|
### Example
|
|
|
|
```python
|
|
# Determine country code FIRST
|
|
country_code = "NL" # Default
|
|
|
|
if entry.get('zcbs_enrichment', {}).get('country'):
|
|
country_code = entry['zcbs_enrichment']['country']
|
|
elif entry.get('google_maps_enrichment', {}).get('address', ''):
|
|
address = entry['google_maps_enrichment']['address']
|
|
if ', Belgium' in address:
|
|
country_code = "BE"
|
|
elif ', Germany' in address:
|
|
country_code = "DE"
|
|
|
|
# THEN call reverse geocoding
|
|
result = reverse_geocode_to_city(latitude, longitude, country_code)
|
|
```
|
|
|
|
## Settlement Resolution Process
|
|
|
|
### Step 1: Coordinate-Based Resolution (Preferred)
|
|
|
|
When coordinates are available, use reverse geocoding to find the nearest GeoNames settlement:
|
|
|
|
```python
|
|
def resolve_settlement_from_coordinates(latitude: float, longitude: float, country_code: str = "NL") -> dict:
|
|
"""
|
|
Find the GeoNames settlement nearest to given coordinates.
|
|
|
|
Returns:
|
|
{
|
|
'settlement_name': 'Lelystad', # GeoNames standardized name
|
|
'settlement_code': 'LEL', # 3-letter abbreviation
|
|
'admin1_code': '16', # GeoNames admin1 code
|
|
'region_code': 'FL', # ISO 3166-2 region code
|
|
'geonames_id': 2751792, # GeoNames ID for provenance
|
|
'distance_km': 0.5 # Distance from coords to settlement center
|
|
}
|
|
"""
|
|
```
|
|
|
|
### Step 2: Name-Based Resolution (Fallback)
|
|
|
|
When only a settlement name is available (no coordinates), look up in GeoNames:
|
|
|
|
```python
|
|
def resolve_settlement_from_name(name: str, country_code: str = "NL") -> dict:
|
|
"""
|
|
Find the GeoNames settlement matching the given name.
|
|
|
|
Uses fuzzy matching and disambiguation when multiple matches exist.
|
|
"""
|
|
```
|
|
|
|
### Step 3: Manual Resolution (Last Resort)
|
|
|
|
If GeoNames lookup fails, flag the entry for manual review with:
|
|
- `settlement_source: MANUAL`
|
|
- `settlement_needs_review: true`
|
|
|
|
## GHCID Settlement Component Rules
|
|
|
|
### Format
|
|
|
|
The settlement component in GHCID uses a **3-letter uppercase code**:
|
|
|
|
```
|
|
NL-{REGION}-{SETTLEMENT}-{TYPE}-{ABBREV}
|
|
^^^^^^^^^^^
|
|
3-letter code from GeoNames
|
|
```
|
|
|
|
### Code Generation Rules
|
|
|
|
1. **Single-word settlements**: First 3 letters uppercase
|
|
- `Amsterdam` → `AMS`
|
|
- `Rotterdam` → `ROT`
|
|
- `Lelystad` → `LEL`
|
|
|
|
2. **Settlements with Dutch articles** (`de`, `het`, `den`, `'s`):
|
|
- First letter of article + first 2 letters of main word
|
|
- `Den Haag` → `DHA`
|
|
- `'s-Hertogenbosch` → `SHE`
|
|
- `De Bilt` → `DBI`
|
|
|
|
3. **Multi-word settlements** (no article):
|
|
- First letter of each word (up to 3)
|
|
- `Nieuw Amsterdam` → `NAM`
|
|
- `Oud Beijerland` → `OBE`
|
|
|
|
4. **GeoNames Disambiguation Database**:
|
|
- For known problematic settlements, use pre-defined codes from disambiguation table
|
|
- Example: Both `Zwolle` (OV) and `Zwolle` (LI) exist - use `ZWO` with region for uniqueness
|
|
|
|
### Measurement Point for Historical Custodians
|
|
|
|
**Rule**: For heritage custodians that no longer exist or have historical coordinates, the **modern-day settlement** (as of 2025-12-01) is used.
|
|
|
|
Rationale:
|
|
- GHCIDs should be stable over time
|
|
- Historical place names may have changed
|
|
- Modern settlements are easier to verify and look up
|
|
- GeoNames reflects current geographic reality
|
|
|
|
Example:
|
|
- A museum that operated 1900-1950 in what was then "Nieuw Land" (before Flevoland province existed)
|
|
- Modern coordinates fall within Lelystad municipality
|
|
- GHCID uses `LEL` (Lelystad) as settlement code, not historical name
|
|
|
|
## GeoNames Database Integration
|
|
|
|
### Database Location
|
|
|
|
```
|
|
/data/reference/geonames.db
|
|
```
|
|
|
|
### Required Tables
|
|
|
|
```sql
|
|
-- Cities/settlements table
|
|
CREATE TABLE cities (
|
|
geonames_id INTEGER PRIMARY KEY,
|
|
name TEXT, -- Local name (may have diacritics)
|
|
ascii_name TEXT, -- ASCII-safe name for identifiers
|
|
country_code TEXT, -- ISO 3166-1 alpha-2
|
|
admin1_code TEXT, -- First-level administrative division
|
|
admin1_name TEXT, -- Region/province name
|
|
latitude REAL,
|
|
longitude REAL,
|
|
population INTEGER,
|
|
feature_code TEXT -- PPL, PPLA, PPLC, etc.
|
|
);
|
|
|
|
-- Disambiguation table for problematic settlements
|
|
CREATE TABLE settlement_codes (
|
|
geonames_id INTEGER PRIMARY KEY,
|
|
country_code TEXT,
|
|
settlement_code TEXT, -- 3-letter code
|
|
is_primary BOOLEAN, -- Primary code for this settlement
|
|
notes TEXT
|
|
);
|
|
```
|
|
|
|
### Admin1 Code Mapping (Netherlands)
|
|
|
|
**IMPORTANT**: GeoNames admin1 codes differ from historical numbering. Use this mapping:
|
|
|
|
| GeoNames admin1 | Province | ISO 3166-2 |
|
|
|-----------------|----------|------------|
|
|
| 01 | Drenthe | NL-DR |
|
|
| 02 | Friesland | NL-FR |
|
|
| 03 | Gelderland | NL-GE |
|
|
| 04 | Groningen | NL-GR |
|
|
| 05 | Limburg | NL-LI |
|
|
| 06 | Noord-Brabant | NL-NB |
|
|
| 07 | Noord-Holland | NL-NH |
|
|
| 09 | Utrecht | NL-UT |
|
|
| 10 | Zeeland | NL-ZE |
|
|
| 11 | Zuid-Holland | NL-ZH |
|
|
| 15 | Overijssel | NL-OV |
|
|
| 16 | Flevoland | NL-FL |
|
|
|
|
**Note**: Code 08 is not used in Netherlands (was assigned to former region).
|
|
|
|
## Validation Requirements
|
|
|
|
### Before GHCID Generation
|
|
|
|
Every entry MUST have:
|
|
- [ ] Settlement name resolved via GeoNames
|
|
- [ ] `geonames_id` recorded in entry metadata
|
|
- [ ] Settlement code (3-letter) generated consistently
|
|
- [ ] Admin1/region code mapped correctly
|
|
|
|
### Provenance Tracking
|
|
|
|
Record GeoNames resolution in entry metadata:
|
|
|
|
```yaml
|
|
location_resolution:
|
|
method: REVERSE_GEOCODE # or NAME_LOOKUP or MANUAL
|
|
geonames_id: 2751792
|
|
geonames_name: Lelystad
|
|
settlement_code: LEL
|
|
admin1_code: "16"
|
|
region_code: FL
|
|
resolution_date: "2025-12-01T00:00:00Z"
|
|
source_coordinates:
|
|
latitude: 52.52111
|
|
longitude: 5.43722
|
|
distance_to_settlement_km: 0.5
|
|
```
|
|
|
|
## 🚨 CRITICAL: XXX Placeholders Are TEMPORARY - Research Required 🚨
|
|
|
|
**XXX placeholders for region/settlement codes are NEVER acceptable as a final state.**
|
|
|
|
When an entry has `XX` (unknown region) or `XXX` (unknown settlement), the agent MUST conduct research to resolve the location.
|
|
|
|
### Resolution Strategy by Institution Type
|
|
|
|
| Institution Type | Location Resolution Method |
|
|
|------------------|---------------------------|
|
|
| **Destroyed institution** | Use last known physical location before destruction |
|
|
| **Historical (closed)** | Use last operating location |
|
|
| **Refugee/diaspora org** | Use current headquarters OR original founding location |
|
|
| **Digital-only platform** | Use parent/founding organization's headquarters |
|
|
| **Decentralized initiative** | Use founding location or primary organizer location |
|
|
| **Unknown city, known country** | Research via Wikidata, Google Maps, official website |
|
|
|
|
### Research Sources (Priority Order)
|
|
|
|
1. **Wikidata** - P131 (located in), P159 (headquarters location), P625 (coordinates)
|
|
2. **Google Maps** - Search institution name
|
|
3. **Official Website** - Contact page, about page
|
|
4. **Web Archive** - archive.org for destroyed/closed institutions
|
|
5. **Academic Sources** - Papers, reports
|
|
6. **News Articles** - Particularly for destroyed heritage sites
|
|
|
|
### Location Resolution Metadata
|
|
|
|
When resolving XXX placeholders, update `location_resolution`:
|
|
|
|
```yaml
|
|
location_resolution:
|
|
method: MANUAL_RESEARCH # Previously was NAME_LOOKUP with XXX
|
|
country_code: PS
|
|
region_code: GZ
|
|
region_name: Gaza Strip
|
|
city_code: GAZ
|
|
city_name: Gaza City
|
|
geonames_id: 281133
|
|
research_date: "2025-12-06T00:00:00Z"
|
|
research_sources:
|
|
- type: wikidata
|
|
id: Q123456
|
|
claim: P131
|
|
- type: web_archive
|
|
url: https://web.archive.org/web/20231001/https://institution-website.org/contact
|
|
notes: "Located in Gaza City prior to destruction in 2024"
|
|
```
|
|
|
|
### File Renaming After Resolution
|
|
|
|
When GHCID changes due to XXX resolution, the file MUST be renamed:
|
|
|
|
```bash
|
|
# Before
|
|
data/custodian/PS-XX-XXX-A-NAPR.yaml
|
|
|
|
# After
|
|
data/custodian/PS-GZ-GAZ-A-NAPR.yaml
|
|
```
|
|
|
|
### Prohibited Practices
|
|
|
|
- ❌ Leaving XXX placeholders in production data
|
|
- ❌ Using "Online" or country name as location
|
|
- ❌ Skipping research because it's difficult
|
|
- ❌ Using XX/XXX for diaspora organizations
|
|
|
|
## Error Handling
|
|
|
|
### No GeoNames Match
|
|
|
|
If a settlement cannot be resolved via automated lookup:
|
|
1. Log warning with entry details
|
|
2. Set `settlement_code: XXX` (temporary placeholder)
|
|
3. Set `settlement_needs_review: true`
|
|
4. Do NOT skip the entry - generate GHCID with XXX placeholder
|
|
5. **IMMEDIATELY** begin manual research to resolve
|
|
|
|
### Multiple GeoNames Matches
|
|
|
|
When multiple settlements match a name:
|
|
1. Use coordinates to disambiguate (if available)
|
|
2. Use admin1/region context (if available)
|
|
3. Use population as tiebreaker (prefer larger settlement)
|
|
4. Flag for manual review if still ambiguous
|
|
|
|
### Coordinates Outside Country
|
|
|
|
If coordinates fall outside the expected country:
|
|
1. Log warning
|
|
2. Use nearest settlement within country
|
|
3. Flag for manual review
|
|
|
|
## Related Documentation
|
|
|
|
- `AGENTS.md` - Section on GHCID generation
|
|
- `docs/PERSISTENT_IDENTIFIERS.md` - Complete GHCID specification
|
|
- `docs/GHCID_PID_SCHEME.md` - PID scheme details
|
|
- `scripts/enrich_nde_entries_ghcid.py` - Implementation
|
|
|
|
## Changelog
|
|
|
|
### v1.1.0 (2025-12-01)
|
|
- **CRITICAL**: Added feature code filtering rules
|
|
- MUST filter for PPL, PPLA, PPLA2, PPLA3, PPLA4, PPLC, PPLS, PPLG
|
|
- MUST exclude PPLX (neighborhoods/districts)
|
|
- Example: Apeldoorn (PPL) not "Binnenstad" (PPLX)
|
|
- **CRITICAL**: Added country code detection rules
|
|
- Must determine country from entry data BEFORE reverse geocoding
|
|
- Priority: zcbs_enrichment.country > location.country > address parsing
|
|
- Example: Belgian institutions use BE, not NL
|
|
- Added Belgium admin1 code mapping (BRU, VLG, WAL)
|
|
|
|
### v1.0.0 (2025-12-01)
|
|
- Initial version
|
|
- Established GeoNames as authoritative source for settlement standardization
|
|
- Defined measurement point rule for historical custodians
|
|
- Documented admin1 code mapping for Netherlands
|