glam/scripts/regenerate_historical_ghcids.py
2025-11-30 23:30:29 +01:00

722 lines
28 KiB
Python
Executable file

#!/usr/bin/env python3
"""
Regenerate GHCIDs for Historical Institutions using GeoNames-based lookup.
This script:
1. Loads historical institutions validation YAML
2. Uses coordinates to lookup modern location (country, region, city) from GeoNames
3. Generates proper GHCIDs using COUNTRY-REGION-CITY-TYPE-NAME format
4. Updates YAML with corrected GHCIDs matching production implementation
Key principle: Use coordinates to find MODERN (2025) location identifiers,
not coordinate hashes.
## GHCID Collision Resolution and Temporal Priority
This script handles GHCID collisions according to temporal priority rules:
**First Batch Collision** (multiple institutions discovered simultaneously):
- ALL colliding institutions receive native language name suffixes in snake_case
- Example: If two museums in Amsterdam both generate NL-NH-AMS-M-SM,
both receive name suffixes: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam,
NL-NH-AMS-M-SM-science_museum_amsterdam
- Rationale: Fair treatment when discovered at the same time
**Historical Addition** (new institution collides with published GHCID):
- ONLY the newly discovered institution receives a name suffix
- Previously published GHCID remains unchanged (PID stability)
- Example: If NL-NH-AMS-M-HM was published 2025-11-01, and a new
institution with the same GHCID is found 2025-11-15, only the new
institution gets: NL-NH-AMS-M-HM-historical_museum_amsterdam
- Rationale: "Cool URIs don't change" - preserve existing citations
**Implementation Notes**:
- Track publication_date in provenance metadata
- Detect collisions by comparing GHCID strings before name suffix addition
- Decision logic: if existing GHCID has publication_date < current extraction,
preserve it; only add name suffix to new institution
- Log collision events in collision registry (future enhancement)
**References**:
- docs/PERSISTENT_IDENTIFIERS.md - Section "Historical Collision Resolution"
- docs/plan/global_glam/07-ghcid-collision-resolution.md - Temporal dimension
- docs/GHCID_PID_SCHEME.md - Timeline examples
Usage:
python scripts/regenerate_historical_ghcids.py
"""
import json
import sys
import yaml
from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, List, Optional
# Add src to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from glam_extractor.identifiers.ghcid import (
GHCIDGenerator,
GHCIDComponents,
InstitutionType,
extract_abbreviation_from_name,
)
from glam_extractor.geocoding.geonames_lookup import GeoNamesDB
class HistoricalGHCIDGenerator:
"""
Generate GHCIDs for historical institutions using modern geographic references.
This generator creates BASE GHCIDs (without name suffixes) by:
1. Reverse geocoding coordinates to modern location (city, region)
2. Looking up city abbreviations from GeoNames database
3. Mapping region names to ISO 3166-2 codes
4. Generating institution abbreviations from names
COLLISION HANDLING:
-------------------
This class generates deterministic base GHCIDs. Collision detection and
name suffix assignment happen in a separate collision resolution stage.
Collision scenarios:
1. **First Batch Collision** (multiple institutions processed together):
- Two institutions generate the same base GHCID
- Both have the same publication_date (extraction timestamp)
- Resolution: BOTH receive native language name suffixes
- Example: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam and
NL-NH-AMS-M-SM-science_museum_amsterdam
2. **Historical Addition** (new institution added later):
- New institution generates GHCID matching already-published identifier
- publication_date is AFTER existing record's timestamp
- Resolution: ONLY new institution receives name suffix
- Existing GHCID preserved (maintains citation stability)
- Example: Existing NL-NH-AMS-M-HM unchanged,
new becomes NL-NH-AMS-M-HM-historical_museum_amsterdam
PID Stability Principle:
------------------------
Once a GHCID is published (exported to RDF, JSON-LD, CSV), it becomes a
persistent identifier that may be:
- Cited in academic papers
- Used in external databases and APIs
- Embedded in linked data triples
- Referenced in archival finding aids
Per W3C "Cool URIs Don't Change" principle, published GHCIDs MUST NOT be
modified retroactively. Historical additions get name suffixes; existing PIDs
remain stable.
References:
-----------
- docs/PERSISTENT_IDENTIFIERS.md (Historical Collision Resolution section)
- docs/plan/global_glam/07-ghcid-collision-resolution.md (Temporal dimension)
- docs/GHCID_PID_SCHEME.md (Timeline examples)
"""
def __init__(self, geonames_db_path: Path):
"""
Initialize with GeoNames database.
Args:
geonames_db_path: Path object to geonames.db SQLite database
"""
self.geonames_db = GeoNamesDB(geonames_db_path)
self.ghcid_gen = GHCIDGenerator()
# Load ISO 3166-2 mappings for region codes
self.reference_dir = Path(__file__).parent.parent / "data" / "reference"
self.nl_mapping = self._load_region_mapping("iso_3166_2_nl.json")
self.it_mapping = self._load_region_mapping("iso_3166_2_it.json")
self.be_mapping = self._load_region_mapping("iso_3166_2_be.json")
self.ru_mapping = self._load_region_mapping("iso_3166_2_ru.json")
self.dk_mapping = self._load_region_mapping("iso_3166_2_dk.json")
self.ar_mapping = self._load_region_mapping("iso_3166_2_ar.json")
self.ru_mapping = self._load_region_mapping("iso_3166_2_ru.json")
self.dk_mapping = self._load_region_mapping("iso_3166_2_dk.json")
self.ar_mapping = self._load_region_mapping("iso_3166_2_ar.json")
# Statistics
self.stats = {
'total': 0,
'success': 0,
'city_found': 0,
'city_fallback': 0,
'region_found': 0,
'region_fallback': 0,
}
def _load_region_mapping(self, filename: str) -> Dict[str, str]:
"""Load ISO 3166-2 region code mapping."""
filepath = self.reference_dir / filename
if not filepath.exists():
print(f"Warning: {filename} not found, region lookups may fail")
return {}
with open(filepath, 'r', encoding='utf-8') as f:
data = json.load(f)
# Extract provinces mapping (different structure per file)
import unicodedata
mapping = {}
# Handle different JSON structures
if 'provinces' in data:
# NL/AR format: {"provinces": {"Drenthe": "DR", ...}}
province_data = data['provinces']
elif 'regions' in data:
# IT/DK format: {"regions": {"Lombardy": "25", ...}}
province_data = data['regions']
elif 'federal_subjects' in data:
# RU format: {"federal_subjects": {"Kaliningrad Oblast": "39", ...}}
province_data = data['federal_subjects']
else:
# Simple format: {"DR": "Drenthe", ...} - need to reverse
province_data = {v: k for k, v in data.items() if isinstance(v, str) and len(k) == 2}
# Create normalized name -> code mapping
for name, code in province_data.items():
# Normalize name for lookup
normalized = unicodedata.normalize('NFD', name.upper())
normalized = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
mapping[normalized.strip()] = code
return mapping
def _get_region_code(self, country: str, region_name: Optional[str]) -> str:
"""
Get ISO 3166-2 region code for a country/region.
Args:
country: ISO 3166-1 alpha-2 country code
region_name: Region/province name (optional)
Returns:
2-letter region code or "00" if not found
"""
if not region_name:
self.stats['region_fallback'] += 1
return "00"
# Normalize region name
import unicodedata
normalized = unicodedata.normalize('NFD', region_name.upper())
normalized = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
normalized = normalized.strip()
# Lookup by country
mapping = None
if country == "NL":
mapping = self.nl_mapping
elif country == "IT":
mapping = self.it_mapping
elif country == "BE":
mapping = self.be_mapping
elif country == "RU":
mapping = self.ru_mapping
elif country == "DK":
mapping = self.dk_mapping
elif country == "AR":
mapping = self.ar_mapping
if mapping and normalized in mapping:
self.stats['region_found'] += 1
return mapping[normalized]
self.stats['region_fallback'] += 1
return "00"
def _get_city_code_fallback(self, city_name: str) -> str:
"""
Generate 3-letter city code from city name.
Same logic as Latin America script.
Args:
city_name: City name
Returns:
3-letter uppercase code
"""
import unicodedata
# Remove accents
normalized = unicodedata.normalize('NFD', city_name)
normalized = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Split into words
words = normalized.split()
if len(words) == 1:
# Single word: take first 3 letters
code = words[0][:3].upper()
elif words[0].lower() in ['la', 'el', 'los', 'las', 'o', 'a', 'de', 'den', 'het']:
# City with article: first letter of article + first 2 of next word
if len(words) > 1:
code = (words[0][0] + words[1][:2]).upper()
else:
code = words[0][:3].upper()
else:
# Multi-word: take first letter of each word (up to 3)
code = ''.join(w[0] for w in words[:3]).upper()
# Ensure exactly 3 letters
if len(code) < 3:
code = code.ljust(3, 'X')
elif len(code) > 3:
code = code[:3]
return code
def _reverse_geocode_with_geonames(
self,
latitude: float,
longitude: float,
country: str
) -> Optional[Dict[str, str]]:
"""
Reverse geocode coordinates to find modern location info from GeoNames.
Args:
latitude: Latitude coordinate
longitude: Longitude coordinate
country: ISO 3166-1 alpha-2 country code
Returns:
Dict with 'city', 'region', 'city_code' or None
"""
# Query GeoNames database for nearby cities
# Using a SQL query to find the closest city
query = """
SELECT
name,
admin1_code,
admin1_name,
latitude,
longitude,
geonames_id,
((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance
FROM cities
WHERE country_code = ?
ORDER BY distance
LIMIT 1
"""
import sqlite3
conn = sqlite3.connect(self.geonames_db.db_path)
cursor = conn.cursor()
try:
cursor.execute(query, (latitude, latitude, longitude, longitude, country))
row = cursor.fetchone()
if row:
city_name, admin1, admin1_name, lat, lon, geonameid, distance = row
# Lookup city abbreviation from GeoNames
city_info = self.geonames_db.lookup_city(city_name, country)
if city_info:
city_code = city_info.get_abbreviation()
else:
# Fallback to 3-letter abbreviation from name
city_code = self._get_city_code_fallback(city_name)
return {
'city': city_name,
'region': admin1_name, # Use admin1_name for region lookup
'region_code': admin1, # Also return the code
'city_code': city_code,
'geonames_id': geonameid,
'distance_deg': distance ** 0.5 # Approximate distance
}
finally:
conn.close()
return None
def generate_ghcid_for_institution(self, record: dict) -> Optional[GHCIDComponents]:
"""
Generate GHCID for a historical institution using modern geographic lookup.
**GHCID Collision Handling**:
This method generates a base GHCID (without name suffix). Collision detection
and name suffix assignment happen at a later stage when comparing against
previously published records.
Decision logic (implemented in future collision detector):
1. **Check if base GHCID already exists in published dataset**
- If NO → Use base GHCID without name suffix
- If YES → Check publication timestamp
2. **Compare publication dates**:
- Same batch (same publication_date) → ALL instances get name suffixes
- Historical addition (current date > existing publication_date):
→ ONLY new institution gets name suffix
→ Existing GHCID preserved (PID stability)
3. **Name suffix format**:
- Native language institution name converted to snake_case
- Example: "Stedelijk Museum Amsterdam""stedelijk_museum_amsterdam"
Example scenarios:
**Scenario 1: First Batch Collision**
```
# 2025-11-01 batch import discovers two institutions:
Institution A: "Stedelijk Museum Amsterdam"
→ NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
Institution B: "Science Museum Amsterdam"
→ NL-NH-AMS-M-SM-science_museum_amsterdam
# Both get name suffixes because discovered simultaneously
```
**Scenario 2: Historical Addition**
```
# 2025-11-01: "Hermitage Amsterdam" published
Institution A: NL-NH-AMS-M-HM (publication_date: 2025-11-01T10:00:00Z)
# 2025-11-15: "Historical Museum Amsterdam" discovered
Institution B: Base GHCID = NL-NH-AMS-M-HM (collision!)
→ Check A.publication_date (2025-11-01) < current (2025-11-15)
→ A unchanged, B gets name suffix: NL-NH-AMS-M-HM-historical_museum_amsterdam
```
Args:
record: Institution record from YAML
Returns:
GHCIDComponents or None if generation fails
"""
self.stats['total'] += 1
# Extract required fields
name = record.get('name', '')
institution_type_str = record.get('institution_type', 'U')
# Get location data
locations = record.get('locations', [])
if not locations:
print(f"Warning: No location data for {name}")
return None
location = locations[0] # Use primary location
city = location.get('city', '')
country = location.get('country', 'XX')
latitude = location.get('latitude')
longitude = location.get('longitude')
if not latitude or not longitude:
print(f"Warning: No coordinates for {name}")
return None
# Reverse geocode to get modern location info
geo_info = self._reverse_geocode_with_geonames(latitude, longitude, country)
if geo_info:
city_code = geo_info['city_code']
region = geo_info['region']
print(f"{name}: Found {geo_info['city']} (code: {city_code}, region: {region})")
self.stats['city_found'] += 1
else:
# Fallback: use city name from record
city_info = self.geonames_db.lookup_city(city, country)
if city_info:
city_code = city_info.get_abbreviation()
region = None # Will use fallback "00"
print(f"{name}: Using city name lookup ({city}{city_code})")
self.stats['city_found'] += 1
else:
city_code = self._get_city_code_fallback(city)
region = None
print(f"{name}: Using fallback ({city}{city_code})")
self.stats['city_fallback'] += 1
# Get region code
region_code = self._get_region_code(country, region)
# Map institution type
if len(institution_type_str) == 1:
# Single letter - map to InstitutionType
type_map = {
'G': InstitutionType.GALLERY,
'L': InstitutionType.LIBRARY,
'A': InstitutionType.ARCHIVE,
'M': InstitutionType.MUSEUM,
'O': InstitutionType.OFFICIAL_INSTITUTION,
'R': InstitutionType.RESEARCH_CENTER,
'C': InstitutionType.CORPORATION,
'U': InstitutionType.UNKNOWN,
'B': InstitutionType.BOTANICAL_ZOO,
'E': InstitutionType.EDUCATION_PROVIDER,
'P': InstitutionType.PERSONAL_COLLECTION,
'S': InstitutionType.COLLECTING_SOCIETY,
}
inst_type = type_map.get(institution_type_str, InstitutionType.UNKNOWN)
else:
try:
inst_type = InstitutionType[institution_type_str]
except KeyError:
inst_type = InstitutionType.UNKNOWN
# Generate abbreviation from name
abbreviation = extract_abbreviation_from_name(name)
# Generate GHCID
components = GHCIDComponents(
country_code=country,
region_code=region_code,
city_locode=city_code,
institution_type=inst_type,
abbreviation=abbreviation
)
self.stats['success'] += 1
return components
def print_statistics(self):
"""Print generation statistics."""
print("\n" + "="*70)
print("HISTORICAL GHCID REGENERATION STATISTICS")
print("="*70)
print(f"Total institutions processed: {self.stats['total']}")
print(f"GHCIDs successfully generated: {self.stats['success']}")
print(f"City codes from GeoNames: {self.stats['city_found']}")
print(f"City codes from fallback: {self.stats['city_fallback']}")
print(f"Region codes found: {self.stats['region_found']}")
print(f"Region codes fallback (00): {self.stats['region_fallback']}")
print()
def main():
"""Main execution."""
# Paths
project_root = Path(__file__).parent.parent
yaml_path = project_root / "data" / "instances" / "historical_institutions_validation.yaml"
geonames_db_path = project_root / "data" / "reference" / "geonames.db"
print("="*70)
print("REGENERATE HISTORICAL INSTITUTION GHCIDs")
print("="*70)
print(f"Input: {yaml_path}")
print(f"GeoNames DB: {geonames_db_path}")
print()
# Create backup
backup_dir = project_root / "data" / "instances" / "archive"
backup_dir.mkdir(exist_ok=True)
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
backup_path = backup_dir / f"historical_institutions_pre_regenerate_{timestamp}.yaml"
import shutil
shutil.copy(yaml_path, backup_path)
print(f"✓ Backup created: {backup_path}")
print()
# Load YAML
with open(yaml_path, 'r', encoding='utf-8') as f:
institutions = yaml.safe_load(f)
print(f"Loaded {len(institutions)} historical institutions")
print()
# Initialize generator
generator = HistoricalGHCIDGenerator(geonames_db_path)
# Process each institution
print("Generating GHCIDs using modern geographic lookup...")
print()
# COLLISION DETECTION NOTE:
# This script regenerates GHCIDs for historical validation data.
# In production, collision detection occurs when comparing against
# previously published GHCIDs with temporal priority:
#
# 1. Build collision registry: group institutions by base GHCID (no name suffix)
# 2. For each collision group:
# a) Check publication_date metadata (from provenance.extraction_date)
# b) If all have same publication_date → First Batch (all get name suffixes)
# c) If dates differ → Historical Addition (only new ones get name suffixes)
# 3. Preserve existing PIDs: Never modify GHCIDs with earlier publication_date
#
# See docs/plan/global_glam/07-ghcid-collision-resolution.md for algorithm
for record in institutions:
name = record.get('name', 'Unknown')
# Generate GHCID
components = generator.generate_ghcid_for_institution(record)
if not components:
continue
# Generate all identifier formats
#
# NAME SUFFIX COLLISION HANDLING:
# Currently generating base GHCID without name suffix.
# In production collision detector:
# 1. Check if this base GHCID exists in published dataset
# 2. If collision detected:
# - First batch (same publication_date): Add name suffix to ALL
# - Historical addition (later date): Add name suffix to NEW only
# 3. Name suffix generation:
# - Convert native language institution name to snake_case
# - Example: "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
#
ghcid_str = components.to_string()
ghcid_uuid = components.to_uuid()
ghcid_uuid_sha256 = components.to_uuid_sha256()
ghcid_numeric = components.to_numeric()
record_id = GHCIDComponents.generate_uuid_v7()
# Update record
record['ghcid_current'] = ghcid_str
record['ghcid_original'] = ghcid_str # Same as current for regeneration
record['ghcid_uuid'] = str(ghcid_uuid)
record['ghcid_uuid_sha256'] = str(ghcid_uuid_sha256)
record['ghcid_numeric'] = ghcid_numeric
record['record_id'] = str(record_id)
# Update GHCID history (first entry)
#
# GHCID HISTORY AND COLLISIONS:
# ghcid_history tracks GHCID changes over time (see GHCIDHistoryEntry schema).
# When collisions are resolved by adding name suffixes:
#
# 1. Create new history entry with valid_from timestamp
# 2. Update previous entry with valid_to timestamp (if name suffix added)
# 3. Record reason for change (e.g., "Name suffix added to resolve collision")
#
# Example history for collision resolution:
# ghcid_history:
# - ghcid: NL-NH-AMS-M-HM-historical_museum_amsterdam # Current (with name suffix)
# ghcid_numeric: 789012345678
# valid_from: "2025-11-15T10:00:00Z"
# valid_to: null
# reason: "Name suffix added to resolve collision with existing NL-NH-AMS-M-HM"
# - ghcid: NL-NH-AMS-M-HM # Original (without name suffix)
# ghcid_numeric: 123456789012
# valid_from: "2025-11-15T09:00:00Z"
# valid_to: "2025-11-15T10:00:00Z"
# reason: "Historical identifier based on modern geographic location"
#
if 'ghcid_history' in record and record['ghcid_history']:
record['ghcid_history'][0]['ghcid'] = ghcid_str
record['ghcid_history'][0]['ghcid_numeric'] = ghcid_numeric
record['ghcid_history'][0]['reason'] = (
"Historical identifier based on modern geographic location "
"from GeoNames reverse geocoding"
)
# Update identifiers list
if 'identifiers' not in record:
record['identifiers'] = []
# Remove old GHCID identifiers
record['identifiers'] = [
i for i in record['identifiers']
if i.get('identifier_scheme') not in ['GHCID', 'GHCID_NUMERIC', 'GHCID_UUID', 'GHCID_UUID_SHA256', 'RECORD_ID']
]
# Add new GHCID identifiers
record['identifiers'].extend([
{
'identifier_scheme': 'GHCID',
'identifier_value': ghcid_str
},
{
'identifier_scheme': 'GHCID_NUMERIC',
'identifier_value': str(ghcid_numeric)
},
{
'identifier_scheme': 'GHCID_UUID',
'identifier_value': str(ghcid_uuid),
'identifier_url': f'urn:uuid:{ghcid_uuid}'
},
{
'identifier_scheme': 'GHCID_UUID_SHA256',
'identifier_value': str(ghcid_uuid_sha256),
'identifier_url': f'urn:uuid:{ghcid_uuid_sha256}'
},
{
'identifier_scheme': 'RECORD_ID',
'identifier_value': str(record_id),
'identifier_url': f'urn:uuid:{record_id}'
}
])
# Print statistics
generator.print_statistics()
# NEXT STEPS: COLLISION DETECTION AND RESOLUTION
# ================================================
# This script generates base GHCIDs. For production deployment:
#
# 1. Build collision detector module:
# - Group institutions by base GHCID (without name suffix)
# - Identify collision sets (2+ institutions with same base GHCID)
#
# 2. Implement temporal priority resolver:
# - Extract publication_date from provenance.extraction_date
# - Compare timestamps within each collision set
# - Apply decision logic:
# * Same timestamp → First Batch → ALL get name suffixes
# * Different timestamps → Historical Addition → NEW get name suffixes
#
# 3. Name suffix generation:
# - Convert native language institution name to snake_case
# - Example: "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
# - Append to base GHCID: NL-NH-AMS-M-HM → NL-NH-AMS-M-HM-historical_museum_amsterdam
#
# 4. Update GHCID history:
# - Create new GHCIDHistoryEntry when name suffix added
# - Set valid_from to current timestamp
# - Set valid_to on previous entry (if exists)
# - Document reason: "Name suffix added to resolve collision with [other institution]"
#
# 5. Generate collision registry report:
# - CSV with columns: base_ghcid, collision_type, resolution_strategy,
# affected_institutions, publication_dates, timestamp
# - Store in data/reports/ghcid_collision_registry.csv
#
# 6. Validation:
# - Verify all GHCIDs are unique (base or with name suffix)
# - Check no published GHCIDs were modified (PID stability test)
# - Validate GHCID history entries have correct temporal ordering
#
# See docs/plan/global_glam/07-ghcid-collision-resolution.md for algorithm
# See docs/GHCID_PID_SCHEME.md for timeline examples
# Write updated YAML
print(f"Writing updated YAML to: {yaml_path}")
with open(yaml_path, 'w', encoding='utf-8') as f:
# Write header
f.write("---\n")
f.write("# Historical Heritage Institutions - GHCID Validation Examples\n")
f.write("# Generated from Wikidata SPARQL query\n")
f.write("# GHCIDs regenerated using modern GeoNames-based location lookup\n")
f.write(f"# Last updated: {datetime.now(timezone.utc).isoformat()}\n")
f.write("#\n")
f.write("# GHCID Format: COUNTRY-REGION-CITY-TYPE-NAME\n")
f.write("# - City codes from GeoNames reverse geocoding (coordinates → city)\n")
f.write("# - Region codes from ISO 3166-2 mappings\n")
f.write("# - Same format as production implementation\n")
f.write("\n")
# Write institutions
yaml.dump(institutions, f, default_flow_style=False, allow_unicode=True, sort_keys=False)
print("✓ Done!")
print()
print("="*70)
print("VALIDATION COMPLETE")
print("="*70)
if __name__ == "__main__":
main()