glam/docs/sessions/2025-11-06-iso-3166-2-integration.md
2025-11-19 23:25:22 +01:00

14 KiB

ISO 3166-2 Integration for GHCID Region Codes

Date: 2025-11-06
Status: COMPLETE
Impact: Production-ready GHCID format with proper region codes for all historical institutions


Overview

Successfully integrated ISO 3166-2 subdivision codes into the GHCID generation system, eliminating "00" fallback codes and achieving 100% region code coverage for historical institutions.

Problem Statement

Initial Issue

Historical institutions were using fallback region code "00" instead of proper ISO 3166-2 subdivision codes:

  • Actual: NL-00-ALK-L-LA (fallback)
  • Target: NL-NH-ALK-L-LA (Noord-Holland region code)

Root Cause

The GHCID regeneration script had incomplete ISO 3166-2 mapping files:

  • Netherlands (NL) mapping existed
  • Italy (IT) mapping missing
  • Russia (RU) mapping missing
  • Denmark (DK) mapping missing
  • Argentina (AR) mapping missing

Impact

  • 4 out of 5 historical institutions had "00" fallback codes
  • GHCIDs were inconsistent with target specification
  • Production readiness blocked

Solution Architecture

Data Flow: GeoNames → ISO 3166-2

Coordinates (54.71°N, 20.51°E)
    ↓
GeoNames Reverse Geocoding
    ↓
admin1_name: "Kaliningrad Oblast" (English)
    ↓
Normalize: "KALININGRAD OBLAST"
    ↓
ISO 3166-2 Mapping Lookup
    ↓
Region Code: "KGD" (Kaliningradskaya oblast')
    ↓
GHCID: RU-KGD-KAL-L-KPL

Key Challenge: Language Mismatch

Problem: GeoNames API returns English region names, but ISO 3166-2 standard uses official local language names.

Example:

  • GeoNames returns: "North Holland" (English)
  • ISO 3166-2 official name: "Noord-Holland" (Dutch)

Solution: Dual-name mapping strategy

{
  "provinces": {
    "Noord-Holland": "NH",   // Official ISO 3166-2 name
    "North Holland": "NH"     // GeoNames English alias
  }
}

Reference Files Created

1. Netherlands (NL) - Updated

File: data/reference/iso_3166_2_nl.json
Coverage: 22 provinces
Changes: Added English aliases for all provinces

Example:

{
  "provinces": {
    "Noord-Holland": "NH",
    "North Holland": "NH",
    "Zuid-Holland": "ZH",
    "South Holland": "ZH",
    "Drenthe": "DR"
  }
}

2. Italy (IT) - Created

File: data/reference/iso_3166_2_it.json
Coverage: 133 subdivisions (20 regions + 107 provinces + 6 autonomous areas)
Source: Debian iso-codes project

Example:

{
  "provinces": {
    "Lombardia": "25",
    "Lombardy": "25",
    "Como": "CO",
    "Province of Como": "CO",
    "Milano": "MI",
    "Milan": "MI"
  }
}

Note: Italy uses 2 digits for regions (25, 52, 62) and 2 letters for provinces (CO, MI, RM).

3. Russia (RU) - Created

File: data/reference/iso_3166_2_ru.json
Coverage: 86 federal subjects
Source: Debian iso-codes project

Example:

{
  "federal_subjects": {
    "Kaliningradskaya oblast'": "KGD",
    "Kaliningrad Oblast": "KGD",
    "Moskva": "MOW",
    "Moscow": "MOW",
    "Sankt-Peterburg": "SPE",
    "Saint Petersburg": "SPE"
  }
}

Note: Russia uses 2-3 letter codes for federal subjects (KGD, MOW, SPE).

4. Denmark (DK) - Created

File: data/reference/iso_3166_2_dk.json
Coverage: 10 subdivisions (5 regions + Greenland + Faroe Islands)
Source: Debian iso-codes project

Example:

{
  "provinces": {
    "Hovedstaden": "84",
    "Capital Region": "84",
    "Grønland": "GL",
    "Greenland": "GL",
    "Færøerne": "FO",
    "Faroe Islands": "FO"
  }
}

Note: Denmark uses 2-digit codes (81-85) for mainland regions and 2-letter codes (GL, FO) for territories.

5. Argentina (AR) - Updated

File: data/reference/iso_3166_2_ar.json
Coverage: 25 provinces
Changes: Updated format + accent normalization

Example:

{
  "provinces": {
    "Tucumán": "T",
    "Tucuman": "T",
    "Buenos Aires": "B",
    "Ciudad Autónoma de Buenos Aires": "C",
    "Ciudad Autonoma de Buenos Aires": "C"
  }
}

Note: Argentina uses single-letter codes (T, B, C) for provinces.


Data Source: Debian iso-codes

Why Debian iso-codes?

  • Free/Libre: No cost (official ISO data costs 300 CHF)
  • Well-maintained: Active project tracking official ISO updates
  • JSON format: Easy to parse and integrate
  • Comprehensive: 5,000+ subdivisions worldwide
  • Trusted: Used by Debian/Ubuntu for locale data

Repository: https://salsa.debian.org/iso-codes-team/iso-codes

Alternatives Considered:

  • Official ISO data: 300 CHF per country, restrictive licensing
  • Wikipedia: Unofficial, inconsistent formatting, scraping required
  • Custom scraping: High maintenance burden, data quality issues

Script Modifications

File: scripts/regenerate_historical_ghcids.py

Changes Made:

  1. Added 3 new mapping loads (lines 51-57):

    self.ru_mapping = self._load_region_mapping("ru", "federal_subjects")
    self.dk_mapping = self._load_region_mapping("dk", "provinces")
    self.ar_mapping = self._load_region_mapping("ar", "provinces")
    
  2. Extended _get_region_code() method (lines 118-124):

    elif country == "RU" and self.ru_mapping:
        return self._lookup_region(admin1_name, self.ru_mapping)
    elif country == "DK" and self.dk_mapping:
        return self._lookup_region(admin1_name, self.dk_mapping)
    elif country == "AR" and self.ar_mapping:
        return self._lookup_region(admin1_name, self.ar_mapping)
    
  3. Enhanced _load_region_mapping() to handle federal_subjects key (lines 86-93):

    if "provinces" in data:
        raw_mapping = data["provinces"]
    elif "federal_subjects" in data:
        raw_mapping = data["federal_subjects"]
    

Validation Results

Execution Statistics

  • Total institutions processed: 5
  • GHCIDs successfully generated: 5 (100%)
  • City codes from GeoNames: 5 (100%)
  • Region codes from ISO 3166-2: 5 (100%)
  • Fallback "00" codes: 0

GHCID Transformation Results

Institution Country OLD (Hash) INTERIM (Fallback) FINAL (ISO 3166-2)
Librije (Alkmaar) NL NL-77907473-L-LA NL-00-ALK-L-LA NL-NH-ALK-L-LA
Giovio Musaeum IT IT-93949449-M-GM IT-00-COM-M-GM IT-25-COM-M-GM
Königsberg Library RU RU-54735954-L-KPL RU-00-KAL-L-KPL RU-KGD-KAL-L-KPL
Kunstkammeret DK DK-55682007-M-K DK-00-NYH-M-K DK-84-NYH-M-K
House of Tucumán AR AR-65168335-M-HHT AR-00-SAN-M-HHT AR-T-SAN-M-HHT

Region Code Details

Institution City GeoNames admin1_name ISO 3166-2 Code Official Name
Librije Alkmaar "North Holland" NH Noord-Holland
Giovio Musaeum Como "Lombardy" 25 Lombardia
Königsberg Library Kaliningrad "Kaliningrad Oblast" KGD Kaliningradskaya oblast'
Kunstkammeret Nyhavn "Capital Region" 84 Hovedstaden
House of Tucumán San Miguel de Tucumán "Tucumán" T Tucumán

Critical Test Case: Königsberg Library

Historical Context

  • Institution: Königsberg Public Library
  • Founded: 1541
  • Closed: 1944 (destroyed in World War II)
  • Historical Location: Königsberg, East Prussia, German Empire
  • Modern Location: Kaliningrad, Russia
  • Border Change: Prussia dissolved 1947, territory became part of Soviet Union

GHCID Generation Challenge

Question: Should GHCID use historical country (Prussia/Germany) or modern country (Russia)?

Decision: Use modern political boundaries (2025 world map)

  • Consistent with GHCID specification
  • Coordinates-based approach naturally resolves to modern location
  • Historical context preserved in metadata, not in identifier

Validation

  • Coordinates: 54.71°N, 20.51°E
  • GeoNames Lookup: Kaliningrad, Russia
  • ISO 3166-2 Region: Kaliningradskaya oblast' (KGD)
  • Final GHCID: RU-KGD-KAL-L-KPL
  • Status: PASSED - Modern boundaries correctly applied

Metadata Preservation

Historical context preserved in LinkML record:

- id: https://w3id.org/heritage/custodian/ru-kgd-kal-l-kpl
  name: Königsberg Public Library
  description: >-
    Major academic library in Königsberg, East Prussia (1541-1944).
    Located in modern-day Kaliningrad, Russia. Historical Prussian
    institution destroyed during World War II.    
  locations:
    - city: Kaliningrad
      country: RU
      historical_context: "Formerly Königsberg, East Prussia (1255-1945)"

Files Modified

Primary Outputs

  1. data/instances/historical_institutions_validation.yaml
    • All 5 institutions regenerated with ISO 3166-2 region codes
    • 100% region code coverage achieved

Backups Created

  1. data/instances/archive/historical_institutions_pre_regenerate_20251106_130123.yaml
  2. data/instances/archive/historical_institutions_pre_regenerate_20251106_130237.yaml
  3. data/instances/archive/historical_institutions_pre_regenerate_20251106_130406.yaml
  4. data/instances/archive/historical_institutions_pre_regenerate_20251106_132311.yaml (final)

Reference Data

  1. data/reference/iso_3166_2_nl.json - Updated with English aliases
  2. data/reference/iso_3166_2_it.json - Created (133 subdivisions)
  3. data/reference/iso_3166_2_ru.json - Created (86 federal subjects)
  4. data/reference/iso_3166_2_dk.json - Created (10 subdivisions)
  5. data/reference/iso_3166_2_ar.json - Updated format + normalization

Scripts

  1. scripts/regenerate_historical_ghcids.py - Added RU/DK/AR mapping support (495 lines)

Lessons Learned

1. Language Normalization is Critical

  • GeoNames returns English names, ISO 3166-2 uses local language names
  • Always include both official and English names in mapping files
  • Normalization strategy: Remove accents, uppercase for matching

2. ISO 3166-2 Format Varies by Country

Different countries use different code formats:

  • 2 letters: Netherlands (NH, ZH), Italian provinces (CO, MI)
  • 2 digits: Italy regions (25, 52), Denmark regions (81-85)
  • 3 letters: Russia federal subjects (KGD, MOW, SPE)
  • 1 letter: Argentina provinces (T, B, C)

No universal pattern - must handle each country individually.

3. Special Keys for Special Countries

  • Most countries: "provinces" key
  • Russia: "federal_subjects" key (different political structure)
  • Must check data structure before parsing

4. Debian iso-codes is Excellent Choice

  • Free, well-maintained, comprehensive
  • JSON format perfect for integration
  • Active project tracking official ISO updates
  • No licensing restrictions (unlike official ISO data)

5. Testing with Real Historical Data is Essential

  • Prussia → Russia border change validated the approach
  • Confirms modern coordinate projection works correctly
  • Historical context preservation in metadata, not identifiers

Next Steps

Immediate Priorities

  1. COMPLETE: Historical institutions ISO 3166-2 integration
  2. Generate GHCIDs for Dutch datasets (369 institutions) using same approach
  3. Expand ISO 3166-2 coverage for Phase 2 global extraction:
    • Brazil (27 states) - For Brazilian institutions
    • Mexico (32 states) - For Mexican institutions
    • Chile (16 regions) - For Chilean institutions

Future Expansion

  • Target: 60+ countries from conversation files
  • Approach: Generate ISO 3166-2 mappings on-demand using Debian iso-codes
  • Automation: Script to fetch and format ISO 3166-2 data from Debian repository

Production Rollout

  • Historical institutions: READY (5 institutions with ISO 3166-2 codes)
  • Dutch TIER_1 data: Next priority (369 institutions)
  • Latin American TIER_4: Following (304 institutions)
  • Global GLAM extraction: Future (2,000+ institutions expected)

Impact Summary

Quantitative Achievements

  • 5 reference files created/updated
  • 276 subdivisions mapped across 5 countries
  • 100% region code coverage for historical institutions (5/5)
  • 0 fallback "00" codes remaining
  • 5 historical GHCIDs regenerated successfully

Qualitative Achievements

  • Solved GeoNames/ISO language mismatch with dual-name strategy
  • Validated border change handling (Prussia → Russia)
  • Established repeatable pattern for adding new countries
  • Demonstrated production-ready GHCID format
  • Identified and documented free/libre data source (Debian iso-codes)

Production Readiness

Historical Institutions: PRODUCTION READY

  • All institutions have proper ISO 3166-2 region codes
  • GHCID format matches target specification
  • Schema validation passing
  • Geographic border changes handled correctly
  • Ready for Phase 2 global GLAM extraction

References

Code

  • Script: scripts/regenerate_historical_ghcids.py
  • Mapping files: data/reference/iso_3166_2_*.json
  • Output: data/instances/historical_institutions_validation.yaml

Documentation

  • GHCID Specification: docs/PERSISTENT_IDENTIFIERS.md
  • GeoNames Integration: docs/plan/global_glam/08-geonames-integration.md
  • Historical Validation: docs/HISTORICAL_INSTITUTIONS_VALIDATION.md

External Resources


Session Date: 2025-11-06
Session Duration: ~3 hours
Status: COMPLETE - Production ready
Next Session: Generate GHCIDs for Dutch datasets (369 institutions)