glam/docs/plan/global_glam/06-global-identifier-system.md
2025-11-19 23:25:22 +01:00

19 KiB

Global Heritage Custodian Identifier System (GHCID)

Version: 0.1.0
Status: Design Proposal
Date: 2025-11-05


Overview

A globally scalable, persistent identifier system for heritage custodians that:

  1. Unifies existing ISIL codes into a consistent format
  2. Includes non-ISIL institutions worldwide
  3. Provides human-readable and machine-readable formats
  4. Supports hierarchical geographic organization
  5. Enables hash-based numeric identifiers for systems requiring numeric IDs

Identifier Format

Human-Readable Format (Primary)

{ISO 3166-1 alpha-2}-{ISO 3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}

Components:

  1. ISO 3166-1 alpha-2 (2 chars): Country code

  2. ISO 3166-2 (1-3 chars): Subdivision code (province, state, region)

  3. UN/LOCODE (3 chars): City/location code

  4. Type (1 char): Institution type

    • G = Gallery
    • L = Library
    • A = Archive
    • M = Museum
    • C = Cultural Center
    • R = Research Institute
    • N = Consortium/Network
    • V = Government Agency
    • X = Mixed/Other
  5. Abbreviation (2-8 chars): Official name abbreviation

    • Use first letters of official international name
    • Maximum 8 characters
    • Uppercase, alphanumeric only

Examples

# National Archives of the Netherlands
NL-00-XXX-A-NAN

# Rijksmuseum Amsterdam
NL-NH-AMS-M-RM

# Biblioteca Nacional do Brasil (Rio de Janeiro)
BR-RJ-RIO-L-BNB

# Museum of Modern Art (New York)
US-NY-NYC-M-MOMA

# British Library (London)
GB-EN-LON-L-BL

# Louvre Museum (Paris)
FR-IL-PAR-M-LM

# Mixed institution (Library + Archive + Museum in Utrecht)
NL-UT-UTC-X-RHCU

Hash-Based Numeric Format (Secondary)

For systems requiring numeric-only identifiers:

import hashlib

def ghcid_to_numeric(ghcid: str) -> int:
    """
    Convert GHCID to deterministic numeric identifier.
    
    Uses SHA256 hash truncated to 64 bits (unsigned integer).
    Range: 0 to 18,446,744,073,709,551,615
    """
    hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
    return int.from_bytes(hash_bytes[:8], byteorder='big')

# Example
ghcid = "NL-NH-AMS-M-RM"  # Rijksmuseum
numeric_id = ghcid_to_numeric(ghcid)
# => 12345678901234567 (deterministic, always same for this GHCID)

Properties:

  • Deterministic (same GHCID always produces same numeric ID)
  • Collision-resistant (SHA256 cryptographic hash)
  • Reversible via lookup table (store GHCID ↔ numeric mapping)
  • Fits in 64-bit integer (compatible with databases, APIs)

ISIL Code Migration

Mapping ISIL to GHCID

Existing ISIL codes can be automatically converted:

def isil_to_ghcid(isil_code: str, institution_type: str, city_locode: str) -> str:
    """
    Convert ISIL code to GHCID format.
    
    ISIL format: {Country}-{Local Code}
    Example: NL-AsdRM (Rijksmuseum Amsterdam)
    """
    country = isil_code[:2]  # NL
    
    # Parse local code for geographic hints
    # Many ISIL codes encode city: NL-AsdRM (Asd = Amsterdam)
    # This requires lookup table or heuristics
    
    # For now, extract from known patterns or use lookup
    subdivision = extract_subdivision(isil_code)  # NH (Noord-Holland)
    locode = city_locode  # AMS (Amsterdam)
    type_code = institution_type_to_code(institution_type)  # M
    abbrev = extract_abbreviation_from_isil(isil_code)  # RM
    
    return f"{country}-{subdivision}-{locode}-{type_code}-{abbrev}"

# Example conversions
ISIL: NL-AsdRM         GHCID: NL-NH-AMS-M-RM
ISIL: US-DLC           GHCID: US-DC-WAS-L-LC (Library of Congress)
ISIL: FR-751131015     GHCID: FR-IL-PAR-L-BNF (Bibliothèque nationale de France)
ISIL: GB-UkOxU         GHCID: GB-EN-OXF-L-BL (Bodleian Library)

ISIL Preservation

Important: Store original ISIL codes as identifiers:

custodian = HeritageCustodian(
    id="NL-NH-AMS-M-RM",  # Primary GHCID
    name="Rijksmuseum",
    identifiers=[
        Identifier(
            identifier_scheme="GHCID",
            identifier_value="NL-NH-AMS-M-RM"
        ),
        Identifier(
            identifier_scheme="ISIL",
            identifier_value="NL-AsdRM"  # Preserved!
        ),
        Identifier(
            identifier_scheme="GHCID_NUMERIC",
            identifier_value="12345678901234567"
        )
    ]
)

Benefits:

  • Maintains backward compatibility with ISIL systems
  • Enables cross-referencing with existing ISIL registries
  • Provides migration path for ISIL-dependent workflows

Benefits Over ISIL-Only Approach

1. Global Coverage

ISIL Limitations:

  • Not all countries have ISIL registries
  • Registration requires bureaucratic process
  • Many small/local institutions lack ISIL codes

GHCID Advantages:

  • Assign identifiers to any heritage institution worldwide
  • No registration required (generated deterministically)
  • Covers 1,004 Dutch organizations currently without ISIL codes
  • Enables grassroots/community heritage organizations

2. Geographic Hierarchy

ISIL: Flat structure, geographic info encoded inconsistently GHCID: Structured hierarchy (Country → Region → City)

Use Cases:

  • Query all museums in Amsterdam: NL-NH-AMS-M-*
  • Query all heritage orgs in São Paulo state: BR-SP-*-*-*
  • Aggregate statistics by region
  • Geocoding lookups

3. Institution Type in ID

ISIL: No type indicator (requires separate lookup) GHCID: Type encoded in ID (-M-, -L-, -A-)

Use Cases:

  • Filter by type without database lookup
  • Validate type consistency
  • Build type-specific indexes

4. Human Readability

ISIL Examples:

  • NL-AsdRM → Readable (Amsterdam, RM)
  • FR-751131015 → Opaque numeric code
  • DE-MUS-815314 → Mixed format

GHCID Examples (all readable):

  • NL-NH-AMS-M-RM → Netherlands, Noord-Holland, Amsterdam, Museum, Rijksmuseum
  • FR-IL-PAR-L-BNF → France, Île-de-France, Paris, Library, BNF
  • BR-RJ-RIO-L-BNB → Brazil, Rio de Janeiro, Rio, Library, BNB

5. Compatibility with Existing Systems

Numeric Systems (e.g., VIAF, databases requiring int64):

  • GHCID → Hash → Numeric ID (deterministic)
  • Store mapping table for reverse lookup

ISIL Systems:

  • Store ISIL as secondary identifier
  • Maintain bidirectional mapping
  • Enable gradual migration

Implementation Strategy

Phase 1: Schema Update

Update heritage_custodian.yaml:

classes:
  HeritageCustodian:
    slots:
      - id  # Now uses GHCID format
      - ghcid  # Explicit GHCID field
      - ghcid_numeric  # Hash-based numeric version
      - identifiers  # List includes ISIL, Wikidata, etc.
    slot_usage:
      id:
        description: Primary identifier in GHCID format
        pattern: '^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
      ghcid:
        description: Explicit GHCID in human-readable format
        required: true
      ghcid_numeric:
        description: Deterministic numeric hash of GHCID
        range: integer

Phase 2: GHCID Generator Module

Create src/glam_extractor/identifiers/ghcid.py:

from dataclasses import dataclass
from typing import Optional
import hashlib
import re

@dataclass
class GHCIDComponents:
    country: str          # ISO 3166-1 alpha-2 (2 chars)
    subdivision: str      # ISO 3166-2 (1-3 chars, or '00')
    locode: str           # UN/LOCODE (3 chars, or 'XXX')
    type_code: str        # Institution type (1 char)
    abbreviation: str     # Name abbreviation (2-8 chars)
    
    def to_ghcid(self) -> str:
        return f"{self.country}-{self.subdivision}-{self.locode}-{self.type_code}-{self.abbreviation}"
    
    def to_numeric(self) -> int:
        ghcid = self.to_ghcid()
        hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
        return int.from_bytes(hash_bytes[:8], byteorder='big')

class GHCIDGenerator:
    """Generate Global Heritage Custodian Identifiers"""
    
    TYPE_CODES = {
        'GALLERY': 'G',
        'LIBRARY': 'L',
        'ARCHIVE': 'A',
        'MUSEUM': 'M',
        'CULTURAL_CENTER': 'C',
        'RESEARCH_INSTITUTE': 'R',
        'CONSORTIUM': 'N',
        'GOVERNMENT_AGENCY': 'V',
        'MIXED': 'X'
    }
    
    def generate(
        self,
        name: str,
        institution_type: str,
        country: str,
        subdivision: Optional[str] = None,
        city_locode: Optional[str] = None
    ) -> GHCIDComponents:
        """
        Generate GHCID from institution metadata.
        
        Args:
            name: Official institution name
            institution_type: Institution type enum value
            country: ISO 3166-1 alpha-2 country code
            subdivision: ISO 3166-2 subdivision code (optional)
            city_locode: UN/LOCODE for city (optional)
        
        Returns:
            GHCIDComponents with all fields populated
        """
        # Normalize inputs
        country = country.upper()
        subdivision = (subdivision or '00').upper()
        locode = (city_locode or 'XXX').upper()
        type_code = self.TYPE_CODES.get(institution_type, 'X')
        
        # Generate abbreviation from name
        abbreviation = self._generate_abbreviation(name)
        
        return GHCIDComponents(
            country=country,
            subdivision=subdivision,
            locode=locode,
            type_code=type_code,
            abbreviation=abbreviation
        )
    
    def _generate_abbreviation(self, name: str) -> str:
        """
        Generate 2-8 character abbreviation from institution name.
        
        Strategy:
        1. Extract words (split on spaces/punctuation)
        2. Take first letter of each significant word
        3. Skip stopwords (the, of, for, and, etc.)
        4. Maximum 8 characters
        5. Minimum 2 characters
        """
        stopwords = {'the', 'of', 'for', 'and', 'in', 'at', 'to', 'a', 'an'}
        
        # Split name into words, filter stopwords
        words = re.findall(r'\b\w+\b', name.lower())
        significant_words = [w for w in words if w not in stopwords]
        
        # Take first letter of each word
        abbrev = ''.join(w[0].upper() for w in significant_words[:8])
        
        # Ensure minimum 2 characters
        if len(abbrev) < 2:
            # Fallback: take first 2-4 chars of first word
            abbrev = name[:4].upper().replace(' ', '')
        
        return abbrev[:8]  # Max 8 chars
    
    def from_isil(
        self,
        isil_code: str,
        institution_type: str,
        city_locode: str,
        subdivision: Optional[str] = None
    ) -> GHCIDComponents:
        """
        Convert ISIL code to GHCID.
        
        Args:
            isil_code: ISIL code (e.g., 'NL-AsdRM')
            institution_type: Institution type enum value
            city_locode: UN/LOCODE for city
            subdivision: ISO 3166-2 code (if not in ISIL)
        
        Returns:
            GHCIDComponents
        """
        # Parse ISIL
        match = re.match(r'^([A-Z]{2})-(.+)$', isil_code)
        if not match:
            raise ValueError(f"Invalid ISIL code format: {isil_code}")
        
        country = match.group(1)
        local_code = match.group(2)
        
        # Extract abbreviation from ISIL local code
        # Many ISIL codes have pattern: {CityCode}{Abbreviation}
        # e.g., NL-AsdRM → Asd (Amsterdam) + RM (Rijksmuseum)
        abbrev = self._extract_isil_abbreviation(local_code)
        
        subdivision = subdivision or '00'
        type_code = self.TYPE_CODES.get(institution_type, 'X')
        
        return GHCIDComponents(
            country=country,
            subdivision=subdivision,
            locode=city_locode,
            type_code=type_code,
            abbreviation=abbrev
        )
    
    def _extract_isil_abbreviation(self, local_code: str) -> str:
        """
        Extract abbreviation from ISIL local code.
        
        Heuristics:
        - If starts with 3-letter city code, take rest
        - If purely numeric, use first 4 digits
        - Otherwise, use full local code (max 8 chars)
        """
        # Check if starts with likely city code (3 lowercase + rest)
        match = re.match(r'^[A-Za-z]{3}(.+)$', local_code)
        if match:
            return match.group(1)[:8].upper()
        
        # If numeric, use first 4-8 digits
        if local_code.isdigit():
            return local_code[:8]
        
        # Otherwise use full code
        return local_code[:8].upper()

Phase 3: Update Parsers

Update all parsers to generate GHCID:

# In isil_registry.py
from glam_extractor.identifiers.ghcid import GHCIDGenerator

generator = GHCIDGenerator()

def to_heritage_custodian(record: ISILRegistryRecord) -> HeritageCustodian:
    # Generate GHCID from ISIL
    ghcid_components = generator.from_isil(
        isil_code=record.isil_code,
        institution_type='MIXED',  # ISIL registry doesn't specify type
        city_locode=lookup_locode(record.plaats),  # Lookup UN/LOCODE
        subdivision=lookup_subdivision(record.plaats)  # Lookup ISO 3166-2
    )
    
    ghcid = ghcid_components.to_ghcid()
    ghcid_numeric = ghcid_components.to_numeric()
    
    return HeritageCustodian(
        id=ghcid,  # Primary key is now GHCID
        ghcid=ghcid,
        ghcid_numeric=ghcid_numeric,
        name=record.instelling,
        institution_type='MIXED',
        identifiers=[
            Identifier(
                identifier_scheme='GHCID',
                identifier_value=ghcid
            ),
            Identifier(
                identifier_scheme='ISIL',
                identifier_value=record.isil_code  # Preserved
            ),
            Identifier(
                identifier_scheme='GHCID_NUMERIC',
                identifier_value=str(ghcid_numeric)
            )
        ],
        # ... rest of fields
    )

Phase 4: Lookup Tables

Create reference data for geocoding:

# data/reference/nl_city_locodes.json
{
    "Amsterdam": {
        "locode": "AMS",
        "subdivision": "NH",  # Noord-Holland
        "geonames_id": "2759794"
    },
    "Rotterdam": {
        "locode": "RTM",
        "subdivision": "ZH",  # Zuid-Holland
        "geonames_id": "2747891"
    },
    # ... all Dutch cities
}

# data/reference/iso_3166_2_nl.json
{
    "NH": "Noord-Holland",
    "ZH": "Zuid-Holland",
    "UT": "Utrecht",
    # ... all provinces
}

Phase 5: Cross-Linking with GHCID

Update cross-linking scripts to use GHCID:

# crosslink_dutch_datasets.py (updated)

# Build lookup by GHCID (not ISIL)
isil_by_ghcid = {}
orgs_by_ghcid = {}

for custodian in isil_custodians:
    isil_by_ghcid[custodian.ghcid] = custodian

for custodian in dutch_custodians:
    orgs_by_ghcid[custodian.ghcid] = custodian

# Merge by GHCID
all_ghcids = set(isil_by_ghcid.keys()) | set(orgs_by_ghcid.keys())

for ghcid in sorted(all_ghcids):
    isil_record = isil_by_ghcid.get(ghcid)
    orgs_record = orgs_by_ghcid.get(ghcid)
    merged = merge_custodians(isil_record, orgs_record, ghcid)
    merged_records.append(merged)

Validation and Testing

Unit Tests

def test_ghcid_generation():
    generator = GHCIDGenerator()
    
    # Test Rijksmuseum
    ghcid = generator.generate(
        name="Rijksmuseum",
        institution_type="MUSEUM",
        country="NL",
        subdivision="NH",
        city_locode="AMS"
    )
    assert ghcid.to_ghcid() == "NL-NH-AMS-M-R"

def test_isil_to_ghcid_conversion():
    generator = GHCIDGenerator()
    
    ghcid = generator.from_isil(
        isil_code="NL-AsdRM",
        institution_type="MUSEUM",
        city_locode="AMS",
        subdivision="NH"
    )
    assert ghcid.to_ghcid() == "NL-NH-AMS-M-RM"

def test_ghcid_numeric_deterministic():
    comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
    comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
    
    assert comp1.to_numeric() == comp2.to_numeric()

def test_ghcid_pattern_validation():
    valid_ghcids = [
        "NL-NH-AMS-M-RM",
        "US-NY-NYC-M-MOMA",
        "BR-RJ-RIO-L-BNB",
        "GB-EN-LON-L-BL"
    ]
    pattern = r'^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
    
    for ghcid in valid_ghcids:
        assert re.match(pattern, ghcid)

Migration Path

For Existing ISIL Data (364 Dutch ISIL records)

  1. Parse ISIL codes (already done)
  2. 🔄 Lookup UN/LOCODE for each city (create lookup table)
  3. 🔄 Lookup ISO 3166-2 subdivision codes (create lookup table)
  4. 🔄 Generate GHCID from ISIL + lookups
  5. Store ISIL as secondary identifier

For Non-ISIL Data (1,004 Dutch orgs without ISIL)

  1. Parse organization data (already done)
  2. 🔄 Extract city from address
  3. 🔄 Lookup UN/LOCODE for city
  4. 🔄 Determine institution type (already done)
  5. 🔄 Generate abbreviation from name
  6. 🔄 Create GHCID (no ISIL to preserve)

For Conversation Data (139 files, 2,000-5,000 institutions)

  1. Extract institution name, type, location (NLP)
  2. 🔄 Geocode location → UN/LOCODE
  3. 🔄 Generate GHCID
  4. 🔄 Check if ISIL exists (cross-reference)
  5. 🔄 Store ISIL if found, otherwise GHCID-only

Benefits Summary

Feature ISIL-Only GHCID
Global coverage Limited (requires registration) Universal (any institution)
Geographic structure Inconsistent Standardized hierarchy
Type indication No Yes (1-char code)
Human-readable Sometimes Always
Numeric format Not standard Deterministic hash
Non-ISIL orgs Excluded Included
Backward compatible N/A Yes (ISIL preserved)
Validation Registry lookup Pattern + checksum
Collision risk Low (registry) Very low (SHA256)

Next Steps

Immediate (Week 1)

  1. Create src/glam_extractor/identifiers/ghcid.py module
  2. Build UN/LOCODE lookup tables for Dutch cities
  3. Build ISO 3166-2 lookup for Dutch provinces
  4. Write tests for GHCID generation and conversion

Short-term (Week 2)

  1. Update schema to use GHCID as primary key
  2. Update ISIL parser to generate GHCID
  3. Update Dutch orgs parser to generate GHCID
  4. Re-run cross-linking with GHCID

Medium-term (Week 3-4)

  1. Expand lookup tables to cover all conversation countries (60+)
  2. Implement GHCID generation in conversation parser
  3. Create GHCID validation and normalization tools
  4. Build GHCID → ISIL reverse lookup service

References


Recommendation: Adopt GHCID as primary identifier system, preserve ISIL codes as secondary identifiers for backward compatibility.