19 KiB
Global Heritage Custodian Identifier System (GHCID)
Version: 0.1.0
Status: Design Proposal
Date: 2025-11-05
Overview
A globally scalable, persistent identifier system for heritage custodians that:
- Unifies existing ISIL codes into a consistent format
- Includes non-ISIL institutions worldwide
- Provides human-readable and machine-readable formats
- Supports hierarchical geographic organization
- Enables hash-based numeric identifiers for systems requiring numeric IDs
Identifier Format
Human-Readable Format (Primary)
{ISO 3166-1 alpha-2}-{ISO 3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}
Components:
-
ISO 3166-1 alpha-2 (2 chars): Country code
- Examples:
NL,BR,US,JP,FR - Standard: https://www.iso.org/iso-3166-country-codes.html
- Examples:
-
ISO 3166-2 (1-3 chars): Subdivision code (province, state, region)
- Examples:
NH(Noord-Holland),CA(California),SP(São Paulo) - Use
00for national-level institutions - Standard: https://www.iso.org/standard/72483.html
- Examples:
-
UN/LOCODE (3 chars): City/location code
- Examples:
AMS(Amsterdam),NYC(New York),RIO(Rio de Janeiro) - Use
XXXfor region-level or unknown locations - Standard: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
- Examples:
-
Type (1 char): Institution type
G= GalleryL= LibraryA= ArchiveM= MuseumC= Cultural CenterR= Research InstituteN= Consortium/NetworkV= Government AgencyX= Mixed/Other
-
Abbreviation (2-8 chars): Official name abbreviation
- Use first letters of official international name
- Maximum 8 characters
- Uppercase, alphanumeric only
Examples
# National Archives of the Netherlands
NL-00-XXX-A-NAN
# Rijksmuseum Amsterdam
NL-NH-AMS-M-RM
# Biblioteca Nacional do Brasil (Rio de Janeiro)
BR-RJ-RIO-L-BNB
# Museum of Modern Art (New York)
US-NY-NYC-M-MOMA
# British Library (London)
GB-EN-LON-L-BL
# Louvre Museum (Paris)
FR-IL-PAR-M-LM
# Mixed institution (Library + Archive + Museum in Utrecht)
NL-UT-UTC-X-RHCU
Hash-Based Numeric Format (Secondary)
For systems requiring numeric-only identifiers:
import hashlib
def ghcid_to_numeric(ghcid: str) -> int:
"""
Convert GHCID to deterministic numeric identifier.
Uses SHA256 hash truncated to 64 bits (unsigned integer).
Range: 0 to 18,446,744,073,709,551,615
"""
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
return int.from_bytes(hash_bytes[:8], byteorder='big')
# Example
ghcid = "NL-NH-AMS-M-RM" # Rijksmuseum
numeric_id = ghcid_to_numeric(ghcid)
# => 12345678901234567 (deterministic, always same for this GHCID)
Properties:
- Deterministic (same GHCID always produces same numeric ID)
- Collision-resistant (SHA256 cryptographic hash)
- Reversible via lookup table (store GHCID ↔ numeric mapping)
- Fits in 64-bit integer (compatible with databases, APIs)
ISIL Code Migration
Mapping ISIL to GHCID
Existing ISIL codes can be automatically converted:
def isil_to_ghcid(isil_code: str, institution_type: str, city_locode: str) -> str:
"""
Convert ISIL code to GHCID format.
ISIL format: {Country}-{Local Code}
Example: NL-AsdRM (Rijksmuseum Amsterdam)
"""
country = isil_code[:2] # NL
# Parse local code for geographic hints
# Many ISIL codes encode city: NL-AsdRM (Asd = Amsterdam)
# This requires lookup table or heuristics
# For now, extract from known patterns or use lookup
subdivision = extract_subdivision(isil_code) # NH (Noord-Holland)
locode = city_locode # AMS (Amsterdam)
type_code = institution_type_to_code(institution_type) # M
abbrev = extract_abbreviation_from_isil(isil_code) # RM
return f"{country}-{subdivision}-{locode}-{type_code}-{abbrev}"
# Example conversions
ISIL: NL-AsdRM → GHCID: NL-NH-AMS-M-RM
ISIL: US-DLC → GHCID: US-DC-WAS-L-LC (Library of Congress)
ISIL: FR-751131015 → GHCID: FR-IL-PAR-L-BNF (Bibliothèque nationale de France)
ISIL: GB-UkOxU → GHCID: GB-EN-OXF-L-BL (Bodleian Library)
ISIL Preservation
Important: Store original ISIL codes as identifiers:
custodian = HeritageCustodian(
id="NL-NH-AMS-M-RM", # Primary GHCID
name="Rijksmuseum",
identifiers=[
Identifier(
identifier_scheme="GHCID",
identifier_value="NL-NH-AMS-M-RM"
),
Identifier(
identifier_scheme="ISIL",
identifier_value="NL-AsdRM" # Preserved!
),
Identifier(
identifier_scheme="GHCID_NUMERIC",
identifier_value="12345678901234567"
)
]
)
Benefits:
- Maintains backward compatibility with ISIL systems
- Enables cross-referencing with existing ISIL registries
- Provides migration path for ISIL-dependent workflows
Benefits Over ISIL-Only Approach
1. Global Coverage
ISIL Limitations:
- Not all countries have ISIL registries
- Registration requires bureaucratic process
- Many small/local institutions lack ISIL codes
GHCID Advantages:
- Assign identifiers to any heritage institution worldwide
- No registration required (generated deterministically)
- Covers 1,004 Dutch organizations currently without ISIL codes
- Enables grassroots/community heritage organizations
2. Geographic Hierarchy
ISIL: Flat structure, geographic info encoded inconsistently GHCID: Structured hierarchy (Country → Region → City)
Use Cases:
- Query all museums in Amsterdam:
NL-NH-AMS-M-* - Query all heritage orgs in São Paulo state:
BR-SP-*-*-* - Aggregate statistics by region
- Geocoding lookups
3. Institution Type in ID
ISIL: No type indicator (requires separate lookup)
GHCID: Type encoded in ID (-M-, -L-, -A-)
Use Cases:
- Filter by type without database lookup
- Validate type consistency
- Build type-specific indexes
4. Human Readability
ISIL Examples:
NL-AsdRM→ Readable (Amsterdam, RM)FR-751131015→ Opaque numeric codeDE-MUS-815314→ Mixed format
GHCID Examples (all readable):
NL-NH-AMS-M-RM→ Netherlands, Noord-Holland, Amsterdam, Museum, RijksmuseumFR-IL-PAR-L-BNF→ France, Île-de-France, Paris, Library, BNFBR-RJ-RIO-L-BNB→ Brazil, Rio de Janeiro, Rio, Library, BNB
5. Compatibility with Existing Systems
Numeric Systems (e.g., VIAF, databases requiring int64):
- GHCID → Hash → Numeric ID (deterministic)
- Store mapping table for reverse lookup
ISIL Systems:
- Store ISIL as secondary identifier
- Maintain bidirectional mapping
- Enable gradual migration
Implementation Strategy
Phase 1: Schema Update
Update heritage_custodian.yaml:
classes:
HeritageCustodian:
slots:
- id # Now uses GHCID format
- ghcid # Explicit GHCID field
- ghcid_numeric # Hash-based numeric version
- identifiers # List includes ISIL, Wikidata, etc.
slot_usage:
id:
description: Primary identifier in GHCID format
pattern: '^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
ghcid:
description: Explicit GHCID in human-readable format
required: true
ghcid_numeric:
description: Deterministic numeric hash of GHCID
range: integer
Phase 2: GHCID Generator Module
Create src/glam_extractor/identifiers/ghcid.py:
from dataclasses import dataclass
from typing import Optional
import hashlib
import re
@dataclass
class GHCIDComponents:
country: str # ISO 3166-1 alpha-2 (2 chars)
subdivision: str # ISO 3166-2 (1-3 chars, or '00')
locode: str # UN/LOCODE (3 chars, or 'XXX')
type_code: str # Institution type (1 char)
abbreviation: str # Name abbreviation (2-8 chars)
def to_ghcid(self) -> str:
return f"{self.country}-{self.subdivision}-{self.locode}-{self.type_code}-{self.abbreviation}"
def to_numeric(self) -> int:
ghcid = self.to_ghcid()
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
return int.from_bytes(hash_bytes[:8], byteorder='big')
class GHCIDGenerator:
"""Generate Global Heritage Custodian Identifiers"""
TYPE_CODES = {
'GALLERY': 'G',
'LIBRARY': 'L',
'ARCHIVE': 'A',
'MUSEUM': 'M',
'CULTURAL_CENTER': 'C',
'RESEARCH_INSTITUTE': 'R',
'CONSORTIUM': 'N',
'GOVERNMENT_AGENCY': 'V',
'MIXED': 'X'
}
def generate(
self,
name: str,
institution_type: str,
country: str,
subdivision: Optional[str] = None,
city_locode: Optional[str] = None
) -> GHCIDComponents:
"""
Generate GHCID from institution metadata.
Args:
name: Official institution name
institution_type: Institution type enum value
country: ISO 3166-1 alpha-2 country code
subdivision: ISO 3166-2 subdivision code (optional)
city_locode: UN/LOCODE for city (optional)
Returns:
GHCIDComponents with all fields populated
"""
# Normalize inputs
country = country.upper()
subdivision = (subdivision or '00').upper()
locode = (city_locode or 'XXX').upper()
type_code = self.TYPE_CODES.get(institution_type, 'X')
# Generate abbreviation from name
abbreviation = self._generate_abbreviation(name)
return GHCIDComponents(
country=country,
subdivision=subdivision,
locode=locode,
type_code=type_code,
abbreviation=abbreviation
)
def _generate_abbreviation(self, name: str) -> str:
"""
Generate 2-8 character abbreviation from institution name.
Strategy:
1. Extract words (split on spaces/punctuation)
2. Take first letter of each significant word
3. Skip stopwords (the, of, for, and, etc.)
4. Maximum 8 characters
5. Minimum 2 characters
"""
stopwords = {'the', 'of', 'for', 'and', 'in', 'at', 'to', 'a', 'an'}
# Split name into words, filter stopwords
words = re.findall(r'\b\w+\b', name.lower())
significant_words = [w for w in words if w not in stopwords]
# Take first letter of each word
abbrev = ''.join(w[0].upper() for w in significant_words[:8])
# Ensure minimum 2 characters
if len(abbrev) < 2:
# Fallback: take first 2-4 chars of first word
abbrev = name[:4].upper().replace(' ', '')
return abbrev[:8] # Max 8 chars
def from_isil(
self,
isil_code: str,
institution_type: str,
city_locode: str,
subdivision: Optional[str] = None
) -> GHCIDComponents:
"""
Convert ISIL code to GHCID.
Args:
isil_code: ISIL code (e.g., 'NL-AsdRM')
institution_type: Institution type enum value
city_locode: UN/LOCODE for city
subdivision: ISO 3166-2 code (if not in ISIL)
Returns:
GHCIDComponents
"""
# Parse ISIL
match = re.match(r'^([A-Z]{2})-(.+)$', isil_code)
if not match:
raise ValueError(f"Invalid ISIL code format: {isil_code}")
country = match.group(1)
local_code = match.group(2)
# Extract abbreviation from ISIL local code
# Many ISIL codes have pattern: {CityCode}{Abbreviation}
# e.g., NL-AsdRM → Asd (Amsterdam) + RM (Rijksmuseum)
abbrev = self._extract_isil_abbreviation(local_code)
subdivision = subdivision or '00'
type_code = self.TYPE_CODES.get(institution_type, 'X')
return GHCIDComponents(
country=country,
subdivision=subdivision,
locode=city_locode,
type_code=type_code,
abbreviation=abbrev
)
def _extract_isil_abbreviation(self, local_code: str) -> str:
"""
Extract abbreviation from ISIL local code.
Heuristics:
- If starts with 3-letter city code, take rest
- If purely numeric, use first 4 digits
- Otherwise, use full local code (max 8 chars)
"""
# Check if starts with likely city code (3 lowercase + rest)
match = re.match(r'^[A-Za-z]{3}(.+)$', local_code)
if match:
return match.group(1)[:8].upper()
# If numeric, use first 4-8 digits
if local_code.isdigit():
return local_code[:8]
# Otherwise use full code
return local_code[:8].upper()
Phase 3: Update Parsers
Update all parsers to generate GHCID:
# In isil_registry.py
from glam_extractor.identifiers.ghcid import GHCIDGenerator
generator = GHCIDGenerator()
def to_heritage_custodian(record: ISILRegistryRecord) -> HeritageCustodian:
# Generate GHCID from ISIL
ghcid_components = generator.from_isil(
isil_code=record.isil_code,
institution_type='MIXED', # ISIL registry doesn't specify type
city_locode=lookup_locode(record.plaats), # Lookup UN/LOCODE
subdivision=lookup_subdivision(record.plaats) # Lookup ISO 3166-2
)
ghcid = ghcid_components.to_ghcid()
ghcid_numeric = ghcid_components.to_numeric()
return HeritageCustodian(
id=ghcid, # Primary key is now GHCID
ghcid=ghcid,
ghcid_numeric=ghcid_numeric,
name=record.instelling,
institution_type='MIXED',
identifiers=[
Identifier(
identifier_scheme='GHCID',
identifier_value=ghcid
),
Identifier(
identifier_scheme='ISIL',
identifier_value=record.isil_code # Preserved
),
Identifier(
identifier_scheme='GHCID_NUMERIC',
identifier_value=str(ghcid_numeric)
)
],
# ... rest of fields
)
Phase 4: Lookup Tables
Create reference data for geocoding:
# data/reference/nl_city_locodes.json
{
"Amsterdam": {
"locode": "AMS",
"subdivision": "NH", # Noord-Holland
"geonames_id": "2759794"
},
"Rotterdam": {
"locode": "RTM",
"subdivision": "ZH", # Zuid-Holland
"geonames_id": "2747891"
},
# ... all Dutch cities
}
# data/reference/iso_3166_2_nl.json
{
"NH": "Noord-Holland",
"ZH": "Zuid-Holland",
"UT": "Utrecht",
# ... all provinces
}
Phase 5: Cross-Linking with GHCID
Update cross-linking scripts to use GHCID:
# crosslink_dutch_datasets.py (updated)
# Build lookup by GHCID (not ISIL)
isil_by_ghcid = {}
orgs_by_ghcid = {}
for custodian in isil_custodians:
isil_by_ghcid[custodian.ghcid] = custodian
for custodian in dutch_custodians:
orgs_by_ghcid[custodian.ghcid] = custodian
# Merge by GHCID
all_ghcids = set(isil_by_ghcid.keys()) | set(orgs_by_ghcid.keys())
for ghcid in sorted(all_ghcids):
isil_record = isil_by_ghcid.get(ghcid)
orgs_record = orgs_by_ghcid.get(ghcid)
merged = merge_custodians(isil_record, orgs_record, ghcid)
merged_records.append(merged)
Validation and Testing
Unit Tests
def test_ghcid_generation():
generator = GHCIDGenerator()
# Test Rijksmuseum
ghcid = generator.generate(
name="Rijksmuseum",
institution_type="MUSEUM",
country="NL",
subdivision="NH",
city_locode="AMS"
)
assert ghcid.to_ghcid() == "NL-NH-AMS-M-R"
def test_isil_to_ghcid_conversion():
generator = GHCIDGenerator()
ghcid = generator.from_isil(
isil_code="NL-AsdRM",
institution_type="MUSEUM",
city_locode="AMS",
subdivision="NH"
)
assert ghcid.to_ghcid() == "NL-NH-AMS-M-RM"
def test_ghcid_numeric_deterministic():
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
assert comp1.to_numeric() == comp2.to_numeric()
def test_ghcid_pattern_validation():
valid_ghcids = [
"NL-NH-AMS-M-RM",
"US-NY-NYC-M-MOMA",
"BR-RJ-RIO-L-BNB",
"GB-EN-LON-L-BL"
]
pattern = r'^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
for ghcid in valid_ghcids:
assert re.match(pattern, ghcid)
Migration Path
For Existing ISIL Data (364 Dutch ISIL records)
- ✅ Parse ISIL codes (already done)
- 🔄 Lookup UN/LOCODE for each city (create lookup table)
- 🔄 Lookup ISO 3166-2 subdivision codes (create lookup table)
- 🔄 Generate GHCID from ISIL + lookups
- ✅ Store ISIL as secondary identifier
For Non-ISIL Data (1,004 Dutch orgs without ISIL)
- ✅ Parse organization data (already done)
- 🔄 Extract city from address
- 🔄 Lookup UN/LOCODE for city
- 🔄 Determine institution type (already done)
- 🔄 Generate abbreviation from name
- 🔄 Create GHCID (no ISIL to preserve)
For Conversation Data (139 files, 2,000-5,000 institutions)
- ⏳ Extract institution name, type, location (NLP)
- 🔄 Geocode location → UN/LOCODE
- 🔄 Generate GHCID
- 🔄 Check if ISIL exists (cross-reference)
- 🔄 Store ISIL if found, otherwise GHCID-only
Benefits Summary
| Feature | ISIL-Only | GHCID |
|---|---|---|
| Global coverage | Limited (requires registration) | Universal (any institution) |
| Geographic structure | Inconsistent | Standardized hierarchy |
| Type indication | No | Yes (1-char code) |
| Human-readable | Sometimes | Always |
| Numeric format | Not standard | Deterministic hash |
| Non-ISIL orgs | Excluded | Included |
| Backward compatible | N/A | Yes (ISIL preserved) |
| Validation | Registry lookup | Pattern + checksum |
| Collision risk | Low (registry) | Very low (SHA256) |
Next Steps
Immediate (Week 1)
- Create
src/glam_extractor/identifiers/ghcid.pymodule - Build UN/LOCODE lookup tables for Dutch cities
- Build ISO 3166-2 lookup for Dutch provinces
- Write tests for GHCID generation and conversion
Short-term (Week 2)
- Update schema to use GHCID as primary key
- Update ISIL parser to generate GHCID
- Update Dutch orgs parser to generate GHCID
- Re-run cross-linking with GHCID
Medium-term (Week 3-4)
- Expand lookup tables to cover all conversation countries (60+)
- Implement GHCID generation in conversation parser
- Create GHCID validation and normalization tools
- Build GHCID → ISIL reverse lookup service
References
- ISO 3166-1: https://www.iso.org/iso-3166-country-codes.html
- ISO 3166-2: https://www.iso.org/standard/72483.html
- UN/LOCODE: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
- ISIL Standard: https://www.iso.org/standard/77849.html
- GeoNames: https://www.geonames.org/
Recommendation: Adopt GHCID as primary identifier system, preserve ISIL codes as secondary identifiers for backward compatibility.