# Global Heritage Custodian Identifier System (GHCID) **Version**: 0.1.0 **Status**: Design Proposal **Date**: 2025-11-05 --- ## Overview A globally scalable, persistent identifier system for heritage custodians that: 1. **Unifies existing ISIL codes** into a consistent format 2. **Includes non-ISIL institutions** worldwide 3. **Provides human-readable and machine-readable formats** 4. **Supports hierarchical geographic organization** 5. **Enables hash-based numeric identifiers** for systems requiring numeric IDs --- ## Identifier Format ### Human-Readable Format (Primary) ``` {ISO 3166-1 alpha-2}-{ISO 3166-2}-{UN/LOCODE}-{Type}-{Abbreviation} ``` **Components**: 1. **ISO 3166-1 alpha-2** (2 chars): Country code - Examples: `NL`, `BR`, `US`, `JP`, `FR` - Standard: https://www.iso.org/iso-3166-country-codes.html 2. **ISO 3166-2** (1-3 chars): Subdivision code (province, state, region) - Examples: `NH` (Noord-Holland), `CA` (California), `SP` (São Paulo) - Use `00` for national-level institutions - Standard: https://www.iso.org/standard/72483.html 3. **UN/LOCODE** (3 chars): City/location code - Examples: `AMS` (Amsterdam), `NYC` (New York), `RIO` (Rio de Janeiro) - Use `XXX` for region-level or unknown locations - Standard: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory 4. **Type** (1 char): Institution type - `G` = Gallery - `L` = Library - `A` = Archive - `M` = Museum - `C` = Cultural Center - `R` = Research Institute - `N` = Consortium/Network - `V` = Government Agency - `X` = Mixed/Other 5. **Abbreviation** (2-8 chars): Official name abbreviation - Use first letters of official international name - Maximum 8 characters - Uppercase, alphanumeric only ### Examples ``` # National Archives of the Netherlands NL-00-XXX-A-NAN # Rijksmuseum Amsterdam NL-NH-AMS-M-RM # Biblioteca Nacional do Brasil (Rio de Janeiro) BR-RJ-RIO-L-BNB # Museum of Modern Art (New York) US-NY-NYC-M-MOMA # British Library (London) GB-EN-LON-L-BL # Louvre Museum (Paris) FR-IL-PAR-M-LM # Mixed institution (Library + Archive + Museum in Utrecht) NL-UT-UTC-X-RHCU ``` ### Hash-Based Numeric Format (Secondary) For systems requiring numeric-only identifiers: ```python import hashlib def ghcid_to_numeric(ghcid: str) -> int: """ Convert GHCID to deterministic numeric identifier. Uses SHA256 hash truncated to 64 bits (unsigned integer). Range: 0 to 18,446,744,073,709,551,615 """ hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest() return int.from_bytes(hash_bytes[:8], byteorder='big') # Example ghcid = "NL-NH-AMS-M-RM" # Rijksmuseum numeric_id = ghcid_to_numeric(ghcid) # => 12345678901234567 (deterministic, always same for this GHCID) ``` **Properties**: - Deterministic (same GHCID always produces same numeric ID) - Collision-resistant (SHA256 cryptographic hash) - Reversible via lookup table (store GHCID ↔ numeric mapping) - Fits in 64-bit integer (compatible with databases, APIs) --- ## ISIL Code Migration ### Mapping ISIL to GHCID Existing ISIL codes can be automatically converted: ```python def isil_to_ghcid(isil_code: str, institution_type: str, city_locode: str) -> str: """ Convert ISIL code to GHCID format. ISIL format: {Country}-{Local Code} Example: NL-AsdRM (Rijksmuseum Amsterdam) """ country = isil_code[:2] # NL # Parse local code for geographic hints # Many ISIL codes encode city: NL-AsdRM (Asd = Amsterdam) # This requires lookup table or heuristics # For now, extract from known patterns or use lookup subdivision = extract_subdivision(isil_code) # NH (Noord-Holland) locode = city_locode # AMS (Amsterdam) type_code = institution_type_to_code(institution_type) # M abbrev = extract_abbreviation_from_isil(isil_code) # RM return f"{country}-{subdivision}-{locode}-{type_code}-{abbrev}" # Example conversions ISIL: NL-AsdRM → GHCID: NL-NH-AMS-M-RM ISIL: US-DLC → GHCID: US-DC-WAS-L-LC (Library of Congress) ISIL: FR-751131015 → GHCID: FR-IL-PAR-L-BNF (Bibliothèque nationale de France) ISIL: GB-UkOxU → GHCID: GB-EN-OXF-L-BL (Bodleian Library) ``` ### ISIL Preservation **Important**: Store original ISIL codes as identifiers: ```python custodian = HeritageCustodian( id="NL-NH-AMS-M-RM", # Primary GHCID name="Rijksmuseum", identifiers=[ Identifier( identifier_scheme="GHCID", identifier_value="NL-NH-AMS-M-RM" ), Identifier( identifier_scheme="ISIL", identifier_value="NL-AsdRM" # Preserved! ), Identifier( identifier_scheme="GHCID_NUMERIC", identifier_value="12345678901234567" ) ] ) ``` **Benefits**: - Maintains backward compatibility with ISIL systems - Enables cross-referencing with existing ISIL registries - Provides migration path for ISIL-dependent workflows --- ## Benefits Over ISIL-Only Approach ### 1. Global Coverage **ISIL Limitations**: - Not all countries have ISIL registries - Registration requires bureaucratic process - Many small/local institutions lack ISIL codes **GHCID Advantages**: - Assign identifiers to **any** heritage institution worldwide - No registration required (generated deterministically) - Covers 1,004 Dutch organizations currently without ISIL codes - Enables grassroots/community heritage organizations ### 2. Geographic Hierarchy **ISIL**: Flat structure, geographic info encoded inconsistently **GHCID**: Structured hierarchy (Country → Region → City) **Use Cases**: - Query all museums in Amsterdam: `NL-NH-AMS-M-*` - Query all heritage orgs in São Paulo state: `BR-SP-*-*-*` - Aggregate statistics by region - Geocoding lookups ### 3. Institution Type in ID **ISIL**: No type indicator (requires separate lookup) **GHCID**: Type encoded in ID (`-M-`, `-L-`, `-A-`) **Use Cases**: - Filter by type without database lookup - Validate type consistency - Build type-specific indexes ### 4. Human Readability **ISIL Examples**: - `NL-AsdRM` → Readable (Amsterdam, RM) - `FR-751131015` → Opaque numeric code - `DE-MUS-815314` → Mixed format **GHCID Examples** (all readable): - `NL-NH-AMS-M-RM` → Netherlands, Noord-Holland, Amsterdam, Museum, Rijksmuseum - `FR-IL-PAR-L-BNF` → France, Île-de-France, Paris, Library, BNF - `BR-RJ-RIO-L-BNB` → Brazil, Rio de Janeiro, Rio, Library, BNB ### 5. Compatibility with Existing Systems **Numeric Systems** (e.g., VIAF, databases requiring int64): - GHCID → Hash → Numeric ID (deterministic) - Store mapping table for reverse lookup **ISIL Systems**: - Store ISIL as secondary identifier - Maintain bidirectional mapping - Enable gradual migration --- ## Implementation Strategy ### Phase 1: Schema Update Update `heritage_custodian.yaml`: ```yaml classes: HeritageCustodian: slots: - id # Now uses GHCID format - ghcid # Explicit GHCID field - ghcid_numeric # Hash-based numeric version - identifiers # List includes ISIL, Wikidata, etc. slot_usage: id: description: Primary identifier in GHCID format pattern: '^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$' ghcid: description: Explicit GHCID in human-readable format required: true ghcid_numeric: description: Deterministic numeric hash of GHCID range: integer ``` ### Phase 2: GHCID Generator Module Create `src/glam_extractor/identifiers/ghcid.py`: ```python from dataclasses import dataclass from typing import Optional import hashlib import re @dataclass class GHCIDComponents: country: str # ISO 3166-1 alpha-2 (2 chars) subdivision: str # ISO 3166-2 (1-3 chars, or '00') locode: str # UN/LOCODE (3 chars, or 'XXX') type_code: str # Institution type (1 char) abbreviation: str # Name abbreviation (2-8 chars) def to_ghcid(self) -> str: return f"{self.country}-{self.subdivision}-{self.locode}-{self.type_code}-{self.abbreviation}" def to_numeric(self) -> int: ghcid = self.to_ghcid() hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest() return int.from_bytes(hash_bytes[:8], byteorder='big') class GHCIDGenerator: """Generate Global Heritage Custodian Identifiers""" TYPE_CODES = { 'GALLERY': 'G', 'LIBRARY': 'L', 'ARCHIVE': 'A', 'MUSEUM': 'M', 'CULTURAL_CENTER': 'C', 'RESEARCH_INSTITUTE': 'R', 'CONSORTIUM': 'N', 'GOVERNMENT_AGENCY': 'V', 'MIXED': 'X' } def generate( self, name: str, institution_type: str, country: str, subdivision: Optional[str] = None, city_locode: Optional[str] = None ) -> GHCIDComponents: """ Generate GHCID from institution metadata. Args: name: Official institution name institution_type: Institution type enum value country: ISO 3166-1 alpha-2 country code subdivision: ISO 3166-2 subdivision code (optional) city_locode: UN/LOCODE for city (optional) Returns: GHCIDComponents with all fields populated """ # Normalize inputs country = country.upper() subdivision = (subdivision or '00').upper() locode = (city_locode or 'XXX').upper() type_code = self.TYPE_CODES.get(institution_type, 'X') # Generate abbreviation from name abbreviation = self._generate_abbreviation(name) return GHCIDComponents( country=country, subdivision=subdivision, locode=locode, type_code=type_code, abbreviation=abbreviation ) def _generate_abbreviation(self, name: str) -> str: """ Generate 2-8 character abbreviation from institution name. Strategy: 1. Extract words (split on spaces/punctuation) 2. Take first letter of each significant word 3. Skip stopwords (the, of, for, and, etc.) 4. Maximum 8 characters 5. Minimum 2 characters """ stopwords = {'the', 'of', 'for', 'and', 'in', 'at', 'to', 'a', 'an'} # Split name into words, filter stopwords words = re.findall(r'\b\w+\b', name.lower()) significant_words = [w for w in words if w not in stopwords] # Take first letter of each word abbrev = ''.join(w[0].upper() for w in significant_words[:8]) # Ensure minimum 2 characters if len(abbrev) < 2: # Fallback: take first 2-4 chars of first word abbrev = name[:4].upper().replace(' ', '') return abbrev[:8] # Max 8 chars def from_isil( self, isil_code: str, institution_type: str, city_locode: str, subdivision: Optional[str] = None ) -> GHCIDComponents: """ Convert ISIL code to GHCID. Args: isil_code: ISIL code (e.g., 'NL-AsdRM') institution_type: Institution type enum value city_locode: UN/LOCODE for city subdivision: ISO 3166-2 code (if not in ISIL) Returns: GHCIDComponents """ # Parse ISIL match = re.match(r'^([A-Z]{2})-(.+)$', isil_code) if not match: raise ValueError(f"Invalid ISIL code format: {isil_code}") country = match.group(1) local_code = match.group(2) # Extract abbreviation from ISIL local code # Many ISIL codes have pattern: {CityCode}{Abbreviation} # e.g., NL-AsdRM → Asd (Amsterdam) + RM (Rijksmuseum) abbrev = self._extract_isil_abbreviation(local_code) subdivision = subdivision or '00' type_code = self.TYPE_CODES.get(institution_type, 'X') return GHCIDComponents( country=country, subdivision=subdivision, locode=city_locode, type_code=type_code, abbreviation=abbrev ) def _extract_isil_abbreviation(self, local_code: str) -> str: """ Extract abbreviation from ISIL local code. Heuristics: - If starts with 3-letter city code, take rest - If purely numeric, use first 4 digits - Otherwise, use full local code (max 8 chars) """ # Check if starts with likely city code (3 lowercase + rest) match = re.match(r'^[A-Za-z]{3}(.+)$', local_code) if match: return match.group(1)[:8].upper() # If numeric, use first 4-8 digits if local_code.isdigit(): return local_code[:8] # Otherwise use full code return local_code[:8].upper() ``` ### Phase 3: Update Parsers Update all parsers to generate GHCID: ```python # In isil_registry.py from glam_extractor.identifiers.ghcid import GHCIDGenerator generator = GHCIDGenerator() def to_heritage_custodian(record: ISILRegistryRecord) -> HeritageCustodian: # Generate GHCID from ISIL ghcid_components = generator.from_isil( isil_code=record.isil_code, institution_type='MIXED', # ISIL registry doesn't specify type city_locode=lookup_locode(record.plaats), # Lookup UN/LOCODE subdivision=lookup_subdivision(record.plaats) # Lookup ISO 3166-2 ) ghcid = ghcid_components.to_ghcid() ghcid_numeric = ghcid_components.to_numeric() return HeritageCustodian( id=ghcid, # Primary key is now GHCID ghcid=ghcid, ghcid_numeric=ghcid_numeric, name=record.instelling, institution_type='MIXED', identifiers=[ Identifier( identifier_scheme='GHCID', identifier_value=ghcid ), Identifier( identifier_scheme='ISIL', identifier_value=record.isil_code # Preserved ), Identifier( identifier_scheme='GHCID_NUMERIC', identifier_value=str(ghcid_numeric) ) ], # ... rest of fields ) ``` ### Phase 4: Lookup Tables Create reference data for geocoding: ```python # data/reference/nl_city_locodes.json { "Amsterdam": { "locode": "AMS", "subdivision": "NH", # Noord-Holland "geonames_id": "2759794" }, "Rotterdam": { "locode": "RTM", "subdivision": "ZH", # Zuid-Holland "geonames_id": "2747891" }, # ... all Dutch cities } # data/reference/iso_3166_2_nl.json { "NH": "Noord-Holland", "ZH": "Zuid-Holland", "UT": "Utrecht", # ... all provinces } ``` ### Phase 5: Cross-Linking with GHCID Update cross-linking scripts to use GHCID: ```python # crosslink_dutch_datasets.py (updated) # Build lookup by GHCID (not ISIL) isil_by_ghcid = {} orgs_by_ghcid = {} for custodian in isil_custodians: isil_by_ghcid[custodian.ghcid] = custodian for custodian in dutch_custodians: orgs_by_ghcid[custodian.ghcid] = custodian # Merge by GHCID all_ghcids = set(isil_by_ghcid.keys()) | set(orgs_by_ghcid.keys()) for ghcid in sorted(all_ghcids): isil_record = isil_by_ghcid.get(ghcid) orgs_record = orgs_by_ghcid.get(ghcid) merged = merge_custodians(isil_record, orgs_record, ghcid) merged_records.append(merged) ``` --- ## Validation and Testing ### Unit Tests ```python def test_ghcid_generation(): generator = GHCIDGenerator() # Test Rijksmuseum ghcid = generator.generate( name="Rijksmuseum", institution_type="MUSEUM", country="NL", subdivision="NH", city_locode="AMS" ) assert ghcid.to_ghcid() == "NL-NH-AMS-M-R" def test_isil_to_ghcid_conversion(): generator = GHCIDGenerator() ghcid = generator.from_isil( isil_code="NL-AsdRM", institution_type="MUSEUM", city_locode="AMS", subdivision="NH" ) assert ghcid.to_ghcid() == "NL-NH-AMS-M-RM" def test_ghcid_numeric_deterministic(): comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM") comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM") assert comp1.to_numeric() == comp2.to_numeric() def test_ghcid_pattern_validation(): valid_ghcids = [ "NL-NH-AMS-M-RM", "US-NY-NYC-M-MOMA", "BR-RJ-RIO-L-BNB", "GB-EN-LON-L-BL" ] pattern = r'^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$' for ghcid in valid_ghcids: assert re.match(pattern, ghcid) ``` --- ## Migration Path ### For Existing ISIL Data (364 Dutch ISIL records) 1. ✅ Parse ISIL codes (already done) 2. 🔄 Lookup UN/LOCODE for each city (create lookup table) 3. 🔄 Lookup ISO 3166-2 subdivision codes (create lookup table) 4. 🔄 Generate GHCID from ISIL + lookups 5. ✅ Store ISIL as secondary identifier ### For Non-ISIL Data (1,004 Dutch orgs without ISIL) 1. ✅ Parse organization data (already done) 2. 🔄 Extract city from address 3. 🔄 Lookup UN/LOCODE for city 4. 🔄 Determine institution type (already done) 5. 🔄 Generate abbreviation from name 6. 🔄 Create GHCID (no ISIL to preserve) ### For Conversation Data (139 files, 2,000-5,000 institutions) 1. ⏳ Extract institution name, type, location (NLP) 2. 🔄 Geocode location → UN/LOCODE 3. 🔄 Generate GHCID 4. 🔄 Check if ISIL exists (cross-reference) 5. 🔄 Store ISIL if found, otherwise GHCID-only --- ## Benefits Summary | Feature | ISIL-Only | GHCID | |---------|-----------|-------| | **Global coverage** | Limited (requires registration) | Universal (any institution) | | **Geographic structure** | Inconsistent | Standardized hierarchy | | **Type indication** | No | Yes (1-char code) | | **Human-readable** | Sometimes | Always | | **Numeric format** | Not standard | Deterministic hash | | **Non-ISIL orgs** | Excluded | Included | | **Backward compatible** | N/A | Yes (ISIL preserved) | | **Validation** | Registry lookup | Pattern + checksum | | **Collision risk** | Low (registry) | Very low (SHA256) | --- ## Next Steps ### Immediate (Week 1) 1. Create `src/glam_extractor/identifiers/ghcid.py` module 2. Build UN/LOCODE lookup tables for Dutch cities 3. Build ISO 3166-2 lookup for Dutch provinces 4. Write tests for GHCID generation and conversion ### Short-term (Week 2) 5. Update schema to use GHCID as primary key 6. Update ISIL parser to generate GHCID 7. Update Dutch orgs parser to generate GHCID 8. Re-run cross-linking with GHCID ### Medium-term (Week 3-4) 9. Expand lookup tables to cover all conversation countries (60+) 10. Implement GHCID generation in conversation parser 11. Create GHCID validation and normalization tools 12. Build GHCID → ISIL reverse lookup service --- ## References - **ISO 3166-1**: https://www.iso.org/iso-3166-country-codes.html - **ISO 3166-2**: https://www.iso.org/standard/72483.html - **UN/LOCODE**: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory - **ISIL Standard**: https://www.iso.org/standard/77849.html - **GeoNames**: https://www.geonames.org/ --- **Recommendation**: Adopt GHCID as primary identifier system, preserve ISIL codes as secondary identifiers for backward compatibility.