glam/docs/plan/global_glam/06-global-identifier-system.md
2025-11-19 23:25:22 +01:00

672 lines
19 KiB
Markdown

# Global Heritage Custodian Identifier System (GHCID)
**Version**: 0.1.0
**Status**: Design Proposal
**Date**: 2025-11-05
---
## Overview
A globally scalable, persistent identifier system for heritage custodians that:
1. **Unifies existing ISIL codes** into a consistent format
2. **Includes non-ISIL institutions** worldwide
3. **Provides human-readable and machine-readable formats**
4. **Supports hierarchical geographic organization**
5. **Enables hash-based numeric identifiers** for systems requiring numeric IDs
---
## Identifier Format
### Human-Readable Format (Primary)
```
{ISO 3166-1 alpha-2}-{ISO 3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}
```
**Components**:
1. **ISO 3166-1 alpha-2** (2 chars): Country code
- Examples: `NL`, `BR`, `US`, `JP`, `FR`
- Standard: https://www.iso.org/iso-3166-country-codes.html
2. **ISO 3166-2** (1-3 chars): Subdivision code (province, state, region)
- Examples: `NH` (Noord-Holland), `CA` (California), `SP` (São Paulo)
- Use `00` for national-level institutions
- Standard: https://www.iso.org/standard/72483.html
3. **UN/LOCODE** (3 chars): City/location code
- Examples: `AMS` (Amsterdam), `NYC` (New York), `RIO` (Rio de Janeiro)
- Use `XXX` for region-level or unknown locations
- Standard: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
4. **Type** (1 char): Institution type
- `G` = Gallery
- `L` = Library
- `A` = Archive
- `M` = Museum
- `C` = Cultural Center
- `R` = Research Institute
- `N` = Consortium/Network
- `V` = Government Agency
- `X` = Mixed/Other
5. **Abbreviation** (2-8 chars): Official name abbreviation
- Use first letters of official international name
- Maximum 8 characters
- Uppercase, alphanumeric only
### Examples
```
# National Archives of the Netherlands
NL-00-XXX-A-NAN
# Rijksmuseum Amsterdam
NL-NH-AMS-M-RM
# Biblioteca Nacional do Brasil (Rio de Janeiro)
BR-RJ-RIO-L-BNB
# Museum of Modern Art (New York)
US-NY-NYC-M-MOMA
# British Library (London)
GB-EN-LON-L-BL
# Louvre Museum (Paris)
FR-IL-PAR-M-LM
# Mixed institution (Library + Archive + Museum in Utrecht)
NL-UT-UTC-X-RHCU
```
### Hash-Based Numeric Format (Secondary)
For systems requiring numeric-only identifiers:
```python
import hashlib
def ghcid_to_numeric(ghcid: str) -> int:
"""
Convert GHCID to deterministic numeric identifier.
Uses SHA256 hash truncated to 64 bits (unsigned integer).
Range: 0 to 18,446,744,073,709,551,615
"""
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
return int.from_bytes(hash_bytes[:8], byteorder='big')
# Example
ghcid = "NL-NH-AMS-M-RM" # Rijksmuseum
numeric_id = ghcid_to_numeric(ghcid)
# => 12345678901234567 (deterministic, always same for this GHCID)
```
**Properties**:
- Deterministic (same GHCID always produces same numeric ID)
- Collision-resistant (SHA256 cryptographic hash)
- Reversible via lookup table (store GHCID ↔ numeric mapping)
- Fits in 64-bit integer (compatible with databases, APIs)
---
## ISIL Code Migration
### Mapping ISIL to GHCID
Existing ISIL codes can be automatically converted:
```python
def isil_to_ghcid(isil_code: str, institution_type: str, city_locode: str) -> str:
"""
Convert ISIL code to GHCID format.
ISIL format: {Country}-{Local Code}
Example: NL-AsdRM (Rijksmuseum Amsterdam)
"""
country = isil_code[:2] # NL
# Parse local code for geographic hints
# Many ISIL codes encode city: NL-AsdRM (Asd = Amsterdam)
# This requires lookup table or heuristics
# For now, extract from known patterns or use lookup
subdivision = extract_subdivision(isil_code) # NH (Noord-Holland)
locode = city_locode # AMS (Amsterdam)
type_code = institution_type_to_code(institution_type) # M
abbrev = extract_abbreviation_from_isil(isil_code) # RM
return f"{country}-{subdivision}-{locode}-{type_code}-{abbrev}"
# Example conversions
ISIL: NL-AsdRM GHCID: NL-NH-AMS-M-RM
ISIL: US-DLC GHCID: US-DC-WAS-L-LC (Library of Congress)
ISIL: FR-751131015 GHCID: FR-IL-PAR-L-BNF (Bibliothèque nationale de France)
ISIL: GB-UkOxU GHCID: GB-EN-OXF-L-BL (Bodleian Library)
```
### ISIL Preservation
**Important**: Store original ISIL codes as identifiers:
```python
custodian = HeritageCustodian(
id="NL-NH-AMS-M-RM", # Primary GHCID
name="Rijksmuseum",
identifiers=[
Identifier(
identifier_scheme="GHCID",
identifier_value="NL-NH-AMS-M-RM"
),
Identifier(
identifier_scheme="ISIL",
identifier_value="NL-AsdRM" # Preserved!
),
Identifier(
identifier_scheme="GHCID_NUMERIC",
identifier_value="12345678901234567"
)
]
)
```
**Benefits**:
- Maintains backward compatibility with ISIL systems
- Enables cross-referencing with existing ISIL registries
- Provides migration path for ISIL-dependent workflows
---
## Benefits Over ISIL-Only Approach
### 1. Global Coverage
**ISIL Limitations**:
- Not all countries have ISIL registries
- Registration requires bureaucratic process
- Many small/local institutions lack ISIL codes
**GHCID Advantages**:
- Assign identifiers to **any** heritage institution worldwide
- No registration required (generated deterministically)
- Covers 1,004 Dutch organizations currently without ISIL codes
- Enables grassroots/community heritage organizations
### 2. Geographic Hierarchy
**ISIL**: Flat structure, geographic info encoded inconsistently
**GHCID**: Structured hierarchy (Country → Region → City)
**Use Cases**:
- Query all museums in Amsterdam: `NL-NH-AMS-M-*`
- Query all heritage orgs in São Paulo state: `BR-SP-*-*-*`
- Aggregate statistics by region
- Geocoding lookups
### 3. Institution Type in ID
**ISIL**: No type indicator (requires separate lookup)
**GHCID**: Type encoded in ID (`-M-`, `-L-`, `-A-`)
**Use Cases**:
- Filter by type without database lookup
- Validate type consistency
- Build type-specific indexes
### 4. Human Readability
**ISIL Examples**:
- `NL-AsdRM` → Readable (Amsterdam, RM)
- `FR-751131015` → Opaque numeric code
- `DE-MUS-815314` → Mixed format
**GHCID Examples** (all readable):
- `NL-NH-AMS-M-RM` → Netherlands, Noord-Holland, Amsterdam, Museum, Rijksmuseum
- `FR-IL-PAR-L-BNF` → France, Île-de-France, Paris, Library, BNF
- `BR-RJ-RIO-L-BNB` → Brazil, Rio de Janeiro, Rio, Library, BNB
### 5. Compatibility with Existing Systems
**Numeric Systems** (e.g., VIAF, databases requiring int64):
- GHCID → Hash → Numeric ID (deterministic)
- Store mapping table for reverse lookup
**ISIL Systems**:
- Store ISIL as secondary identifier
- Maintain bidirectional mapping
- Enable gradual migration
---
## Implementation Strategy
### Phase 1: Schema Update
Update `heritage_custodian.yaml`:
```yaml
classes:
HeritageCustodian:
slots:
- id # Now uses GHCID format
- ghcid # Explicit GHCID field
- ghcid_numeric # Hash-based numeric version
- identifiers # List includes ISIL, Wikidata, etc.
slot_usage:
id:
description: Primary identifier in GHCID format
pattern: '^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
ghcid:
description: Explicit GHCID in human-readable format
required: true
ghcid_numeric:
description: Deterministic numeric hash of GHCID
range: integer
```
### Phase 2: GHCID Generator Module
Create `src/glam_extractor/identifiers/ghcid.py`:
```python
from dataclasses import dataclass
from typing import Optional
import hashlib
import re
@dataclass
class GHCIDComponents:
country: str # ISO 3166-1 alpha-2 (2 chars)
subdivision: str # ISO 3166-2 (1-3 chars, or '00')
locode: str # UN/LOCODE (3 chars, or 'XXX')
type_code: str # Institution type (1 char)
abbreviation: str # Name abbreviation (2-8 chars)
def to_ghcid(self) -> str:
return f"{self.country}-{self.subdivision}-{self.locode}-{self.type_code}-{self.abbreviation}"
def to_numeric(self) -> int:
ghcid = self.to_ghcid()
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
return int.from_bytes(hash_bytes[:8], byteorder='big')
class GHCIDGenerator:
"""Generate Global Heritage Custodian Identifiers"""
TYPE_CODES = {
'GALLERY': 'G',
'LIBRARY': 'L',
'ARCHIVE': 'A',
'MUSEUM': 'M',
'CULTURAL_CENTER': 'C',
'RESEARCH_INSTITUTE': 'R',
'CONSORTIUM': 'N',
'GOVERNMENT_AGENCY': 'V',
'MIXED': 'X'
}
def generate(
self,
name: str,
institution_type: str,
country: str,
subdivision: Optional[str] = None,
city_locode: Optional[str] = None
) -> GHCIDComponents:
"""
Generate GHCID from institution metadata.
Args:
name: Official institution name
institution_type: Institution type enum value
country: ISO 3166-1 alpha-2 country code
subdivision: ISO 3166-2 subdivision code (optional)
city_locode: UN/LOCODE for city (optional)
Returns:
GHCIDComponents with all fields populated
"""
# Normalize inputs
country = country.upper()
subdivision = (subdivision or '00').upper()
locode = (city_locode or 'XXX').upper()
type_code = self.TYPE_CODES.get(institution_type, 'X')
# Generate abbreviation from name
abbreviation = self._generate_abbreviation(name)
return GHCIDComponents(
country=country,
subdivision=subdivision,
locode=locode,
type_code=type_code,
abbreviation=abbreviation
)
def _generate_abbreviation(self, name: str) -> str:
"""
Generate 2-8 character abbreviation from institution name.
Strategy:
1. Extract words (split on spaces/punctuation)
2. Take first letter of each significant word
3. Skip stopwords (the, of, for, and, etc.)
4. Maximum 8 characters
5. Minimum 2 characters
"""
stopwords = {'the', 'of', 'for', 'and', 'in', 'at', 'to', 'a', 'an'}
# Split name into words, filter stopwords
words = re.findall(r'\b\w+\b', name.lower())
significant_words = [w for w in words if w not in stopwords]
# Take first letter of each word
abbrev = ''.join(w[0].upper() for w in significant_words[:8])
# Ensure minimum 2 characters
if len(abbrev) < 2:
# Fallback: take first 2-4 chars of first word
abbrev = name[:4].upper().replace(' ', '')
return abbrev[:8] # Max 8 chars
def from_isil(
self,
isil_code: str,
institution_type: str,
city_locode: str,
subdivision: Optional[str] = None
) -> GHCIDComponents:
"""
Convert ISIL code to GHCID.
Args:
isil_code: ISIL code (e.g., 'NL-AsdRM')
institution_type: Institution type enum value
city_locode: UN/LOCODE for city
subdivision: ISO 3166-2 code (if not in ISIL)
Returns:
GHCIDComponents
"""
# Parse ISIL
match = re.match(r'^([A-Z]{2})-(.+)$', isil_code)
if not match:
raise ValueError(f"Invalid ISIL code format: {isil_code}")
country = match.group(1)
local_code = match.group(2)
# Extract abbreviation from ISIL local code
# Many ISIL codes have pattern: {CityCode}{Abbreviation}
# e.g., NL-AsdRM → Asd (Amsterdam) + RM (Rijksmuseum)
abbrev = self._extract_isil_abbreviation(local_code)
subdivision = subdivision or '00'
type_code = self.TYPE_CODES.get(institution_type, 'X')
return GHCIDComponents(
country=country,
subdivision=subdivision,
locode=city_locode,
type_code=type_code,
abbreviation=abbrev
)
def _extract_isil_abbreviation(self, local_code: str) -> str:
"""
Extract abbreviation from ISIL local code.
Heuristics:
- If starts with 3-letter city code, take rest
- If purely numeric, use first 4 digits
- Otherwise, use full local code (max 8 chars)
"""
# Check if starts with likely city code (3 lowercase + rest)
match = re.match(r'^[A-Za-z]{3}(.+)$', local_code)
if match:
return match.group(1)[:8].upper()
# If numeric, use first 4-8 digits
if local_code.isdigit():
return local_code[:8]
# Otherwise use full code
return local_code[:8].upper()
```
### Phase 3: Update Parsers
Update all parsers to generate GHCID:
```python
# In isil_registry.py
from glam_extractor.identifiers.ghcid import GHCIDGenerator
generator = GHCIDGenerator()
def to_heritage_custodian(record: ISILRegistryRecord) -> HeritageCustodian:
# Generate GHCID from ISIL
ghcid_components = generator.from_isil(
isil_code=record.isil_code,
institution_type='MIXED', # ISIL registry doesn't specify type
city_locode=lookup_locode(record.plaats), # Lookup UN/LOCODE
subdivision=lookup_subdivision(record.plaats) # Lookup ISO 3166-2
)
ghcid = ghcid_components.to_ghcid()
ghcid_numeric = ghcid_components.to_numeric()
return HeritageCustodian(
id=ghcid, # Primary key is now GHCID
ghcid=ghcid,
ghcid_numeric=ghcid_numeric,
name=record.instelling,
institution_type='MIXED',
identifiers=[
Identifier(
identifier_scheme='GHCID',
identifier_value=ghcid
),
Identifier(
identifier_scheme='ISIL',
identifier_value=record.isil_code # Preserved
),
Identifier(
identifier_scheme='GHCID_NUMERIC',
identifier_value=str(ghcid_numeric)
)
],
# ... rest of fields
)
```
### Phase 4: Lookup Tables
Create reference data for geocoding:
```python
# data/reference/nl_city_locodes.json
{
"Amsterdam": {
"locode": "AMS",
"subdivision": "NH", # Noord-Holland
"geonames_id": "2759794"
},
"Rotterdam": {
"locode": "RTM",
"subdivision": "ZH", # Zuid-Holland
"geonames_id": "2747891"
},
# ... all Dutch cities
}
# data/reference/iso_3166_2_nl.json
{
"NH": "Noord-Holland",
"ZH": "Zuid-Holland",
"UT": "Utrecht",
# ... all provinces
}
```
### Phase 5: Cross-Linking with GHCID
Update cross-linking scripts to use GHCID:
```python
# crosslink_dutch_datasets.py (updated)
# Build lookup by GHCID (not ISIL)
isil_by_ghcid = {}
orgs_by_ghcid = {}
for custodian in isil_custodians:
isil_by_ghcid[custodian.ghcid] = custodian
for custodian in dutch_custodians:
orgs_by_ghcid[custodian.ghcid] = custodian
# Merge by GHCID
all_ghcids = set(isil_by_ghcid.keys()) | set(orgs_by_ghcid.keys())
for ghcid in sorted(all_ghcids):
isil_record = isil_by_ghcid.get(ghcid)
orgs_record = orgs_by_ghcid.get(ghcid)
merged = merge_custodians(isil_record, orgs_record, ghcid)
merged_records.append(merged)
```
---
## Validation and Testing
### Unit Tests
```python
def test_ghcid_generation():
generator = GHCIDGenerator()
# Test Rijksmuseum
ghcid = generator.generate(
name="Rijksmuseum",
institution_type="MUSEUM",
country="NL",
subdivision="NH",
city_locode="AMS"
)
assert ghcid.to_ghcid() == "NL-NH-AMS-M-R"
def test_isil_to_ghcid_conversion():
generator = GHCIDGenerator()
ghcid = generator.from_isil(
isil_code="NL-AsdRM",
institution_type="MUSEUM",
city_locode="AMS",
subdivision="NH"
)
assert ghcid.to_ghcid() == "NL-NH-AMS-M-RM"
def test_ghcid_numeric_deterministic():
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
assert comp1.to_numeric() == comp2.to_numeric()
def test_ghcid_pattern_validation():
valid_ghcids = [
"NL-NH-AMS-M-RM",
"US-NY-NYC-M-MOMA",
"BR-RJ-RIO-L-BNB",
"GB-EN-LON-L-BL"
]
pattern = r'^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
for ghcid in valid_ghcids:
assert re.match(pattern, ghcid)
```
---
## Migration Path
### For Existing ISIL Data (364 Dutch ISIL records)
1. ✅ Parse ISIL codes (already done)
2. 🔄 Lookup UN/LOCODE for each city (create lookup table)
3. 🔄 Lookup ISO 3166-2 subdivision codes (create lookup table)
4. 🔄 Generate GHCID from ISIL + lookups
5. ✅ Store ISIL as secondary identifier
### For Non-ISIL Data (1,004 Dutch orgs without ISIL)
1. ✅ Parse organization data (already done)
2. 🔄 Extract city from address
3. 🔄 Lookup UN/LOCODE for city
4. 🔄 Determine institution type (already done)
5. 🔄 Generate abbreviation from name
6. 🔄 Create GHCID (no ISIL to preserve)
### For Conversation Data (139 files, 2,000-5,000 institutions)
1. ⏳ Extract institution name, type, location (NLP)
2. 🔄 Geocode location → UN/LOCODE
3. 🔄 Generate GHCID
4. 🔄 Check if ISIL exists (cross-reference)
5. 🔄 Store ISIL if found, otherwise GHCID-only
---
## Benefits Summary
| Feature | ISIL-Only | GHCID |
|---------|-----------|-------|
| **Global coverage** | Limited (requires registration) | Universal (any institution) |
| **Geographic structure** | Inconsistent | Standardized hierarchy |
| **Type indication** | No | Yes (1-char code) |
| **Human-readable** | Sometimes | Always |
| **Numeric format** | Not standard | Deterministic hash |
| **Non-ISIL orgs** | Excluded | Included |
| **Backward compatible** | N/A | Yes (ISIL preserved) |
| **Validation** | Registry lookup | Pattern + checksum |
| **Collision risk** | Low (registry) | Very low (SHA256) |
---
## Next Steps
### Immediate (Week 1)
1. Create `src/glam_extractor/identifiers/ghcid.py` module
2. Build UN/LOCODE lookup tables for Dutch cities
3. Build ISO 3166-2 lookup for Dutch provinces
4. Write tests for GHCID generation and conversion
### Short-term (Week 2)
5. Update schema to use GHCID as primary key
6. Update ISIL parser to generate GHCID
7. Update Dutch orgs parser to generate GHCID
8. Re-run cross-linking with GHCID
### Medium-term (Week 3-4)
9. Expand lookup tables to cover all conversation countries (60+)
10. Implement GHCID generation in conversation parser
11. Create GHCID validation and normalization tools
12. Build GHCID → ISIL reverse lookup service
---
## References
- **ISO 3166-1**: https://www.iso.org/iso-3166-country-codes.html
- **ISO 3166-2**: https://www.iso.org/standard/72483.html
- **UN/LOCODE**: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
- **ISIL Standard**: https://www.iso.org/standard/77849.html
- **GeoNames**: https://www.geonames.org/
---
**Recommendation**: Adopt GHCID as primary identifier system, preserve ISIL codes as secondary identifiers for backward compatibility.