672 lines
19 KiB
Markdown
672 lines
19 KiB
Markdown
# Global Heritage Custodian Identifier System (GHCID)
|
|
|
|
**Version**: 0.1.0
|
|
**Status**: Design Proposal
|
|
**Date**: 2025-11-05
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
A globally scalable, persistent identifier system for heritage custodians that:
|
|
1. **Unifies existing ISIL codes** into a consistent format
|
|
2. **Includes non-ISIL institutions** worldwide
|
|
3. **Provides human-readable and machine-readable formats**
|
|
4. **Supports hierarchical geographic organization**
|
|
5. **Enables hash-based numeric identifiers** for systems requiring numeric IDs
|
|
|
|
---
|
|
|
|
## Identifier Format
|
|
|
|
### Human-Readable Format (Primary)
|
|
|
|
```
|
|
{ISO 3166-1 alpha-2}-{ISO 3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}
|
|
```
|
|
|
|
**Components**:
|
|
|
|
1. **ISO 3166-1 alpha-2** (2 chars): Country code
|
|
- Examples: `NL`, `BR`, `US`, `JP`, `FR`
|
|
- Standard: https://www.iso.org/iso-3166-country-codes.html
|
|
|
|
2. **ISO 3166-2** (1-3 chars): Subdivision code (province, state, region)
|
|
- Examples: `NH` (Noord-Holland), `CA` (California), `SP` (São Paulo)
|
|
- Use `00` for national-level institutions
|
|
- Standard: https://www.iso.org/standard/72483.html
|
|
|
|
3. **UN/LOCODE** (3 chars): City/location code
|
|
- Examples: `AMS` (Amsterdam), `NYC` (New York), `RIO` (Rio de Janeiro)
|
|
- Use `XXX` for region-level or unknown locations
|
|
- Standard: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
|
|
|
|
4. **Type** (1 char): Institution type
|
|
- `G` = Gallery
|
|
- `L` = Library
|
|
- `A` = Archive
|
|
- `M` = Museum
|
|
- `C` = Cultural Center
|
|
- `R` = Research Institute
|
|
- `N` = Consortium/Network
|
|
- `V` = Government Agency
|
|
- `X` = Mixed/Other
|
|
|
|
5. **Abbreviation** (2-8 chars): Official name abbreviation
|
|
- Use first letters of official international name
|
|
- Maximum 8 characters
|
|
- Uppercase, alphanumeric only
|
|
|
|
### Examples
|
|
|
|
```
|
|
# National Archives of the Netherlands
|
|
NL-00-XXX-A-NAN
|
|
|
|
# Rijksmuseum Amsterdam
|
|
NL-NH-AMS-M-RM
|
|
|
|
# Biblioteca Nacional do Brasil (Rio de Janeiro)
|
|
BR-RJ-RIO-L-BNB
|
|
|
|
# Museum of Modern Art (New York)
|
|
US-NY-NYC-M-MOMA
|
|
|
|
# British Library (London)
|
|
GB-EN-LON-L-BL
|
|
|
|
# Louvre Museum (Paris)
|
|
FR-IL-PAR-M-LM
|
|
|
|
# Mixed institution (Library + Archive + Museum in Utrecht)
|
|
NL-UT-UTC-X-RHCU
|
|
```
|
|
|
|
### Hash-Based Numeric Format (Secondary)
|
|
|
|
For systems requiring numeric-only identifiers:
|
|
|
|
```python
|
|
import hashlib
|
|
|
|
def ghcid_to_numeric(ghcid: str) -> int:
|
|
"""
|
|
Convert GHCID to deterministic numeric identifier.
|
|
|
|
Uses SHA256 hash truncated to 64 bits (unsigned integer).
|
|
Range: 0 to 18,446,744,073,709,551,615
|
|
"""
|
|
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
|
|
return int.from_bytes(hash_bytes[:8], byteorder='big')
|
|
|
|
# Example
|
|
ghcid = "NL-NH-AMS-M-RM" # Rijksmuseum
|
|
numeric_id = ghcid_to_numeric(ghcid)
|
|
# => 12345678901234567 (deterministic, always same for this GHCID)
|
|
```
|
|
|
|
**Properties**:
|
|
- Deterministic (same GHCID always produces same numeric ID)
|
|
- Collision-resistant (SHA256 cryptographic hash)
|
|
- Reversible via lookup table (store GHCID ↔ numeric mapping)
|
|
- Fits in 64-bit integer (compatible with databases, APIs)
|
|
|
|
---
|
|
|
|
## ISIL Code Migration
|
|
|
|
### Mapping ISIL to GHCID
|
|
|
|
Existing ISIL codes can be automatically converted:
|
|
|
|
```python
|
|
def isil_to_ghcid(isil_code: str, institution_type: str, city_locode: str) -> str:
|
|
"""
|
|
Convert ISIL code to GHCID format.
|
|
|
|
ISIL format: {Country}-{Local Code}
|
|
Example: NL-AsdRM (Rijksmuseum Amsterdam)
|
|
"""
|
|
country = isil_code[:2] # NL
|
|
|
|
# Parse local code for geographic hints
|
|
# Many ISIL codes encode city: NL-AsdRM (Asd = Amsterdam)
|
|
# This requires lookup table or heuristics
|
|
|
|
# For now, extract from known patterns or use lookup
|
|
subdivision = extract_subdivision(isil_code) # NH (Noord-Holland)
|
|
locode = city_locode # AMS (Amsterdam)
|
|
type_code = institution_type_to_code(institution_type) # M
|
|
abbrev = extract_abbreviation_from_isil(isil_code) # RM
|
|
|
|
return f"{country}-{subdivision}-{locode}-{type_code}-{abbrev}"
|
|
|
|
# Example conversions
|
|
ISIL: NL-AsdRM → GHCID: NL-NH-AMS-M-RM
|
|
ISIL: US-DLC → GHCID: US-DC-WAS-L-LC (Library of Congress)
|
|
ISIL: FR-751131015 → GHCID: FR-IL-PAR-L-BNF (Bibliothèque nationale de France)
|
|
ISIL: GB-UkOxU → GHCID: GB-EN-OXF-L-BL (Bodleian Library)
|
|
```
|
|
|
|
### ISIL Preservation
|
|
|
|
**Important**: Store original ISIL codes as identifiers:
|
|
|
|
```python
|
|
custodian = HeritageCustodian(
|
|
id="NL-NH-AMS-M-RM", # Primary GHCID
|
|
name="Rijksmuseum",
|
|
identifiers=[
|
|
Identifier(
|
|
identifier_scheme="GHCID",
|
|
identifier_value="NL-NH-AMS-M-RM"
|
|
),
|
|
Identifier(
|
|
identifier_scheme="ISIL",
|
|
identifier_value="NL-AsdRM" # Preserved!
|
|
),
|
|
Identifier(
|
|
identifier_scheme="GHCID_NUMERIC",
|
|
identifier_value="12345678901234567"
|
|
)
|
|
]
|
|
)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Maintains backward compatibility with ISIL systems
|
|
- Enables cross-referencing with existing ISIL registries
|
|
- Provides migration path for ISIL-dependent workflows
|
|
|
|
---
|
|
|
|
## Benefits Over ISIL-Only Approach
|
|
|
|
### 1. Global Coverage
|
|
|
|
**ISIL Limitations**:
|
|
- Not all countries have ISIL registries
|
|
- Registration requires bureaucratic process
|
|
- Many small/local institutions lack ISIL codes
|
|
|
|
**GHCID Advantages**:
|
|
- Assign identifiers to **any** heritage institution worldwide
|
|
- No registration required (generated deterministically)
|
|
- Covers 1,004 Dutch organizations currently without ISIL codes
|
|
- Enables grassroots/community heritage organizations
|
|
|
|
### 2. Geographic Hierarchy
|
|
|
|
**ISIL**: Flat structure, geographic info encoded inconsistently
|
|
**GHCID**: Structured hierarchy (Country → Region → City)
|
|
|
|
**Use Cases**:
|
|
- Query all museums in Amsterdam: `NL-NH-AMS-M-*`
|
|
- Query all heritage orgs in São Paulo state: `BR-SP-*-*-*`
|
|
- Aggregate statistics by region
|
|
- Geocoding lookups
|
|
|
|
### 3. Institution Type in ID
|
|
|
|
**ISIL**: No type indicator (requires separate lookup)
|
|
**GHCID**: Type encoded in ID (`-M-`, `-L-`, `-A-`)
|
|
|
|
**Use Cases**:
|
|
- Filter by type without database lookup
|
|
- Validate type consistency
|
|
- Build type-specific indexes
|
|
|
|
### 4. Human Readability
|
|
|
|
**ISIL Examples**:
|
|
- `NL-AsdRM` → Readable (Amsterdam, RM)
|
|
- `FR-751131015` → Opaque numeric code
|
|
- `DE-MUS-815314` → Mixed format
|
|
|
|
**GHCID Examples** (all readable):
|
|
- `NL-NH-AMS-M-RM` → Netherlands, Noord-Holland, Amsterdam, Museum, Rijksmuseum
|
|
- `FR-IL-PAR-L-BNF` → France, Île-de-France, Paris, Library, BNF
|
|
- `BR-RJ-RIO-L-BNB` → Brazil, Rio de Janeiro, Rio, Library, BNB
|
|
|
|
### 5. Compatibility with Existing Systems
|
|
|
|
**Numeric Systems** (e.g., VIAF, databases requiring int64):
|
|
- GHCID → Hash → Numeric ID (deterministic)
|
|
- Store mapping table for reverse lookup
|
|
|
|
**ISIL Systems**:
|
|
- Store ISIL as secondary identifier
|
|
- Maintain bidirectional mapping
|
|
- Enable gradual migration
|
|
|
|
---
|
|
|
|
## Implementation Strategy
|
|
|
|
### Phase 1: Schema Update
|
|
|
|
Update `heritage_custodian.yaml`:
|
|
|
|
```yaml
|
|
classes:
|
|
HeritageCustodian:
|
|
slots:
|
|
- id # Now uses GHCID format
|
|
- ghcid # Explicit GHCID field
|
|
- ghcid_numeric # Hash-based numeric version
|
|
- identifiers # List includes ISIL, Wikidata, etc.
|
|
slot_usage:
|
|
id:
|
|
description: Primary identifier in GHCID format
|
|
pattern: '^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
|
|
ghcid:
|
|
description: Explicit GHCID in human-readable format
|
|
required: true
|
|
ghcid_numeric:
|
|
description: Deterministic numeric hash of GHCID
|
|
range: integer
|
|
```
|
|
|
|
### Phase 2: GHCID Generator Module
|
|
|
|
Create `src/glam_extractor/identifiers/ghcid.py`:
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Optional
|
|
import hashlib
|
|
import re
|
|
|
|
@dataclass
|
|
class GHCIDComponents:
|
|
country: str # ISO 3166-1 alpha-2 (2 chars)
|
|
subdivision: str # ISO 3166-2 (1-3 chars, or '00')
|
|
locode: str # UN/LOCODE (3 chars, or 'XXX')
|
|
type_code: str # Institution type (1 char)
|
|
abbreviation: str # Name abbreviation (2-8 chars)
|
|
|
|
def to_ghcid(self) -> str:
|
|
return f"{self.country}-{self.subdivision}-{self.locode}-{self.type_code}-{self.abbreviation}"
|
|
|
|
def to_numeric(self) -> int:
|
|
ghcid = self.to_ghcid()
|
|
hash_bytes = hashlib.sha256(ghcid.encode('utf-8')).digest()
|
|
return int.from_bytes(hash_bytes[:8], byteorder='big')
|
|
|
|
class GHCIDGenerator:
|
|
"""Generate Global Heritage Custodian Identifiers"""
|
|
|
|
TYPE_CODES = {
|
|
'GALLERY': 'G',
|
|
'LIBRARY': 'L',
|
|
'ARCHIVE': 'A',
|
|
'MUSEUM': 'M',
|
|
'CULTURAL_CENTER': 'C',
|
|
'RESEARCH_INSTITUTE': 'R',
|
|
'CONSORTIUM': 'N',
|
|
'GOVERNMENT_AGENCY': 'V',
|
|
'MIXED': 'X'
|
|
}
|
|
|
|
def generate(
|
|
self,
|
|
name: str,
|
|
institution_type: str,
|
|
country: str,
|
|
subdivision: Optional[str] = None,
|
|
city_locode: Optional[str] = None
|
|
) -> GHCIDComponents:
|
|
"""
|
|
Generate GHCID from institution metadata.
|
|
|
|
Args:
|
|
name: Official institution name
|
|
institution_type: Institution type enum value
|
|
country: ISO 3166-1 alpha-2 country code
|
|
subdivision: ISO 3166-2 subdivision code (optional)
|
|
city_locode: UN/LOCODE for city (optional)
|
|
|
|
Returns:
|
|
GHCIDComponents with all fields populated
|
|
"""
|
|
# Normalize inputs
|
|
country = country.upper()
|
|
subdivision = (subdivision or '00').upper()
|
|
locode = (city_locode or 'XXX').upper()
|
|
type_code = self.TYPE_CODES.get(institution_type, 'X')
|
|
|
|
# Generate abbreviation from name
|
|
abbreviation = self._generate_abbreviation(name)
|
|
|
|
return GHCIDComponents(
|
|
country=country,
|
|
subdivision=subdivision,
|
|
locode=locode,
|
|
type_code=type_code,
|
|
abbreviation=abbreviation
|
|
)
|
|
|
|
def _generate_abbreviation(self, name: str) -> str:
|
|
"""
|
|
Generate 2-8 character abbreviation from institution name.
|
|
|
|
Strategy:
|
|
1. Extract words (split on spaces/punctuation)
|
|
2. Take first letter of each significant word
|
|
3. Skip stopwords (the, of, for, and, etc.)
|
|
4. Maximum 8 characters
|
|
5. Minimum 2 characters
|
|
"""
|
|
stopwords = {'the', 'of', 'for', 'and', 'in', 'at', 'to', 'a', 'an'}
|
|
|
|
# Split name into words, filter stopwords
|
|
words = re.findall(r'\b\w+\b', name.lower())
|
|
significant_words = [w for w in words if w not in stopwords]
|
|
|
|
# Take first letter of each word
|
|
abbrev = ''.join(w[0].upper() for w in significant_words[:8])
|
|
|
|
# Ensure minimum 2 characters
|
|
if len(abbrev) < 2:
|
|
# Fallback: take first 2-4 chars of first word
|
|
abbrev = name[:4].upper().replace(' ', '')
|
|
|
|
return abbrev[:8] # Max 8 chars
|
|
|
|
def from_isil(
|
|
self,
|
|
isil_code: str,
|
|
institution_type: str,
|
|
city_locode: str,
|
|
subdivision: Optional[str] = None
|
|
) -> GHCIDComponents:
|
|
"""
|
|
Convert ISIL code to GHCID.
|
|
|
|
Args:
|
|
isil_code: ISIL code (e.g., 'NL-AsdRM')
|
|
institution_type: Institution type enum value
|
|
city_locode: UN/LOCODE for city
|
|
subdivision: ISO 3166-2 code (if not in ISIL)
|
|
|
|
Returns:
|
|
GHCIDComponents
|
|
"""
|
|
# Parse ISIL
|
|
match = re.match(r'^([A-Z]{2})-(.+)$', isil_code)
|
|
if not match:
|
|
raise ValueError(f"Invalid ISIL code format: {isil_code}")
|
|
|
|
country = match.group(1)
|
|
local_code = match.group(2)
|
|
|
|
# Extract abbreviation from ISIL local code
|
|
# Many ISIL codes have pattern: {CityCode}{Abbreviation}
|
|
# e.g., NL-AsdRM → Asd (Amsterdam) + RM (Rijksmuseum)
|
|
abbrev = self._extract_isil_abbreviation(local_code)
|
|
|
|
subdivision = subdivision or '00'
|
|
type_code = self.TYPE_CODES.get(institution_type, 'X')
|
|
|
|
return GHCIDComponents(
|
|
country=country,
|
|
subdivision=subdivision,
|
|
locode=city_locode,
|
|
type_code=type_code,
|
|
abbreviation=abbrev
|
|
)
|
|
|
|
def _extract_isil_abbreviation(self, local_code: str) -> str:
|
|
"""
|
|
Extract abbreviation from ISIL local code.
|
|
|
|
Heuristics:
|
|
- If starts with 3-letter city code, take rest
|
|
- If purely numeric, use first 4 digits
|
|
- Otherwise, use full local code (max 8 chars)
|
|
"""
|
|
# Check if starts with likely city code (3 lowercase + rest)
|
|
match = re.match(r'^[A-Za-z]{3}(.+)$', local_code)
|
|
if match:
|
|
return match.group(1)[:8].upper()
|
|
|
|
# If numeric, use first 4-8 digits
|
|
if local_code.isdigit():
|
|
return local_code[:8]
|
|
|
|
# Otherwise use full code
|
|
return local_code[:8].upper()
|
|
```
|
|
|
|
### Phase 3: Update Parsers
|
|
|
|
Update all parsers to generate GHCID:
|
|
|
|
```python
|
|
# In isil_registry.py
|
|
from glam_extractor.identifiers.ghcid import GHCIDGenerator
|
|
|
|
generator = GHCIDGenerator()
|
|
|
|
def to_heritage_custodian(record: ISILRegistryRecord) -> HeritageCustodian:
|
|
# Generate GHCID from ISIL
|
|
ghcid_components = generator.from_isil(
|
|
isil_code=record.isil_code,
|
|
institution_type='MIXED', # ISIL registry doesn't specify type
|
|
city_locode=lookup_locode(record.plaats), # Lookup UN/LOCODE
|
|
subdivision=lookup_subdivision(record.plaats) # Lookup ISO 3166-2
|
|
)
|
|
|
|
ghcid = ghcid_components.to_ghcid()
|
|
ghcid_numeric = ghcid_components.to_numeric()
|
|
|
|
return HeritageCustodian(
|
|
id=ghcid, # Primary key is now GHCID
|
|
ghcid=ghcid,
|
|
ghcid_numeric=ghcid_numeric,
|
|
name=record.instelling,
|
|
institution_type='MIXED',
|
|
identifiers=[
|
|
Identifier(
|
|
identifier_scheme='GHCID',
|
|
identifier_value=ghcid
|
|
),
|
|
Identifier(
|
|
identifier_scheme='ISIL',
|
|
identifier_value=record.isil_code # Preserved
|
|
),
|
|
Identifier(
|
|
identifier_scheme='GHCID_NUMERIC',
|
|
identifier_value=str(ghcid_numeric)
|
|
)
|
|
],
|
|
# ... rest of fields
|
|
)
|
|
```
|
|
|
|
### Phase 4: Lookup Tables
|
|
|
|
Create reference data for geocoding:
|
|
|
|
```python
|
|
# data/reference/nl_city_locodes.json
|
|
{
|
|
"Amsterdam": {
|
|
"locode": "AMS",
|
|
"subdivision": "NH", # Noord-Holland
|
|
"geonames_id": "2759794"
|
|
},
|
|
"Rotterdam": {
|
|
"locode": "RTM",
|
|
"subdivision": "ZH", # Zuid-Holland
|
|
"geonames_id": "2747891"
|
|
},
|
|
# ... all Dutch cities
|
|
}
|
|
|
|
# data/reference/iso_3166_2_nl.json
|
|
{
|
|
"NH": "Noord-Holland",
|
|
"ZH": "Zuid-Holland",
|
|
"UT": "Utrecht",
|
|
# ... all provinces
|
|
}
|
|
```
|
|
|
|
### Phase 5: Cross-Linking with GHCID
|
|
|
|
Update cross-linking scripts to use GHCID:
|
|
|
|
```python
|
|
# crosslink_dutch_datasets.py (updated)
|
|
|
|
# Build lookup by GHCID (not ISIL)
|
|
isil_by_ghcid = {}
|
|
orgs_by_ghcid = {}
|
|
|
|
for custodian in isil_custodians:
|
|
isil_by_ghcid[custodian.ghcid] = custodian
|
|
|
|
for custodian in dutch_custodians:
|
|
orgs_by_ghcid[custodian.ghcid] = custodian
|
|
|
|
# Merge by GHCID
|
|
all_ghcids = set(isil_by_ghcid.keys()) | set(orgs_by_ghcid.keys())
|
|
|
|
for ghcid in sorted(all_ghcids):
|
|
isil_record = isil_by_ghcid.get(ghcid)
|
|
orgs_record = orgs_by_ghcid.get(ghcid)
|
|
merged = merge_custodians(isil_record, orgs_record, ghcid)
|
|
merged_records.append(merged)
|
|
```
|
|
|
|
---
|
|
|
|
## Validation and Testing
|
|
|
|
### Unit Tests
|
|
|
|
```python
|
|
def test_ghcid_generation():
|
|
generator = GHCIDGenerator()
|
|
|
|
# Test Rijksmuseum
|
|
ghcid = generator.generate(
|
|
name="Rijksmuseum",
|
|
institution_type="MUSEUM",
|
|
country="NL",
|
|
subdivision="NH",
|
|
city_locode="AMS"
|
|
)
|
|
assert ghcid.to_ghcid() == "NL-NH-AMS-M-R"
|
|
|
|
def test_isil_to_ghcid_conversion():
|
|
generator = GHCIDGenerator()
|
|
|
|
ghcid = generator.from_isil(
|
|
isil_code="NL-AsdRM",
|
|
institution_type="MUSEUM",
|
|
city_locode="AMS",
|
|
subdivision="NH"
|
|
)
|
|
assert ghcid.to_ghcid() == "NL-NH-AMS-M-RM"
|
|
|
|
def test_ghcid_numeric_deterministic():
|
|
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
|
|
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
|
|
|
|
assert comp1.to_numeric() == comp2.to_numeric()
|
|
|
|
def test_ghcid_pattern_validation():
|
|
valid_ghcids = [
|
|
"NL-NH-AMS-M-RM",
|
|
"US-NY-NYC-M-MOMA",
|
|
"BR-RJ-RIO-L-BNB",
|
|
"GB-EN-LON-L-BL"
|
|
]
|
|
pattern = r'^[A-Z]{2}-[A-Z0-9]{2,3}-[A-Z]{3}-[GLAMNRCVX]-[A-Z0-9]{2,8}$'
|
|
|
|
for ghcid in valid_ghcids:
|
|
assert re.match(pattern, ghcid)
|
|
```
|
|
|
|
---
|
|
|
|
## Migration Path
|
|
|
|
### For Existing ISIL Data (364 Dutch ISIL records)
|
|
|
|
1. ✅ Parse ISIL codes (already done)
|
|
2. 🔄 Lookup UN/LOCODE for each city (create lookup table)
|
|
3. 🔄 Lookup ISO 3166-2 subdivision codes (create lookup table)
|
|
4. 🔄 Generate GHCID from ISIL + lookups
|
|
5. ✅ Store ISIL as secondary identifier
|
|
|
|
### For Non-ISIL Data (1,004 Dutch orgs without ISIL)
|
|
|
|
1. ✅ Parse organization data (already done)
|
|
2. 🔄 Extract city from address
|
|
3. 🔄 Lookup UN/LOCODE for city
|
|
4. 🔄 Determine institution type (already done)
|
|
5. 🔄 Generate abbreviation from name
|
|
6. 🔄 Create GHCID (no ISIL to preserve)
|
|
|
|
### For Conversation Data (139 files, 2,000-5,000 institutions)
|
|
|
|
1. ⏳ Extract institution name, type, location (NLP)
|
|
2. 🔄 Geocode location → UN/LOCODE
|
|
3. 🔄 Generate GHCID
|
|
4. 🔄 Check if ISIL exists (cross-reference)
|
|
5. 🔄 Store ISIL if found, otherwise GHCID-only
|
|
|
|
---
|
|
|
|
## Benefits Summary
|
|
|
|
| Feature | ISIL-Only | GHCID |
|
|
|---------|-----------|-------|
|
|
| **Global coverage** | Limited (requires registration) | Universal (any institution) |
|
|
| **Geographic structure** | Inconsistent | Standardized hierarchy |
|
|
| **Type indication** | No | Yes (1-char code) |
|
|
| **Human-readable** | Sometimes | Always |
|
|
| **Numeric format** | Not standard | Deterministic hash |
|
|
| **Non-ISIL orgs** | Excluded | Included |
|
|
| **Backward compatible** | N/A | Yes (ISIL preserved) |
|
|
| **Validation** | Registry lookup | Pattern + checksum |
|
|
| **Collision risk** | Low (registry) | Very low (SHA256) |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Immediate (Week 1)
|
|
1. Create `src/glam_extractor/identifiers/ghcid.py` module
|
|
2. Build UN/LOCODE lookup tables for Dutch cities
|
|
3. Build ISO 3166-2 lookup for Dutch provinces
|
|
4. Write tests for GHCID generation and conversion
|
|
|
|
### Short-term (Week 2)
|
|
5. Update schema to use GHCID as primary key
|
|
6. Update ISIL parser to generate GHCID
|
|
7. Update Dutch orgs parser to generate GHCID
|
|
8. Re-run cross-linking with GHCID
|
|
|
|
### Medium-term (Week 3-4)
|
|
9. Expand lookup tables to cover all conversation countries (60+)
|
|
10. Implement GHCID generation in conversation parser
|
|
11. Create GHCID validation and normalization tools
|
|
12. Build GHCID → ISIL reverse lookup service
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **ISO 3166-1**: https://www.iso.org/iso-3166-country-codes.html
|
|
- **ISO 3166-2**: https://www.iso.org/standard/72483.html
|
|
- **UN/LOCODE**: https://unece.org/trade/cefact/unlocode-code-list-country-and-territory
|
|
- **ISIL Standard**: https://www.iso.org/standard/77849.html
|
|
- **GeoNames**: https://www.geonames.org/
|
|
|
|
---
|
|
|
|
**Recommendation**: Adopt GHCID as primary identifier system, preserve ISIL codes as secondary identifiers for backward compatibility.
|