760 lines
25 KiB
Markdown
760 lines
25 KiB
Markdown
# Persistent Identifiers for Heritage Institutions
|
||
|
||
## Overview
|
||
|
||
The GLAM Data Extraction project uses **multiple identifier formats** optimized for different purposes:
|
||
|
||
### Persistent Identifiers (Deterministic)
|
||
These can be regenerated from the GHCID string and are stable across systems:
|
||
|
||
| Format | Bits | Algorithm | Use Case | Status |
|
||
|--------|------|-----------|----------|--------|
|
||
| **UUID v5** | 128 | SHA-1 | **PRIMARY** - Europeana, DPLA, IIIF, Wikidata | RFC 4122 Standard |
|
||
| **UUID SHA-256** | 128 | SHA-256 | **SOTA** - Security compliance, future-proofing | RFC 9562 (UUID v8) |
|
||
| **Numeric** | 64 | SHA-256 | CSV exports, numeric analysis | Internal |
|
||
| **Human-readable** | Variable | ISO format | Citations, documentation | ISO-based |
|
||
|
||
### Database Record Identifiers (Non-Deterministic)
|
||
These are generated once per record and optimize database performance:
|
||
|
||
| Format | Bits | Algorithm | Use Case | Status |
|
||
|--------|------|-----------|----------|--------|
|
||
| **UUID v7** | 128 | Timestamp + Random | Database PKs, time-ordered queries | RFC 9562 Standard |
|
||
|
||
## Why Four Formats?
|
||
|
||
### 1. **UUID v5 (SHA-1)** - Interoperability Standard ⭐ PRIMARY
|
||
```
|
||
Format: 550e8400-e29b-41d4-a716-446655440000
|
||
Version: 5 (name-based, SHA-1)
|
||
Standard: RFC 4122 (2005)
|
||
```
|
||
|
||
**✅ Strengths:**
|
||
- **RFC 4122 compliant** - Universal library support
|
||
- **Deterministic** - Same GHCID → Same UUID always (content-addressed)
|
||
- **Transparent** - Publicly documented algorithm, anyone can verify
|
||
- **Interoperable** - Works with Europeana, DPLA, IIIF, Wikidata
|
||
- **128-bit collision resistance** - P(collision) ≈ 1.5×10^-29 for 1M institutions
|
||
|
||
**⚠️ SHA-1 Nuance:**
|
||
- Uses SHA-1 internally (RFC 4122 specification)
|
||
- SHA-1 deprecated for **cryptographic security** (digital signatures, TLS, passwords)
|
||
- SHA-1 **appropriate for identifier generation** (non-adversarial, collision-resistant)
|
||
- See [Why GHCID Uses UUID v5 and SHA-1](WHY_UUID_V5_SHA1.md) for detailed rationale
|
||
|
||
**Why SHA-1 is Safe for GHCID:**
|
||
```
|
||
Cryptographic Use (Vulnerable):
|
||
- Adversarial context (attacker forges signatures)
|
||
- Two-message collision attack
|
||
- Security-critical (financial, authentication)
|
||
|
||
Identifier Use (Safe):
|
||
- Non-adversarial context (no one forges museum IDs)
|
||
- Single-source generation (we control inputs)
|
||
- Uniqueness requirement (birthday paradox protection sufficient)
|
||
```
|
||
|
||
**Use When:**
|
||
- **Primary identifier** for all GHCID records
|
||
- Integrating with existing UUID v5 systems
|
||
- Exporting to Europeana, DPLA, IIIF
|
||
- Storing in Wikidata as external identifier
|
||
- RFC 4122 strict compliance required
|
||
- **Maximum transparency** required (anyone can verify)
|
||
|
||
---
|
||
|
||
### 2. **UUID SHA-256 (Custom)** - SOTA Cryptographic Strength
|
||
```
|
||
Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
||
Version: 8 (custom/experimental)
|
||
Algorithm: SHA-256 (truncated to 128 bits)
|
||
```
|
||
|
||
**✅ Strengths:**
|
||
- **SHA-256** - NIST-approved, SOTA cryptographic hash (2024)
|
||
- **Superior collision resistance** vs SHA-1
|
||
- **Future-proof** - No known practical attacks
|
||
- **UUID-compatible** - Valid UUID format, works with UUID parsers
|
||
|
||
**⚠️ Nuances:**
|
||
- **Not RFC 4122 standard** - Custom implementation
|
||
- UUID v8 is "experimental/vendor-specific" designation
|
||
- May not be recognized by strict UUID v5-only systems
|
||
|
||
**Use When:**
|
||
- Security policy mandates SHA-256
|
||
- Maximum collision resistance required
|
||
- Future-proofing against SHA-1 deprecation
|
||
- Custom identifier resolution service
|
||
|
||
**Algorithm:**
|
||
1. Hash GHCID string with SHA-256 → 256 bits
|
||
2. Truncate to first 128 bits (16 bytes)
|
||
3. Set version bits to 8 (custom)
|
||
4. Set variant bits to RFC 4122 (0b10xxxxxx)
|
||
|
||
---
|
||
|
||
### 3. **Numeric (64-bit)** - Database Optimization
|
||
```
|
||
Format: 213324328442227739
|
||
Algorithm: SHA-256 → first 8 bytes → uint64
|
||
Range: 0 to 18,446,744,073,709,551,615
|
||
```
|
||
|
||
**✅ Strengths:**
|
||
- **Compact** - Fits in SQL BIGINT (8 bytes)
|
||
- **Fast indexing** - Integer comparisons faster than UUID
|
||
- **CSV-friendly** - No special characters
|
||
- **Deterministic** - Same GHCID → Same number
|
||
|
||
**⚠️ Nuances:**
|
||
- **64-bit truncation** reduces collision resistance vs full 256-bit
|
||
- P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%)
|
||
- Still negligible for heritage domain (<10M institutions expected)
|
||
|
||
**Use When:**
|
||
- Database primary key optimization
|
||
- CSV exports for spreadsheet analysis
|
||
- Numeric sorting required
|
||
- Systems without UUID support
|
||
|
||
---
|
||
|
||
### 4. **Human-Readable (ISO-based)** - Citations & References
|
||
```
|
||
Format: US-CA-SAN-A-IA
|
||
Components: {Country}-{Region}-{City}-{Type}-{Abbreviation}
|
||
Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam)
|
||
```
|
||
|
||
**✅ Strengths:**
|
||
- **Human-readable** - Understandable without lookup
|
||
- **Geographic context** - Location embedded in ID
|
||
- **Type indicator** - Institution type visible
|
||
- **Citable** - Use in academic papers, documentation
|
||
|
||
**⚠️ Nuances:**
|
||
- **Not persistent** if institution relocates or changes name
|
||
- Use `ghcid_original` field (frozen) for true persistence
|
||
- `ghcid` field (current) may change over time
|
||
|
||
**Use When:**
|
||
- Academic citations
|
||
- Documentation and reports
|
||
- Human-readable data exchange
|
||
- Debugging and logging
|
||
|
||
---
|
||
|
||
## Collision Resistance Comparison
|
||
|
||
### Mathematical Analysis
|
||
|
||
```python
|
||
# Collision probability (birthday paradox):
|
||
# P(collision) ≈ n² / (2 × 2^bits)
|
||
|
||
# For 1,000,000 institutions:
|
||
|
||
# UUID v5 / UUID SHA-256 (128-bit):
|
||
P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29
|
||
# Effectively zero - more atoms in universe than collisions
|
||
|
||
# Numeric (64-bit):
|
||
P = (10^6)² / (2 × 2^64) ≈ 2.7 × 10^-7 (0.00003%)
|
||
# Negligible for heritage domain
|
||
|
||
# Even at 10 million institutions:
|
||
P_64bit = (10^7)² / (2 × 2^64) ≈ 2.7 × 10^-5 (0.003%)
|
||
# Still acceptable
|
||
```
|
||
|
||
### Real-World Context
|
||
|
||
| Institution Count | UUID v5/SHA-256 | Numeric (64-bit) | Assessment |
|
||
|-------------------|-----------------|------------------|------------|
|
||
| **100,000** | ~0% | 2.7×10^-11 (0.0000000027%) | ✅ All safe |
|
||
| **1,000,000** | ~0% | 2.7×10^-7 (0.00003%) | ✅ All safe |
|
||
| **10,000,000** | ~0% | 2.7×10^-5 (0.003%) | ✅ UUID safe, numeric acceptable |
|
||
| **100,000,000** | ~0% | 0.27% | ⚠️ Use UUID, numeric risky |
|
||
|
||
**Conclusion:** For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance.
|
||
|
||
---
|
||
|
||
## Historical Collision Resolution
|
||
|
||
### The Rule: Temporal Priority Determines Disambiguation
|
||
|
||
When creating GHCIDs, collisions can occur in two temporal contexts:
|
||
|
||
1. **First Batch Creation** (initial PID assignment): Multiple institutions discovered simultaneously
|
||
2. **Historical Addition** (post-publication): New historical institution added after existing GHCID published
|
||
|
||
**Critical Design Decision**: The collision resolution strategy differs based on temporal context to preserve PID stability.
|
||
|
||
### Collision Resolution: Native Language Name Suffix
|
||
|
||
**Key Change**: Collisions are resolved by appending the **full legal name in native language in snake_case format**, NOT Wikidata Q-numbers.
|
||
|
||
**Name Suffix Rules**:
|
||
- Use the institution's full official name in its native language
|
||
- Convert to snake_case (lowercase, underscores for spaces)
|
||
- Remove apostrophes, accents, commas, and other punctuation/diacritics
|
||
- Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
|
||
|
||
**Name Normalization Examples**:
|
||
```
|
||
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
|
||
"Musée d'Orsay" → "musee_dorsay"
|
||
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
|
||
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
|
||
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
|
||
```
|
||
|
||
### First Batch Behavior (Initial PID Creation)
|
||
|
||
**Scenario**: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together.
|
||
|
||
**Resolution**: **ALL** colliding institutions get name suffixes appended.
|
||
|
||
**Example**:
|
||
|
||
```yaml
|
||
# Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM
|
||
|
||
# Stedelijk Museum (founded 1874)
|
||
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
|
||
# Science Museum Amsterdam (founded 2010)
|
||
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
|
||
```
|
||
|
||
**Rationale**: No existing PIDs to preserve; both institutions are "new" to the system.
|
||
|
||
### Historical Addition Behavior (Post-Publication)
|
||
|
||
**Scenario**: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID.
|
||
|
||
**Resolution**: **ONLY** the newly added historical institution gets a name suffix. The existing PID remains unchanged.
|
||
|
||
**Example**:
|
||
|
||
```yaml
|
||
# Existing GHCID (published 2025-11-01)
|
||
ghcid_original: NL-NH-AMS-M-HM # Hermitage Museum Amsterdam (2009-2023)
|
||
|
||
# Historical institution added later (2025-11-15)
|
||
# Amsterdam Historical Museum (1926-1975)
|
||
# Would also generate: NL-NH-AMS-M-HM
|
||
#
|
||
# COLLISION DETECTED → Add name suffix to NEW addition ONLY
|
||
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
|
||
```
|
||
|
||
**Outcome**:
|
||
- `NL-NH-AMS-M-HM` (Hermitage Museum Amsterdam) → **UNCHANGED**
|
||
- `NL-NH-AMS-M-HM-amsterdam_historical_museum` (Amsterdam Historical Museum) → **Name suffix added**
|
||
|
||
**Rationale**: Preserve stability of already-published PIDs.
|
||
|
||
### Why This Matters: PID Stability Principle
|
||
|
||
**Problem**: Changing existing GHCIDs breaks external references.
|
||
|
||
PIDs may already be:
|
||
- Cited in academic publications
|
||
- Referenced in datasets and APIs
|
||
- Stored in institutional databases
|
||
- Embedded in IIIF manifests
|
||
- Linked from Wikidata
|
||
|
||
**Principle**: **"Cool URIs don't change"** (Tim Berners-Lee, W3C)
|
||
|
||
Once a GHCID is published (in first batch or as standalone record), it should **NEVER** change, even if new historical institutions create collisions.
|
||
|
||
### Decision Table: Who Gets Name Suffix?
|
||
|
||
| Scenario | When | Existing GHCID | New GHCID | Who Gets Name Suffix | Rationale |
|
||
|----------|------|----------------|-----------|---------------------|-----------|
|
||
| **First Batch** | Initial PID creation (2025-11-01) | None (first time) | `NL-NH-AMS-M-SM` (2 institutions) | **ALL** colliding institutions | No existing PIDs to preserve |
|
||
| **Historical Addition** | Post-publication (2025-11-15) | `NL-NH-AMS-M-HM` (published) | `NL-NH-AMS-M-HM` (historical) | **ONLY** newly added institution | Preserve published PID stability |
|
||
| **Standalone Addition** | New institution (2026-01-01) | `NL-NH-AMS-M-XY` (published) | `NL-NH-AMS-M-XY` (new contemporary) | **ONLY** newly added institution | Preserve existing PID |
|
||
|
||
### Implementation Guidance
|
||
|
||
**Name Suffix Generation**:
|
||
|
||
```python
|
||
import re
|
||
import unicodedata
|
||
|
||
def generate_name_suffix(native_name: str) -> str:
|
||
"""Convert native language institution name to snake_case suffix.
|
||
|
||
Examples:
|
||
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
|
||
"Musée d'Orsay" → "musee_dorsay"
|
||
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
|
||
"""
|
||
# Normalize unicode (NFD decomposition) and remove diacritics
|
||
normalized = unicodedata.normalize('NFD', native_name)
|
||
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||
|
||
# Convert to lowercase
|
||
lowercase = ascii_name.lower()
|
||
|
||
# Remove apostrophes, commas, and other punctuation
|
||
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
|
||
|
||
# Replace spaces and hyphens with underscores
|
||
underscored = re.sub(r'[\s\-]+', '_', no_punct)
|
||
|
||
# Remove any remaining non-alphanumeric characters (except underscores)
|
||
clean = re.sub(r'[^a-z0-9_]', '', underscored)
|
||
|
||
# Collapse multiple underscores
|
||
final = re.sub(r'_+', '_', clean).strip('_')
|
||
|
||
return final
|
||
```
|
||
|
||
**Collision Detection Logic**:
|
||
|
||
```python
|
||
def resolve_collision(new_ghcid: str, new_name: str, existing_ghcids: Set[str]) -> str:
|
||
"""
|
||
Resolve GHCID collision based on temporal context.
|
||
|
||
Args:
|
||
new_ghcid: Base GHCID for new institution
|
||
new_name: Native language name of the institution
|
||
existing_ghcids: Set of already-published GHCIDs
|
||
|
||
Returns:
|
||
Final GHCID (with name suffix if needed)
|
||
"""
|
||
if new_ghcid in existing_ghcids:
|
||
# COLLISION DETECTED: New institution collides with existing
|
||
# Resolution: Add name suffix to NEW institution ONLY
|
||
name_suffix = generate_name_suffix(new_name)
|
||
return f"{new_ghcid}-{name_suffix}"
|
||
else:
|
||
# No collision: Use base GHCID
|
||
return new_ghcid
|
||
```
|
||
|
||
**First Batch Processing** (different logic):
|
||
|
||
```python
|
||
def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]:
|
||
"""
|
||
Process initial batch of institutions.
|
||
|
||
For first batch, ALL collisions get name suffixes appended.
|
||
"""
|
||
# Group by base GHCID
|
||
ghcid_groups = defaultdict(list)
|
||
for inst in institutions:
|
||
base_ghcid = generate_base_ghcid(inst)
|
||
ghcid_groups[base_ghcid].append(inst)
|
||
|
||
records = []
|
||
for base_ghcid, group in ghcid_groups.items():
|
||
if len(group) == 1:
|
||
# No collision: Use base GHCID
|
||
records.append(create_record(group[0], base_ghcid))
|
||
else:
|
||
# COLLISION: ALL institutions get name suffixes
|
||
for inst in group:
|
||
name_suffix = generate_name_suffix(inst.name)
|
||
ghcid = f"{base_ghcid}-{name_suffix}"
|
||
records.append(create_record(inst, ghcid))
|
||
|
||
return records
|
||
```
|
||
|
||
### Edge Cases
|
||
|
||
**Case 1: Multiple historical institutions added simultaneously**
|
||
|
||
If multiple historical institutions are added together (same date) and collide with existing GHCID:
|
||
|
||
```yaml
|
||
# Existing (published 2025-11-01)
|
||
ghcid: NL-NH-AMS-M-XY
|
||
|
||
# Both added 2025-11-15
|
||
# Historical Institution A: "Amsterdam Art Archive"
|
||
ghcid: NL-NH-AMS-M-XY-amsterdam_art_archive
|
||
|
||
# Historical Institution B: "Amsterdam Archaeology Museum"
|
||
ghcid: NL-NH-AMS-M-XY-amsterdam_archaeology_museum
|
||
```
|
||
|
||
**Resolution**: ALL newly added institutions get name suffixes (treat as mini-batch).
|
||
|
||
**Case 2: Existing GHCID already has name suffix**
|
||
|
||
If existing GHCID already has name suffix (from first batch collision), new historical addition gets different name suffix:
|
||
|
||
```yaml
|
||
# Existing (from first batch with collision)
|
||
ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
|
||
# Historical addition (2025-11-15)
|
||
ghcid: NL-NH-AMS-M-SM-stadsmuseum_amsterdam # Different name suffix
|
||
```
|
||
|
||
**No ambiguity**: Each institution has unique name suffix derived from its native language name.
|
||
|
||
**Case 3: Non-Latin script names**
|
||
|
||
For institutions with non-Latin script names, transliterate to ASCII:
|
||
|
||
```yaml
|
||
# Chinese institution: 北京故宫博物院 (Palace Museum Beijing)
|
||
ghcid: CN-BJ-BEI-M-PM-beijing_gugong_bowuyuan
|
||
|
||
# Japanese institution: 東京国立博物館 (Tokyo National Museum)
|
||
ghcid: JP-TK-TOK-M-TN-tokyo_kokuritsu_hakubutsukan
|
||
|
||
# Arabic institution: المتحف المصري (Egyptian Museum)
|
||
ghcid: EG-CA-CAI-M-EM-al_mathaf_al_masri
|
||
```
|
||
|
||
### Testing Strategy
|
||
|
||
**Test 1: First Batch Collision**
|
||
|
||
```python
|
||
def test_first_batch_collision():
|
||
"""Verify ALL institutions in first batch get name suffixes"""
|
||
institutions = [
|
||
Institution("Stedelijk Museum Amsterdam", type="M", city="AMS"),
|
||
Institution("Science Museum Amsterdam", type="M", city="AMS")
|
||
]
|
||
|
||
records = process_first_batch(institutions)
|
||
|
||
# Both should have name suffixes
|
||
assert records[0].ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
|
||
assert records[1].ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"
|
||
```
|
||
|
||
**Test 2: Historical Addition Collision**
|
||
|
||
```python
|
||
def test_historical_addition_preserves_existing():
|
||
"""Verify existing GHCID unchanged when historical added"""
|
||
# Existing GHCID (published)
|
||
existing_ghcids = {"NL-NH-AMS-M-HM"}
|
||
|
||
# Add historical institution
|
||
historical = Institution(
|
||
name="Amsterdam Historical Museum",
|
||
type="M",
|
||
city="AMS",
|
||
temporal_extent={"start": "1926", "end": "1975"}
|
||
)
|
||
|
||
new_ghcid = resolve_collision(
|
||
generate_base_ghcid(historical),
|
||
historical.name,
|
||
existing_ghcids
|
||
)
|
||
|
||
# New historical gets name suffix
|
||
assert new_ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"
|
||
|
||
# Existing GHCID NOT in database update
|
||
# (verify existing record unchanged)
|
||
```
|
||
|
||
**Test 3: Name Suffix Generation**
|
||
|
||
```python
|
||
def test_name_suffix_generation():
|
||
"""Verify name suffix normalization"""
|
||
assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
|
||
assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
|
||
assert generate_name_suffix("Biblioteca Nacional do Brasil") == "biblioteca_nacional_do_brasil"
|
||
assert generate_name_suffix("Royal Museum, London") == "royal_museum_london"
|
||
```
|
||
|
||
### Documentation References
|
||
|
||
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
|
||
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
|
||
- **Implementation**: `src/glam_extractor/identifiers/ghcid.py`
|
||
- **Schema**: `schemas/provenance.yaml` (GHCIDHistoryEntry)
|
||
- **Abbreviation Special Characters**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` (characters to exclude from abbreviations)
|
||
|
||
---
|
||
|
||
## SHA-1 vs SHA-256: The Nuance
|
||
|
||
### Why UUID v5 Uses SHA-1
|
||
|
||
**RFC 4122 (2005)** standardized UUID v5 with SHA-1 because:
|
||
- SHA-1 was considered secure in 2005
|
||
- 128-bit UUID space provides collision resistance even with SHA-1
|
||
- Purpose is **identifier generation**, not **security/authentication**
|
||
|
||
### SHA-1 Cryptographic Weakness
|
||
|
||
**SHA-1 collision attacks (2017):**
|
||
- Google/CWI demonstrated practical SHA-1 collision
|
||
- Two different inputs producing same hash
|
||
- **Critical for digital signatures** (authentication, certificates)
|
||
- **Less critical for identifiers** (birthday paradox protection sufficient)
|
||
|
||
### When SHA-1 Is Problematic
|
||
|
||
❌ **Digital signatures** - Attacker can forge documents
|
||
❌ **Certificate authorities** - SSL/TLS security compromised
|
||
❌ **Password hashing** - Weakens brute-force resistance
|
||
❌ **Blockchain** - Consensus security at risk
|
||
|
||
### When SHA-1 Is Acceptable
|
||
|
||
✅ **UUID generation** - Collision resistance adequate for identifier space
|
||
✅ **Git commits** - Linus Torvalds: "SHA-1 is fine for Git's use case"
|
||
✅ **Non-adversarial contexts** - No attacker trying to cause collisions
|
||
|
||
---
|
||
|
||
## Recommended Usage Strategy
|
||
|
||
### Default: Dual UUID Approach
|
||
|
||
Store **both UUID formats** for maximum flexibility:
|
||
|
||
```yaml
|
||
# Example YAML record
|
||
- id: 550e8400-e29b-41d4-a716-446655440000 # Use UUID v5 as primary ID
|
||
name: Internet Archive
|
||
institution_type: ARCHIVE
|
||
ghcid: US-CA-SAN-A-IA
|
||
ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 (SHA-1)
|
||
ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID SHA-256
|
||
ghcid_numeric: 213324328442227739 # Numeric (64-bit)
|
||
identifiers:
|
||
- identifier_scheme: GHCID
|
||
identifier_value: US-CA-SAN-A-IA
|
||
- identifier_scheme: GHCID_UUID_V5
|
||
identifier_value: 550e8400-e29b-41d4-a716-446655440000
|
||
- identifier_scheme: GHCID_UUID_SHA256
|
||
identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
||
- identifier_scheme: GHCID_NUMERIC
|
||
identifier_value: 213324328442227739
|
||
```
|
||
|
||
### Use Case Decision Tree
|
||
|
||
```
|
||
Need to integrate with existing systems?
|
||
├─ YES → Use UUID v5 (to_uuid())
|
||
│ - Europeana, DPLA, IIIF, Wikidata
|
||
│ - RFC 4122 compliance required
|
||
│
|
||
└─ NO → Building custom system?
|
||
├─ Security policy mandates SHA-256?
|
||
│ ├─ YES → Use UUID SHA-256 (to_uuid_sha256())
|
||
│ └─ NO → Use UUID v5 for standard compliance
|
||
│
|
||
└─ Database optimization critical?
|
||
├─ YES → Use Numeric (to_numeric()) as PK
|
||
│ - Store UUID v5 as alternate key
|
||
└─ NO → Use UUID v5 as primary identifier
|
||
```
|
||
|
||
---
|
||
|
||
## Code Examples
|
||
|
||
### Generate All Four Formats
|
||
|
||
```python
|
||
from glam_extractor.identifiers.ghcid import GHCIDComponents
|
||
|
||
# Create GHCID components
|
||
components = GHCIDComponents(
|
||
country_code="US",
|
||
region_code="CA",
|
||
city_locode="SAN",
|
||
institution_type="A",
|
||
abbreviation="IA"
|
||
)
|
||
|
||
# Generate all formats
|
||
uuid_v5 = components.to_uuid() # UUID v5 (SHA-1)
|
||
uuid_sha256 = components.to_uuid_sha256() # UUID SHA-256
|
||
numeric = components.to_numeric() # Numeric (64-bit)
|
||
human = components.to_string() # Human-readable
|
||
|
||
print(f"UUID v5: {uuid_v5}")
|
||
print(f"UUID SHA-256: {uuid_sha256}")
|
||
print(f"Numeric: {numeric}")
|
||
print(f"Human: {human}")
|
||
|
||
# Output:
|
||
# UUID v5: 550e8400-e29b-41d4-a716-446655440000
|
||
# UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
||
# Numeric: 213324328442227739
|
||
# Human: US-CA-SAN-A-IA
|
||
```
|
||
|
||
### Verify Determinism
|
||
|
||
```python
|
||
# Same input always produces same output
|
||
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
|
||
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
|
||
|
||
assert comp1.to_uuid() == comp2.to_uuid()
|
||
assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256()
|
||
assert comp1.to_numeric() == comp2.to_numeric()
|
||
assert comp1.to_string() == comp2.to_string()
|
||
```
|
||
|
||
### Export to Different Formats
|
||
|
||
```python
|
||
# RDF/JSON-LD (use UUID v5)
|
||
rdf_id = f"urn:uuid:{components.to_uuid()}"
|
||
# → "urn:uuid:550e8400-e29b-41d4-a716-446655440000"
|
||
|
||
# IIIF Manifest (use UUID v5)
|
||
iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json"
|
||
|
||
# Database (use numeric PK)
|
||
sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')"
|
||
|
||
# Citation (use human-readable)
|
||
citation = f"See Internet Archive ({components.to_string()}) for digital collections."
|
||
```
|
||
|
||
---
|
||
|
||
## Future-Proofing Strategy
|
||
|
||
### Timeline Projections
|
||
|
||
| Year | SHA-1 Status | UUID v5 Status | Recommendation |
|
||
|------|--------------|----------------|----------------|
|
||
| **2024** | Weak for security, OK for IDs | Standard, widely supported | ✅ Use UUID v5 as primary |
|
||
| **2030** | Likely deprecated for security | Still standard for IDs | ✅ Dual UUID (v5 + SHA-256) |
|
||
| **2040** | Possibly deprecated entirely | May be superseded | ⚠️ Migrate to UUID SHA-256 |
|
||
|
||
### Migration Path
|
||
|
||
If SHA-1 is fully deprecated:
|
||
|
||
1. **Phase 1 (Now):** Store both UUID v5 and UUID SHA-256
|
||
2. **Phase 2 (2030):** Make UUID SHA-256 primary, keep v5 as alias
|
||
3. **Phase 3 (2040):** Deprecate UUID v5, use SHA-256 exclusively
|
||
|
||
**Critical:** Because both are **deterministic**, you can always regenerate from GHCID string without breaking references.
|
||
|
||
---
|
||
|
||
## Governance & Resolution
|
||
|
||
### Identifier Persistence Requirements
|
||
|
||
Technical generation is only half the solution. True persistence requires:
|
||
|
||
#### 1. **Resolution Service**
|
||
```
|
||
https://id.heritage.example.org/uuid/{uuid}
|
||
https://id.heritage.example.org/numeric/{numeric}
|
||
https://id.heritage.example.org/ghcid/{ghcid}
|
||
|
||
All three should resolve to the same institutional record.
|
||
```
|
||
|
||
#### 2. **Mapping Database**
|
||
```sql
|
||
CREATE TABLE ghcid_registry (
|
||
uuid_v5 UUID PRIMARY KEY,
|
||
uuid_sha256 UUID NOT NULL,
|
||
numeric BIGINT NOT NULL,
|
||
ghcid VARCHAR(100) NOT NULL,
|
||
ghcid_original VARCHAR(100) NOT NULL, -- Frozen
|
||
institution_name TEXT NOT NULL,
|
||
last_updated TIMESTAMP,
|
||
UNIQUE(uuid_sha256),
|
||
UNIQUE(numeric),
|
||
UNIQUE(ghcid_original)
|
||
);
|
||
```
|
||
|
||
#### 3. **Organizational Commitment**
|
||
- Maintain resolution service for decades
|
||
- Fund infrastructure for long-term operation
|
||
- Establish governance policies for ID assignment
|
||
- Handle institution mergers/closures/relocations
|
||
|
||
#### 4. **Community Standards**
|
||
- Coordinate with ISIL, Wikidata, GeoNames
|
||
- Publish GHCID specification as RFC or W3C note
|
||
- Engage with Europeana, DPLA, IIIF communities
|
||
- Establish dispute resolution process
|
||
|
||
---
|
||
|
||
## Comparison with Existing PID Systems
|
||
|
||
| System | Format | Governance | Resolution | Adoption |
|
||
|--------|--------|------------|------------|----------|
|
||
| **DOI** | 10.xxxx/yyyy | IDF (non-profit) | doi.org | High (scholarly) |
|
||
| **ARK** | ark:/nnnnn/xxx | CDL (California) | n2t.net | Medium (archives) |
|
||
| **Handle** | hdl:xxxx/yyyy | CNRI (non-profit) | handle.net | Medium (repositories) |
|
||
| **GHCID** | UUID v5 | **TBD** | **TBD** | None (new) |
|
||
|
||
**Lesson:** Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical.
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### For This Project (2024-2025)
|
||
|
||
1. ✅ **Implement dual UUID generation** (v5 + SHA-256)
|
||
2. ✅ **Store all four identifier formats** in data model
|
||
3. ✅ **Use UUID v5 as primary ID** for current interoperability
|
||
4. ✅ **Document SHA-1 nuance** clearly
|
||
5. ⏳ **Build resolution service prototype**
|
||
6. ⏳ **Engage with Europeana/DPLA** for feedback
|
||
7. ⏳ **Draft GHCID specification** for community review
|
||
|
||
### For Production Deployment
|
||
|
||
1. ⏳ **Establish governance body** (non-profit foundation?)
|
||
2. ⏳ **Secure long-term funding** for resolution service
|
||
3. ⏳ **Coordinate with existing PID systems** (ISIL, VIAF, Wikidata)
|
||
4. ⏳ **Publish specification** (W3C note or IETF RFC)
|
||
5. ⏳ **Deploy resolution infrastructure** (multi-region, high availability)
|
||
6. ⏳ **Engage heritage community** for adoption
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **RFC 4122:** UUID Standard (https://tools.ietf.org/html/rfc4122)
|
||
- **SHA-1 Collision:** Google/CWI (2017) - https://shattered.io
|
||
- **UUID v8 Draft:** New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/)
|
||
- **NIST SHA-256:** FIPS 180-4 - https://csrc.nist.gov/publications/fips
|
||
- **Identifier.org:** Life sciences identifiers - https://identifiers.org
|
||
- **N2T:** Name-to-Thing resolver - https://n2t.net
|
||
|
||
---
|
||
|
||
**Version:** 1.0
|
||
**Date:** 2024-11-06
|
||
**Status:** Draft for Community Review
|