glam/docs/PERSISTENT_IDENTIFIERS.md
2025-12-07 00:26:01 +01:00

760 lines
25 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Persistent Identifiers for Heritage Institutions
## Overview
The GLAM Data Extraction project uses **multiple identifier formats** optimized for different purposes:
### Persistent Identifiers (Deterministic)
These can be regenerated from the GHCID string and are stable across systems:
| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v5** | 128 | SHA-1 | **PRIMARY** - Europeana, DPLA, IIIF, Wikidata | RFC 4122 Standard |
| **UUID SHA-256** | 128 | SHA-256 | **SOTA** - Security compliance, future-proofing | RFC 9562 (UUID v8) |
| **Numeric** | 64 | SHA-256 | CSV exports, numeric analysis | Internal |
| **Human-readable** | Variable | ISO format | Citations, documentation | ISO-based |
### Database Record Identifiers (Non-Deterministic)
These are generated once per record and optimize database performance:
| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v7** | 128 | Timestamp + Random | Database PKs, time-ordered queries | RFC 9562 Standard |
## Why Four Formats?
### 1. **UUID v5 (SHA-1)** - Interoperability Standard ⭐ PRIMARY
```
Format: 550e8400-e29b-41d4-a716-446655440000
Version: 5 (name-based, SHA-1)
Standard: RFC 4122 (2005)
```
**✅ Strengths:**
- **RFC 4122 compliant** - Universal library support
- **Deterministic** - Same GHCID → Same UUID always (content-addressed)
- **Transparent** - Publicly documented algorithm, anyone can verify
- **Interoperable** - Works with Europeana, DPLA, IIIF, Wikidata
- **128-bit collision resistance** - P(collision) ≈ 1.5×10^-29 for 1M institutions
**⚠️ SHA-1 Nuance:**
- Uses SHA-1 internally (RFC 4122 specification)
- SHA-1 deprecated for **cryptographic security** (digital signatures, TLS, passwords)
- SHA-1 **appropriate for identifier generation** (non-adversarial, collision-resistant)
- See [Why GHCID Uses UUID v5 and SHA-1](WHY_UUID_V5_SHA1.md) for detailed rationale
**Why SHA-1 is Safe for GHCID:**
```
Cryptographic Use (Vulnerable):
- Adversarial context (attacker forges signatures)
- Two-message collision attack
- Security-critical (financial, authentication)
Identifier Use (Safe):
- Non-adversarial context (no one forges museum IDs)
- Single-source generation (we control inputs)
- Uniqueness requirement (birthday paradox protection sufficient)
```
**Use When:**
- **Primary identifier** for all GHCID records
- Integrating with existing UUID v5 systems
- Exporting to Europeana, DPLA, IIIF
- Storing in Wikidata as external identifier
- RFC 4122 strict compliance required
- **Maximum transparency** required (anyone can verify)
---
### 2. **UUID SHA-256 (Custom)** - SOTA Cryptographic Strength
```
Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
Version: 8 (custom/experimental)
Algorithm: SHA-256 (truncated to 128 bits)
```
**✅ Strengths:**
- **SHA-256** - NIST-approved, SOTA cryptographic hash (2024)
- **Superior collision resistance** vs SHA-1
- **Future-proof** - No known practical attacks
- **UUID-compatible** - Valid UUID format, works with UUID parsers
**⚠️ Nuances:**
- **Not RFC 4122 standard** - Custom implementation
- UUID v8 is "experimental/vendor-specific" designation
- May not be recognized by strict UUID v5-only systems
**Use When:**
- Security policy mandates SHA-256
- Maximum collision resistance required
- Future-proofing against SHA-1 deprecation
- Custom identifier resolution service
**Algorithm:**
1. Hash GHCID string with SHA-256 → 256 bits
2. Truncate to first 128 bits (16 bytes)
3. Set version bits to 8 (custom)
4. Set variant bits to RFC 4122 (0b10xxxxxx)
---
### 3. **Numeric (64-bit)** - Database Optimization
```
Format: 213324328442227739
Algorithm: SHA-256 → first 8 bytes → uint64
Range: 0 to 18,446,744,073,709,551,615
```
**✅ Strengths:**
- **Compact** - Fits in SQL BIGINT (8 bytes)
- **Fast indexing** - Integer comparisons faster than UUID
- **CSV-friendly** - No special characters
- **Deterministic** - Same GHCID → Same number
**⚠️ Nuances:**
- **64-bit truncation** reduces collision resistance vs full 256-bit
- P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%)
- Still negligible for heritage domain (<10M institutions expected)
**Use When:**
- Database primary key optimization
- CSV exports for spreadsheet analysis
- Numeric sorting required
- Systems without UUID support
---
### 4. **Human-Readable (ISO-based)** - Citations & References
```
Format: US-CA-SAN-A-IA
Components: {Country}-{Region}-{City}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam)
```
** Strengths:**
- **Human-readable** - Understandable without lookup
- **Geographic context** - Location embedded in ID
- **Type indicator** - Institution type visible
- **Citable** - Use in academic papers, documentation
** Nuances:**
- **Not persistent** if institution relocates or changes name
- Use `ghcid_original` field (frozen) for true persistence
- `ghcid` field (current) may change over time
**Use When:**
- Academic citations
- Documentation and reports
- Human-readable data exchange
- Debugging and logging
---
## Collision Resistance Comparison
### Mathematical Analysis
```python
# Collision probability (birthday paradox):
# P(collision) ≈ n² / (2 × 2^bits)
# For 1,000,000 institutions:
# UUID v5 / UUID SHA-256 (128-bit):
P = (10^6)² / (2 × 2^128) 1.5 × 10^-29
# Effectively zero - more atoms in universe than collisions
# Numeric (64-bit):
P = (10^6)² / (2 × 2^64) 2.7 × 10^-7 (0.00003%)
# Negligible for heritage domain
# Even at 10 million institutions:
P_64bit = (10^7)² / (2 × 2^64) 2.7 × 10^-5 (0.003%)
# Still acceptable
```
### Real-World Context
| Institution Count | UUID v5/SHA-256 | Numeric (64-bit) | Assessment |
|-------------------|-----------------|------------------|------------|
| **100,000** | ~0% | 2.7×10^-11 (0.0000000027%) | All safe |
| **1,000,000** | ~0% | 2.7×10^-7 (0.00003%) | All safe |
| **10,000,000** | ~0% | 2.7×10^-5 (0.003%) | UUID safe, numeric acceptable |
| **100,000,000** | ~0% | 0.27% | Use UUID, numeric risky |
**Conclusion:** For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance.
---
## Historical Collision Resolution
### The Rule: Temporal Priority Determines Disambiguation
When creating GHCIDs, collisions can occur in two temporal contexts:
1. **First Batch Creation** (initial PID assignment): Multiple institutions discovered simultaneously
2. **Historical Addition** (post-publication): New historical institution added after existing GHCID published
**Critical Design Decision**: The collision resolution strategy differs based on temporal context to preserve PID stability.
### Collision Resolution: Native Language Name Suffix
**Key Change**: Collisions are resolved by appending the **full legal name in native language in snake_case format**, NOT Wikidata Q-numbers.
**Name Suffix Rules**:
- Use the institution's full official name in its native language
- Convert to snake_case (lowercase, underscores for spaces)
- Remove apostrophes, accents, commas, and other punctuation/diacritics
- Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)
**Name Normalization Examples**:
```
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
```
### First Batch Behavior (Initial PID Creation)
**Scenario**: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together.
**Resolution**: **ALL** colliding institutions get name suffixes appended.
**Example**:
```yaml
# Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM
# Stedelijk Museum (founded 1874)
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
# Science Museum Amsterdam (founded 2010)
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
```
**Rationale**: No existing PIDs to preserve; both institutions are "new" to the system.
### Historical Addition Behavior (Post-Publication)
**Scenario**: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID.
**Resolution**: **ONLY** the newly added historical institution gets a name suffix. The existing PID remains unchanged.
**Example**:
```yaml
# Existing GHCID (published 2025-11-01)
ghcid_original: NL-NH-AMS-M-HM # Hermitage Museum Amsterdam (2009-2023)
# Historical institution added later (2025-11-15)
# Amsterdam Historical Museum (1926-1975)
# Would also generate: NL-NH-AMS-M-HM
#
# COLLISION DETECTED → Add name suffix to NEW addition ONLY
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
```
**Outcome**:
- `NL-NH-AMS-M-HM` (Hermitage Museum Amsterdam) **UNCHANGED**
- `NL-NH-AMS-M-HM-amsterdam_historical_museum` (Amsterdam Historical Museum) **Name suffix added**
**Rationale**: Preserve stability of already-published PIDs.
### Why This Matters: PID Stability Principle
**Problem**: Changing existing GHCIDs breaks external references.
PIDs may already be:
- Cited in academic publications
- Referenced in datasets and APIs
- Stored in institutional databases
- Embedded in IIIF manifests
- Linked from Wikidata
**Principle**: **"Cool URIs don't change"** (Tim Berners-Lee, W3C)
Once a GHCID is published (in first batch or as standalone record), it should **NEVER** change, even if new historical institutions create collisions.
### Decision Table: Who Gets Name Suffix?
| Scenario | When | Existing GHCID | New GHCID | Who Gets Name Suffix | Rationale |
|----------|------|----------------|-----------|---------------------|-----------|
| **First Batch** | Initial PID creation (2025-11-01) | None (first time) | `NL-NH-AMS-M-SM` (2 institutions) | **ALL** colliding institutions | No existing PIDs to preserve |
| **Historical Addition** | Post-publication (2025-11-15) | `NL-NH-AMS-M-HM` (published) | `NL-NH-AMS-M-HM` (historical) | **ONLY** newly added institution | Preserve published PID stability |
| **Standalone Addition** | New institution (2026-01-01) | `NL-NH-AMS-M-XY` (published) | `NL-NH-AMS-M-XY` (new contemporary) | **ONLY** newly added institution | Preserve existing PID |
### Implementation Guidance
**Name Suffix Generation**:
```python
import re
import unicodedata
def generate_name_suffix(native_name: str) -> str:
"""Convert native language institution name to snake_case suffix.
Examples:
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
"""
# Normalize unicode (NFD decomposition) and remove diacritics
normalized = unicodedata.normalize('NFD', native_name)
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Convert to lowercase
lowercase = ascii_name.lower()
# Remove apostrophes, commas, and other punctuation
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
# Replace spaces and hyphens with underscores
underscored = re.sub(r'[\s\-]+', '_', no_punct)
# Remove any remaining non-alphanumeric characters (except underscores)
clean = re.sub(r'[^a-z0-9_]', '', underscored)
# Collapse multiple underscores
final = re.sub(r'_+', '_', clean).strip('_')
return final
```
**Collision Detection Logic**:
```python
def resolve_collision(new_ghcid: str, new_name: str, existing_ghcids: Set[str]) -> str:
"""
Resolve GHCID collision based on temporal context.
Args:
new_ghcid: Base GHCID for new institution
new_name: Native language name of the institution
existing_ghcids: Set of already-published GHCIDs
Returns:
Final GHCID (with name suffix if needed)
"""
if new_ghcid in existing_ghcids:
# COLLISION DETECTED: New institution collides with existing
# Resolution: Add name suffix to NEW institution ONLY
name_suffix = generate_name_suffix(new_name)
return f"{new_ghcid}-{name_suffix}"
else:
# No collision: Use base GHCID
return new_ghcid
```
**First Batch Processing** (different logic):
```python
def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]:
"""
Process initial batch of institutions.
For first batch, ALL collisions get name suffixes appended.
"""
# Group by base GHCID
ghcid_groups = defaultdict(list)
for inst in institutions:
base_ghcid = generate_base_ghcid(inst)
ghcid_groups[base_ghcid].append(inst)
records = []
for base_ghcid, group in ghcid_groups.items():
if len(group) == 1:
# No collision: Use base GHCID
records.append(create_record(group[0], base_ghcid))
else:
# COLLISION: ALL institutions get name suffixes
for inst in group:
name_suffix = generate_name_suffix(inst.name)
ghcid = f"{base_ghcid}-{name_suffix}"
records.append(create_record(inst, ghcid))
return records
```
### Edge Cases
**Case 1: Multiple historical institutions added simultaneously**
If multiple historical institutions are added together (same date) and collide with existing GHCID:
```yaml
# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-XY
# Both added 2025-11-15
# Historical Institution A: "Amsterdam Art Archive"
ghcid: NL-NH-AMS-M-XY-amsterdam_art_archive
# Historical Institution B: "Amsterdam Archaeology Museum"
ghcid: NL-NH-AMS-M-XY-amsterdam_archaeology_museum
```
**Resolution**: ALL newly added institutions get name suffixes (treat as mini-batch).
**Case 2: Existing GHCID already has name suffix**
If existing GHCID already has name suffix (from first batch collision), new historical addition gets different name suffix:
```yaml
# Existing (from first batch with collision)
ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
# Historical addition (2025-11-15)
ghcid: NL-NH-AMS-M-SM-stadsmuseum_amsterdam # Different name suffix
```
**No ambiguity**: Each institution has unique name suffix derived from its native language name.
**Case 3: Non-Latin script names**
For institutions with non-Latin script names, transliterate to ASCII:
```yaml
# Chinese institution: 北京故宫博物院 (Palace Museum Beijing)
ghcid: CN-BJ-BEI-M-PM-beijing_gugong_bowuyuan
# Japanese institution: 東京国立博物館 (Tokyo National Museum)
ghcid: JP-TK-TOK-M-TN-tokyo_kokuritsu_hakubutsukan
# Arabic institution: المتحف المصري (Egyptian Museum)
ghcid: EG-CA-CAI-M-EM-al_mathaf_al_masri
```
### Testing Strategy
**Test 1: First Batch Collision**
```python
def test_first_batch_collision():
"""Verify ALL institutions in first batch get name suffixes"""
institutions = [
Institution("Stedelijk Museum Amsterdam", type="M", city="AMS"),
Institution("Science Museum Amsterdam", type="M", city="AMS")
]
records = process_first_batch(institutions)
# Both should have name suffixes
assert records[0].ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
assert records[1].ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"
```
**Test 2: Historical Addition Collision**
```python
def test_historical_addition_preserves_existing():
"""Verify existing GHCID unchanged when historical added"""
# Existing GHCID (published)
existing_ghcids = {"NL-NH-AMS-M-HM"}
# Add historical institution
historical = Institution(
name="Amsterdam Historical Museum",
type="M",
city="AMS",
temporal_extent={"start": "1926", "end": "1975"}
)
new_ghcid = resolve_collision(
generate_base_ghcid(historical),
historical.name,
existing_ghcids
)
# New historical gets name suffix
assert new_ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"
# Existing GHCID NOT in database update
# (verify existing record unchanged)
```
**Test 3: Name Suffix Generation**
```python
def test_name_suffix_generation():
"""Verify name suffix normalization"""
assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
assert generate_name_suffix("Biblioteca Nacional do Brasil") == "biblioteca_nacional_do_brasil"
assert generate_name_suffix("Royal Museum, London") == "royal_museum_london"
```
### Documentation References
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
- **Implementation**: `src/glam_extractor/identifiers/ghcid.py`
- **Schema**: `schemas/provenance.yaml` (GHCIDHistoryEntry)
- **Abbreviation Special Characters**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` (characters to exclude from abbreviations)
---
## SHA-1 vs SHA-256: The Nuance
### Why UUID v5 Uses SHA-1
**RFC 4122 (2005)** standardized UUID v5 with SHA-1 because:
- SHA-1 was considered secure in 2005
- 128-bit UUID space provides collision resistance even with SHA-1
- Purpose is **identifier generation**, not **security/authentication**
### SHA-1 Cryptographic Weakness
**SHA-1 collision attacks (2017):**
- Google/CWI demonstrated practical SHA-1 collision
- Two different inputs producing same hash
- **Critical for digital signatures** (authentication, certificates)
- **Less critical for identifiers** (birthday paradox protection sufficient)
### When SHA-1 Is Problematic
**Digital signatures** - Attacker can forge documents
**Certificate authorities** - SSL/TLS security compromised
**Password hashing** - Weakens brute-force resistance
**Blockchain** - Consensus security at risk
### When SHA-1 Is Acceptable
**UUID generation** - Collision resistance adequate for identifier space
**Git commits** - Linus Torvalds: "SHA-1 is fine for Git's use case"
**Non-adversarial contexts** - No attacker trying to cause collisions
---
## Recommended Usage Strategy
### Default: Dual UUID Approach
Store **both UUID formats** for maximum flexibility:
```yaml
# Example YAML record
- id: 550e8400-e29b-41d4-a716-446655440000 # Use UUID v5 as primary ID
name: Internet Archive
institution_type: ARCHIVE
ghcid: US-CA-SAN-A-IA
ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 (SHA-1)
ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID SHA-256
ghcid_numeric: 213324328442227739 # Numeric (64-bit)
identifiers:
- identifier_scheme: GHCID
identifier_value: US-CA-SAN-A-IA
- identifier_scheme: GHCID_UUID_V5
identifier_value: 550e8400-e29b-41d4-a716-446655440000
- identifier_scheme: GHCID_UUID_SHA256
identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
- identifier_scheme: GHCID_NUMERIC
identifier_value: 213324328442227739
```
### Use Case Decision Tree
```
Need to integrate with existing systems?
├─ YES → Use UUID v5 (to_uuid())
│ - Europeana, DPLA, IIIF, Wikidata
│ - RFC 4122 compliance required
└─ NO → Building custom system?
├─ Security policy mandates SHA-256?
│ ├─ YES → Use UUID SHA-256 (to_uuid_sha256())
│ └─ NO → Use UUID v5 for standard compliance
└─ Database optimization critical?
├─ YES → Use Numeric (to_numeric()) as PK
│ - Store UUID v5 as alternate key
└─ NO → Use UUID v5 as primary identifier
```
---
## Code Examples
### Generate All Four Formats
```python
from glam_extractor.identifiers.ghcid import GHCIDComponents
# Create GHCID components
components = GHCIDComponents(
country_code="US",
region_code="CA",
city_locode="SAN",
institution_type="A",
abbreviation="IA"
)
# Generate all formats
uuid_v5 = components.to_uuid() # UUID v5 (SHA-1)
uuid_sha256 = components.to_uuid_sha256() # UUID SHA-256
numeric = components.to_numeric() # Numeric (64-bit)
human = components.to_string() # Human-readable
print(f"UUID v5: {uuid_v5}")
print(f"UUID SHA-256: {uuid_sha256}")
print(f"Numeric: {numeric}")
print(f"Human: {human}")
# Output:
# UUID v5: 550e8400-e29b-41d4-a716-446655440000
# UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
# Numeric: 213324328442227739
# Human: US-CA-SAN-A-IA
```
### Verify Determinism
```python
# Same input always produces same output
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
assert comp1.to_uuid() == comp2.to_uuid()
assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256()
assert comp1.to_numeric() == comp2.to_numeric()
assert comp1.to_string() == comp2.to_string()
```
### Export to Different Formats
```python
# RDF/JSON-LD (use UUID v5)
rdf_id = f"urn:uuid:{components.to_uuid()}"
# → "urn:uuid:550e8400-e29b-41d4-a716-446655440000"
# IIIF Manifest (use UUID v5)
iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json"
# Database (use numeric PK)
sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')"
# Citation (use human-readable)
citation = f"See Internet Archive ({components.to_string()}) for digital collections."
```
---
## Future-Proofing Strategy
### Timeline Projections
| Year | SHA-1 Status | UUID v5 Status | Recommendation |
|------|--------------|----------------|----------------|
| **2024** | Weak for security, OK for IDs | Standard, widely supported | Use UUID v5 as primary |
| **2030** | Likely deprecated for security | Still standard for IDs | Dual UUID (v5 + SHA-256) |
| **2040** | Possibly deprecated entirely | May be superseded | Migrate to UUID SHA-256 |
### Migration Path
If SHA-1 is fully deprecated:
1. **Phase 1 (Now):** Store both UUID v5 and UUID SHA-256
2. **Phase 2 (2030):** Make UUID SHA-256 primary, keep v5 as alias
3. **Phase 3 (2040):** Deprecate UUID v5, use SHA-256 exclusively
**Critical:** Because both are **deterministic**, you can always regenerate from GHCID string without breaking references.
---
## Governance & Resolution
### Identifier Persistence Requirements
Technical generation is only half the solution. True persistence requires:
#### 1. **Resolution Service**
```
https://id.heritage.example.org/uuid/{uuid}
https://id.heritage.example.org/numeric/{numeric}
https://id.heritage.example.org/ghcid/{ghcid}
All three should resolve to the same institutional record.
```
#### 2. **Mapping Database**
```sql
CREATE TABLE ghcid_registry (
uuid_v5 UUID PRIMARY KEY,
uuid_sha256 UUID NOT NULL,
numeric BIGINT NOT NULL,
ghcid VARCHAR(100) NOT NULL,
ghcid_original VARCHAR(100) NOT NULL, -- Frozen
institution_name TEXT NOT NULL,
last_updated TIMESTAMP,
UNIQUE(uuid_sha256),
UNIQUE(numeric),
UNIQUE(ghcid_original)
);
```
#### 3. **Organizational Commitment**
- Maintain resolution service for decades
- Fund infrastructure for long-term operation
- Establish governance policies for ID assignment
- Handle institution mergers/closures/relocations
#### 4. **Community Standards**
- Coordinate with ISIL, Wikidata, GeoNames
- Publish GHCID specification as RFC or W3C note
- Engage with Europeana, DPLA, IIIF communities
- Establish dispute resolution process
---
## Comparison with Existing PID Systems
| System | Format | Governance | Resolution | Adoption |
|--------|--------|------------|------------|----------|
| **DOI** | 10.xxxx/yyyy | IDF (non-profit) | doi.org | High (scholarly) |
| **ARK** | ark:/nnnnn/xxx | CDL (California) | n2t.net | Medium (archives) |
| **Handle** | hdl:xxxx/yyyy | CNRI (non-profit) | handle.net | Medium (repositories) |
| **GHCID** | UUID v5 | **TBD** | **TBD** | None (new) |
**Lesson:** Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical.
---
## Recommendations
### For This Project (2024-2025)
1. **Implement dual UUID generation** (v5 + SHA-256)
2. **Store all four identifier formats** in data model
3. **Use UUID v5 as primary ID** for current interoperability
4. **Document SHA-1 nuance** clearly
5. **Build resolution service prototype**
6. **Engage with Europeana/DPLA** for feedback
7. **Draft GHCID specification** for community review
### For Production Deployment
1. **Establish governance body** (non-profit foundation?)
2. **Secure long-term funding** for resolution service
3. **Coordinate with existing PID systems** (ISIL, VIAF, Wikidata)
4. **Publish specification** (W3C note or IETF RFC)
5. **Deploy resolution infrastructure** (multi-region, high availability)
6. **Engage heritage community** for adoption
---
## References
- **RFC 4122:** UUID Standard (https://tools.ietf.org/html/rfc4122)
- **SHA-1 Collision:** Google/CWI (2017) - https://shattered.io
- **UUID v8 Draft:** New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/)
- **NIST SHA-256:** FIPS 180-4 - https://csrc.nist.gov/publications/fips
- **Identifier.org:** Life sciences identifiers - https://identifiers.org
- **N2T:** Name-to-Thing resolver - https://n2t.net
---
**Version:** 1.0
**Date:** 2024-11-06
**Status:** Draft for Community Review