glam/docs/PERSISTENT_IDENTIFIERS.md
2025-11-19 23:25:22 +01:00

685 lines
22 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Persistent Identifiers for Heritage Institutions
## Overview
The GLAM Data Extraction project uses **multiple identifier formats** optimized for different purposes:
### Persistent Identifiers (Deterministic)
These can be regenerated from the GHCID string and are stable across systems:
| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v5** | 128 | SHA-1 | **PRIMARY** - Europeana, DPLA, IIIF, Wikidata | RFC 4122 Standard |
| **UUID SHA-256** | 128 | SHA-256 | **SOTA** - Security compliance, future-proofing | RFC 9562 (UUID v8) |
| **Numeric** | 64 | SHA-256 | CSV exports, numeric analysis | Internal |
| **Human-readable** | Variable | ISO format | Citations, documentation | ISO-based |
### Database Record Identifiers (Non-Deterministic)
These are generated once per record and optimize database performance:
| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v7** | 128 | Timestamp + Random | Database PKs, time-ordered queries | RFC 9562 Standard |
## Why Four Formats?
### 1. **UUID v5 (SHA-1)** - Interoperability Standard ⭐ PRIMARY
```
Format: 550e8400-e29b-41d4-a716-446655440000
Version: 5 (name-based, SHA-1)
Standard: RFC 4122 (2005)
```
**✅ Strengths:**
- **RFC 4122 compliant** - Universal library support
- **Deterministic** - Same GHCID → Same UUID always (content-addressed)
- **Transparent** - Publicly documented algorithm, anyone can verify
- **Interoperable** - Works with Europeana, DPLA, IIIF, Wikidata
- **128-bit collision resistance** - P(collision) ≈ 1.5×10^-29 for 1M institutions
**⚠️ SHA-1 Nuance:**
- Uses SHA-1 internally (RFC 4122 specification)
- SHA-1 deprecated for **cryptographic security** (digital signatures, TLS, passwords)
- SHA-1 **appropriate for identifier generation** (non-adversarial, collision-resistant)
- See [Why GHCID Uses UUID v5 and SHA-1](WHY_UUID_V5_SHA1.md) for detailed rationale
**Why SHA-1 is Safe for GHCID:**
```
Cryptographic Use (Vulnerable):
- Adversarial context (attacker forges signatures)
- Two-message collision attack
- Security-critical (financial, authentication)
Identifier Use (Safe):
- Non-adversarial context (no one forges museum IDs)
- Single-source generation (we control inputs)
- Uniqueness requirement (birthday paradox protection sufficient)
```
**Use When:**
- **Primary identifier** for all GHCID records
- Integrating with existing UUID v5 systems
- Exporting to Europeana, DPLA, IIIF
- Storing in Wikidata as external identifier
- RFC 4122 strict compliance required
- **Maximum transparency** required (anyone can verify)
---
### 2. **UUID SHA-256 (Custom)** - SOTA Cryptographic Strength
```
Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
Version: 8 (custom/experimental)
Algorithm: SHA-256 (truncated to 128 bits)
```
**✅ Strengths:**
- **SHA-256** - NIST-approved, SOTA cryptographic hash (2024)
- **Superior collision resistance** vs SHA-1
- **Future-proof** - No known practical attacks
- **UUID-compatible** - Valid UUID format, works with UUID parsers
**⚠️ Nuances:**
- **Not RFC 4122 standard** - Custom implementation
- UUID v8 is "experimental/vendor-specific" designation
- May not be recognized by strict UUID v5-only systems
**Use When:**
- Security policy mandates SHA-256
- Maximum collision resistance required
- Future-proofing against SHA-1 deprecation
- Custom identifier resolution service
**Algorithm:**
1. Hash GHCID string with SHA-256 → 256 bits
2. Truncate to first 128 bits (16 bytes)
3. Set version bits to 8 (custom)
4. Set variant bits to RFC 4122 (0b10xxxxxx)
---
### 3. **Numeric (64-bit)** - Database Optimization
```
Format: 213324328442227739
Algorithm: SHA-256 → first 8 bytes → uint64
Range: 0 to 18,446,744,073,709,551,615
```
**✅ Strengths:**
- **Compact** - Fits in SQL BIGINT (8 bytes)
- **Fast indexing** - Integer comparisons faster than UUID
- **CSV-friendly** - No special characters
- **Deterministic** - Same GHCID → Same number
**⚠️ Nuances:**
- **64-bit truncation** reduces collision resistance vs full 256-bit
- P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%)
- Still negligible for heritage domain (<10M institutions expected)
**Use When:**
- Database primary key optimization
- CSV exports for spreadsheet analysis
- Numeric sorting required
- Systems without UUID support
---
### 4. **Human-Readable (ISO-based)** - Citations & References
```
Format: US-CA-SAN-A-IA
Components: {Country}-{Region}-{City}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam)
```
** Strengths:**
- **Human-readable** - Understandable without lookup
- **Geographic context** - Location embedded in ID
- **Type indicator** - Institution type visible
- **Citable** - Use in academic papers, documentation
** Nuances:**
- **Not persistent** if institution relocates or changes name
- Use `ghcid_original` field (frozen) for true persistence
- `ghcid` field (current) may change over time
**Use When:**
- Academic citations
- Documentation and reports
- Human-readable data exchange
- Debugging and logging
---
## Collision Resistance Comparison
### Mathematical Analysis
```python
# Collision probability (birthday paradox):
# P(collision) ≈ n² / (2 × 2^bits)
# For 1,000,000 institutions:
# UUID v5 / UUID SHA-256 (128-bit):
P = (10^6)² / (2 × 2^128) 1.5 × 10^-29
# Effectively zero - more atoms in universe than collisions
# Numeric (64-bit):
P = (10^6)² / (2 × 2^64) 2.7 × 10^-7 (0.00003%)
# Negligible for heritage domain
# Even at 10 million institutions:
P_64bit = (10^7)² / (2 × 2^64) 2.7 × 10^-5 (0.003%)
# Still acceptable
```
### Real-World Context
| Institution Count | UUID v5/SHA-256 | Numeric (64-bit) | Assessment |
|-------------------|-----------------|------------------|------------|
| **100,000** | ~0% | 2.7×10^-11 (0.0000000027%) | All safe |
| **1,000,000** | ~0% | 2.7×10^-7 (0.00003%) | All safe |
| **10,000,000** | ~0% | 2.7×10^-5 (0.003%) | UUID safe, numeric acceptable |
| **100,000,000** | ~0% | 0.27% | Use UUID, numeric risky |
**Conclusion:** For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance.
---
## Historical Collision Resolution
### The Rule: Temporal Priority Determines Disambiguation
When creating GHCIDs, collisions can occur in two temporal contexts:
1. **First Batch Creation** (initial PID assignment): Multiple institutions discovered simultaneously
2. **Historical Addition** (post-publication): New historical institution added after existing GHCID published
**Critical Design Decision**: The collision resolution strategy differs based on temporal context to preserve PID stability.
### First Batch Behavior (Initial PID Creation)
**Scenario**: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together.
**Resolution**: **ALL** colliding institutions get Wikidata Q-numbers appended.
**Example**:
```yaml
# Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM
# Rijksmuseum Stedelijk (founded 1895, Wikidata Q842858)
ghcid_original: NL-NH-AMS-M-SM-Q842858
# Stedelijk Museum (founded 1874, Wikidata Q924335)
ghcid_original: NL-NH-AMS-M-SM-Q924335
```
**Rationale**: No existing PIDs to preserve; both institutions are "new" to the system.
### Historical Addition Behavior (Post-Publication)
**Scenario**: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID.
**Resolution**: **ONLY** the newly added historical institution gets a Q-number suffix. The existing PID remains unchanged.
**Example**:
```yaml
# Existing GHCID (published 2025-11-01)
ghcid_original: NL-NH-AMS-M-HM # Hermitage Museum Amsterdam (2009-2023)
# Historical institution added later (2025-11-15)
# Amsterdam Historical Museum (1926-1975, Wikidata Q17339437)
# Would also generate: NL-NH-AMS-M-HM
#
# COLLISION DETECTED → Add Q-number to NEW addition ONLY
ghcid_original: NL-NH-AMS-M-HM-Q17339437
```
**Outcome**:
- `NL-NH-AMS-M-HM` (Hermitage Museum Amsterdam) **UNCHANGED**
- `NL-NH-AMS-M-HM-Q17339437` (Amsterdam Historical Museum) **Q-number added**
**Rationale**: Preserve stability of already-published PIDs.
### Why This Matters: PID Stability Principle
**Problem**: Changing existing GHCIDs breaks external references.
PIDs may already be:
- Cited in academic publications
- Referenced in datasets and APIs
- Stored in institutional databases
- Embedded in IIIF manifests
- Linked from Wikidata
**Principle**: **"Cool URIs don't change"** (Tim Berners-Lee, W3C)
Once a GHCID is published (in first batch or as standalone record), it should **NEVER** change, even if new historical institutions create collisions.
### Decision Table: Who Gets Q-Number?
| Scenario | When | Existing GHCID | New GHCID | Who Gets Q-Number | Rationale |
|----------|------|----------------|-----------|-------------------|-----------|
| **First Batch** | Initial PID creation (2025-11-01) | None (first time) | `NL-NH-AMS-M-SM` (2 institutions) | **ALL** colliding institutions | No existing PIDs to preserve |
| **Historical Addition** | Post-publication (2025-11-15) | `NL-NH-AMS-M-HM` (published) | `NL-NH-AMS-M-HM` (historical) | **ONLY** newly added institution | Preserve published PID stability |
| **Standalone Addition** | New institution (2026-01-01) | `NL-NH-AMS-M-XY` (published) | `NL-NH-AMS-M-XY` (new contemporary) | **ONLY** newly added institution | Preserve existing PID |
### Implementation Guidance
**Collision Detection Logic**:
```python
def resolve_collision(new_ghcid: str, existing_ghcids: Set[str]) -> str:
"""
Resolve GHCID collision based on temporal context.
Args:
new_ghcid: Base GHCID for new institution
existing_ghcids: Set of already-published GHCIDs
Returns:
Final GHCID (with Q-number if needed)
"""
if new_ghcid in existing_ghcids:
# COLLISION DETECTED: New institution collides with existing
# Resolution: Add Q-number to NEW institution ONLY
wikidata_qid = fetch_wikidata_qid(new_institution)
return f"{new_ghcid}-Q{wikidata_qid}"
else:
# No collision: Use base GHCID
return new_ghcid
```
**First Batch Processing** (different logic):
```python
def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]:
"""
Process initial batch of institutions.
For first batch, ALL collisions get Q-numbers appended.
"""
# Group by base GHCID
ghcid_groups = defaultdict(list)
for inst in institutions:
base_ghcid = generate_base_ghcid(inst)
ghcid_groups[base_ghcid].append(inst)
records = []
for base_ghcid, group in ghcid_groups.items():
if len(group) == 1:
# No collision: Use base GHCID
records.append(create_record(group[0], base_ghcid))
else:
# COLLISION: ALL institutions get Q-numbers
for inst in group:
qid = fetch_wikidata_qid(inst)
ghcid = f"{base_ghcid}-Q{qid}"
records.append(create_record(inst, ghcid))
return records
```
### Edge Cases
**Case 1: Multiple historical institutions added simultaneously**
If multiple historical institutions are added together (same date) and collide with existing GHCID:
```yaml
# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-XY
# Both added 2025-11-15
# Historical Institution A (Wikidata Q111111)
ghcid: NL-NH-AMS-M-XY-Q111111
# Historical Institution B (Wikidata Q222222)
ghcid: NL-NH-AMS-M-XY-Q222222
```
**Resolution**: ALL newly added institutions get Q-numbers (treat as mini-batch).
**Case 2: Historical institution without Wikidata ID**
If historical institution lacks Wikidata Q-number, use fallback identifiers:
1. **VIAF ID**: `NL-NH-AMS-M-XY-V12345678`
2. **Founding Year**: `NL-NH-AMS-M-XY-Y1654` (e.g., Ole Worm's Wunderkammer)
3. **Sequential**: `NL-NH-AMS-M-XY-001` (last resort)
**Case 3: Existing GHCID already has Q-number**
If existing GHCID already has Q-number (from first batch collision), new historical addition gets different Q-number:
```yaml
# Existing (from first batch with collision)
ghcid: NL-NH-AMS-M-SM-Q924335 # Stedelijk Museum
# Historical addition (2025-11-15, Wikidata Q888888)
ghcid: NL-NH-AMS-M-SM-Q888888 # Different Q-number
```
**No ambiguity**: Each institution has unique Q-number.
### Testing Strategy
**Test 1: First Batch Collision**
```python
def test_first_batch_collision():
"""Verify ALL institutions in first batch get Q-numbers"""
institutions = [
Institution("Stedelijk Museum", type="M", city="AMS", qid="Q924335"),
Institution("Science Museum", type="M", city="AMS", qid="Q842858")
]
records = process_first_batch(institutions)
# Both should have Q-numbers
assert records[0].ghcid == "NL-NH-AMS-M-SM-Q924335"
assert records[1].ghcid == "NL-NH-AMS-M-SM-Q842858"
```
**Test 2: Historical Addition Collision**
```python
def test_historical_addition_preserves_existing():
"""Verify existing GHCID unchanged when historical added"""
# Existing GHCID (published)
existing_ghcids = {"NL-NH-AMS-M-HM"}
# Add historical institution
historical = Institution(
"Amsterdam Historical Museum",
type="M",
city="AMS",
qid="Q17339437",
temporal_extent={"start": "1926", "end": "1975"}
)
new_ghcid = resolve_collision(
generate_base_ghcid(historical),
existing_ghcids
)
# New historical gets Q-number
assert new_ghcid == "NL-NH-AMS-M-HM-Q17339437"
# Existing GHCID NOT in database update
# (verify existing record unchanged)
```
### Documentation References
- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
- **Implementation**: `src/glam_extractor/identifiers/ghcid.py`
- **Schema**: `schemas/provenance.yaml` (GHCIDHistoryEntry)
---
## SHA-1 vs SHA-256: The Nuance
### Why UUID v5 Uses SHA-1
**RFC 4122 (2005)** standardized UUID v5 with SHA-1 because:
- SHA-1 was considered secure in 2005
- 128-bit UUID space provides collision resistance even with SHA-1
- Purpose is **identifier generation**, not **security/authentication**
### SHA-1 Cryptographic Weakness
**SHA-1 collision attacks (2017):**
- Google/CWI demonstrated practical SHA-1 collision
- Two different inputs producing same hash
- **Critical for digital signatures** (authentication, certificates)
- **Less critical for identifiers** (birthday paradox protection sufficient)
### When SHA-1 Is Problematic
**Digital signatures** - Attacker can forge documents
**Certificate authorities** - SSL/TLS security compromised
**Password hashing** - Weakens brute-force resistance
**Blockchain** - Consensus security at risk
### When SHA-1 Is Acceptable
**UUID generation** - Collision resistance adequate for identifier space
**Git commits** - Linus Torvalds: "SHA-1 is fine for Git's use case"
**Non-adversarial contexts** - No attacker trying to cause collisions
---
## Recommended Usage Strategy
### Default: Dual UUID Approach
Store **both UUID formats** for maximum flexibility:
```yaml
# Example YAML record
- id: 550e8400-e29b-41d4-a716-446655440000 # Use UUID v5 as primary ID
name: Internet Archive
institution_type: ARCHIVE
ghcid: US-CA-SAN-A-IA
ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 (SHA-1)
ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID SHA-256
ghcid_numeric: 213324328442227739 # Numeric (64-bit)
identifiers:
- identifier_scheme: GHCID
identifier_value: US-CA-SAN-A-IA
- identifier_scheme: GHCID_UUID_V5
identifier_value: 550e8400-e29b-41d4-a716-446655440000
- identifier_scheme: GHCID_UUID_SHA256
identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
- identifier_scheme: GHCID_NUMERIC
identifier_value: 213324328442227739
```
### Use Case Decision Tree
```
Need to integrate with existing systems?
├─ YES → Use UUID v5 (to_uuid())
│ - Europeana, DPLA, IIIF, Wikidata
│ - RFC 4122 compliance required
└─ NO → Building custom system?
├─ Security policy mandates SHA-256?
│ ├─ YES → Use UUID SHA-256 (to_uuid_sha256())
│ └─ NO → Use UUID v5 for standard compliance
└─ Database optimization critical?
├─ YES → Use Numeric (to_numeric()) as PK
│ - Store UUID v5 as alternate key
└─ NO → Use UUID v5 as primary identifier
```
---
## Code Examples
### Generate All Four Formats
```python
from glam_extractor.identifiers.ghcid import GHCIDComponents
# Create GHCID components
components = GHCIDComponents(
country_code="US",
region_code="CA",
city_locode="SAN",
institution_type="A",
abbreviation="IA"
)
# Generate all formats
uuid_v5 = components.to_uuid() # UUID v5 (SHA-1)
uuid_sha256 = components.to_uuid_sha256() # UUID SHA-256
numeric = components.to_numeric() # Numeric (64-bit)
human = components.to_string() # Human-readable
print(f"UUID v5: {uuid_v5}")
print(f"UUID SHA-256: {uuid_sha256}")
print(f"Numeric: {numeric}")
print(f"Human: {human}")
# Output:
# UUID v5: 550e8400-e29b-41d4-a716-446655440000
# UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
# Numeric: 213324328442227739
# Human: US-CA-SAN-A-IA
```
### Verify Determinism
```python
# Same input always produces same output
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
assert comp1.to_uuid() == comp2.to_uuid()
assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256()
assert comp1.to_numeric() == comp2.to_numeric()
assert comp1.to_string() == comp2.to_string()
```
### Export to Different Formats
```python
# RDF/JSON-LD (use UUID v5)
rdf_id = f"urn:uuid:{components.to_uuid()}"
# → "urn:uuid:550e8400-e29b-41d4-a716-446655440000"
# IIIF Manifest (use UUID v5)
iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json"
# Database (use numeric PK)
sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')"
# Citation (use human-readable)
citation = f"See Internet Archive ({components.to_string()}) for digital collections."
```
---
## Future-Proofing Strategy
### Timeline Projections
| Year | SHA-1 Status | UUID v5 Status | Recommendation |
|------|--------------|----------------|----------------|
| **2024** | Weak for security, OK for IDs | Standard, widely supported | Use UUID v5 as primary |
| **2030** | Likely deprecated for security | Still standard for IDs | Dual UUID (v5 + SHA-256) |
| **2040** | Possibly deprecated entirely | May be superseded | Migrate to UUID SHA-256 |
### Migration Path
If SHA-1 is fully deprecated:
1. **Phase 1 (Now):** Store both UUID v5 and UUID SHA-256
2. **Phase 2 (2030):** Make UUID SHA-256 primary, keep v5 as alias
3. **Phase 3 (2040):** Deprecate UUID v5, use SHA-256 exclusively
**Critical:** Because both are **deterministic**, you can always regenerate from GHCID string without breaking references.
---
## Governance & Resolution
### Identifier Persistence Requirements
Technical generation is only half the solution. True persistence requires:
#### 1. **Resolution Service**
```
https://id.heritage.example.org/uuid/{uuid}
https://id.heritage.example.org/numeric/{numeric}
https://id.heritage.example.org/ghcid/{ghcid}
All three should resolve to the same institutional record.
```
#### 2. **Mapping Database**
```sql
CREATE TABLE ghcid_registry (
uuid_v5 UUID PRIMARY KEY,
uuid_sha256 UUID NOT NULL,
numeric BIGINT NOT NULL,
ghcid VARCHAR(100) NOT NULL,
ghcid_original VARCHAR(100) NOT NULL, -- Frozen
institution_name TEXT NOT NULL,
last_updated TIMESTAMP,
UNIQUE(uuid_sha256),
UNIQUE(numeric),
UNIQUE(ghcid_original)
);
```
#### 3. **Organizational Commitment**
- Maintain resolution service for decades
- Fund infrastructure for long-term operation
- Establish governance policies for ID assignment
- Handle institution mergers/closures/relocations
#### 4. **Community Standards**
- Coordinate with ISIL, Wikidata, GeoNames
- Publish GHCID specification as RFC or W3C note
- Engage with Europeana, DPLA, IIIF communities
- Establish dispute resolution process
---
## Comparison with Existing PID Systems
| System | Format | Governance | Resolution | Adoption |
|--------|--------|------------|------------|----------|
| **DOI** | 10.xxxx/yyyy | IDF (non-profit) | doi.org | High (scholarly) |
| **ARK** | ark:/nnnnn/xxx | CDL (California) | n2t.net | Medium (archives) |
| **Handle** | hdl:xxxx/yyyy | CNRI (non-profit) | handle.net | Medium (repositories) |
| **GHCID** | UUID v5 | **TBD** | **TBD** | None (new) |
**Lesson:** Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical.
---
## Recommendations
### For This Project (2024-2025)
1. **Implement dual UUID generation** (v5 + SHA-256)
2. **Store all four identifier formats** in data model
3. **Use UUID v5 as primary ID** for current interoperability
4. **Document SHA-1 nuance** clearly
5. **Build resolution service prototype**
6. **Engage with Europeana/DPLA** for feedback
7. **Draft GHCID specification** for community review
### For Production Deployment
1. **Establish governance body** (non-profit foundation?)
2. **Secure long-term funding** for resolution service
3. **Coordinate with existing PID systems** (ISIL, VIAF, Wikidata)
4. **Publish specification** (W3C note or IETF RFC)
5. **Deploy resolution infrastructure** (multi-region, high availability)
6. **Engage heritage community** for adoption
---
## References
- **RFC 4122:** UUID Standard (https://tools.ietf.org/html/rfc4122)
- **SHA-1 Collision:** Google/CWI (2017) - https://shattered.io
- **UUID v8 Draft:** New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/)
- **NIST SHA-256:** FIPS 180-4 - https://csrc.nist.gov/publications/fips
- **Identifier.org:** Life sciences identifiers - https://identifiers.org
- **N2T:** Name-to-Thing resolver - https://n2t.net
---
**Version:** 1.0
**Date:** 2024-11-06
**Status:** Draft for Community Review