9.8 KiB
9.8 KiB
UUID Strategy for Heritage Institutions
The Problem with UUID v7 for Persistent Identifiers
UUID v7 is time-based and random - perfect for databases, but not deterministic.
Why Determinism Matters for PIDs
| Scenario | UUID v5 (Deterministic) | UUID v7 (Time-based) |
|---|---|---|
| Regenerate from GHCID | ✅ Always same UUID | ❌ Different UUID each time |
| Independent systems agree | ✅ Same UUID generated | ❌ Different UUIDs |
| Lost database recovery | ✅ Rebuild from GHCID strings | ❌ Must restore from backup |
| Content-addressed | ✅ Hash of content | ❌ Based on timestamp |
Example:
# UUID v5 - Deterministic (content-addressed)
ghcid = "US-CA-SAN-A-IA"
uuid1 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
uuid2 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
uuid1 == uuid2 # ✅ SAME!
# UUID v7 - Time-based (timestamp-addressed)
uuid1 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-5678-7abc-...
uuid2 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-9999-7fff-...
uuid1 == uuid2 # ❌ DIFFERENT!
✅ Hybrid Strategy: Best of Both Worlds
Use both UUID types for different purposes:
1. UUID v5 (SHA-1) or UUID v8 (SHA-256) → Persistent Identifier
- Purpose: Long-term, deterministic, content-addressed PID
- Use for: Cross-system references, citations, Wikidata, IIIF
- Benefit: Can always regenerate from GHCID string
2. UUID v7 → Database Record ID
- Purpose: Time-ordered, sortable, high-performance database key
- Use for: Internal database primary keys, indexing, queries
- Benefit: Faster inserts, better B-tree performance
Example Data Model
# Heritage institution record
- record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB primary key
pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Persistent ID
pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
name: Internet Archive
ghcid: US-CA-SAN-A-IA
created_at: 2024-11-06T12:34:56Z # Embedded in UUID v7
identifiers:
- identifier_scheme: RECORD_ID_V7
identifier_value: 018e1234-5678-7abc-def0-123456789abc
- identifier_scheme: PID_UUID_V5
identifier_value: 550e8400-e29b-41d4-a716-446655440000
- identifier_scheme: PID_UUID_SHA256
identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
Database Schema
CREATE TABLE heritage_institutions (
-- UUID v7 - Primary key (sortable, fast)
record_id UUID PRIMARY KEY DEFAULT uuid_v7(),
-- UUID v5 - Persistent identifier (deterministic)
pid_uuid_v5 UUID NOT NULL UNIQUE,
-- UUID v8 - SHA-256 PID (SOTA cryptographic strength)
pid_uuid_sha256 UUID NOT NULL UNIQUE,
-- Human-readable GHCID
ghcid VARCHAR(100) NOT NULL,
-- Institution data
name TEXT NOT NULL,
institution_type VARCHAR(50),
-- Timestamps (automatically in UUID v7!)
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
-- Indexes
INDEX idx_pid_v5 (pid_uuid_v5),
INDEX idx_ghcid (ghcid)
);
Performance Comparison
UUID v7 vs UUID v5 for Database Primary Keys
From the article's insights:
| Metric | UUID v4/v5 (Random) | UUID v7 (Time-ordered) |
|---|---|---|
| Insert Performance | ⚠️ Slow (random B-tree splits) | ✅ Fast (sequential inserts) |
| Index Size | ⚠️ Larger (fragmented) | ✅ Smaller (compact) |
| Range Queries | ❌ Inefficient | ✅ Efficient (time-based) |
| Sortability | ❌ Random order | ✅ Time-ordered |
| Determinism | ✅ Yes (v5 only) | ❌ No |
| Content-addressed | ✅ Yes (v5 only) | ❌ No |
Conclusion:
- Use UUID v7 as database PK for performance
- Use UUID v5/v8 as PID for persistence and interoperability
Implementation: Dual UUID Generation
import uuid
import hashlib
from datetime import datetime
from uuid_utils import uuid7 # Python uuid-utils library
class GHCIDComponents:
def __init__(self, country_code, region_code, city_locode,
institution_type, abbreviation):
self.country_code = country_code.upper()
self.region_code = region_code.upper()
self.city_locode = city_locode.upper()
self.institution_type = institution_type.upper()
self.abbreviation = abbreviation.upper()
def to_string(self) -> str:
"""Human-readable GHCID string."""
return f"{self.country_code}-{self.region_code}-{self.city_locode}-{self.institution_type}-{self.abbreviation}"
# === PERSISTENT IDENTIFIERS (deterministic) ===
def to_uuid_v5(self) -> uuid.UUID:
"""UUID v5 - Persistent ID (SHA-1 based, RFC 4122)."""
ghcid_str = self.to_string()
return uuid.uuid5(GHCID_NAMESPACE, ghcid_str)
def to_uuid_sha256(self) -> uuid.UUID:
"""UUID v8 - Persistent ID (SHA-256 based, SOTA)."""
ghcid_str = self.to_string()
hash_bytes = hashlib.sha256(ghcid_str.encode('utf-8')).digest()
uuid_bytes = bytearray(hash_bytes[:16])
uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80 # Version 8
uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80 # Variant RFC 4122
return uuid.UUID(bytes=bytes(uuid_bytes))
# === DATABASE RECORD ID (time-based) ===
@staticmethod
def generate_record_id() -> uuid.UUID:
"""UUID v7 - Database primary key (time-ordered, high performance)."""
return uuid7() # From uuid-utils library
# === EXAMPLE USAGE ===
def create_database_record(self):
"""Create a complete database record with all UUID types."""
return {
'record_id': self.generate_record_id(), # UUID v7 - DB PK
'pid_uuid_v5': self.to_uuid_v5(), # UUID v5 - Persistent ID
'pid_uuid_sha256': self.to_uuid_sha256(), # UUID v8 - SOTA PID
'ghcid': self.to_string(), # Human-readable
'created_at': datetime.utcnow(),
}
# Example
components = GHCIDComponents("US", "CA", "SAN", "A", "IA")
record = components.create_database_record()
print(f"Record ID (v7): {record['record_id']}") # 018e1234-5678-7abc-def0-123456789abc
print(f"PID v5: {record['pid_uuid_v5']}") # 550e8400-e29b-41d4-a716-446655440000
print(f"PID SHA-256: {record['pid_uuid_sha256']}") # a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
print(f"GHCID: {record['ghcid']}") # US-CA-SAN-A-IA
# Verify determinism
assert components.to_uuid_v5() == components.to_uuid_v5() # ✅ Same every time
assert components.to_uuid_sha256() == components.to_uuid_sha256() # ✅ Same every time
assert components.generate_record_id() != components.generate_record_id() # ✅ Different (time-based)
Resolution Service Strategy
Multi-UUID Resolution
# All UUIDs resolve to the same institution record
# UUID v7 (record ID)
https://id.heritage.org/record/018e1234-5678-7abc-def0-123456789abc
→ Redirects to institutional page
# UUID v5 (persistent ID)
https://id.heritage.org/pid/550e8400-e29b-41d4-a716-446655440000
→ Redirects to institutional page
# UUID v8 (SHA-256 PID)
https://id.heritage.org/pid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
→ Redirects to institutional page
# Human-readable GHCID
https://id.heritage.org/ghcid/US-CA-SAN-A-IA
→ Redirects to institutional page
When to Use Each UUID Type
Decision Matrix
| Use Case | UUID v7 | UUID v5 | UUID v8 (SHA-256) |
|---|---|---|---|
| Database primary key | ✅ Best choice | ⚠️ Works but slower | ⚠️ Works but slower |
| Time-ordered queries | ✅ Native support | ❌ Random order | ❌ Random order |
| Persistent identifier (PID) | ❌ Not deterministic | ✅ Standard choice | ✅ SOTA choice |
| Cross-system references | ⚠️ Internal only | ✅ Yes | ✅ Yes |
| Citations/Wikidata | ❌ Not persistent | ✅ Yes | ✅ Yes (if accepted) |
| Security compliance (SHA-256 required) | ❌ No | ❌ Uses SHA-1 | ✅ Yes |
| Europeana/DPLA integration | ❌ No | ✅ Standard | ⚠️ Custom |
Summary: Three-UUID Strategy
1. UUID v7 → Internal Database Record ID
- ✅ Fast inserts (sequential)
- ✅ Time-ordered (sortable by creation time)
- ✅ Better B-tree performance
- ❌ NOT persistent (time-based, random component)
- Use for: Database primary keys, internal references
2. UUID v5 → Public Persistent Identifier (Standard)
- ✅ Deterministic (content-addressed)
- ✅ RFC 4122 compliant
- ✅ Interoperable (Europeana, DPLA, IIIF)
- ⚠️ SHA-1 based (weaker cryptographically)
- Use for: Public PIDs, cross-system references, citations
3. UUID v8 (SHA-256) → Future-Proof Persistent Identifier
- ✅ Deterministic (content-addressed)
- ✅ SHA-256 (SOTA cryptographic strength)
- ✅ Future-proof against SHA-1 deprecation
- ⚠️ Custom implementation (not standard)
- Use for: Security-compliant PIDs, future-proofing
Recommendation for GLAM Project
Store all three UUIDs:
- record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB PK
pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Public PID
pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
ghcid: US-CA-SAN-A-IA
Benefits:
- ✅ Fast database performance (UUID v7 PK)
- ✅ Standard interoperability (UUID v5 PID)
- ✅ Future-proof (UUID v8 SHA-256 PID)
- ✅ Human-readable (GHCID string)
Trade-off: Slightly more storage (48 bytes vs 16 bytes per record), but worth it for flexibility.
Version: 2.0
Date: 2024-11-06
Status: Hybrid UUID Strategy