# UUID Strategy for Heritage Institutions ## The Problem with UUID v7 for Persistent Identifiers **UUID v7 is time-based and random** - perfect for databases, but **not deterministic**. ### Why Determinism Matters for PIDs | Scenario | UUID v5 (Deterministic) | UUID v7 (Time-based) | |----------|-------------------------|----------------------| | **Regenerate from GHCID** | ✅ Always same UUID | ❌ Different UUID each time | | **Independent systems agree** | ✅ Same UUID generated | ❌ Different UUIDs | | **Lost database recovery** | ✅ Rebuild from GHCID strings | ❌ Must restore from backup | | **Content-addressed** | ✅ Hash of content | ❌ Based on timestamp | **Example:** ```python # UUID v5 - Deterministic (content-addressed) ghcid = "US-CA-SAN-A-IA" uuid1 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-... uuid2 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-... uuid1 == uuid2 # ✅ SAME! # UUID v7 - Time-based (timestamp-addressed) uuid1 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-5678-7abc-... uuid2 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-9999-7fff-... uuid1 == uuid2 # ❌ DIFFERENT! ``` --- ## ✅ Hybrid Strategy: Best of Both Worlds Use **both UUID types** for different purposes: ### 1. **UUID v5 (SHA-1)** or **UUID v8 (SHA-256)** → Persistent Identifier - **Purpose:** Long-term, deterministic, content-addressed PID - **Use for:** Cross-system references, citations, Wikidata, IIIF - **Benefit:** Can always regenerate from GHCID string ### 2. **UUID v7** → Database Record ID - **Purpose:** Time-ordered, sortable, high-performance database key - **Use for:** Internal database primary keys, indexing, queries - **Benefit:** Faster inserts, better B-tree performance ### Example Data Model ```yaml # Heritage institution record - record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB primary key pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Persistent ID pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID name: Internet Archive ghcid: US-CA-SAN-A-IA created_at: 2024-11-06T12:34:56Z # Embedded in UUID v7 identifiers: - identifier_scheme: RECORD_ID_V7 identifier_value: 018e1234-5678-7abc-def0-123456789abc - identifier_scheme: PID_UUID_V5 identifier_value: 550e8400-e29b-41d4-a716-446655440000 - identifier_scheme: PID_UUID_SHA256 identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d ``` ### Database Schema ```sql CREATE TABLE heritage_institutions ( -- UUID v7 - Primary key (sortable, fast) record_id UUID PRIMARY KEY DEFAULT uuid_v7(), -- UUID v5 - Persistent identifier (deterministic) pid_uuid_v5 UUID NOT NULL UNIQUE, -- UUID v8 - SHA-256 PID (SOTA cryptographic strength) pid_uuid_sha256 UUID NOT NULL UNIQUE, -- Human-readable GHCID ghcid VARCHAR(100) NOT NULL, -- Institution data name TEXT NOT NULL, institution_type VARCHAR(50), -- Timestamps (automatically in UUID v7!) created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW(), -- Indexes INDEX idx_pid_v5 (pid_uuid_v5), INDEX idx_ghcid (ghcid) ); ``` --- ## Performance Comparison ### UUID v7 vs UUID v5 for Database Primary Keys **From the article's insights:** | Metric | UUID v4/v5 (Random) | UUID v7 (Time-ordered) | |--------|---------------------|------------------------| | **Insert Performance** | ⚠️ Slow (random B-tree splits) | ✅ Fast (sequential inserts) | | **Index Size** | ⚠️ Larger (fragmented) | ✅ Smaller (compact) | | **Range Queries** | ❌ Inefficient | ✅ Efficient (time-based) | | **Sortability** | ❌ Random order | ✅ Time-ordered | | **Determinism** | ✅ Yes (v5 only) | ❌ No | | **Content-addressed** | ✅ Yes (v5 only) | ❌ No | **Conclusion:** - Use **UUID v7 as database PK** for performance - Use **UUID v5/v8 as PID** for persistence and interoperability --- ## Implementation: Dual UUID Generation ```python import uuid import hashlib from datetime import datetime from uuid_utils import uuid7 # Python uuid-utils library class GHCIDComponents: def __init__(self, country_code, region_code, city_locode, institution_type, abbreviation): self.country_code = country_code.upper() self.region_code = region_code.upper() self.city_locode = city_locode.upper() self.institution_type = institution_type.upper() self.abbreviation = abbreviation.upper() def to_string(self) -> str: """Human-readable GHCID string.""" return f"{self.country_code}-{self.region_code}-{self.city_locode}-{self.institution_type}-{self.abbreviation}" # === PERSISTENT IDENTIFIERS (deterministic) === def to_uuid_v5(self) -> uuid.UUID: """UUID v5 - Persistent ID (SHA-1 based, RFC 4122).""" ghcid_str = self.to_string() return uuid.uuid5(GHCID_NAMESPACE, ghcid_str) def to_uuid_sha256(self) -> uuid.UUID: """UUID v8 - Persistent ID (SHA-256 based, SOTA).""" ghcid_str = self.to_string() hash_bytes = hashlib.sha256(ghcid_str.encode('utf-8')).digest() uuid_bytes = bytearray(hash_bytes[:16]) uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80 # Version 8 uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80 # Variant RFC 4122 return uuid.UUID(bytes=bytes(uuid_bytes)) # === DATABASE RECORD ID (time-based) === @staticmethod def generate_record_id() -> uuid.UUID: """UUID v7 - Database primary key (time-ordered, high performance).""" return uuid7() # From uuid-utils library # === EXAMPLE USAGE === def create_database_record(self): """Create a complete database record with all UUID types.""" return { 'record_id': self.generate_record_id(), # UUID v7 - DB PK 'pid_uuid_v5': self.to_uuid_v5(), # UUID v5 - Persistent ID 'pid_uuid_sha256': self.to_uuid_sha256(), # UUID v8 - SOTA PID 'ghcid': self.to_string(), # Human-readable 'created_at': datetime.utcnow(), } # Example components = GHCIDComponents("US", "CA", "SAN", "A", "IA") record = components.create_database_record() print(f"Record ID (v7): {record['record_id']}") # 018e1234-5678-7abc-def0-123456789abc print(f"PID v5: {record['pid_uuid_v5']}") # 550e8400-e29b-41d4-a716-446655440000 print(f"PID SHA-256: {record['pid_uuid_sha256']}") # a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d print(f"GHCID: {record['ghcid']}") # US-CA-SAN-A-IA # Verify determinism assert components.to_uuid_v5() == components.to_uuid_v5() # ✅ Same every time assert components.to_uuid_sha256() == components.to_uuid_sha256() # ✅ Same every time assert components.generate_record_id() != components.generate_record_id() # ✅ Different (time-based) ``` --- ## Resolution Service Strategy ### Multi-UUID Resolution ``` # All UUIDs resolve to the same institution record # UUID v7 (record ID) https://id.heritage.org/record/018e1234-5678-7abc-def0-123456789abc → Redirects to institutional page # UUID v5 (persistent ID) https://id.heritage.org/pid/550e8400-e29b-41d4-a716-446655440000 → Redirects to institutional page # UUID v8 (SHA-256 PID) https://id.heritage.org/pid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d → Redirects to institutional page # Human-readable GHCID https://id.heritage.org/ghcid/US-CA-SAN-A-IA → Redirects to institutional page ``` --- ## When to Use Each UUID Type ### Decision Matrix | Use Case | UUID v7 | UUID v5 | UUID v8 (SHA-256) | |----------|---------|---------|-------------------| | **Database primary key** | ✅ Best choice | ⚠️ Works but slower | ⚠️ Works but slower | | **Time-ordered queries** | ✅ Native support | ❌ Random order | ❌ Random order | | **Persistent identifier (PID)** | ❌ Not deterministic | ✅ Standard choice | ✅ SOTA choice | | **Cross-system references** | ⚠️ Internal only | ✅ Yes | ✅ Yes | | **Citations/Wikidata** | ❌ Not persistent | ✅ Yes | ✅ Yes (if accepted) | | **Security compliance (SHA-256 required)** | ❌ No | ❌ Uses SHA-1 | ✅ Yes | | **Europeana/DPLA integration** | ❌ No | ✅ Standard | ⚠️ Custom | --- ## Summary: Three-UUID Strategy ### 1. **UUID v7** → Internal Database Record ID - ✅ Fast inserts (sequential) - ✅ Time-ordered (sortable by creation time) - ✅ Better B-tree performance - ❌ NOT persistent (time-based, random component) - **Use for:** Database primary keys, internal references ### 2. **UUID v5** → Public Persistent Identifier (Standard) - ✅ Deterministic (content-addressed) - ✅ RFC 4122 compliant - ✅ Interoperable (Europeana, DPLA, IIIF) - ⚠️ SHA-1 based (weaker cryptographically) - **Use for:** Public PIDs, cross-system references, citations ### 3. **UUID v8 (SHA-256)** → Future-Proof Persistent Identifier - ✅ Deterministic (content-addressed) - ✅ SHA-256 (SOTA cryptographic strength) - ✅ Future-proof against SHA-1 deprecation - ⚠️ Custom implementation (not standard) - **Use for:** Security-compliant PIDs, future-proofing --- ## Recommendation for GLAM Project **Store all three UUIDs:** ```yaml - record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB PK pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Public PID pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID ghcid: US-CA-SAN-A-IA ``` **Benefits:** - ✅ Fast database performance (UUID v7 PK) - ✅ Standard interoperability (UUID v5 PID) - ✅ Future-proof (UUID v8 SHA-256 PID) - ✅ Human-readable (GHCID string) **Trade-off:** Slightly more storage (48 bytes vs 16 bytes per record), but worth it for flexibility. --- **Version:** 2.0 **Date:** 2024-11-06 **Status:** Hybrid UUID Strategy