glam/docs/UUID_STRATEGY.md
2025-11-19 23:25:22 +01:00

9.8 KiB

UUID Strategy for Heritage Institutions

The Problem with UUID v7 for Persistent Identifiers

UUID v7 is time-based and random - perfect for databases, but not deterministic.

Why Determinism Matters for PIDs

Scenario UUID v5 (Deterministic) UUID v7 (Time-based)
Regenerate from GHCID Always same UUID Different UUID each time
Independent systems agree Same UUID generated Different UUIDs
Lost database recovery Rebuild from GHCID strings Must restore from backup
Content-addressed Hash of content Based on timestamp

Example:

# UUID v5 - Deterministic (content-addressed)
ghcid = "US-CA-SAN-A-IA"
uuid1 = uuid.uuid5(NAMESPACE, ghcid)  # → 550e8400-e29b-41d4-a716-...
uuid2 = uuid.uuid5(NAMESPACE, ghcid)  # → 550e8400-e29b-41d4-a716-...
uuid1 == uuid2  # ✅ SAME!

# UUID v7 - Time-based (timestamp-addressed)
uuid1 = UuidCreator.getTimeOrderedEpoch()  # → 018e1234-5678-7abc-...
uuid2 = UuidCreator.getTimeOrderedEpoch()  # → 018e1234-9999-7fff-...
uuid1 == uuid2  # ❌ DIFFERENT!

Hybrid Strategy: Best of Both Worlds

Use both UUID types for different purposes:

1. UUID v5 (SHA-1) or UUID v8 (SHA-256) → Persistent Identifier

  • Purpose: Long-term, deterministic, content-addressed PID
  • Use for: Cross-system references, citations, Wikidata, IIIF
  • Benefit: Can always regenerate from GHCID string

2. UUID v7 → Database Record ID

  • Purpose: Time-ordered, sortable, high-performance database key
  • Use for: Internal database primary keys, indexing, queries
  • Benefit: Faster inserts, better B-tree performance

Example Data Model

# Heritage institution record
- record_id: 018e1234-5678-7abc-def0-123456789abc  # UUID v7 - DB primary key
  pid: 550e8400-e29b-41d4-a716-446655440000        # UUID v5 - Persistent ID
  pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
  name: Internet Archive
  ghcid: US-CA-SAN-A-IA
  created_at: 2024-11-06T12:34:56Z  # Embedded in UUID v7
  identifiers:
    - identifier_scheme: RECORD_ID_V7
      identifier_value: 018e1234-5678-7abc-def0-123456789abc
    - identifier_scheme: PID_UUID_V5
      identifier_value: 550e8400-e29b-41d4-a716-446655440000
    - identifier_scheme: PID_UUID_SHA256
      identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d

Database Schema

CREATE TABLE heritage_institutions (
    -- UUID v7 - Primary key (sortable, fast)
    record_id UUID PRIMARY KEY DEFAULT uuid_v7(),
    
    -- UUID v5 - Persistent identifier (deterministic)
    pid_uuid_v5 UUID NOT NULL UNIQUE,
    
    -- UUID v8 - SHA-256 PID (SOTA cryptographic strength)
    pid_uuid_sha256 UUID NOT NULL UNIQUE,
    
    -- Human-readable GHCID
    ghcid VARCHAR(100) NOT NULL,
    
    -- Institution data
    name TEXT NOT NULL,
    institution_type VARCHAR(50),
    
    -- Timestamps (automatically in UUID v7!)
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    
    -- Indexes
    INDEX idx_pid_v5 (pid_uuid_v5),
    INDEX idx_ghcid (ghcid)
);

Performance Comparison

UUID v7 vs UUID v5 for Database Primary Keys

From the article's insights:

Metric UUID v4/v5 (Random) UUID v7 (Time-ordered)
Insert Performance ⚠️ Slow (random B-tree splits) Fast (sequential inserts)
Index Size ⚠️ Larger (fragmented) Smaller (compact)
Range Queries Inefficient Efficient (time-based)
Sortability Random order Time-ordered
Determinism Yes (v5 only) No
Content-addressed Yes (v5 only) No

Conclusion:

  • Use UUID v7 as database PK for performance
  • Use UUID v5/v8 as PID for persistence and interoperability

Implementation: Dual UUID Generation

import uuid
import hashlib
from datetime import datetime
from uuid_utils import uuid7  # Python uuid-utils library

class GHCIDComponents:
    def __init__(self, country_code, region_code, city_locode, 
                 institution_type, abbreviation):
        self.country_code = country_code.upper()
        self.region_code = region_code.upper()
        self.city_locode = city_locode.upper()
        self.institution_type = institution_type.upper()
        self.abbreviation = abbreviation.upper()
    
    def to_string(self) -> str:
        """Human-readable GHCID string."""
        return f"{self.country_code}-{self.region_code}-{self.city_locode}-{self.institution_type}-{self.abbreviation}"
    
    # === PERSISTENT IDENTIFIERS (deterministic) ===
    
    def to_uuid_v5(self) -> uuid.UUID:
        """UUID v5 - Persistent ID (SHA-1 based, RFC 4122)."""
        ghcid_str = self.to_string()
        return uuid.uuid5(GHCID_NAMESPACE, ghcid_str)
    
    def to_uuid_sha256(self) -> uuid.UUID:
        """UUID v8 - Persistent ID (SHA-256 based, SOTA)."""
        ghcid_str = self.to_string()
        hash_bytes = hashlib.sha256(ghcid_str.encode('utf-8')).digest()
        uuid_bytes = bytearray(hash_bytes[:16])
        uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80  # Version 8
        uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80  # Variant RFC 4122
        return uuid.UUID(bytes=bytes(uuid_bytes))
    
    # === DATABASE RECORD ID (time-based) ===
    
    @staticmethod
    def generate_record_id() -> uuid.UUID:
        """UUID v7 - Database primary key (time-ordered, high performance)."""
        return uuid7()  # From uuid-utils library
    
    # === EXAMPLE USAGE ===
    
    def create_database_record(self):
        """Create a complete database record with all UUID types."""
        return {
            'record_id': self.generate_record_id(),      # UUID v7 - DB PK
            'pid_uuid_v5': self.to_uuid_v5(),            # UUID v5 - Persistent ID
            'pid_uuid_sha256': self.to_uuid_sha256(),    # UUID v8 - SOTA PID
            'ghcid': self.to_string(),                   # Human-readable
            'created_at': datetime.utcnow(),
        }

# Example
components = GHCIDComponents("US", "CA", "SAN", "A", "IA")

record = components.create_database_record()
print(f"Record ID (v7):       {record['record_id']}")         # 018e1234-5678-7abc-def0-123456789abc
print(f"PID v5:               {record['pid_uuid_v5']}")       # 550e8400-e29b-41d4-a716-446655440000
print(f"PID SHA-256:          {record['pid_uuid_sha256']}")   # a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
print(f"GHCID:                {record['ghcid']}")             # US-CA-SAN-A-IA

# Verify determinism
assert components.to_uuid_v5() == components.to_uuid_v5()  # ✅ Same every time
assert components.to_uuid_sha256() == components.to_uuid_sha256()  # ✅ Same every time
assert components.generate_record_id() != components.generate_record_id()  # ✅ Different (time-based)

Resolution Service Strategy

Multi-UUID Resolution

# All UUIDs resolve to the same institution record

# UUID v7 (record ID)
https://id.heritage.org/record/018e1234-5678-7abc-def0-123456789abc
→ Redirects to institutional page

# UUID v5 (persistent ID)
https://id.heritage.org/pid/550e8400-e29b-41d4-a716-446655440000
→ Redirects to institutional page

# UUID v8 (SHA-256 PID)
https://id.heritage.org/pid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
→ Redirects to institutional page

# Human-readable GHCID
https://id.heritage.org/ghcid/US-CA-SAN-A-IA
→ Redirects to institutional page

When to Use Each UUID Type

Decision Matrix

Use Case UUID v7 UUID v5 UUID v8 (SHA-256)
Database primary key Best choice ⚠️ Works but slower ⚠️ Works but slower
Time-ordered queries Native support Random order Random order
Persistent identifier (PID) Not deterministic Standard choice SOTA choice
Cross-system references ⚠️ Internal only Yes Yes
Citations/Wikidata Not persistent Yes Yes (if accepted)
Security compliance (SHA-256 required) No Uses SHA-1 Yes
Europeana/DPLA integration No Standard ⚠️ Custom

Summary: Three-UUID Strategy

1. UUID v7 → Internal Database Record ID

  • Fast inserts (sequential)
  • Time-ordered (sortable by creation time)
  • Better B-tree performance
  • NOT persistent (time-based, random component)
  • Use for: Database primary keys, internal references

2. UUID v5 → Public Persistent Identifier (Standard)

  • Deterministic (content-addressed)
  • RFC 4122 compliant
  • Interoperable (Europeana, DPLA, IIIF)
  • ⚠️ SHA-1 based (weaker cryptographically)
  • Use for: Public PIDs, cross-system references, citations

3. UUID v8 (SHA-256) → Future-Proof Persistent Identifier

  • Deterministic (content-addressed)
  • SHA-256 (SOTA cryptographic strength)
  • Future-proof against SHA-1 deprecation
  • ⚠️ Custom implementation (not standard)
  • Use for: Security-compliant PIDs, future-proofing

Recommendation for GLAM Project

Store all three UUIDs:

- record_id: 018e1234-5678-7abc-def0-123456789abc  # UUID v7 - DB PK
  pid: 550e8400-e29b-41d4-a716-446655440000        # UUID v5 - Public PID
  pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
  ghcid: US-CA-SAN-A-IA

Benefits:

  • Fast database performance (UUID v7 PK)
  • Standard interoperability (UUID v5 PID)
  • Future-proof (UUID v8 SHA-256 PID)
  • Human-readable (GHCID string)

Trade-off: Slightly more storage (48 bytes vs 16 bytes per record), but worth it for flexibility.


Version: 2.0
Date: 2024-11-06
Status: Hybrid UUID Strategy