281 lines
9.8 KiB
Markdown
281 lines
9.8 KiB
Markdown
# UUID Strategy for Heritage Institutions
|
|
|
|
## The Problem with UUID v7 for Persistent Identifiers
|
|
|
|
**UUID v7 is time-based and random** - perfect for databases, but **not deterministic**.
|
|
|
|
### Why Determinism Matters for PIDs
|
|
|
|
| Scenario | UUID v5 (Deterministic) | UUID v7 (Time-based) |
|
|
|----------|-------------------------|----------------------|
|
|
| **Regenerate from GHCID** | ✅ Always same UUID | ❌ Different UUID each time |
|
|
| **Independent systems agree** | ✅ Same UUID generated | ❌ Different UUIDs |
|
|
| **Lost database recovery** | ✅ Rebuild from GHCID strings | ❌ Must restore from backup |
|
|
| **Content-addressed** | ✅ Hash of content | ❌ Based on timestamp |
|
|
|
|
**Example:**
|
|
```python
|
|
# UUID v5 - Deterministic (content-addressed)
|
|
ghcid = "US-CA-SAN-A-IA"
|
|
uuid1 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
|
|
uuid2 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
|
|
uuid1 == uuid2 # ✅ SAME!
|
|
|
|
# UUID v7 - Time-based (timestamp-addressed)
|
|
uuid1 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-5678-7abc-...
|
|
uuid2 = UuidCreator.getTimeOrderedEpoch() # → 018e1234-9999-7fff-...
|
|
uuid1 == uuid2 # ❌ DIFFERENT!
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ Hybrid Strategy: Best of Both Worlds
|
|
|
|
Use **both UUID types** for different purposes:
|
|
|
|
### 1. **UUID v5 (SHA-1)** or **UUID v8 (SHA-256)** → Persistent Identifier
|
|
- **Purpose:** Long-term, deterministic, content-addressed PID
|
|
- **Use for:** Cross-system references, citations, Wikidata, IIIF
|
|
- **Benefit:** Can always regenerate from GHCID string
|
|
|
|
### 2. **UUID v7** → Database Record ID
|
|
- **Purpose:** Time-ordered, sortable, high-performance database key
|
|
- **Use for:** Internal database primary keys, indexing, queries
|
|
- **Benefit:** Faster inserts, better B-tree performance
|
|
|
|
### Example Data Model
|
|
|
|
```yaml
|
|
# Heritage institution record
|
|
- record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB primary key
|
|
pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Persistent ID
|
|
pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
|
|
name: Internet Archive
|
|
ghcid: US-CA-SAN-A-IA
|
|
created_at: 2024-11-06T12:34:56Z # Embedded in UUID v7
|
|
identifiers:
|
|
- identifier_scheme: RECORD_ID_V7
|
|
identifier_value: 018e1234-5678-7abc-def0-123456789abc
|
|
- identifier_scheme: PID_UUID_V5
|
|
identifier_value: 550e8400-e29b-41d4-a716-446655440000
|
|
- identifier_scheme: PID_UUID_SHA256
|
|
identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
|
```
|
|
|
|
### Database Schema
|
|
|
|
```sql
|
|
CREATE TABLE heritage_institutions (
|
|
-- UUID v7 - Primary key (sortable, fast)
|
|
record_id UUID PRIMARY KEY DEFAULT uuid_v7(),
|
|
|
|
-- UUID v5 - Persistent identifier (deterministic)
|
|
pid_uuid_v5 UUID NOT NULL UNIQUE,
|
|
|
|
-- UUID v8 - SHA-256 PID (SOTA cryptographic strength)
|
|
pid_uuid_sha256 UUID NOT NULL UNIQUE,
|
|
|
|
-- Human-readable GHCID
|
|
ghcid VARCHAR(100) NOT NULL,
|
|
|
|
-- Institution data
|
|
name TEXT NOT NULL,
|
|
institution_type VARCHAR(50),
|
|
|
|
-- Timestamps (automatically in UUID v7!)
|
|
created_at TIMESTAMP DEFAULT NOW(),
|
|
updated_at TIMESTAMP DEFAULT NOW(),
|
|
|
|
-- Indexes
|
|
INDEX idx_pid_v5 (pid_uuid_v5),
|
|
INDEX idx_ghcid (ghcid)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Comparison
|
|
|
|
### UUID v7 vs UUID v5 for Database Primary Keys
|
|
|
|
**From the article's insights:**
|
|
|
|
| Metric | UUID v4/v5 (Random) | UUID v7 (Time-ordered) |
|
|
|--------|---------------------|------------------------|
|
|
| **Insert Performance** | ⚠️ Slow (random B-tree splits) | ✅ Fast (sequential inserts) |
|
|
| **Index Size** | ⚠️ Larger (fragmented) | ✅ Smaller (compact) |
|
|
| **Range Queries** | ❌ Inefficient | ✅ Efficient (time-based) |
|
|
| **Sortability** | ❌ Random order | ✅ Time-ordered |
|
|
| **Determinism** | ✅ Yes (v5 only) | ❌ No |
|
|
| **Content-addressed** | ✅ Yes (v5 only) | ❌ No |
|
|
|
|
**Conclusion:**
|
|
- Use **UUID v7 as database PK** for performance
|
|
- Use **UUID v5/v8 as PID** for persistence and interoperability
|
|
|
|
---
|
|
|
|
## Implementation: Dual UUID Generation
|
|
|
|
```python
|
|
import uuid
|
|
import hashlib
|
|
from datetime import datetime
|
|
from uuid_utils import uuid7 # Python uuid-utils library
|
|
|
|
class GHCIDComponents:
|
|
def __init__(self, country_code, region_code, city_locode,
|
|
institution_type, abbreviation):
|
|
self.country_code = country_code.upper()
|
|
self.region_code = region_code.upper()
|
|
self.city_locode = city_locode.upper()
|
|
self.institution_type = institution_type.upper()
|
|
self.abbreviation = abbreviation.upper()
|
|
|
|
def to_string(self) -> str:
|
|
"""Human-readable GHCID string."""
|
|
return f"{self.country_code}-{self.region_code}-{self.city_locode}-{self.institution_type}-{self.abbreviation}"
|
|
|
|
# === PERSISTENT IDENTIFIERS (deterministic) ===
|
|
|
|
def to_uuid_v5(self) -> uuid.UUID:
|
|
"""UUID v5 - Persistent ID (SHA-1 based, RFC 4122)."""
|
|
ghcid_str = self.to_string()
|
|
return uuid.uuid5(GHCID_NAMESPACE, ghcid_str)
|
|
|
|
def to_uuid_sha256(self) -> uuid.UUID:
|
|
"""UUID v8 - Persistent ID (SHA-256 based, SOTA)."""
|
|
ghcid_str = self.to_string()
|
|
hash_bytes = hashlib.sha256(ghcid_str.encode('utf-8')).digest()
|
|
uuid_bytes = bytearray(hash_bytes[:16])
|
|
uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80 # Version 8
|
|
uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80 # Variant RFC 4122
|
|
return uuid.UUID(bytes=bytes(uuid_bytes))
|
|
|
|
# === DATABASE RECORD ID (time-based) ===
|
|
|
|
@staticmethod
|
|
def generate_record_id() -> uuid.UUID:
|
|
"""UUID v7 - Database primary key (time-ordered, high performance)."""
|
|
return uuid7() # From uuid-utils library
|
|
|
|
# === EXAMPLE USAGE ===
|
|
|
|
def create_database_record(self):
|
|
"""Create a complete database record with all UUID types."""
|
|
return {
|
|
'record_id': self.generate_record_id(), # UUID v7 - DB PK
|
|
'pid_uuid_v5': self.to_uuid_v5(), # UUID v5 - Persistent ID
|
|
'pid_uuid_sha256': self.to_uuid_sha256(), # UUID v8 - SOTA PID
|
|
'ghcid': self.to_string(), # Human-readable
|
|
'created_at': datetime.utcnow(),
|
|
}
|
|
|
|
# Example
|
|
components = GHCIDComponents("US", "CA", "SAN", "A", "IA")
|
|
|
|
record = components.create_database_record()
|
|
print(f"Record ID (v7): {record['record_id']}") # 018e1234-5678-7abc-def0-123456789abc
|
|
print(f"PID v5: {record['pid_uuid_v5']}") # 550e8400-e29b-41d4-a716-446655440000
|
|
print(f"PID SHA-256: {record['pid_uuid_sha256']}") # a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
|
print(f"GHCID: {record['ghcid']}") # US-CA-SAN-A-IA
|
|
|
|
# Verify determinism
|
|
assert components.to_uuid_v5() == components.to_uuid_v5() # ✅ Same every time
|
|
assert components.to_uuid_sha256() == components.to_uuid_sha256() # ✅ Same every time
|
|
assert components.generate_record_id() != components.generate_record_id() # ✅ Different (time-based)
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution Service Strategy
|
|
|
|
### Multi-UUID Resolution
|
|
|
|
```
|
|
# All UUIDs resolve to the same institution record
|
|
|
|
# UUID v7 (record ID)
|
|
https://id.heritage.org/record/018e1234-5678-7abc-def0-123456789abc
|
|
→ Redirects to institutional page
|
|
|
|
# UUID v5 (persistent ID)
|
|
https://id.heritage.org/pid/550e8400-e29b-41d4-a716-446655440000
|
|
→ Redirects to institutional page
|
|
|
|
# UUID v8 (SHA-256 PID)
|
|
https://id.heritage.org/pid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
|
→ Redirects to institutional page
|
|
|
|
# Human-readable GHCID
|
|
https://id.heritage.org/ghcid/US-CA-SAN-A-IA
|
|
→ Redirects to institutional page
|
|
```
|
|
|
|
---
|
|
|
|
## When to Use Each UUID Type
|
|
|
|
### Decision Matrix
|
|
|
|
| Use Case | UUID v7 | UUID v5 | UUID v8 (SHA-256) |
|
|
|----------|---------|---------|-------------------|
|
|
| **Database primary key** | ✅ Best choice | ⚠️ Works but slower | ⚠️ Works but slower |
|
|
| **Time-ordered queries** | ✅ Native support | ❌ Random order | ❌ Random order |
|
|
| **Persistent identifier (PID)** | ❌ Not deterministic | ✅ Standard choice | ✅ SOTA choice |
|
|
| **Cross-system references** | ⚠️ Internal only | ✅ Yes | ✅ Yes |
|
|
| **Citations/Wikidata** | ❌ Not persistent | ✅ Yes | ✅ Yes (if accepted) |
|
|
| **Security compliance (SHA-256 required)** | ❌ No | ❌ Uses SHA-1 | ✅ Yes |
|
|
| **Europeana/DPLA integration** | ❌ No | ✅ Standard | ⚠️ Custom |
|
|
|
|
---
|
|
|
|
## Summary: Three-UUID Strategy
|
|
|
|
### 1. **UUID v7** → Internal Database Record ID
|
|
- ✅ Fast inserts (sequential)
|
|
- ✅ Time-ordered (sortable by creation time)
|
|
- ✅ Better B-tree performance
|
|
- ❌ NOT persistent (time-based, random component)
|
|
- **Use for:** Database primary keys, internal references
|
|
|
|
### 2. **UUID v5** → Public Persistent Identifier (Standard)
|
|
- ✅ Deterministic (content-addressed)
|
|
- ✅ RFC 4122 compliant
|
|
- ✅ Interoperable (Europeana, DPLA, IIIF)
|
|
- ⚠️ SHA-1 based (weaker cryptographically)
|
|
- **Use for:** Public PIDs, cross-system references, citations
|
|
|
|
### 3. **UUID v8 (SHA-256)** → Future-Proof Persistent Identifier
|
|
- ✅ Deterministic (content-addressed)
|
|
- ✅ SHA-256 (SOTA cryptographic strength)
|
|
- ✅ Future-proof against SHA-1 deprecation
|
|
- ⚠️ Custom implementation (not standard)
|
|
- **Use for:** Security-compliant PIDs, future-proofing
|
|
|
|
---
|
|
|
|
## Recommendation for GLAM Project
|
|
|
|
**Store all three UUIDs:**
|
|
|
|
```yaml
|
|
- record_id: 018e1234-5678-7abc-def0-123456789abc # UUID v7 - DB PK
|
|
pid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 - Public PID
|
|
pid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v8 - SOTA PID
|
|
ghcid: US-CA-SAN-A-IA
|
|
```
|
|
|
|
**Benefits:**
|
|
- ✅ Fast database performance (UUID v7 PK)
|
|
- ✅ Standard interoperability (UUID v5 PID)
|
|
- ✅ Future-proof (UUID v8 SHA-256 PID)
|
|
- ✅ Human-readable (GHCID string)
|
|
|
|
**Trade-off:** Slightly more storage (48 bytes vs 16 bytes per record), but worth it for flexibility.
|
|
|
|
---
|
|
|
|
**Version:** 2.0
|
|
**Date:** 2024-11-06
|
|
**Status:** Hybrid UUID Strategy
|