428 lines
15 KiB
Markdown
428 lines
15 KiB
Markdown
# Why GHCID Uses UUID v5 and SHA-1
|
||
|
||
**Date:** 2024-11-06
|
||
**Decision:** Use UUID v5 (SHA-1) as the primary persistent identifier format for GHCID
|
||
**Status:** Adopted and implemented
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
**GHCID uses UUID v5 (RFC 4122) with SHA-1 hashing** for generating persistent identifiers from GHCID strings. While SHA-1 is deprecated for **cryptographic security applications** (digital signatures, TLS certificates), it remains **appropriate and safe for deterministic identifier generation** in non-adversarial contexts.
|
||
|
||
**Key Principle:** Transparency and reproducibility are more important than cutting-edge cryptographic strength for heritage institution identifiers.
|
||
|
||
---
|
||
|
||
## The Decision
|
||
|
||
### What We Chose
|
||
|
||
```yaml
|
||
Primary Identifier: UUID v5 (SHA-1)
|
||
Secondary Identifier: UUID v8 (SHA-256)
|
||
Database Record ID: UUID v7 (time-ordered)
|
||
```
|
||
|
||
### Why UUID v5 is Primary
|
||
|
||
1. **RFC 4122 Standardized** - Universal recognition and support
|
||
2. **Deterministic** - Same input always produces same output
|
||
3. **Transparent** - Algorithm is publicly documented and verifiable
|
||
4. **Widely Supported** - Built into every major programming language
|
||
5. **Sufficient Collision Resistance** - 128 bits provides virtually zero collision probability
|
||
|
||
---
|
||
|
||
## Understanding SHA-1 in Context
|
||
|
||
### SHA-1 Cryptographic Weaknesses (Real)
|
||
|
||
**Vulnerability:** Collision attacks (SHAttered attack, 2017)
|
||
- Attackers can find two different inputs that produce the same SHA-1 hash
|
||
- **Critical for:** Digital signatures, TLS certificates, password hashing
|
||
- **Example threat:** Forge a malicious PDF with the same signature as a legitimate document
|
||
|
||
### SHA-1 for Identifiers (Safe)
|
||
|
||
**Reality:** Collision attacks do NOT apply to GHCID identifier generation
|
||
|
||
**Why it's safe for UUIDs:**
|
||
|
||
| Cryptographic Use (Vulnerable) | Identifier Use (Safe) |
|
||
|--------------------------------|----------------------|
|
||
| **Adversarial context** - Attacker actively tries to forge signatures | **Non-adversarial context** - No one is trying to forge institution IDs |
|
||
| **Two-message attack** - Attacker controls BOTH inputs | **Single-source generation** - We control the input (GHCID strings) |
|
||
| **Security requirement** - Must resist preimage and collision attacks | **Uniqueness requirement** - Only need collision resistance in birthday paradox sense |
|
||
| **High stakes** - Financial fraud, impersonation, data tampering | **Low stakes** - Identifier collision would be inconvenient, not catastrophic |
|
||
|
||
---
|
||
|
||
## Mathematical Collision Resistance
|
||
|
||
### Birthday Paradox Analysis
|
||
|
||
**Question:** How many institutions before we expect a UUID collision?
|
||
|
||
**For UUID v5 (128 bits):**
|
||
|
||
```
|
||
P(collision) ≈ n² / (2 × 2^128)
|
||
|
||
Where n = number of institutions
|
||
|
||
n = 1,000,000 institutions:
|
||
P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29
|
||
|
||
In plain English:
|
||
- 0.000000000000000000000000000015% chance
|
||
- More atoms in the observable universe than expected collisions
|
||
```
|
||
|
||
**Real-world scale:**
|
||
- **Current:** ~10,000 heritage institutions in dataset
|
||
- **Expected growth:** 1-10 million worldwide
|
||
- **Collision probability:** Effectively zero
|
||
|
||
**Even if SHA-1 collision resistance is weakened:**
|
||
- Collision attacks require massive computational effort (months on Google's infrastructure)
|
||
- Not economically feasible for heritage identifiers
|
||
- Attack has no benefit (no financial gain from forged museum IDs)
|
||
|
||
---
|
||
|
||
## Why Transparency Matters More Than Cryptographic Strength
|
||
|
||
### Core Principle: Verifiable Identifier Generation
|
||
|
||
**GHCID's mission is to create a transparent, persistent identifier scheme.** This requires:
|
||
|
||
1. **Anyone can verify** our UUIDs are correctly generated
|
||
2. **Anyone can regenerate** UUIDs from GHCID strings
|
||
3. **No secret algorithms** or proprietary implementations
|
||
4. **No trust required** - verify, don't trust
|
||
|
||
### UUID v5 Achieves This
|
||
|
||
```python
|
||
# ANYONE can verify this algorithm
|
||
import uuid
|
||
|
||
# Public, standardized namespace (documented in RFC 4122)
|
||
GHCID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
|
||
|
||
# Standard UUID v5 generation (built into Python)
|
||
ghcid_string = "US-CA-SAN-A-IA"
|
||
ghcid_uuid = uuid.uuid5(GHCID_NAMESPACE, ghcid_string)
|
||
|
||
print(ghcid_uuid)
|
||
# → 550e8400-e29b-41d4-a716-446655440000
|
||
|
||
# Result: ANYONE can reproduce this UUID and verify it's correct
|
||
```
|
||
|
||
**Transparency benefits:**
|
||
- ✅ Researchers can independently verify our data
|
||
- ✅ Future systems can regenerate UUIDs without our codebase
|
||
- ✅ No "black box" algorithms
|
||
- ✅ Builds community trust
|
||
|
||
---
|
||
|
||
## Comparison: UUID v5 vs UUID v8 (SHA-256)
|
||
|
||
### UUID v8 (SHA-256) Alternative
|
||
|
||
We **also generate UUID v8 (SHA-256)** for future-proofing, but it's secondary because:
|
||
|
||
| Aspect | UUID v5 (SHA-1) | UUID v8 (SHA-256) |
|
||
|--------|-----------------|-------------------|
|
||
| **Standardization** | ✅ RFC 4122 (2005) | ⚠️ RFC 9562 (2024) - experimental format |
|
||
| **Algorithm** | ✅ Defined by RFC | ❌ Custom implementation (we define it) |
|
||
| **Transparency** | ✅ `uuid.uuid5()` built-in | ⚠️ Requires custom code to verify |
|
||
| **Interoperability** | ✅ Every language has uuid5() | ❌ Requires sharing our implementation |
|
||
| **Cryptographic strength** | ⚠️ SHA-1 (deprecated for security) | ✅ SHA-256 (NIST-approved 2030+) |
|
||
| **Collision resistance** | ✅ 128 bits (sufficient) | ✅ 128 bits (truncated from 256) |
|
||
| **Verification** | ✅ One line of code | ⚠️ Must replicate our algorithm |
|
||
|
||
**Example of reduced transparency with UUID v8:**
|
||
|
||
```python
|
||
# UUID v8 - CUSTOM algorithm (less transparent)
|
||
def ghcid_to_uuid_v8(ghcid_string: str) -> uuid.UUID:
|
||
"""
|
||
Custom UUID v8 using SHA-256.
|
||
|
||
NOTE: Others must know OUR specific algorithm to verify this.
|
||
"""
|
||
# Hash the GHCID string
|
||
hash_bytes = hashlib.sha256(ghcid_string.encode('utf-8')).digest()
|
||
|
||
# Truncate to 128 bits (custom choice - we could have done this differently)
|
||
uuid_bytes = bytearray(hash_bytes[:16])
|
||
|
||
# Set version bits (standard)
|
||
uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80
|
||
|
||
# Set variant bits (standard)
|
||
uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80
|
||
|
||
return uuid.UUID(bytes=bytes(uuid_bytes))
|
||
|
||
# Problem: Others must know we take first 16 bytes, not last 16 bytes
|
||
# Problem: Others must know we use UTF-8 encoding
|
||
# Problem: Others must have our exact implementation
|
||
```
|
||
|
||
**Conclusion:** UUID v8 is more secure cryptographically, but **less transparent** for verification.
|
||
|
||
---
|
||
|
||
## Why NOT Use UUID v7?
|
||
|
||
### UUID v7 is Time-Based, NOT Deterministic
|
||
|
||
**UUID v7 is perfect for database primary keys** (fast, time-ordered), but **unsuitable for persistent identifiers**:
|
||
|
||
```python
|
||
# UUID v5 - Deterministic (content-addressed)
|
||
ghcid = "US-CA-SAN-A-IA"
|
||
uuid1 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
|
||
uuid2 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-...
|
||
uuid1 == uuid2 # ✅ SAME! (Always)
|
||
|
||
# UUID v7 - Time-based (timestamp-addressed)
|
||
uuid1 = generate_uuid_v7() # → 019a58fd-3226-7504-8c55-...
|
||
uuid2 = generate_uuid_v7() # → 019a58fd-9999-7abc-def0-...
|
||
uuid1 == uuid2 # ❌ DIFFERENT! (Every time)
|
||
```
|
||
|
||
**Problem for citations:**
|
||
|
||
```markdown
|
||
# Paper published in 2024
|
||
"See Internet Archive (urn:uuid:550e8400-e29b-41d4-a716-446655440000) for details."
|
||
|
||
# Database crashes in 2025, you rebuild from GHCID strings
|
||
# With UUID v5: Regenerate → 550e8400-... ✅ SAME! Citation still works!
|
||
# With UUID v7: Generate new → 019b1234-... ❌ BROKEN! Citation is dead!
|
||
```
|
||
|
||
**Use case separation:**
|
||
- **UUID v7:** Database record IDs (internal, performance)
|
||
- **UUID v5:** Persistent identifiers (public, citations, cross-system references)
|
||
|
||
---
|
||
|
||
## Addressing Security Auditor Concerns
|
||
|
||
### Common Objection: "SHA-1 is Broken!"
|
||
|
||
**Response:**
|
||
|
||
SHA-1 is **deprecated for cryptographic security applications** (digital signatures, TLS, password hashing), but **appropriate for non-cryptographic use cases** like:
|
||
|
||
- ✅ **Git commit hashes** - Linus Torvalds: "SHA-1 is fine for Git"
|
||
- ✅ **UUID generation** - RFC 4122 still active (not deprecated)
|
||
- ✅ **Checksums** - File integrity in non-adversarial contexts
|
||
- ✅ **Content addressing** - Hash-based deduplication
|
||
|
||
**Key distinction:**
|
||
|
||
| Use Case | SHA-1 Status | Reasoning |
|
||
|----------|--------------|-----------|
|
||
| **Digital signatures** | ❌ UNSAFE | Attacker can forge signatures |
|
||
| **TLS certificates** | ❌ UNSAFE | Attacker can impersonate websites |
|
||
| **Password hashing** | ❌ UNSAFE | Attacker can crack passwords faster |
|
||
| **GHCID identifiers** | ✅ SAFE | No attacker, no forgery incentive, sufficient collision resistance |
|
||
|
||
### NIST Guidance
|
||
|
||
**NIST SP 800-107 (2012):** "SHA-1 should not be used for digital signatures... however, SHA-1 may be used for... generating hash-based message authentication codes (HMACs), key derivation functions (KDFs), and random number generators."
|
||
|
||
**NIST retirement date (Dec 31, 2030):** Applies to **security applications** (signatures, authentication), NOT to identifier generation.
|
||
|
||
---
|
||
|
||
## Future-Proofing Strategy
|
||
|
||
### Dual UUID Approach
|
||
|
||
**We generate BOTH UUID v5 and UUID v8 (SHA-256):**
|
||
|
||
```yaml
|
||
# Every institution record includes
|
||
identifiers:
|
||
- identifier_scheme: UUID_V5
|
||
identifier_value: 550e8400-e29b-41d4-a716-446655440000
|
||
primary: true # Current standard
|
||
|
||
- identifier_scheme: UUID_V8_SHA256
|
||
identifier_value: 018e6897-dca5-8eb7-931f-30301fbde4ec
|
||
primary: false # Future-proofing
|
||
```
|
||
|
||
**Migration path:**
|
||
|
||
| Timeline | Primary Identifier | Secondary Identifier | Rationale |
|
||
|----------|-------------------|----------------------|-----------|
|
||
| **2024-2030** | UUID v5 (SHA-1) | UUID v8 (SHA-256) | Standard compliance, wide support |
|
||
| **2030-2040** | UUID v8 (SHA-256) | UUID v5 (legacy) | If SHA-1 fully deprecated |
|
||
| **2040+** | UUID v8 (SHA-256) | None | SHA-1 sunset complete |
|
||
|
||
**Critical advantage:** Both are **deterministic** - can regenerate from GHCID string anytime, no data loss.
|
||
|
||
---
|
||
|
||
## Real-World Precedents
|
||
|
||
### Other Systems Using SHA-1 for Identifiers
|
||
|
||
1. **Git Version Control**
|
||
- Uses SHA-1 for commit hashes
|
||
- Linus Torvalds: "Collision attacks don't matter for Git's use case"
|
||
- GitHub still uses SHA-1 (with plans to migrate to SHA-256 over many years)
|
||
|
||
2. **UUID v5 (RFC 4122)**
|
||
- Standard since 2005
|
||
- Still widely used in 2024
|
||
- No RFC deprecation or replacement (yet)
|
||
|
||
3. **Content-Addressed Storage**
|
||
- IPFS, BitTorrent use SHA-1 and SHA-256
|
||
- Hash function chosen based on use case, not blanket "SHA-1 is bad"
|
||
|
||
---
|
||
|
||
## Documentation Transparency
|
||
|
||
### Public Algorithm Documentation
|
||
|
||
**GHCID UUID v5 generation is fully documented:**
|
||
|
||
```python
|
||
# File: src/glam_extractor/identifiers/ghcid.py
|
||
|
||
# GHCID UUID v5 Namespace (publicly documented)
|
||
GHCID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')
|
||
|
||
def to_uuid(self) -> str:
|
||
"""
|
||
Generate UUID v5 from GHCID string.
|
||
|
||
Algorithm:
|
||
1. Construct GHCID string: f"{country}-{region}-{city}-{type}-{abbr}"
|
||
2. Apply RFC 4122 UUID v5 algorithm:
|
||
- Namespace: 6ba7b810-9dad-11d1-80b4-00c04fd430c8
|
||
- Name: GHCID string (UTF-8 encoded)
|
||
- Hash: SHA-1 (per RFC 4122)
|
||
3. Format as UUID: 8-4-4-4-12 hex format
|
||
|
||
Example:
|
||
>>> components = GHCIDComponents("US", "CA", "SAN", "A", "IA")
|
||
>>> components.to_uuid()
|
||
'550e8400-e29b-41d4-a716-446655440000'
|
||
|
||
Returns:
|
||
UUID v5 string (36 characters with hyphens)
|
||
"""
|
||
ghcid_string = self.to_string()
|
||
return str(uuid.uuid5(GHCID_NAMESPACE, ghcid_string))
|
||
```
|
||
|
||
**Anyone can verify:**
|
||
|
||
```bash
|
||
# Python
|
||
python3 -c "import uuid; print(uuid.uuid5(uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8'), 'US-CA-SAN-A-IA'))"
|
||
|
||
# JavaScript (using uuid package)
|
||
const { v5 } = require('uuid');
|
||
console.log(v5('US-CA-SAN-A-IA', '6ba7b810-9dad-11d1-80b4-00c04fd430c8'));
|
||
|
||
# Java
|
||
UUID.nameUUIDFromBytes("US-CA-SAN-A-IA".getBytes());
|
||
|
||
# All produce: 550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
---
|
||
|
||
## Governance Implications
|
||
|
||
### Why This Choice Supports GHCID as a PID Scheme
|
||
|
||
**For GHCID to become an established persistent identifier scheme**, it must:
|
||
|
||
1. **Be Transparent** ✅ - UUID v5 is fully documented and verifiable
|
||
2. **Be Reproducible** ✅ - Anyone can regenerate UUIDs from GHCID strings
|
||
3. **Be Standard-Compliant** ✅ - RFC 4122 is an IETF standard
|
||
4. **Be Future-Proof** ✅ - We also generate UUID v8 (SHA-256) for migration
|
||
5. **Build Trust** ✅ - No proprietary "black box" algorithms
|
||
|
||
**Transparency builds community trust**, which is essential for:
|
||
- Adoption by heritage institutions
|
||
- Integration with Europeana, DPLA, Wikidata
|
||
- Recognition as a legitimate PID scheme
|
||
- Long-term persistence (decades of operation)
|
||
|
||
---
|
||
|
||
## Summary: Why UUID v5 (SHA-1) is the Right Choice
|
||
|
||
### ✅ Advantages
|
||
|
||
1. **Standardized** - RFC 4122 compliant, universal support
|
||
2. **Transparent** - Publicly documented, anyone can verify
|
||
3. **Deterministic** - Same GHCID always produces same UUID
|
||
4. **Sufficient collision resistance** - 128 bits, virtually zero probability
|
||
5. **Widely supported** - Built into every programming language
|
||
6. **Non-adversarial context** - No security threats from collision attacks
|
||
7. **Interoperable** - Works with existing UUID v5 systems
|
||
|
||
### ⚠️ Limitations
|
||
|
||
1. **Perception** - "SHA-1 is broken" stigma (requires education)
|
||
2. **Security auditors** - May flag SHA-1 use (need to explain context)
|
||
3. **Future deprecation** - If RFC 4122 is updated (mitigated by dual UUIDs)
|
||
|
||
### 🔮 Future-Proofing
|
||
|
||
- ✅ We generate **both UUID v5 (SHA-1) and UUID v8 (SHA-256)**
|
||
- ✅ Can migrate to SHA-256 primary if needed
|
||
- ✅ Both are deterministic - no data loss in migration
|
||
- ✅ Transparent algorithm documentation ensures verifiability
|
||
|
||
---
|
||
|
||
## Decision Log
|
||
|
||
**Date:** 2024-11-06
|
||
**Decision Maker:** GLAM Data Extraction Project
|
||
**Decision:** Adopt UUID v5 (SHA-1) as primary persistent identifier format
|
||
|
||
**Rationale:**
|
||
- Transparency and verifiability outweigh cryptographic strength concerns
|
||
- SHA-1 collision attacks are irrelevant in non-adversarial identifier generation
|
||
- RFC 4122 standard compliance ensures wide interoperability
|
||
- Dual UUID strategy (v5 + v8) provides future-proofing
|
||
- 128-bit collision resistance is more than sufficient for heritage domain
|
||
|
||
**Status:** ✅ Adopted and implemented
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **RFC 4122:** A Universally Unique IDentifier (UUID) URN Namespace - https://tools.ietf.org/html/rfc4122
|
||
- **RFC 9562:** Universally Unique IDentifiers (UUIDs) - https://datatracker.ietf.org/doc/rfc9562/
|
||
- **SHAttered Attack:** Google/CWI (2017) - https://shattered.io
|
||
- **NIST SP 800-107:** Recommendation for Applications Using SHA-1 - https://csrc.nist.gov/publications/detail/sp/800-107/rev-1/final
|
||
- **Git and SHA-1:** Linus Torvalds mailing list discussions
|
||
- **UUID v5 Usage:** Wikidata, IIIF, Europeana identifier practices
|
||
|
||
---
|
||
|
||
**Version:** 1.0
|
||
**Last Updated:** 2024-11-06
|
||
**Review Date:** 2027-01-01 (or when RFC 4122 is updated)
|