# Why GHCID Uses UUID v5 and SHA-1 **Date:** 2024-11-06 **Decision:** Use UUID v5 (SHA-1) as the primary persistent identifier format for GHCID **Status:** Adopted and implemented --- ## Executive Summary **GHCID uses UUID v5 (RFC 4122) with SHA-1 hashing** for generating persistent identifiers from GHCID strings. While SHA-1 is deprecated for **cryptographic security applications** (digital signatures, TLS certificates), it remains **appropriate and safe for deterministic identifier generation** in non-adversarial contexts. **Key Principle:** Transparency and reproducibility are more important than cutting-edge cryptographic strength for heritage institution identifiers. --- ## The Decision ### What We Chose ```yaml Primary Identifier: UUID v5 (SHA-1) Secondary Identifier: UUID v8 (SHA-256) Database Record ID: UUID v7 (time-ordered) ``` ### Why UUID v5 is Primary 1. **RFC 4122 Standardized** - Universal recognition and support 2. **Deterministic** - Same input always produces same output 3. **Transparent** - Algorithm is publicly documented and verifiable 4. **Widely Supported** - Built into every major programming language 5. **Sufficient Collision Resistance** - 128 bits provides virtually zero collision probability --- ## Understanding SHA-1 in Context ### SHA-1 Cryptographic Weaknesses (Real) **Vulnerability:** Collision attacks (SHAttered attack, 2017) - Attackers can find two different inputs that produce the same SHA-1 hash - **Critical for:** Digital signatures, TLS certificates, password hashing - **Example threat:** Forge a malicious PDF with the same signature as a legitimate document ### SHA-1 for Identifiers (Safe) **Reality:** Collision attacks do NOT apply to GHCID identifier generation **Why it's safe for UUIDs:** | Cryptographic Use (Vulnerable) | Identifier Use (Safe) | |--------------------------------|----------------------| | **Adversarial context** - Attacker actively tries to forge signatures | **Non-adversarial context** - No one is trying to forge institution IDs | | **Two-message attack** - Attacker controls BOTH inputs | **Single-source generation** - We control the input (GHCID strings) | | **Security requirement** - Must resist preimage and collision attacks | **Uniqueness requirement** - Only need collision resistance in birthday paradox sense | | **High stakes** - Financial fraud, impersonation, data tampering | **Low stakes** - Identifier collision would be inconvenient, not catastrophic | --- ## Mathematical Collision Resistance ### Birthday Paradox Analysis **Question:** How many institutions before we expect a UUID collision? **For UUID v5 (128 bits):** ``` P(collision) ≈ n² / (2 × 2^128) Where n = number of institutions n = 1,000,000 institutions: P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29 In plain English: - 0.000000000000000000000000000015% chance - More atoms in the observable universe than expected collisions ``` **Real-world scale:** - **Current:** ~10,000 heritage institutions in dataset - **Expected growth:** 1-10 million worldwide - **Collision probability:** Effectively zero **Even if SHA-1 collision resistance is weakened:** - Collision attacks require massive computational effort (months on Google's infrastructure) - Not economically feasible for heritage identifiers - Attack has no benefit (no financial gain from forged museum IDs) --- ## Why Transparency Matters More Than Cryptographic Strength ### Core Principle: Verifiable Identifier Generation **GHCID's mission is to create a transparent, persistent identifier scheme.** This requires: 1. **Anyone can verify** our UUIDs are correctly generated 2. **Anyone can regenerate** UUIDs from GHCID strings 3. **No secret algorithms** or proprietary implementations 4. **No trust required** - verify, don't trust ### UUID v5 Achieves This ```python # ANYONE can verify this algorithm import uuid # Public, standardized namespace (documented in RFC 4122) GHCID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') # Standard UUID v5 generation (built into Python) ghcid_string = "US-CA-SAN-A-IA" ghcid_uuid = uuid.uuid5(GHCID_NAMESPACE, ghcid_string) print(ghcid_uuid) # → 550e8400-e29b-41d4-a716-446655440000 # Result: ANYONE can reproduce this UUID and verify it's correct ``` **Transparency benefits:** - ✅ Researchers can independently verify our data - ✅ Future systems can regenerate UUIDs without our codebase - ✅ No "black box" algorithms - ✅ Builds community trust --- ## Comparison: UUID v5 vs UUID v8 (SHA-256) ### UUID v8 (SHA-256) Alternative We **also generate UUID v8 (SHA-256)** for future-proofing, but it's secondary because: | Aspect | UUID v5 (SHA-1) | UUID v8 (SHA-256) | |--------|-----------------|-------------------| | **Standardization** | ✅ RFC 4122 (2005) | ⚠️ RFC 9562 (2024) - experimental format | | **Algorithm** | ✅ Defined by RFC | ❌ Custom implementation (we define it) | | **Transparency** | ✅ `uuid.uuid5()` built-in | ⚠️ Requires custom code to verify | | **Interoperability** | ✅ Every language has uuid5() | ❌ Requires sharing our implementation | | **Cryptographic strength** | ⚠️ SHA-1 (deprecated for security) | ✅ SHA-256 (NIST-approved 2030+) | | **Collision resistance** | ✅ 128 bits (sufficient) | ✅ 128 bits (truncated from 256) | | **Verification** | ✅ One line of code | ⚠️ Must replicate our algorithm | **Example of reduced transparency with UUID v8:** ```python # UUID v8 - CUSTOM algorithm (less transparent) def ghcid_to_uuid_v8(ghcid_string: str) -> uuid.UUID: """ Custom UUID v8 using SHA-256. NOTE: Others must know OUR specific algorithm to verify this. """ # Hash the GHCID string hash_bytes = hashlib.sha256(ghcid_string.encode('utf-8')).digest() # Truncate to 128 bits (custom choice - we could have done this differently) uuid_bytes = bytearray(hash_bytes[:16]) # Set version bits (standard) uuid_bytes[6] = (uuid_bytes[6] & 0x0F) | 0x80 # Set variant bits (standard) uuid_bytes[8] = (uuid_bytes[8] & 0x3F) | 0x80 return uuid.UUID(bytes=bytes(uuid_bytes)) # Problem: Others must know we take first 16 bytes, not last 16 bytes # Problem: Others must know we use UTF-8 encoding # Problem: Others must have our exact implementation ``` **Conclusion:** UUID v8 is more secure cryptographically, but **less transparent** for verification. --- ## Why NOT Use UUID v7? ### UUID v7 is Time-Based, NOT Deterministic **UUID v7 is perfect for database primary keys** (fast, time-ordered), but **unsuitable for persistent identifiers**: ```python # UUID v5 - Deterministic (content-addressed) ghcid = "US-CA-SAN-A-IA" uuid1 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-... uuid2 = uuid.uuid5(NAMESPACE, ghcid) # → 550e8400-e29b-41d4-a716-... uuid1 == uuid2 # ✅ SAME! (Always) # UUID v7 - Time-based (timestamp-addressed) uuid1 = generate_uuid_v7() # → 019a58fd-3226-7504-8c55-... uuid2 = generate_uuid_v7() # → 019a58fd-9999-7abc-def0-... uuid1 == uuid2 # ❌ DIFFERENT! (Every time) ``` **Problem for citations:** ```markdown # Paper published in 2024 "See Internet Archive (urn:uuid:550e8400-e29b-41d4-a716-446655440000) for details." # Database crashes in 2025, you rebuild from GHCID strings # With UUID v5: Regenerate → 550e8400-... ✅ SAME! Citation still works! # With UUID v7: Generate new → 019b1234-... ❌ BROKEN! Citation is dead! ``` **Use case separation:** - **UUID v7:** Database record IDs (internal, performance) - **UUID v5:** Persistent identifiers (public, citations, cross-system references) --- ## Addressing Security Auditor Concerns ### Common Objection: "SHA-1 is Broken!" **Response:** SHA-1 is **deprecated for cryptographic security applications** (digital signatures, TLS, password hashing), but **appropriate for non-cryptographic use cases** like: - ✅ **Git commit hashes** - Linus Torvalds: "SHA-1 is fine for Git" - ✅ **UUID generation** - RFC 4122 still active (not deprecated) - ✅ **Checksums** - File integrity in non-adversarial contexts - ✅ **Content addressing** - Hash-based deduplication **Key distinction:** | Use Case | SHA-1 Status | Reasoning | |----------|--------------|-----------| | **Digital signatures** | ❌ UNSAFE | Attacker can forge signatures | | **TLS certificates** | ❌ UNSAFE | Attacker can impersonate websites | | **Password hashing** | ❌ UNSAFE | Attacker can crack passwords faster | | **GHCID identifiers** | ✅ SAFE | No attacker, no forgery incentive, sufficient collision resistance | ### NIST Guidance **NIST SP 800-107 (2012):** "SHA-1 should not be used for digital signatures... however, SHA-1 may be used for... generating hash-based message authentication codes (HMACs), key derivation functions (KDFs), and random number generators." **NIST retirement date (Dec 31, 2030):** Applies to **security applications** (signatures, authentication), NOT to identifier generation. --- ## Future-Proofing Strategy ### Dual UUID Approach **We generate BOTH UUID v5 and UUID v8 (SHA-256):** ```yaml # Every institution record includes identifiers: - identifier_scheme: UUID_V5 identifier_value: 550e8400-e29b-41d4-a716-446655440000 primary: true # Current standard - identifier_scheme: UUID_V8_SHA256 identifier_value: 018e6897-dca5-8eb7-931f-30301fbde4ec primary: false # Future-proofing ``` **Migration path:** | Timeline | Primary Identifier | Secondary Identifier | Rationale | |----------|-------------------|----------------------|-----------| | **2024-2030** | UUID v5 (SHA-1) | UUID v8 (SHA-256) | Standard compliance, wide support | | **2030-2040** | UUID v8 (SHA-256) | UUID v5 (legacy) | If SHA-1 fully deprecated | | **2040+** | UUID v8 (SHA-256) | None | SHA-1 sunset complete | **Critical advantage:** Both are **deterministic** - can regenerate from GHCID string anytime, no data loss. --- ## Real-World Precedents ### Other Systems Using SHA-1 for Identifiers 1. **Git Version Control** - Uses SHA-1 for commit hashes - Linus Torvalds: "Collision attacks don't matter for Git's use case" - GitHub still uses SHA-1 (with plans to migrate to SHA-256 over many years) 2. **UUID v5 (RFC 4122)** - Standard since 2005 - Still widely used in 2024 - No RFC deprecation or replacement (yet) 3. **Content-Addressed Storage** - IPFS, BitTorrent use SHA-1 and SHA-256 - Hash function chosen based on use case, not blanket "SHA-1 is bad" --- ## Documentation Transparency ### Public Algorithm Documentation **GHCID UUID v5 generation is fully documented:** ```python # File: src/glam_extractor/identifiers/ghcid.py # GHCID UUID v5 Namespace (publicly documented) GHCID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') def to_uuid(self) -> str: """ Generate UUID v5 from GHCID string. Algorithm: 1. Construct GHCID string: f"{country}-{region}-{city}-{type}-{abbr}" 2. Apply RFC 4122 UUID v5 algorithm: - Namespace: 6ba7b810-9dad-11d1-80b4-00c04fd430c8 - Name: GHCID string (UTF-8 encoded) - Hash: SHA-1 (per RFC 4122) 3. Format as UUID: 8-4-4-4-12 hex format Example: >>> components = GHCIDComponents("US", "CA", "SAN", "A", "IA") >>> components.to_uuid() '550e8400-e29b-41d4-a716-446655440000' Returns: UUID v5 string (36 characters with hyphens) """ ghcid_string = self.to_string() return str(uuid.uuid5(GHCID_NAMESPACE, ghcid_string)) ``` **Anyone can verify:** ```bash # Python python3 -c "import uuid; print(uuid.uuid5(uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8'), 'US-CA-SAN-A-IA'))" # JavaScript (using uuid package) const { v5 } = require('uuid'); console.log(v5('US-CA-SAN-A-IA', '6ba7b810-9dad-11d1-80b4-00c04fd430c8')); # Java UUID.nameUUIDFromBytes("US-CA-SAN-A-IA".getBytes()); # All produce: 550e8400-e29b-41d4-a716-446655440000 ``` --- ## Governance Implications ### Why This Choice Supports GHCID as a PID Scheme **For GHCID to become an established persistent identifier scheme**, it must: 1. **Be Transparent** ✅ - UUID v5 is fully documented and verifiable 2. **Be Reproducible** ✅ - Anyone can regenerate UUIDs from GHCID strings 3. **Be Standard-Compliant** ✅ - RFC 4122 is an IETF standard 4. **Be Future-Proof** ✅ - We also generate UUID v8 (SHA-256) for migration 5. **Build Trust** ✅ - No proprietary "black box" algorithms **Transparency builds community trust**, which is essential for: - Adoption by heritage institutions - Integration with Europeana, DPLA, Wikidata - Recognition as a legitimate PID scheme - Long-term persistence (decades of operation) --- ## Summary: Why UUID v5 (SHA-1) is the Right Choice ### ✅ Advantages 1. **Standardized** - RFC 4122 compliant, universal support 2. **Transparent** - Publicly documented, anyone can verify 3. **Deterministic** - Same GHCID always produces same UUID 4. **Sufficient collision resistance** - 128 bits, virtually zero probability 5. **Widely supported** - Built into every programming language 6. **Non-adversarial context** - No security threats from collision attacks 7. **Interoperable** - Works with existing UUID v5 systems ### ⚠️ Limitations 1. **Perception** - "SHA-1 is broken" stigma (requires education) 2. **Security auditors** - May flag SHA-1 use (need to explain context) 3. **Future deprecation** - If RFC 4122 is updated (mitigated by dual UUIDs) ### 🔮 Future-Proofing - ✅ We generate **both UUID v5 (SHA-1) and UUID v8 (SHA-256)** - ✅ Can migrate to SHA-256 primary if needed - ✅ Both are deterministic - no data loss in migration - ✅ Transparent algorithm documentation ensures verifiability --- ## Decision Log **Date:** 2024-11-06 **Decision Maker:** GLAM Data Extraction Project **Decision:** Adopt UUID v5 (SHA-1) as primary persistent identifier format **Rationale:** - Transparency and verifiability outweigh cryptographic strength concerns - SHA-1 collision attacks are irrelevant in non-adversarial identifier generation - RFC 4122 standard compliance ensures wide interoperability - Dual UUID strategy (v5 + v8) provides future-proofing - 128-bit collision resistance is more than sufficient for heritage domain **Status:** ✅ Adopted and implemented --- ## References - **RFC 4122:** A Universally Unique IDentifier (UUID) URN Namespace - https://tools.ietf.org/html/rfc4122 - **RFC 9562:** Universally Unique IDentifiers (UUIDs) - https://datatracker.ietf.org/doc/rfc9562/ - **SHAttered Attack:** Google/CWI (2017) - https://shattered.io - **NIST SP 800-107:** Recommendation for Applications Using SHA-1 - https://csrc.nist.gov/publications/detail/sp/800-107/rev-1/final - **Git and SHA-1:** Linus Torvalds mailing list discussions - **UUID v5 Usage:** Wikidata, IIIF, Europeana identifier practices --- **Version:** 1.0 **Last Updated:** 2024-11-06 **Review Date:** 2027-01-01 (or when RFC 4122 is updated)