# Persistent Identifiers for Heritage Institutions ## Overview The GLAM Data Extraction project uses **multiple identifier formats** optimized for different purposes: ### Persistent Identifiers (Deterministic) These can be regenerated from the GHCID string and are stable across systems: | Format | Bits | Algorithm | Use Case | Status | |--------|------|-----------|----------|--------| | **UUID v5** | 128 | SHA-1 | **PRIMARY** - Europeana, DPLA, IIIF, Wikidata | RFC 4122 Standard | | **UUID SHA-256** | 128 | SHA-256 | **SOTA** - Security compliance, future-proofing | RFC 9562 (UUID v8) | | **Numeric** | 64 | SHA-256 | CSV exports, numeric analysis | Internal | | **Human-readable** | Variable | ISO format | Citations, documentation | ISO-based | ### Database Record Identifiers (Non-Deterministic) These are generated once per record and optimize database performance: | Format | Bits | Algorithm | Use Case | Status | |--------|------|-----------|----------|--------| | **UUID v7** | 128 | Timestamp + Random | Database PKs, time-ordered queries | RFC 9562 Standard | ## Why Four Formats? ### 1. **UUID v5 (SHA-1)** - Interoperability Standard ⭐ PRIMARY ``` Format: 550e8400-e29b-41d4-a716-446655440000 Version: 5 (name-based, SHA-1) Standard: RFC 4122 (2005) ``` **✅ Strengths:** - **RFC 4122 compliant** - Universal library support - **Deterministic** - Same GHCID → Same UUID always (content-addressed) - **Transparent** - Publicly documented algorithm, anyone can verify - **Interoperable** - Works with Europeana, DPLA, IIIF, Wikidata - **128-bit collision resistance** - P(collision) ≈ 1.5×10^-29 for 1M institutions **⚠️ SHA-1 Nuance:** - Uses SHA-1 internally (RFC 4122 specification) - SHA-1 deprecated for **cryptographic security** (digital signatures, TLS, passwords) - SHA-1 **appropriate for identifier generation** (non-adversarial, collision-resistant) - See [Why GHCID Uses UUID v5 and SHA-1](WHY_UUID_V5_SHA1.md) for detailed rationale **Why SHA-1 is Safe for GHCID:** ``` Cryptographic Use (Vulnerable): - Adversarial context (attacker forges signatures) - Two-message collision attack - Security-critical (financial, authentication) Identifier Use (Safe): - Non-adversarial context (no one forges museum IDs) - Single-source generation (we control inputs) - Uniqueness requirement (birthday paradox protection sufficient) ``` **Use When:** - **Primary identifier** for all GHCID records - Integrating with existing UUID v5 systems - Exporting to Europeana, DPLA, IIIF - Storing in Wikidata as external identifier - RFC 4122 strict compliance required - **Maximum transparency** required (anyone can verify) --- ### 2. **UUID SHA-256 (Custom)** - SOTA Cryptographic Strength ``` Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d Version: 8 (custom/experimental) Algorithm: SHA-256 (truncated to 128 bits) ``` **✅ Strengths:** - **SHA-256** - NIST-approved, SOTA cryptographic hash (2024) - **Superior collision resistance** vs SHA-1 - **Future-proof** - No known practical attacks - **UUID-compatible** - Valid UUID format, works with UUID parsers **⚠️ Nuances:** - **Not RFC 4122 standard** - Custom implementation - UUID v8 is "experimental/vendor-specific" designation - May not be recognized by strict UUID v5-only systems **Use When:** - Security policy mandates SHA-256 - Maximum collision resistance required - Future-proofing against SHA-1 deprecation - Custom identifier resolution service **Algorithm:** 1. Hash GHCID string with SHA-256 → 256 bits 2. Truncate to first 128 bits (16 bytes) 3. Set version bits to 8 (custom) 4. Set variant bits to RFC 4122 (0b10xxxxxx) --- ### 3. **Numeric (64-bit)** - Database Optimization ``` Format: 213324328442227739 Algorithm: SHA-256 → first 8 bytes → uint64 Range: 0 to 18,446,744,073,709,551,615 ``` **✅ Strengths:** - **Compact** - Fits in SQL BIGINT (8 bytes) - **Fast indexing** - Integer comparisons faster than UUID - **CSV-friendly** - No special characters - **Deterministic** - Same GHCID → Same number **⚠️ Nuances:** - **64-bit truncation** reduces collision resistance vs full 256-bit - P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%) - Still negligible for heritage domain (<10M institutions expected) **Use When:** - Database primary key optimization - CSV exports for spreadsheet analysis - Numeric sorting required - Systems without UUID support --- ### 4. **Human-Readable (ISO-based)** - Citations & References ``` Format: US-CA-SAN-A-IA Components: {Country}-{Region}-{City}-{Type}-{Abbreviation} Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam) ``` **✅ Strengths:** - **Human-readable** - Understandable without lookup - **Geographic context** - Location embedded in ID - **Type indicator** - Institution type visible - **Citable** - Use in academic papers, documentation **⚠️ Nuances:** - **Not persistent** if institution relocates or changes name - Use `ghcid_original` field (frozen) for true persistence - `ghcid` field (current) may change over time **Use When:** - Academic citations - Documentation and reports - Human-readable data exchange - Debugging and logging --- ## Collision Resistance Comparison ### Mathematical Analysis ```python # Collision probability (birthday paradox): # P(collision) ≈ n² / (2 × 2^bits) # For 1,000,000 institutions: # UUID v5 / UUID SHA-256 (128-bit): P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29 # Effectively zero - more atoms in universe than collisions # Numeric (64-bit): P = (10^6)² / (2 × 2^64) ≈ 2.7 × 10^-7 (0.00003%) # Negligible for heritage domain # Even at 10 million institutions: P_64bit = (10^7)² / (2 × 2^64) ≈ 2.7 × 10^-5 (0.003%) # Still acceptable ``` ### Real-World Context | Institution Count | UUID v5/SHA-256 | Numeric (64-bit) | Assessment | |-------------------|-----------------|------------------|------------| | **100,000** | ~0% | 2.7×10^-11 (0.0000000027%) | ✅ All safe | | **1,000,000** | ~0% | 2.7×10^-7 (0.00003%) | ✅ All safe | | **10,000,000** | ~0% | 2.7×10^-5 (0.003%) | ✅ UUID safe, numeric acceptable | | **100,000,000** | ~0% | 0.27% | ⚠️ Use UUID, numeric risky | **Conclusion:** For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance. --- ## Historical Collision Resolution ### The Rule: Temporal Priority Determines Disambiguation When creating GHCIDs, collisions can occur in two temporal contexts: 1. **First Batch Creation** (initial PID assignment): Multiple institutions discovered simultaneously 2. **Historical Addition** (post-publication): New historical institution added after existing GHCID published **Critical Design Decision**: The collision resolution strategy differs based on temporal context to preserve PID stability. ### Collision Resolution: Native Language Name Suffix **Key Change**: Collisions are resolved by appending the **full legal name in native language in snake_case format**, NOT Wikidata Q-numbers. **Name Suffix Rules**: - Use the institution's full official name in its native language - Convert to snake_case (lowercase, underscores for spaces) - Remove apostrophes, accents, commas, and other punctuation/diacritics - Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese) **Name Normalization Examples**: ``` "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam" "Musée d'Orsay" → "musee_dorsay" "Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil" "北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration) "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek" ``` ### First Batch Behavior (Initial PID Creation) **Scenario**: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together. **Resolution**: **ALL** colliding institutions get name suffixes appended. **Example**: ```yaml # Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM # Stedelijk Museum (founded 1874) ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam # Science Museum Amsterdam (founded 2010) ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam ``` **Rationale**: No existing PIDs to preserve; both institutions are "new" to the system. ### Historical Addition Behavior (Post-Publication) **Scenario**: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID. **Resolution**: **ONLY** the newly added historical institution gets a name suffix. The existing PID remains unchanged. **Example**: ```yaml # Existing GHCID (published 2025-11-01) ghcid_original: NL-NH-AMS-M-HM # Hermitage Museum Amsterdam (2009-2023) # Historical institution added later (2025-11-15) # Amsterdam Historical Museum (1926-1975) # Would also generate: NL-NH-AMS-M-HM # # COLLISION DETECTED → Add name suffix to NEW addition ONLY ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum ``` **Outcome**: - `NL-NH-AMS-M-HM` (Hermitage Museum Amsterdam) → **UNCHANGED** - `NL-NH-AMS-M-HM-amsterdam_historical_museum` (Amsterdam Historical Museum) → **Name suffix added** **Rationale**: Preserve stability of already-published PIDs. ### Why This Matters: PID Stability Principle **Problem**: Changing existing GHCIDs breaks external references. PIDs may already be: - Cited in academic publications - Referenced in datasets and APIs - Stored in institutional databases - Embedded in IIIF manifests - Linked from Wikidata **Principle**: **"Cool URIs don't change"** (Tim Berners-Lee, W3C) Once a GHCID is published (in first batch or as standalone record), it should **NEVER** change, even if new historical institutions create collisions. ### Decision Table: Who Gets Name Suffix? | Scenario | When | Existing GHCID | New GHCID | Who Gets Name Suffix | Rationale | |----------|------|----------------|-----------|---------------------|-----------| | **First Batch** | Initial PID creation (2025-11-01) | None (first time) | `NL-NH-AMS-M-SM` (2 institutions) | **ALL** colliding institutions | No existing PIDs to preserve | | **Historical Addition** | Post-publication (2025-11-15) | `NL-NH-AMS-M-HM` (published) | `NL-NH-AMS-M-HM` (historical) | **ONLY** newly added institution | Preserve published PID stability | | **Standalone Addition** | New institution (2026-01-01) | `NL-NH-AMS-M-XY` (published) | `NL-NH-AMS-M-XY` (new contemporary) | **ONLY** newly added institution | Preserve existing PID | ### Implementation Guidance **Name Suffix Generation**: ```python import re import unicodedata def generate_name_suffix(native_name: str) -> str: """Convert native language institution name to snake_case suffix. Examples: "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam" "Musée d'Orsay" → "musee_dorsay" "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek" """ # Normalize unicode (NFD decomposition) and remove diacritics normalized = unicodedata.normalize('NFD', native_name) ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') # Convert to lowercase lowercase = ascii_name.lower() # Remove apostrophes, commas, and other punctuation no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase) # Replace spaces and hyphens with underscores underscored = re.sub(r'[\s\-]+', '_', no_punct) # Remove any remaining non-alphanumeric characters (except underscores) clean = re.sub(r'[^a-z0-9_]', '', underscored) # Collapse multiple underscores final = re.sub(r'_+', '_', clean).strip('_') return final ``` **Collision Detection Logic**: ```python def resolve_collision(new_ghcid: str, new_name: str, existing_ghcids: Set[str]) -> str: """ Resolve GHCID collision based on temporal context. Args: new_ghcid: Base GHCID for new institution new_name: Native language name of the institution existing_ghcids: Set of already-published GHCIDs Returns: Final GHCID (with name suffix if needed) """ if new_ghcid in existing_ghcids: # COLLISION DETECTED: New institution collides with existing # Resolution: Add name suffix to NEW institution ONLY name_suffix = generate_name_suffix(new_name) return f"{new_ghcid}-{name_suffix}" else: # No collision: Use base GHCID return new_ghcid ``` **First Batch Processing** (different logic): ```python def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]: """ Process initial batch of institutions. For first batch, ALL collisions get name suffixes appended. """ # Group by base GHCID ghcid_groups = defaultdict(list) for inst in institutions: base_ghcid = generate_base_ghcid(inst) ghcid_groups[base_ghcid].append(inst) records = [] for base_ghcid, group in ghcid_groups.items(): if len(group) == 1: # No collision: Use base GHCID records.append(create_record(group[0], base_ghcid)) else: # COLLISION: ALL institutions get name suffixes for inst in group: name_suffix = generate_name_suffix(inst.name) ghcid = f"{base_ghcid}-{name_suffix}" records.append(create_record(inst, ghcid)) return records ``` ### Edge Cases **Case 1: Multiple historical institutions added simultaneously** If multiple historical institutions are added together (same date) and collide with existing GHCID: ```yaml # Existing (published 2025-11-01) ghcid: NL-NH-AMS-M-XY # Both added 2025-11-15 # Historical Institution A: "Amsterdam Art Archive" ghcid: NL-NH-AMS-M-XY-amsterdam_art_archive # Historical Institution B: "Amsterdam Archaeology Museum" ghcid: NL-NH-AMS-M-XY-amsterdam_archaeology_museum ``` **Resolution**: ALL newly added institutions get name suffixes (treat as mini-batch). **Case 2: Existing GHCID already has name suffix** If existing GHCID already has name suffix (from first batch collision), new historical addition gets different name suffix: ```yaml # Existing (from first batch with collision) ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam # Historical addition (2025-11-15) ghcid: NL-NH-AMS-M-SM-stadsmuseum_amsterdam # Different name suffix ``` **No ambiguity**: Each institution has unique name suffix derived from its native language name. **Case 3: Non-Latin script names** For institutions with non-Latin script names, transliterate to ASCII: ```yaml # Chinese institution: 北京故宫博物院 (Palace Museum Beijing) ghcid: CN-BJ-BEI-M-PM-beijing_gugong_bowuyuan # Japanese institution: 東京国立博物館 (Tokyo National Museum) ghcid: JP-TK-TOK-M-TN-tokyo_kokuritsu_hakubutsukan # Arabic institution: المتحف المصري (Egyptian Museum) ghcid: EG-CA-CAI-M-EM-al_mathaf_al_masri ``` ### Testing Strategy **Test 1: First Batch Collision** ```python def test_first_batch_collision(): """Verify ALL institutions in first batch get name suffixes""" institutions = [ Institution("Stedelijk Museum Amsterdam", type="M", city="AMS"), Institution("Science Museum Amsterdam", type="M", city="AMS") ] records = process_first_batch(institutions) # Both should have name suffixes assert records[0].ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam" assert records[1].ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam" ``` **Test 2: Historical Addition Collision** ```python def test_historical_addition_preserves_existing(): """Verify existing GHCID unchanged when historical added""" # Existing GHCID (published) existing_ghcids = {"NL-NH-AMS-M-HM"} # Add historical institution historical = Institution( name="Amsterdam Historical Museum", type="M", city="AMS", temporal_extent={"start": "1926", "end": "1975"} ) new_ghcid = resolve_collision( generate_base_ghcid(historical), historical.name, existing_ghcids ) # New historical gets name suffix assert new_ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum" # Existing GHCID NOT in database update # (verify existing record unchanged) ``` **Test 3: Name Suffix Generation** ```python def test_name_suffix_generation(): """Verify name suffix normalization""" assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay" assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek" assert generate_name_suffix("Biblioteca Nacional do Brasil") == "biblioteca_nacional_do_brasil" assert generate_name_suffix("Royal Museum, London") == "royal_museum_london" ``` ### Documentation References - **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md` - **GHCID Specification**: `docs/GHCID_PID_SCHEME.md` - **Implementation**: `src/glam_extractor/identifiers/ghcid.py` - **Schema**: `schemas/provenance.yaml` (GHCIDHistoryEntry) - **Abbreviation Special Characters**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` (characters to exclude from abbreviations) --- ## SHA-1 vs SHA-256: The Nuance ### Why UUID v5 Uses SHA-1 **RFC 4122 (2005)** standardized UUID v5 with SHA-1 because: - SHA-1 was considered secure in 2005 - 128-bit UUID space provides collision resistance even with SHA-1 - Purpose is **identifier generation**, not **security/authentication** ### SHA-1 Cryptographic Weakness **SHA-1 collision attacks (2017):** - Google/CWI demonstrated practical SHA-1 collision - Two different inputs producing same hash - **Critical for digital signatures** (authentication, certificates) - **Less critical for identifiers** (birthday paradox protection sufficient) ### When SHA-1 Is Problematic ❌ **Digital signatures** - Attacker can forge documents ❌ **Certificate authorities** - SSL/TLS security compromised ❌ **Password hashing** - Weakens brute-force resistance ❌ **Blockchain** - Consensus security at risk ### When SHA-1 Is Acceptable ✅ **UUID generation** - Collision resistance adequate for identifier space ✅ **Git commits** - Linus Torvalds: "SHA-1 is fine for Git's use case" ✅ **Non-adversarial contexts** - No attacker trying to cause collisions --- ## Recommended Usage Strategy ### Default: Dual UUID Approach Store **both UUID formats** for maximum flexibility: ```yaml # Example YAML record - id: 550e8400-e29b-41d4-a716-446655440000 # Use UUID v5 as primary ID name: Internet Archive institution_type: ARCHIVE ghcid: US-CA-SAN-A-IA ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 (SHA-1) ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID SHA-256 ghcid_numeric: 213324328442227739 # Numeric (64-bit) identifiers: - identifier_scheme: GHCID identifier_value: US-CA-SAN-A-IA - identifier_scheme: GHCID_UUID_V5 identifier_value: 550e8400-e29b-41d4-a716-446655440000 - identifier_scheme: GHCID_UUID_SHA256 identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d - identifier_scheme: GHCID_NUMERIC identifier_value: 213324328442227739 ``` ### Use Case Decision Tree ``` Need to integrate with existing systems? ├─ YES → Use UUID v5 (to_uuid()) │ - Europeana, DPLA, IIIF, Wikidata │ - RFC 4122 compliance required │ └─ NO → Building custom system? ├─ Security policy mandates SHA-256? │ ├─ YES → Use UUID SHA-256 (to_uuid_sha256()) │ └─ NO → Use UUID v5 for standard compliance │ └─ Database optimization critical? ├─ YES → Use Numeric (to_numeric()) as PK │ - Store UUID v5 as alternate key └─ NO → Use UUID v5 as primary identifier ``` --- ## Code Examples ### Generate All Four Formats ```python from glam_extractor.identifiers.ghcid import GHCIDComponents # Create GHCID components components = GHCIDComponents( country_code="US", region_code="CA", city_locode="SAN", institution_type="A", abbreviation="IA" ) # Generate all formats uuid_v5 = components.to_uuid() # UUID v5 (SHA-1) uuid_sha256 = components.to_uuid_sha256() # UUID SHA-256 numeric = components.to_numeric() # Numeric (64-bit) human = components.to_string() # Human-readable print(f"UUID v5: {uuid_v5}") print(f"UUID SHA-256: {uuid_sha256}") print(f"Numeric: {numeric}") print(f"Human: {human}") # Output: # UUID v5: 550e8400-e29b-41d4-a716-446655440000 # UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # Numeric: 213324328442227739 # Human: US-CA-SAN-A-IA ``` ### Verify Determinism ```python # Same input always produces same output comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM") comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM") assert comp1.to_uuid() == comp2.to_uuid() assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256() assert comp1.to_numeric() == comp2.to_numeric() assert comp1.to_string() == comp2.to_string() ``` ### Export to Different Formats ```python # RDF/JSON-LD (use UUID v5) rdf_id = f"urn:uuid:{components.to_uuid()}" # → "urn:uuid:550e8400-e29b-41d4-a716-446655440000" # IIIF Manifest (use UUID v5) iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json" # Database (use numeric PK) sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')" # Citation (use human-readable) citation = f"See Internet Archive ({components.to_string()}) for digital collections." ``` --- ## Future-Proofing Strategy ### Timeline Projections | Year | SHA-1 Status | UUID v5 Status | Recommendation | |------|--------------|----------------|----------------| | **2024** | Weak for security, OK for IDs | Standard, widely supported | ✅ Use UUID v5 as primary | | **2030** | Likely deprecated for security | Still standard for IDs | ✅ Dual UUID (v5 + SHA-256) | | **2040** | Possibly deprecated entirely | May be superseded | ⚠️ Migrate to UUID SHA-256 | ### Migration Path If SHA-1 is fully deprecated: 1. **Phase 1 (Now):** Store both UUID v5 and UUID SHA-256 2. **Phase 2 (2030):** Make UUID SHA-256 primary, keep v5 as alias 3. **Phase 3 (2040):** Deprecate UUID v5, use SHA-256 exclusively **Critical:** Because both are **deterministic**, you can always regenerate from GHCID string without breaking references. --- ## Governance & Resolution ### Identifier Persistence Requirements Technical generation is only half the solution. True persistence requires: #### 1. **Resolution Service** ``` https://id.heritage.example.org/uuid/{uuid} https://id.heritage.example.org/numeric/{numeric} https://id.heritage.example.org/ghcid/{ghcid} All three should resolve to the same institutional record. ``` #### 2. **Mapping Database** ```sql CREATE TABLE ghcid_registry ( uuid_v5 UUID PRIMARY KEY, uuid_sha256 UUID NOT NULL, numeric BIGINT NOT NULL, ghcid VARCHAR(100) NOT NULL, ghcid_original VARCHAR(100) NOT NULL, -- Frozen institution_name TEXT NOT NULL, last_updated TIMESTAMP, UNIQUE(uuid_sha256), UNIQUE(numeric), UNIQUE(ghcid_original) ); ``` #### 3. **Organizational Commitment** - Maintain resolution service for decades - Fund infrastructure for long-term operation - Establish governance policies for ID assignment - Handle institution mergers/closures/relocations #### 4. **Community Standards** - Coordinate with ISIL, Wikidata, GeoNames - Publish GHCID specification as RFC or W3C note - Engage with Europeana, DPLA, IIIF communities - Establish dispute resolution process --- ## Comparison with Existing PID Systems | System | Format | Governance | Resolution | Adoption | |--------|--------|------------|------------|----------| | **DOI** | 10.xxxx/yyyy | IDF (non-profit) | doi.org | High (scholarly) | | **ARK** | ark:/nnnnn/xxx | CDL (California) | n2t.net | Medium (archives) | | **Handle** | hdl:xxxx/yyyy | CNRI (non-profit) | handle.net | Medium (repositories) | | **GHCID** | UUID v5 | **TBD** | **TBD** | None (new) | **Lesson:** Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical. --- ## Recommendations ### For This Project (2024-2025) 1. ✅ **Implement dual UUID generation** (v5 + SHA-256) 2. ✅ **Store all four identifier formats** in data model 3. ✅ **Use UUID v5 as primary ID** for current interoperability 4. ✅ **Document SHA-1 nuance** clearly 5. ⏳ **Build resolution service prototype** 6. ⏳ **Engage with Europeana/DPLA** for feedback 7. ⏳ **Draft GHCID specification** for community review ### For Production Deployment 1. ⏳ **Establish governance body** (non-profit foundation?) 2. ⏳ **Secure long-term funding** for resolution service 3. ⏳ **Coordinate with existing PID systems** (ISIL, VIAF, Wikidata) 4. ⏳ **Publish specification** (W3C note or IETF RFC) 5. ⏳ **Deploy resolution infrastructure** (multi-region, high availability) 6. ⏳ **Engage heritage community** for adoption --- ## References - **RFC 4122:** UUID Standard (https://tools.ietf.org/html/rfc4122) - **SHA-1 Collision:** Google/CWI (2017) - https://shattered.io - **UUID v8 Draft:** New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/) - **NIST SHA-256:** FIPS 180-4 - https://csrc.nist.gov/publications/fips - **Identifier.org:** Life sciences identifiers - https://identifiers.org - **N2T:** Name-to-Thing resolver - https://n2t.net --- **Version:** 1.0 **Date:** 2024-11-06 **Status:** Draft for Community Review