glam/docs/PERSISTENT_IDENTIFIERS.md

# Persistent Identifiers for Heritage Institutions

## Overview

The GLAM Data Extraction project uses **multiple identifier formats** optimized for different purposes:

### Persistent Identifiers (Deterministic)
These can be regenerated from the GHCID string and are stable across systems:

| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v5** | 128 | SHA-1 | **PRIMARY** - Europeana, DPLA, IIIF, Wikidata | RFC 4122 Standard |
| **UUID SHA-256** | 128 | SHA-256 | **SOTA** - Security compliance, future-proofing | RFC 9562 (UUID v8) |
| **Numeric** | 64 | SHA-256 | CSV exports, numeric analysis | Internal |
| **Human-readable** | Variable | ISO format | Citations, documentation | ISO-based |

### Database Record Identifiers (Non-Deterministic)
These are generated once per record and optimize database performance:

| Format | Bits | Algorithm | Use Case | Status |
|--------|------|-----------|----------|--------|
| **UUID v7** | 128 | Timestamp + Random | Database PKs, time-ordered queries | RFC 9562 Standard |

## Why Four Formats?

### 1. **UUID v5 (SHA-1)** - Interoperability Standard ⭐ PRIMARY
```
Format: 550e8400-e29b-41d4-a716-446655440000
Version: 5 (name-based, SHA-1)
Standard: RFC 4122 (2005)
```

**✅ Strengths:**
- **RFC 4122 compliant** - Universal library support
- **Deterministic** - Same GHCID → Same UUID always (content-addressed)
- **Transparent** - Publicly documented algorithm, anyone can verify
- **Interoperable** - Works with Europeana, DPLA, IIIF, Wikidata
- **128-bit collision resistance** - P(collision) ≈ 1.5×10^-29 for 1M institutions

**⚠️ SHA-1 Nuance:**
- Uses SHA-1 internally (RFC 4122 specification)
- SHA-1 deprecated for **cryptographic security** (digital signatures, TLS, passwords)
- SHA-1 **appropriate for identifier generation** (non-adversarial, collision-resistant)
- See [Why GHCID Uses UUID v5 and SHA-1](WHY_UUID_V5_SHA1.md) for detailed rationale

**Why SHA-1 is Safe for GHCID:**
```
Cryptographic Use (Vulnerable):
  - Adversarial context (attacker forges signatures)
  - Two-message collision attack
  - Security-critical (financial, authentication)

Identifier Use (Safe):
  - Non-adversarial context (no one forges museum IDs)
  - Single-source generation (we control inputs)
  - Uniqueness requirement (birthday paradox protection sufficient)
```

**Use When:**
- **Primary identifier** for all GHCID records
- Integrating with existing UUID v5 systems
- Exporting to Europeana, DPLA, IIIF
- Storing in Wikidata as external identifier
- RFC 4122 strict compliance required
- **Maximum transparency** required (anyone can verify)

---

### 2. **UUID SHA-256 (Custom)** - SOTA Cryptographic Strength
```
Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
Version: 8 (custom/experimental)
Algorithm: SHA-256 (truncated to 128 bits)
```

**✅ Strengths:**
- **SHA-256** - NIST-approved, SOTA cryptographic hash (2024)
- **Superior collision resistance** vs SHA-1
- **Future-proof** - No known practical attacks
- **UUID-compatible** - Valid UUID format, works with UUID parsers

**⚠️ Nuances:**
- **Not RFC 4122 standard** - Custom implementation
- UUID v8 is "experimental/vendor-specific" designation
- May not be recognized by strict UUID v5-only systems

**Use When:**
- Security policy mandates SHA-256
- Maximum collision resistance required
- Future-proofing against SHA-1 deprecation
- Custom identifier resolution service

**Algorithm:**
1. Hash GHCID string with SHA-256 → 256 bits
2. Truncate to first 128 bits (16 bytes)
3. Set version bits to 8 (custom)
4. Set variant bits to RFC 4122 (0b10xxxxxx)

---

### 3. **Numeric (64-bit)** - Database Optimization
```
Format: 213324328442227739
Algorithm: SHA-256 → first 8 bytes → uint64
Range: 0 to 18,446,744,073,709,551,615
```

**✅ Strengths:**
- **Compact** - Fits in SQL BIGINT (8 bytes)
- **Fast indexing** - Integer comparisons faster than UUID
- **CSV-friendly** - No special characters
- **Deterministic** - Same GHCID → Same number

**⚠️ Nuances:**
- **64-bit truncation** reduces collision resistance vs full 256-bit
- P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%)
- Still negligible for heritage domain (<10M institutions expected)

**Use When:**
- Database primary key optimization
- CSV exports for spreadsheet analysis
- Numeric sorting required
- Systems without UUID support

---

### 4. **Human-Readable (ISO-based)** - Citations & References
```
Format: US-CA-SAN-A-IA
Components: {Country}-{Region}-{City}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam)
```

**✅ Strengths:**
- **Human-readable** - Understandable without lookup
- **Geographic context** - Location embedded in ID
- **Type indicator** - Institution type visible
- **Citable** - Use in academic papers, documentation

**⚠️ Nuances:**
- **Not persistent** if institution relocates or changes name
- Use `ghcid_original` field (frozen) for true persistence
- `ghcid` field (current) may change over time

**Use When:**
- Academic citations
- Documentation and reports
- Human-readable data exchange
- Debugging and logging

---

## Collision Resistance Comparison

### Mathematical Analysis

```python
# Collision probability (birthday paradox):
# P(collision) ≈ n² / (2 × 2^bits)

# For 1,000,000 institutions:

# UUID v5 / UUID SHA-256 (128-bit):
P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29
# Effectively zero - more atoms in universe than collisions

# Numeric (64-bit):
P = (10^6)² / (2 × 2^64) ≈ 2.7 × 10^-7  (0.00003%)
# Negligible for heritage domain

# Even at 10 million institutions:
P_64bit = (10^7)² / (2 × 2^64) ≈ 2.7 × 10^-5  (0.003%)
# Still acceptable
```

### Real-World Context

| Institution Count | UUID v5/SHA-256 | Numeric (64-bit) | Assessment |
|-------------------|-----------------|------------------|------------|
| **100,000** | ~0% | 2.7×10^-11 (0.0000000027%) | ✅ All safe |
| **1,000,000** | ~0% | 2.7×10^-7 (0.00003%) | ✅ All safe |
| **10,000,000** | ~0% | 2.7×10^-5 (0.003%) | ✅ UUID safe, numeric acceptable |
| **100,000,000** | ~0% | 0.27% | ⚠️ Use UUID, numeric risky |

**Conclusion:** For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance.

---

## Historical Collision Resolution

### The Rule: Temporal Priority Determines Disambiguation

When creating GHCIDs, collisions can occur in two temporal contexts:

1. **First Batch Creation** (initial PID assignment): Multiple institutions discovered simultaneously
2. **Historical Addition** (post-publication): New historical institution added after existing GHCID published

**Critical Design Decision**: The collision resolution strategy differs based on temporal context to preserve PID stability.

### Collision Resolution: Native Language Name Suffix

**Key Change**: Collisions are resolved by appending the **full legal name in native language in snake_case format**, NOT Wikidata Q-numbers.

**Name Suffix Rules**:
- Use the institution's full official name in its native language
- Convert to snake_case (lowercase, underscores for spaces)
- Remove apostrophes, accents, commas, and other punctuation/diacritics
- Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)

**Name Normalization Examples**:
```
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
```

### First Batch Behavior (Initial PID Creation)

**Scenario**: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together.

**Resolution**: **ALL** colliding institutions get name suffixes appended.

**Example**:

```yaml
# Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM

# Stedelijk Museum (founded 1874)
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

# Science Museum Amsterdam (founded 2010)
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
```

**Rationale**: No existing PIDs to preserve; both institutions are "new" to the system.

### Historical Addition Behavior (Post-Publication)

**Scenario**: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID.

**Resolution**: **ONLY** the newly added historical institution gets a name suffix. The existing PID remains unchanged.

**Example**:

```yaml
# Existing GHCID (published 2025-11-01)
ghcid_original: NL-NH-AMS-M-HM  # Hermitage Museum Amsterdam (2009-2023)

# Historical institution added later (2025-11-15)
# Amsterdam Historical Museum (1926-1975)
# Would also generate: NL-NH-AMS-M-HM
#
# COLLISION DETECTED → Add name suffix to NEW addition ONLY
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
```

**Outcome**:
- `NL-NH-AMS-M-HM` (Hermitage Museum Amsterdam) → **UNCHANGED**
- `NL-NH-AMS-M-HM-amsterdam_historical_museum` (Amsterdam Historical Museum) → **Name suffix added**

**Rationale**: Preserve stability of already-published PIDs.

### Why This Matters: PID Stability Principle

**Problem**: Changing existing GHCIDs breaks external references.

PIDs may already be:
- Cited in academic publications
- Referenced in datasets and APIs
- Stored in institutional databases
- Embedded in IIIF manifests
- Linked from Wikidata

**Principle**: **"Cool URIs don't change"** (Tim Berners-Lee, W3C)

Once a GHCID is published (in first batch or as standalone record), it should **NEVER** change, even if new historical institutions create collisions.

### Decision Table: Who Gets Name Suffix?

| Scenario | When | Existing GHCID | New GHCID | Who Gets Name Suffix | Rationale |
|----------|------|----------------|-----------|---------------------|-----------|
| **First Batch** | Initial PID creation (2025-11-01) | None (first time) | `NL-NH-AMS-M-SM` (2 institutions) | **ALL** colliding institutions | No existing PIDs to preserve |
| **Historical Addition** | Post-publication (2025-11-15) | `NL-NH-AMS-M-HM` (published) | `NL-NH-AMS-M-HM` (historical) | **ONLY** newly added institution | Preserve published PID stability |
| **Standalone Addition** | New institution (2026-01-01) | `NL-NH-AMS-M-XY` (published) | `NL-NH-AMS-M-XY` (new contemporary) | **ONLY** newly added institution | Preserve existing PID |

### Implementation Guidance

**Name Suffix Generation**:

```python
import re
import unicodedata

def generate_name_suffix(native_name: str) -> str:
    """Convert native language institution name to snake_case suffix.

    Examples:
        "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
        "Musée d'Orsay" → "musee_dorsay"
        "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
    """
    # Normalize unicode (NFD decomposition) and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')

    # Convert to lowercase
    lowercase = ascii_name.lower()

    # Remove apostrophes, commas, and other punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)

    # Replace spaces and hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)

    # Remove any remaining non-alphanumeric characters (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)

    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')

    return final
```

**Collision Detection Logic**:

```python
def resolve_collision(new_ghcid: str, new_name: str, existing_ghcids: Set[str]) -> str:
    """
    Resolve GHCID collision based on temporal context.

    Args:
        new_ghcid: Base GHCID for new institution
        new_name: Native language name of the institution
        existing_ghcids: Set of already-published GHCIDs

    Returns:
        Final GHCID (with name suffix if needed)
    """
    if new_ghcid in existing_ghcids:
        # COLLISION DETECTED: New institution collides with existing
        # Resolution: Add name suffix to NEW institution ONLY
        name_suffix = generate_name_suffix(new_name)
        return f"{new_ghcid}-{name_suffix}"
    else:
        # No collision: Use base GHCID
        return new_ghcid
```

**First Batch Processing** (different logic):

```python
def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]:
    """
    Process initial batch of institutions.

    For first batch, ALL collisions get name suffixes appended.
    """
    # Group by base GHCID
    ghcid_groups = defaultdict(list)
    for inst in institutions:
        base_ghcid = generate_base_ghcid(inst)
        ghcid_groups[base_ghcid].append(inst)

    records = []
    for base_ghcid, group in ghcid_groups.items():
        if len(group) == 1:
            # No collision: Use base GHCID
            records.append(create_record(group[0], base_ghcid))
        else:
            # COLLISION: ALL institutions get name suffixes
            for inst in group:
                name_suffix = generate_name_suffix(inst.name)
                ghcid = f"{base_ghcid}-{name_suffix}"
                records.append(create_record(inst, ghcid))

    return records
```

### Edge Cases

**Case 1: Multiple historical institutions added simultaneously**

If multiple historical institutions are added together (same date) and collide with existing GHCID:

```yaml
# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-XY

# Both added 2025-11-15
# Historical Institution A: "Amsterdam Art Archive"
ghcid: NL-NH-AMS-M-XY-amsterdam_art_archive

# Historical Institution B: "Amsterdam Archaeology Museum"
ghcid: NL-NH-AMS-M-XY-amsterdam_archaeology_museum
```

**Resolution**: ALL newly added institutions get name suffixes (treat as mini-batch).

**Case 2: Existing GHCID already has name suffix**

If existing GHCID already has name suffix (from first batch collision), new historical addition gets different name suffix:

```yaml
# Existing (from first batch with collision)
ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

# Historical addition (2025-11-15)
ghcid: NL-NH-AMS-M-SM-stadsmuseum_amsterdam  # Different name suffix
```

**No ambiguity**: Each institution has unique name suffix derived from its native language name.

**Case 3: Non-Latin script names**

For institutions with non-Latin script names, transliterate to ASCII:

```yaml
# Chinese institution: 北京故宫博物院 (Palace Museum Beijing)
ghcid: CN-BJ-BEI-M-PM-beijing_gugong_bowuyuan

# Japanese institution: 東京国立博物館 (Tokyo National Museum)
ghcid: JP-TK-TOK-M-TN-tokyo_kokuritsu_hakubutsukan

# Arabic institution: المتحف المصري (Egyptian Museum)
ghcid: EG-CA-CAI-M-EM-al_mathaf_al_masri
```

### Testing Strategy

**Test 1: First Batch Collision**

```python
def test_first_batch_collision():
    """Verify ALL institutions in first batch get name suffixes"""
    institutions = [
        Institution("Stedelijk Museum Amsterdam", type="M", city="AMS"),
        Institution("Science Museum Amsterdam", type="M", city="AMS")
    ]

    records = process_first_batch(institutions)

    # Both should have name suffixes
    assert records[0].ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
    assert records[1].ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"
```

**Test 2: Historical Addition Collision**

```python
def test_historical_addition_preserves_existing():
    """Verify existing GHCID unchanged when historical added"""
    # Existing GHCID (published)
    existing_ghcids = {"NL-NH-AMS-M-HM"}

    # Add historical institution
    historical = Institution(
        name="Amsterdam Historical Museum",
        type="M",
        city="AMS",
        temporal_extent={"start": "1926", "end": "1975"}
    )

    new_ghcid = resolve_collision(
        generate_base_ghcid(historical),
        historical.name,
        existing_ghcids
    )

    # New historical gets name suffix
    assert new_ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"

    # Existing GHCID NOT in database update
    # (verify existing record unchanged)
```

**Test 3: Name Suffix Generation**

```python
def test_name_suffix_generation():
    """Verify name suffix normalization"""
    assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
    assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
    assert generate_name_suffix("Biblioteca Nacional do Brasil") == "biblioteca_nacional_do_brasil"
    assert generate_name_suffix("Royal Museum, London") == "royal_museum_london"
```

### Documentation References

- **Collision Resolution**: `docs/plan/global_glam/07-ghcid-collision-resolution.md`
- **GHCID Specification**: `docs/GHCID_PID_SCHEME.md`
- **Implementation**: `src/glam_extractor/identifiers/ghcid.py`
- **Schema**: `schemas/provenance.yaml` (GHCIDHistoryEntry)
- **Abbreviation Special Characters**: `.opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md` (characters to exclude from abbreviations)

---

## SHA-1 vs SHA-256: The Nuance

### Why UUID v5 Uses SHA-1

**RFC 4122 (2005)** standardized UUID v5 with SHA-1 because:
- SHA-1 was considered secure in 2005
- 128-bit UUID space provides collision resistance even with SHA-1
- Purpose is **identifier generation**, not **security/authentication**

### SHA-1 Cryptographic Weakness

**SHA-1 collision attacks (2017):**
- Google/CWI demonstrated practical SHA-1 collision
- Two different inputs producing same hash
- **Critical for digital signatures** (authentication, certificates)
- **Less critical for identifiers** (birthday paradox protection sufficient)

### When SHA-1 Is Problematic

❌ **Digital signatures** - Attacker can forge documents
❌ **Certificate authorities** - SSL/TLS security compromised
❌ **Password hashing** - Weakens brute-force resistance
❌ **Blockchain** - Consensus security at risk

### When SHA-1 Is Acceptable

✅ **UUID generation** - Collision resistance adequate for identifier space
✅ **Git commits** - Linus Torvalds: "SHA-1 is fine for Git's use case"
✅ **Non-adversarial contexts** - No attacker trying to cause collisions

---

## Recommended Usage Strategy

### Default: Dual UUID Approach

Store **both UUID formats** for maximum flexibility:

```yaml
# Example YAML record
- id: 550e8400-e29b-41d4-a716-446655440000  # Use UUID v5 as primary ID
  name: Internet Archive
  institution_type: ARCHIVE
  ghcid: US-CA-SAN-A-IA
  ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000  # UUID v5 (SHA-1)
  ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d  # UUID SHA-256
  ghcid_numeric: 213324328442227739  # Numeric (64-bit)
  identifiers:
    - identifier_scheme: GHCID
      identifier_value: US-CA-SAN-A-IA
    - identifier_scheme: GHCID_UUID_V5
      identifier_value: 550e8400-e29b-41d4-a716-446655440000
    - identifier_scheme: GHCID_UUID_SHA256
      identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
    - identifier_scheme: GHCID_NUMERIC
      identifier_value: 213324328442227739
```

### Use Case Decision Tree

```
Need to integrate with existing systems?
├─ YES → Use UUID v5 (to_uuid())
│         - Europeana, DPLA, IIIF, Wikidata
│         - RFC 4122 compliance required
│
└─ NO → Building custom system?
    ├─ Security policy mandates SHA-256?
    │  ├─ YES → Use UUID SHA-256 (to_uuid_sha256())
    │  └─ NO → Use UUID v5 for standard compliance
    │
    └─ Database optimization critical?
        ├─ YES → Use Numeric (to_numeric()) as PK
        │         - Store UUID v5 as alternate key
        └─ NO → Use UUID v5 as primary identifier
```

---

## Code Examples

### Generate All Four Formats

```python
from glam_extractor.identifiers.ghcid import GHCIDComponents

# Create GHCID components
components = GHCIDComponents(
    country_code="US",
    region_code="CA",
    city_locode="SAN",
    institution_type="A",
    abbreviation="IA"
)

# Generate all formats
uuid_v5 = components.to_uuid()           # UUID v5 (SHA-1)
uuid_sha256 = components.to_uuid_sha256()  # UUID SHA-256
numeric = components.to_numeric()        # Numeric (64-bit)
human = components.to_string()           # Human-readable

print(f"UUID v5:      {uuid_v5}")
print(f"UUID SHA-256: {uuid_sha256}")
print(f"Numeric:      {numeric}")
print(f"Human:        {human}")

# Output:
# UUID v5:      550e8400-e29b-41d4-a716-446655440000
# UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
# Numeric:      213324328442227739
# Human:        US-CA-SAN-A-IA
```

### Verify Determinism

```python
# Same input always produces same output
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")

assert comp1.to_uuid() == comp2.to_uuid()
assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256()
assert comp1.to_numeric() == comp2.to_numeric()
assert comp1.to_string() == comp2.to_string()
```

### Export to Different Formats

```python
# RDF/JSON-LD (use UUID v5)
rdf_id = f"urn:uuid:{components.to_uuid()}"
# → "urn:uuid:550e8400-e29b-41d4-a716-446655440000"

# IIIF Manifest (use UUID v5)
iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json"

# Database (use numeric PK)
sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')"

# Citation (use human-readable)
citation = f"See Internet Archive ({components.to_string()}) for digital collections."
```

---

## Future-Proofing Strategy

### Timeline Projections

| Year | SHA-1 Status | UUID v5 Status | Recommendation |
|------|--------------|----------------|----------------|
| **2024** | Weak for security, OK for IDs | Standard, widely supported | ✅ Use UUID v5 as primary |
| **2030** | Likely deprecated for security | Still standard for IDs | ✅ Dual UUID (v5 + SHA-256) |
| **2040** | Possibly deprecated entirely | May be superseded | ⚠️ Migrate to UUID SHA-256 |

### Migration Path

If SHA-1 is fully deprecated:

1. **Phase 1 (Now):** Store both UUID v5 and UUID SHA-256
2. **Phase 2 (2030):** Make UUID SHA-256 primary, keep v5 as alias
3. **Phase 3 (2040):** Deprecate UUID v5, use SHA-256 exclusively

**Critical:** Because both are **deterministic**, you can always regenerate from GHCID string without breaking references.

---

## Governance & Resolution

### Identifier Persistence Requirements

Technical generation is only half the solution. True persistence requires:

#### 1. **Resolution Service**
```
https://id.heritage.example.org/uuid/{uuid}
https://id.heritage.example.org/numeric/{numeric}
https://id.heritage.example.org/ghcid/{ghcid}

All three should resolve to the same institutional record.
```

#### 2. **Mapping Database**
```sql
CREATE TABLE ghcid_registry (
    uuid_v5 UUID PRIMARY KEY,
    uuid_sha256 UUID NOT NULL,
    numeric BIGINT NOT NULL,
    ghcid VARCHAR(100) NOT NULL,
    ghcid_original VARCHAR(100) NOT NULL,  -- Frozen
    institution_name TEXT NOT NULL,
    last_updated TIMESTAMP,
    UNIQUE(uuid_sha256),
    UNIQUE(numeric),
    UNIQUE(ghcid_original)
);
```

#### 3. **Organizational Commitment**
- Maintain resolution service for decades
- Fund infrastructure for long-term operation
- Establish governance policies for ID assignment
- Handle institution mergers/closures/relocations

#### 4. **Community Standards**
- Coordinate with ISIL, Wikidata, GeoNames
- Publish GHCID specification as RFC or W3C note
- Engage with Europeana, DPLA, IIIF communities
- Establish dispute resolution process

---

## Comparison with Existing PID Systems

| System | Format | Governance | Resolution | Adoption |
|--------|--------|------------|------------|----------|
| **DOI** | 10.xxxx/yyyy | IDF (non-profit) | doi.org | High (scholarly) |
| **ARK** | ark:/nnnnn/xxx | CDL (California) | n2t.net | Medium (archives) |
| **Handle** | hdl:xxxx/yyyy | CNRI (non-profit) | handle.net | Medium (repositories) |
| **GHCID** | UUID v5 | **TBD** | **TBD** | None (new) |

**Lesson:** Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical.

---

## Recommendations

### For This Project (2024-2025)

1. ✅ **Implement dual UUID generation** (v5 + SHA-256)
2. ✅ **Store all four identifier formats** in data model
3. ✅ **Use UUID v5 as primary ID** for current interoperability
4. ✅ **Document SHA-1 nuance** clearly
5. ⏳ **Build resolution service prototype**
6. ⏳ **Engage with Europeana/DPLA** for feedback
7. ⏳ **Draft GHCID specification** for community review

### For Production Deployment

1. ⏳ **Establish governance body** (non-profit foundation?)
2. ⏳ **Secure long-term funding** for resolution service
3. ⏳ **Coordinate with existing PID systems** (ISIL, VIAF, Wikidata)
4. ⏳ **Publish specification** (W3C note or IETF RFC)
5. ⏳ **Deploy resolution infrastructure** (multi-region, high availability)
6. ⏳ **Engage heritage community** for adoption

---

## References

- **RFC 4122:** UUID Standard (https://tools.ietf.org/html/rfc4122)
- **SHA-1 Collision:** Google/CWI (2017) - https://shattered.io
- **UUID v8 Draft:** New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/)
- **NIST SHA-256:** FIPS 180-4 - https://csrc.nist.gov/publications/fips
- **Identifier.org:** Life sciences identifiers - https://identifiers.org
- **N2T:** Name-to-Thing resolver - https://n2t.net

---

**Version:** 1.0
**Date:** 2024-11-06
**Status:** Draft for Community Review