kempersc/glam

Fork 0

kempersc ee4e57bc75 add new entries

2025-12-07 00:26:01 +01:00

25 KiB

Raw Blame History

Persistent Identifiers for Heritage Institutions

Overview

The GLAM Data Extraction project uses multiple identifier formats optimized for different purposes:

Persistent Identifiers (Deterministic)

These can be regenerated from the GHCID string and are stable across systems:

Format	Bits	Algorithm	Use Case	Status
UUID v5	128	SHA-1	PRIMARY - Europeana, DPLA, IIIF, Wikidata	RFC 4122 Standard
UUID SHA-256	128	SHA-256	SOTA - Security compliance, future-proofing	RFC 9562 (UUID v8)
Numeric	64	SHA-256	CSV exports, numeric analysis	Internal
Human-readable	Variable	ISO format	Citations, documentation	ISO-based

Database Record Identifiers (Non-Deterministic)

These are generated once per record and optimize database performance:

Format	Bits	Algorithm	Use Case	Status
UUID v7	128	Timestamp + Random	Database PKs, time-ordered queries	RFC 9562 Standard

Why Four Formats?

1. UUID v5 (SHA-1) - Interoperability Standard ⭐ PRIMARY

Format: 550e8400-e29b-41d4-a716-446655440000
Version: 5 (name-based, SHA-1)
Standard: RFC 4122 (2005)

✅ Strengths:

RFC 4122 compliant - Universal library support
Deterministic - Same GHCID → Same UUID always (content-addressed)
Transparent - Publicly documented algorithm, anyone can verify
Interoperable - Works with Europeana, DPLA, IIIF, Wikidata
128-bit collision resistance - P(collision) ≈ 1.5×10^-29 for 1M institutions

⚠️ SHA-1 Nuance:

Uses SHA-1 internally (RFC 4122 specification)
SHA-1 deprecated for cryptographic security (digital signatures, TLS, passwords)
SHA-1 appropriate for identifier generation (non-adversarial, collision-resistant)
See Why GHCID Uses UUID v5 and SHA-1 for detailed rationale

Why SHA-1 is Safe for GHCID:

Cryptographic Use (Vulnerable):
  - Adversarial context (attacker forges signatures)
  - Two-message collision attack
  - Security-critical (financial, authentication)

Identifier Use (Safe):
  - Non-adversarial context (no one forges museum IDs)
  - Single-source generation (we control inputs)
  - Uniqueness requirement (birthday paradox protection sufficient)

Use When:

Primary identifier for all GHCID records
Integrating with existing UUID v5 systems
Exporting to Europeana, DPLA, IIIF
Storing in Wikidata as external identifier
RFC 4122 strict compliance required
Maximum transparency required (anyone can verify)

2. UUID SHA-256 (Custom) - SOTA Cryptographic Strength

Format: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
Version: 8 (custom/experimental)
Algorithm: SHA-256 (truncated to 128 bits)

✅ Strengths:

SHA-256 - NIST-approved, SOTA cryptographic hash (2024)
Superior collision resistance vs SHA-1
Future-proof - No known practical attacks
UUID-compatible - Valid UUID format, works with UUID parsers

⚠️ Nuances:

Not RFC 4122 standard - Custom implementation
UUID v8 is "experimental/vendor-specific" designation
May not be recognized by strict UUID v5-only systems

Use When:

Security policy mandates SHA-256
Maximum collision resistance required
Future-proofing against SHA-1 deprecation
Custom identifier resolution service

Algorithm:

Hash GHCID string with SHA-256 → 256 bits
Truncate to first 128 bits (16 bytes)
Set version bits to 8 (custom)
Set variant bits to RFC 4122 (0b10xxxxxx)

3. Numeric (64-bit) - Database Optimization

Format: 213324328442227739
Algorithm: SHA-256 → first 8 bytes → uint64
Range: 0 to 18,446,744,073,709,551,615

✅ Strengths:

Compact - Fits in SQL BIGINT (8 bytes)
Fast indexing - Integer comparisons faster than UUID
CSV-friendly - No special characters
Deterministic - Same GHCID → Same number

⚠️ Nuances:

64-bit truncation reduces collision resistance vs full 256-bit
P(collision) ≈ 2.7×10^-7 for 1M institutions (0.00003%)
Still negligible for heritage domain (<10M institutions expected)

Use When:

Database primary key optimization
CSV exports for spreadsheet analysis
Numeric sorting required
Systems without UUID support

4. Human-Readable (ISO-based) - Citations & References

Format: US-CA-SAN-A-IA
Components: {Country}-{Region}-{City}-{Type}-{Abbreviation}
Example: NL-NH-AMS-M-RM (Rijksmuseum Amsterdam)

✅ Strengths:

Human-readable - Understandable without lookup
Geographic context - Location embedded in ID
Type indicator - Institution type visible
Citable - Use in academic papers, documentation

⚠️ Nuances:

Not persistent if institution relocates or changes name
Use ghcid_original field (frozen) for true persistence
ghcid field (current) may change over time

Use When:

Academic citations
Documentation and reports
Human-readable data exchange
Debugging and logging

Collision Resistance Comparison

Mathematical Analysis

# Collision probability (birthday paradox):
# P(collision) ≈ n² / (2 × 2^bits)

# For 1,000,000 institutions:

# UUID v5 / UUID SHA-256 (128-bit):
P = (10^6)² / (2 × 2^128) ≈ 1.5 × 10^-29
# Effectively zero - more atoms in universe than collisions

# Numeric (64-bit):
P = (10^6)² / (2 × 2^64) ≈ 2.7 × 10^-7  (0.00003%)
# Negligible for heritage domain

# Even at 10 million institutions:
P_64bit = (10^7)² / (2 × 2^64) ≈ 2.7 × 10^-5  (0.003%)
# Still acceptable

Real-World Context

Institution Count	UUID v5/SHA-256	Numeric (64-bit)	Assessment
100,000	~0%	2.7×10^-11 (0.0000000027%)	✅ All safe
1,000,000	~0%	2.7×10^-7 (0.00003%)	✅ All safe
10,000,000	~0%	2.7×10^-5 (0.003%)	✅ UUID safe, numeric acceptable
100,000,000	~0%	0.27%	⚠️ Use UUID, numeric risky

Conclusion: For the heritage domain (expected <10M institutions worldwide), all formats provide sufficient collision resistance.

Historical Collision Resolution

The Rule: Temporal Priority Determines Disambiguation

When creating GHCIDs, collisions can occur in two temporal contexts:

First Batch Creation (initial PID assignment): Multiple institutions discovered simultaneously
Historical Addition (post-publication): New historical institution added after existing GHCID published

Critical Design Decision: The collision resolution strategy differs based on temporal context to preserve PID stability.

Collision Resolution: Native Language Name Suffix

Key Change: Collisions are resolved by appending the full legal name in native language in snake_case format, NOT Wikidata Q-numbers.

Name Suffix Rules:

Use the institution's full official name in its native language
Convert to snake_case (lowercase, underscores for spaces)
Remove apostrophes, accents, commas, and other punctuation/diacritics
Transliterate non-Latin scripts to ASCII (e.g., Pinyin for Chinese)

Name Normalization Examples:

"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Biblioteca Nacional do Brasil" → "biblioteca_nacional_do_brasil"
"北京故宫博物院" → "beijing_gugong_bowuyuan" (pinyin transliteration)
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"

First Batch Behavior (Initial PID Creation)

Scenario: During initial GHCID generation, multiple institutions with identical base GHCIDs are discovered together.

Resolution: ALL colliding institutions get name suffixes appended.

Example:

# Discovery: Two museums in Amsterdam both generate NL-NH-AMS-M-SM

# Stedelijk Museum (founded 1874)
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

# Science Museum Amsterdam (founded 2010)
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam

Rationale: No existing PIDs to preserve; both institutions are "new" to the system.

Historical Addition Behavior (Post-Publication)

Scenario: After initial GHCID batch is published, a historical institution is added that collides with an existing GHCID.

Resolution: ONLY the newly added historical institution gets a name suffix. The existing PID remains unchanged.

Example:

# Existing GHCID (published 2025-11-01)
ghcid_original: NL-NH-AMS-M-HM  # Hermitage Museum Amsterdam (2009-2023)

# Historical institution added later (2025-11-15)
# Amsterdam Historical Museum (1926-1975)
# Would also generate: NL-NH-AMS-M-HM
# 
# COLLISION DETECTED → Add name suffix to NEW addition ONLY
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum

Outcome:

NL-NH-AMS-M-HM (Hermitage Museum Amsterdam) → UNCHANGED
NL-NH-AMS-M-HM-amsterdam_historical_museum (Amsterdam Historical Museum) → Name suffix added

Rationale: Preserve stability of already-published PIDs.

Why This Matters: PID Stability Principle

Problem: Changing existing GHCIDs breaks external references.

PIDs may already be:

Cited in academic publications
Referenced in datasets and APIs
Stored in institutional databases
Embedded in IIIF manifests
Linked from Wikidata

Principle: "Cool URIs don't change" (Tim Berners-Lee, W3C)

Once a GHCID is published (in first batch or as standalone record), it should NEVER change, even if new historical institutions create collisions.

Decision Table: Who Gets Name Suffix?

Scenario	When	Existing GHCID	New GHCID	Who Gets Name Suffix	Rationale
First Batch	Initial PID creation (2025-11-01)	None (first time)	`NL-NH-AMS-M-SM` (2 institutions)	ALL colliding institutions	No existing PIDs to preserve
Historical Addition	Post-publication (2025-11-15)	`NL-NH-AMS-M-HM` (published)	`NL-NH-AMS-M-HM` (historical)	ONLY newly added institution	Preserve published PID stability
Standalone Addition	New institution (2026-01-01)	`NL-NH-AMS-M-XY` (published)	`NL-NH-AMS-M-XY` (new contemporary)	ONLY newly added institution	Preserve existing PID

Implementation Guidance

Name Suffix Generation:

import re
import unicodedata

def generate_name_suffix(native_name: str) -> str:
    """Convert native language institution name to snake_case suffix.
    
    Examples:
        "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
        "Musée d'Orsay" → "musee_dorsay"
        "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
    """
    # Normalize unicode (NFD decomposition) and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Convert to lowercase
    lowercase = ascii_name.lower()
    
    # Remove apostrophes, commas, and other punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces and hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)
    
    # Remove any remaining non-alphanumeric characters (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')
    
    return final

Collision Detection Logic:

def resolve_collision(new_ghcid: str, new_name: str, existing_ghcids: Set[str]) -> str:
    """
    Resolve GHCID collision based on temporal context.
    
    Args:
        new_ghcid: Base GHCID for new institution
        new_name: Native language name of the institution
        existing_ghcids: Set of already-published GHCIDs
    
    Returns:
        Final GHCID (with name suffix if needed)
    """
    if new_ghcid in existing_ghcids:
        # COLLISION DETECTED: New institution collides with existing
        # Resolution: Add name suffix to NEW institution ONLY
        name_suffix = generate_name_suffix(new_name)
        return f"{new_ghcid}-{name_suffix}"
    else:
        # No collision: Use base GHCID
        return new_ghcid

First Batch Processing (different logic):

def process_first_batch(institutions: List[Institution]) -> List[GHCIDRecord]:
    """
    Process initial batch of institutions.
    
    For first batch, ALL collisions get name suffixes appended.
    """
    # Group by base GHCID
    ghcid_groups = defaultdict(list)
    for inst in institutions:
        base_ghcid = generate_base_ghcid(inst)
        ghcid_groups[base_ghcid].append(inst)
    
    records = []
    for base_ghcid, group in ghcid_groups.items():
        if len(group) == 1:
            # No collision: Use base GHCID
            records.append(create_record(group[0], base_ghcid))
        else:
            # COLLISION: ALL institutions get name suffixes
            for inst in group:
                name_suffix = generate_name_suffix(inst.name)
                ghcid = f"{base_ghcid}-{name_suffix}"
                records.append(create_record(inst, ghcid))
    
    return records

Edge Cases

Case 1: Multiple historical institutions added simultaneously

If multiple historical institutions are added together (same date) and collide with existing GHCID:

# Existing (published 2025-11-01)
ghcid: NL-NH-AMS-M-XY

# Both added 2025-11-15
# Historical Institution A: "Amsterdam Art Archive"
ghcid: NL-NH-AMS-M-XY-amsterdam_art_archive

# Historical Institution B: "Amsterdam Archaeology Museum"
ghcid: NL-NH-AMS-M-XY-amsterdam_archaeology_museum

Resolution: ALL newly added institutions get name suffixes (treat as mini-batch).

Case 2: Existing GHCID already has name suffix

If existing GHCID already has name suffix (from first batch collision), new historical addition gets different name suffix:

# Existing (from first batch with collision)
ghcid: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

# Historical addition (2025-11-15)
ghcid: NL-NH-AMS-M-SM-stadsmuseum_amsterdam  # Different name suffix

No ambiguity: Each institution has unique name suffix derived from its native language name.

Case 3: Non-Latin script names

For institutions with non-Latin script names, transliterate to ASCII:

# Chinese institution: 北京故宫博物院 (Palace Museum Beijing)
ghcid: CN-BJ-BEI-M-PM-beijing_gugong_bowuyuan

# Japanese institution: 東京国立博物館 (Tokyo National Museum)  
ghcid: JP-TK-TOK-M-TN-tokyo_kokuritsu_hakubutsukan

# Arabic institution: المتحف المصري (Egyptian Museum)
ghcid: EG-CA-CAI-M-EM-al_mathaf_al_masri

Testing Strategy

Test 1: First Batch Collision

def test_first_batch_collision():
    """Verify ALL institutions in first batch get name suffixes"""
    institutions = [
        Institution("Stedelijk Museum Amsterdam", type="M", city="AMS"),
        Institution("Science Museum Amsterdam", type="M", city="AMS")
    ]
    
    records = process_first_batch(institutions)
    
    # Both should have name suffixes
    assert records[0].ghcid == "NL-NH-AMS-M-SM-stedelijk_museum_amsterdam"
    assert records[1].ghcid == "NL-NH-AMS-M-SM-science_museum_amsterdam"

Test 2: Historical Addition Collision

def test_historical_addition_preserves_existing():
    """Verify existing GHCID unchanged when historical added"""
    # Existing GHCID (published)
    existing_ghcids = {"NL-NH-AMS-M-HM"}
    
    # Add historical institution
    historical = Institution(
        name="Amsterdam Historical Museum",
        type="M",
        city="AMS",
        temporal_extent={"start": "1926", "end": "1975"}
    )
    
    new_ghcid = resolve_collision(
        generate_base_ghcid(historical),
        historical.name,
        existing_ghcids
    )
    
    # New historical gets name suffix
    assert new_ghcid == "NL-NH-AMS-M-HM-amsterdam_historical_museum"
    
    # Existing GHCID NOT in database update
    # (verify existing record unchanged)

Test 3: Name Suffix Generation

def test_name_suffix_generation():
    """Verify name suffix normalization"""
    assert generate_name_suffix("Musée d'Orsay") == "musee_dorsay"
    assert generate_name_suffix("Österreichische Nationalbibliothek") == "osterreichische_nationalbibliothek"
    assert generate_name_suffix("Biblioteca Nacional do Brasil") == "biblioteca_nacional_do_brasil"
    assert generate_name_suffix("Royal Museum, London") == "royal_museum_london"

Documentation References

Collision Resolution: docs/plan/global_glam/07-ghcid-collision-resolution.md
GHCID Specification: docs/GHCID_PID_SCHEME.md
Implementation: src/glam_extractor/identifiers/ghcid.py
Schema: schemas/provenance.yaml (GHCIDHistoryEntry)
Abbreviation Special Characters: .opencode/ABBREVIATION_SPECIAL_CHAR_RULE.md (characters to exclude from abbreviations)

SHA-1 vs SHA-256: The Nuance

Why UUID v5 Uses SHA-1

RFC 4122 (2005) standardized UUID v5 with SHA-1 because:

SHA-1 was considered secure in 2005
128-bit UUID space provides collision resistance even with SHA-1
Purpose is identifier generation, not security/authentication

SHA-1 Cryptographic Weakness

SHA-1 collision attacks (2017):

Google/CWI demonstrated practical SHA-1 collision
Two different inputs producing same hash
Critical for digital signatures (authentication, certificates)
Less critical for identifiers (birthday paradox protection sufficient)

When SHA-1 Is Problematic

❌ Digital signatures - Attacker can forge documents ❌ Certificate authorities - SSL/TLS security compromised ❌ Password hashing - Weakens brute-force resistance ❌ Blockchain - Consensus security at risk

When SHA-1 Is Acceptable

✅ UUID generation - Collision resistance adequate for identifier space ✅ Git commits - Linus Torvalds: "SHA-1 is fine for Git's use case" ✅ Non-adversarial contexts - No attacker trying to cause collisions

Recommended Usage Strategy

Default: Dual UUID Approach

Store both UUID formats for maximum flexibility:

# Example YAML record
- id: 550e8400-e29b-41d4-a716-446655440000  # Use UUID v5 as primary ID
  name: Internet Archive
  institution_type: ARCHIVE
  ghcid: US-CA-SAN-A-IA
  ghcid_uuid: 550e8400-e29b-41d4-a716-446655440000  # UUID v5 (SHA-1)
  ghcid_uuid_sha256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d  # UUID SHA-256
  ghcid_numeric: 213324328442227739  # Numeric (64-bit)
  identifiers:
    - identifier_scheme: GHCID
      identifier_value: US-CA-SAN-A-IA
    - identifier_scheme: GHCID_UUID_V5
      identifier_value: 550e8400-e29b-41d4-a716-446655440000
    - identifier_scheme: GHCID_UUID_SHA256
      identifier_value: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
    - identifier_scheme: GHCID_NUMERIC
      identifier_value: 213324328442227739

Use Case Decision Tree

Need to integrate with existing systems?
├─ YES → Use UUID v5 (to_uuid())
│         - Europeana, DPLA, IIIF, Wikidata
│         - RFC 4122 compliance required
│
└─ NO → Building custom system?
    ├─ Security policy mandates SHA-256?
    │  ├─ YES → Use UUID SHA-256 (to_uuid_sha256())
    │  └─ NO → Use UUID v5 for standard compliance
    │
    └─ Database optimization critical?
        ├─ YES → Use Numeric (to_numeric()) as PK
        │         - Store UUID v5 as alternate key
        └─ NO → Use UUID v5 as primary identifier

Code Examples

Generate All Four Formats

from glam_extractor.identifiers.ghcid import GHCIDComponents

# Create GHCID components
components = GHCIDComponents(
    country_code="US",
    region_code="CA",
    city_locode="SAN",
    institution_type="A",
    abbreviation="IA"
)

# Generate all formats
uuid_v5 = components.to_uuid()           # UUID v5 (SHA-1)
uuid_sha256 = components.to_uuid_sha256()  # UUID SHA-256
numeric = components.to_numeric()        # Numeric (64-bit)
human = components.to_string()           # Human-readable

print(f"UUID v5:      {uuid_v5}")
print(f"UUID SHA-256: {uuid_sha256}")
print(f"Numeric:      {numeric}")
print(f"Human:        {human}")

# Output:
# UUID v5:      550e8400-e29b-41d4-a716-446655440000
# UUID SHA-256: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
# Numeric:      213324328442227739
# Human:        US-CA-SAN-A-IA

Verify Determinism

# Same input always produces same output
comp1 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")
comp2 = GHCIDComponents("NL", "NH", "AMS", "M", "RM")

assert comp1.to_uuid() == comp2.to_uuid()
assert comp1.to_uuid_sha256() == comp2.to_uuid_sha256()
assert comp1.to_numeric() == comp2.to_numeric()
assert comp1.to_string() == comp2.to_string()

Export to Different Formats

# RDF/JSON-LD (use UUID v5)
rdf_id = f"urn:uuid:{components.to_uuid()}"
# → "urn:uuid:550e8400-e29b-41d4-a716-446655440000"

# IIIF Manifest (use UUID v5)
iiif_id = f"https://iiif.example.org/manifests/{components.to_uuid()}/manifest.json"

# Database (use numeric PK)
sql = f"INSERT INTO institutions (id, name) VALUES ({components.to_numeric()}, 'Internet Archive')"

# Citation (use human-readable)
citation = f"See Internet Archive ({components.to_string()}) for digital collections."

Future-Proofing Strategy

Timeline Projections

Year	SHA-1 Status	UUID v5 Status	Recommendation
2024	Weak for security, OK for IDs	Standard, widely supported	✅ Use UUID v5 as primary
2030	Likely deprecated for security	Still standard for IDs	✅ Dual UUID (v5 + SHA-256)
2040	Possibly deprecated entirely	May be superseded	⚠️ Migrate to UUID SHA-256

Migration Path

If SHA-1 is fully deprecated:

Phase 1 (Now): Store both UUID v5 and UUID SHA-256
Phase 2 (2030): Make UUID SHA-256 primary, keep v5 as alias
Phase 3 (2040): Deprecate UUID v5, use SHA-256 exclusively

Critical: Because both are deterministic, you can always regenerate from GHCID string without breaking references.

Governance & Resolution

Identifier Persistence Requirements

Technical generation is only half the solution. True persistence requires:

1. Resolution Service

https://id.heritage.example.org/uuid/{uuid}
https://id.heritage.example.org/numeric/{numeric}
https://id.heritage.example.org/ghcid/{ghcid}

All three should resolve to the same institutional record.

2. Mapping Database

CREATE TABLE ghcid_registry (
    uuid_v5 UUID PRIMARY KEY,
    uuid_sha256 UUID NOT NULL,
    numeric BIGINT NOT NULL,
    ghcid VARCHAR(100) NOT NULL,
    ghcid_original VARCHAR(100) NOT NULL,  -- Frozen
    institution_name TEXT NOT NULL,
    last_updated TIMESTAMP,
    UNIQUE(uuid_sha256),
    UNIQUE(numeric),
    UNIQUE(ghcid_original)
);

3. Organizational Commitment

Maintain resolution service for decades
Fund infrastructure for long-term operation
Establish governance policies for ID assignment
Handle institution mergers/closures/relocations

4. Community Standards

Coordinate with ISIL, Wikidata, GeoNames
Publish GHCID specification as RFC or W3C note
Engage with Europeana, DPLA, IIIF communities
Establish dispute resolution process

Comparison with Existing PID Systems

System	Format	Governance	Resolution	Adoption
DOI	10.xxxx/yyyy	IDF (non-profit)	doi.org	High (scholarly)
ARK	ark:/nnnnn/xxx	CDL (California)	n2t.net	Medium (archives)
Handle	hdl:xxxx/yyyy	CNRI (non-profit)	handle.net	Medium (repositories)
GHCID	UUID v5	TBD	TBD	None (new)

Lesson: Technical mechanism is necessary but not sufficient. Governance and organizational commitment are critical.

Recommendations

For This Project (2024-2025)

✅ Implement dual UUID generation (v5 + SHA-256)
✅ Store all four identifier formats in data model
✅ Use UUID v5 as primary ID for current interoperability
✅ Document SHA-1 nuance clearly
⏳ Build resolution service prototype
⏳ Engage with Europeana/DPLA for feedback
⏳ Draft GHCID specification for community review

For Production Deployment

⏳ Establish governance body (non-profit foundation?)
⏳ Secure long-term funding for resolution service
⏳ Coordinate with existing PID systems (ISIL, VIAF, Wikidata)
⏳ Publish specification (W3C note or IETF RFC)
⏳ Deploy resolution infrastructure (multi-region, high availability)
⏳ Engage heritage community for adoption

References

RFC 4122: UUID Standard (https://tools.ietf.org/html/rfc4122)
SHA-1 Collision: Google/CWI (2017) - https://shattered.io
UUID v8 Draft: New UUID Formats (https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/)
NIST SHA-256: FIPS 180-4 - https://csrc.nist.gov/publications/fips
Identifier.org: Life sciences identifiers - https://identifiers.org
N2T: Name-to-Thing resolver - https://n2t.net

Version: 1.0
Date: 2024-11-06
Status: Draft for Community Review

25 KiB Raw Blame History Unescape Escape

Persistent Identifiers for Heritage Institutions

Overview

Persistent Identifiers (Deterministic)

Database Record Identifiers (Non-Deterministic)

Why Four Formats?

1. UUID v5 (SHA-1) - Interoperability Standard ⭐ PRIMARY

2. UUID SHA-256 (Custom) - SOTA Cryptographic Strength

3. Numeric (64-bit) - Database Optimization

4. Human-Readable (ISO-based) - Citations & References

Collision Resistance Comparison

Mathematical Analysis

Real-World Context

Historical Collision Resolution

The Rule: Temporal Priority Determines Disambiguation

Collision Resolution: Native Language Name Suffix

First Batch Behavior (Initial PID Creation)

Historical Addition Behavior (Post-Publication)

Why This Matters: PID Stability Principle

Decision Table: Who Gets Name Suffix?

Implementation Guidance

Edge Cases

Testing Strategy

Documentation References

SHA-1 vs SHA-256: The Nuance

Why UUID v5 Uses SHA-1

SHA-1 Cryptographic Weakness

When SHA-1 Is Problematic

When SHA-1 Is Acceptable

Recommended Usage Strategy

Default: Dual UUID Approach

Use Case Decision Tree

Code Examples

Generate All Four Formats

Verify Determinism

Export to Different Formats

Future-Proofing Strategy

Timeline Projections

Migration Path

Governance & Resolution

Identifier Persistence Requirements

1. Resolution Service

2. Mapping Database

3. Organizational Commitment

4. Community Standards

Comparison with Existing PID Systems

Recommendations

For This Project (2024-2025)

For Production Deployment

References

25 KiB

Raw Blame History