glam/docs/GHCID_PID_SCHEME.md
2025-12-01 16:06:34 +01:00

56 KiB
Raw Blame History

Global Heritage Custodian Identifier (GHCID) Persistent Identifier Scheme

Version: 1.0
Date: 2025-11-06
Status: Formal Specification (Draft for Community Review)
Authors: GLAM Data Extraction Project


Table of Contents

  1. Introduction
  2. Persistent Identifier Requirements
  3. GHCID Identifier Formats
  4. Resolution Architecture
  5. Governance Model
  6. Comparison with Existing PID Systems
  7. Implementation Roadmap
  8. Technical Specifications
  9. Use Cases and Applications
  10. Migration and Transition Strategies

Introduction

Purpose

The Global Heritage Custodian Identifier (GHCID) is a persistent identifier (PID) scheme designed to uniquely and permanently identify heritage custodian organizations worldwide—including galleries, libraries, archives, museums (GLAM), research centers, botanical gardens, collecting societies, and other cultural heritage institutions.

GHCID addresses critical gaps in existing identifier systems:

  1. Global Coverage: Extends beyond ISIL's limited geographic coverage to include institutions in all countries
  2. Non-Registry Institutions: Provides identifiers for organizations without ISIL codes (estimated 70-80% of heritage institutions worldwide)
  3. Change Tracking: Models organizational evolution through mergers, splits, relocations, and name changes
  4. Multi-Format Support: Offers human-readable, UUID, and numeric formats for diverse system requirements
  5. Linked Data Integration: Aligns with W3C, ISO, and IETF standards for semantic web compatibility

Scope

GHCID is intended for:

  • Heritage custodian organizations: Museums, archives, libraries, galleries, research centers, botanical gardens, zoos, collecting societies, and heritage platforms
  • Cross-system references: Citations, metadata aggregation, Linked Open Data (LOD) graphs
  • Long-term persistence: Identifiers designed to remain stable for decades or centuries
  • Global interoperability: Compatible with Europeana, DPLA, IIIF, Wikidata, GeoNames, and other aggregators

GHCID is NOT intended for:

  • Individual collection items (use ARK, DOI, Handle for objects)
  • Digital files or surrogates (use IIIF, ARK for digital objects)
  • Person identifiers (use ORCID, ISNI, VIAF for people)
  • Geographic locations (use GeoNames, OSM for places)

Design Principles

  1. Transparency: Publicly documented algorithms, verifiable by anyone
  2. Determinism: Same input always produces same identifier
  3. Persistence: Identifiers remain valid even if organizations change names or relocate
  4. Interoperability: Compatible with existing PID systems (ISIL, VIAF, Wikidata)
  5. Open Standards: Based on IETF RFCs, ISO standards, W3C recommendations
  6. No Vendor Lock-In: Open-source implementation, no proprietary dependencies

Persistent Identifier Requirements

Core PID Properties

A persistent identifier system must satisfy:

  1. Uniqueness: No two entities share the same identifier
  2. Persistence: Identifiers remain valid indefinitely (decades to centuries)
  3. Resolvability: Identifiers can be resolved to authoritative metadata
  4. Transparency: Generation algorithm is publicly documented
  5. Governance: Clear authority and policies for identifier assignment
  6. Actionability: Identifiers can be used in URLs, APIs, citations

GHCID Compliance

Requirement GHCID Implementation Status
Uniqueness UUID v5 (128-bit, P(collision) ≈ 10^-29), SHA-256 fallback Implemented
Persistence Deterministic generation from stable metadata, ghcid_original frozen on first assignment Implemented
Resolvability HTTP resolution service (planned), JSON-LD API 🔄 In Design
Transparency Open-source code, RFC 4122 standard, public algorithm docs Implemented
Governance Community governance model (proposed), ISIL coordination Pending
Actionability URN and HTTPS URI formats, embeddable in RDF/JSON-LD Implemented

GHCID Identifier Formats

GHCID provides four complementary identifier formats, each optimized for specific use cases. All formats are deterministic and derived from the same underlying GHCID components.

1. Human-Readable GHCID String

Format: {Country}-{Region}-{City}-{Type}-{Abbreviation}

Components:

Component Format Standard Example
Country ISO 3166-1 alpha-2 (2 chars) ISO 3166-1 NL, US, BR
Region ISO 3166-2 subdivision (2-3 chars) OR GeoNames admin1 code ISO 3166-2, GeoNames NH (Noord-Holland), CA (California)
City GeoNames city code (3-4 chars, base36 encoding) GeoNames 27597942759794
Type Institution type (1 char) GHCID taxonomy M (Museum), L (Library), A (Archive)
Abbreviation First letter of each word in emic name (2-10 chars) Derived RM (Rijksmuseum)

Examples:

NL-NH-2759794-M-RM        # Rijksmuseum, Amsterdam, Netherlands
US-DC-4140963-L-LC        # Library of Congress, Washington DC, USA
BR-RJ-3451190-L-BNB       # Biblioteca Nacional do Brasil, Rio de Janeiro, Brazil
GB-EN-2643743-M-BM        # British Museum, London, United Kingdom
FR-IL-2988507-M-LM        # Louvre Museum, Paris, France

Use Cases:

  • Academic citations
  • Documentation and reports
  • Debugging and logging
  • Human-readable data exchange

Persistence Note:

  • Organizations may relocate or change names over time
  • ghcid_original: Frozen on first assignment, never changes (TRUE PID)
  • ghcid_current: Updated if organization changes (convenience field)
  • Both stored in record; ghcid_original used for citations and cross-system references

Historical Institutions Rule (Added 2025-11-06):

GHCID supports historical heritage institutions (e.g., 17th-century cabinet collections, defunct museums, closed archives) using modern geographic coordinates:

  • Geographic Components: Country, region, and city codes are based on where the institution's coordinates would fall on a modern world map (2025) using the last recorded date of the institution's existence
  • Temporal Projection: Historical locations are projected onto current ISO 3166-1/3166-2 and GeoNames geocoding standards
  • Abbreviation: Institution abbreviations use the first letter of each significant word in the official emic (native language) name, skipping prepositions, articles, and conjunctions in all languages
  • Institution Type: Uses the full GHCID taxonomy (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.) - historical context preserved in metadata, not identifier
  • Flexibility: The GHCID format is deliberately designed to accommodate institutions from any historical period

Examples of Historical Institutions:

# Wunderkammer of Ole Worm (Copenhagen, 1655 - closed 1654)
DK-84-2618425-P-OW        # Denmark, Capital Region, Copenhagen (modern), Personal Collection, Ole Worm

# Bibliotheca Corviniana (Buda, 1490 - dispersed 1526)
HU-BU-3054643-L-BC        # Hungary, Budapest (modern coords), Library, Bibliotheca Corviniana

# Cabinet of Curiosities of Ferdinand II (Innsbruck, 1620s)
AT-7-2775216-P-FII        # Austria, Tyrol, Innsbruck (modern), Personal Collection, Ferdinand II

# Dutch East India Company Archives (Jakarta, 1602-1800)
ID-JK-1642911-C-VOC       # Indonesia (modern country), Jakarta (modern), Corporation, VOC

Rationale:

  • Historical institutions existed at specific geographic coordinates
  • Modern political boundaries and city identifiers provide stable reference points
  • Emic name abbreviations preserve original cultural/linguistic context while ensuring deterministic generation
  • Metadata fields capture full historical context (founding/closure dates, historical names, organizational changes)
  • GHCID history tracks temporal evolution via ghcid_history entries
  • Enables citation of historical collections in modern scholarship using persistent identifiers

2. UUID v5 (SHA-1) - Primary Persistent Identifier

Format: RFC 4122 UUID v5 (128-bit, hyphenated)

Algorithm:

  1. Construct GHCID string: {Country}-{Region}-{City}-{Type}-{Abbreviation}
  2. Apply RFC 4122 UUID v5 with:
    • Namespace: 6ba7b810-9dad-11d1-80b4-00c04fd430c8 (DNS namespace from RFC 4122)
    • Name: GHCID string (UTF-8 encoded)
    • Hash: SHA-1 (per RFC 4122 specification)
  3. Format as UUID: xxxxxxxx-xxxx-5xxx-yxxx-xxxxxxxxxxxx

Examples:

NL-NH-2759794-M-RM → 550e8400-e29b-41d4-a716-446655440000
US-DC-4140963-L-LC → 8b3e6f12-a4d5-5c89-b123-456789abcdef

Properties:

  • Standard Compliance: RFC 4122 (2005), IETF standard
  • Collision Resistance: P(collision) ≈ 1.5 × 10^-29 for 1M institutions
  • Deterministic: Same GHCID always produces same UUID
  • Interoperable: Compatible with Europeana, DPLA, IIIF, Wikidata
  • Transparent: Built-in function in all major programming languages

SHA-1 Safety:

  • SHA-1 is deprecated for cryptographic security (digital signatures, TLS)
  • SHA-1 is appropriate for identifier generation (non-adversarial, collision-resistant)
  • UUID v5 collision resistance relies on 128-bit output space, not SHA-1 preimage resistance
  • See WHY_UUID_V5_SHA1.md for detailed rationale

Use Cases:

  • Primary identifier for all GHCID records
  • RDF/JSON-LD @id field
  • IIIF manifest identifiers
  • Wikidata external ID
  • Database foreign keys
  • Cross-system references

URN Format:

urn:uuid:550e8400-e29b-41d4-a716-446655440000

HTTPS URI Format (with resolution service):

https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000

3. UUID SHA-256 (Future-Proof Alternative)

Format: RFC 9562 UUID v8 (128-bit, custom SHA-256)

Algorithm:

  1. Construct GHCID string
  2. Hash with SHA-256 → 256 bits
  3. Truncate to first 128 bits (16 bytes)
  4. Set version bits to 8 (custom/experimental)
  5. Set variant bits to RFC 4122 (0b10xxxxxx)

Examples:

NL-NH-2759794-M-RM → a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d

Properties:

  • Cryptographic Strength: SHA-256 (NIST-approved through 2030+)
  • Collision Resistance: P(collision) ≈ 1.5 × 10^-29 (same as UUID v5)
  • Future-Proof: No known practical attacks against SHA-256
  • Deterministic: Same GHCID always produces same UUID
  • Less Transparent: Custom algorithm requires sharing implementation code

Use Cases:

  • Security policies mandating SHA-256
  • Future migration path if SHA-1 fully deprecated
  • Custom identifier resolution services
  • Internal systems with strict cryptographic requirements

Status: Generated alongside UUID v5, stored as secondary identifier

4. Numeric (64-bit Integer)

Format: Unsigned 64-bit integer (0 to 18,446,744,073,709,551,615)

Algorithm:

  1. Hash GHCID string with SHA-256 → 256 bits
  2. Extract first 8 bytes (64 bits)
  3. Convert to unsigned integer (big-endian)

Examples:

NL-NH-2759794-M-RM → 12345678901234567
US-DC-4140963-L-LC → 98765432109876543

Properties:

  • Compact: 8 bytes (vs. 36 bytes for UUID string)
  • Deterministic: Same GHCID always produces same numeric ID
  • Fast Indexing: Integer comparisons faster than string UUIDs
  • CSV-Friendly: No special characters
  • Reduced Collision Resistance: P(collision) ≈ 2.7 × 10^-7 for 1M institutions (still negligible)

Use Cases:

  • Database primary keys (SQL BIGINT)
  • CSV exports for spreadsheet analysis
  • Numeric sorting requirements
  • Systems without UUID support
  • Legacy system integration

Limitations:

  • NOT recommended as primary PID (use UUID v5 instead)
  • Suitable for heritage domain (<10M institutions expected)
  • For >100M institutions, collision risk becomes non-negligible (0.27%)

Resolution Architecture

Resolution Service Design

A persistent identifier is only as persistent as its resolution infrastructure. GHCID requires a long-term, reliable resolution service to resolve identifiers to authoritative metadata.

Resolver Endpoints

Base URL: https://id.heritage.example.org/ (example domain)

Endpoint Pattern Format Example
/uuid/{uuid} UUID v5 /uuid/550e8400-e29b-41d4-a716-446655440000
/uuid-sha256/{uuid} UUID SHA-256 /uuid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
/numeric/{id} Numeric /numeric/12345678901234567
/ghcid/{string} Human-readable /ghcid/NL-NH-2759794-M-RM

All four endpoints resolve to the SAME institutional record.

Resolution Protocol

HTTP GET Request:

GET /uuid/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1
Host: id.heritage.example.org
Accept: application/ld+json

HTTP Response (JSON-LD):

{
  "@context": "https://w3id.org/heritage/custodian/context.jsonld",
  "@type": ["HeritageCustodian", "schema:Museum", "org:Organization"],
  "@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
  "ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
  "ghcid_original": "NL-NH-2759794-M-RM",
  "ghcid_current": "NL-NH-2759794-M-RM",
  "name": "Rijksmuseum",
  "alternateName": ["Rijksmuseum Amsterdam", "Rijks"],
  "description": "The national museum of the Netherlands, dedicated to arts and history.",
  "institution_type": "MUSEUM",
  "url": "https://www.rijksmuseum.nl",
  "sameAs": [
    "https://www.wikidata.org/wiki/Q190804",
    "https://viaf.org/viaf/131511535",
    "urn:isil:NL-AsdRM"
  ],
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "Museumstraat 1",
    "addressLocality": "Amsterdam",
    "postalCode": "1071 XX",
    "addressCountry": "NL",
    "geonames": "https://sws.geonames.org/2759794/"
  },
  "foundingDate": "1800-01-01",
  "provenance": {
    "data_source": "CSV_REGISTRY",
    "data_tier": "TIER_1_AUTHORITATIVE",
    "extraction_date": "2025-11-06T10:30:00Z"
  }
}

Content Negotiation:

Accept Header Response Format Use Case
application/ld+json JSON-LD Linked Data applications
application/json Plain JSON APIs, JavaScript
text/turtle RDF Turtle SPARQL, semantic web
application/rdf+xml RDF/XML Legacy RDF systems
text/html HTML landing page Human browsing
text/plain Plain text summary Simple debugging

HTTP Status Codes

Code Meaning When Used
200 OK Identifier resolved successfully Record found
303 See Other Redirect to canonical URL Multiple URLs for same resource
404 Not Found Identifier not in registry Unknown GHCID
410 Gone Institution closed/merged Record marked as inactive
500 Internal Server Error Resolver malfunction Service downtime

Persistence Commitment

Requirement: Resolution service must commit to:

  1. Minimum 50-year operation (heritage institutions have multi-century lifespans)
  2. High availability (99.9% uptime SLA)
  3. Multi-region redundancy (geographic distribution)
  4. Daily backups with disaster recovery plan
  5. Transparent governance (public policies, community oversight)
  6. Open-source resolver code (forkable by community if needed)

Funding Model (options):

  • Grant funding (national libraries, heritage foundations)
  • Membership fees (GLAM consortia, aggregators)
  • Government support (cultural heritage agencies)
  • Cloud provider donations (Google, AWS, Azure)

Governance Model

Organizational Structure

GHCID is proposed as a community-governed persistent identifier scheme, modeled on successful PID systems like DOI, ARK, and Handle.

Proposed Governance Body

GHCID Consortium (working name):

  1. Steering Committee (7-9 members)

    • Representatives from: National libraries, international archives, museum networks
    • Terms: 3 years, staggered rotation
    • Responsibilities: Policy decisions, budget oversight, strategic direction
  2. Technical Working Group

    • Developers, data scientists, Linked Data experts
    • Responsibilities: Specification updates, resolver development, community tools
  3. Community Advisory Board

    • Heritage institutions, researchers, aggregators (Europeana, DPLA)
    • Responsibilities: Use case feedback, adoption guidance
  4. Secretariat

    • Permanent staff (2-3 FTE)
    • Responsibilities: Day-to-day operations, resolver maintenance, documentation

Coordination with Existing Systems

GHCID does NOT replace existing identifier systems; it complements and coordinates with:

  1. ISIL (ISO 15511)

    • Store ISIL codes as secondary identifiers
    • Cross-reference GHCID ↔ ISIL mapping
    • Collaborate with national ISIL agencies
  2. Wikidata

    • Propose GHCID as new external identifier property
    • Link GHCID records to Wikidata Q-numbers
    • Enable bidirectional cross-referencing
  3. VIAF (Virtual International Authority File)

    • For institutions with VIAF records, store VIAF ID
    • Coordinate with OCLC on authority control
  4. GeoNames

    • Use GeoNames IDs for geographic components
    • Link GHCID locations to GeoNames URIs
  5. Europeana / DPLA

    • Integrate GHCID into aggregator metadata
    • Use UUID v5 format for interoperability

Identifier Assignment Policies

Who can assign a GHCID?

Option 1: Open Generation (preferred for transparency)

  • Anyone can generate a GHCID using the open-source algorithm
  • Deterministic generation ensures same institution → same ID
  • Conflicts resolved via community review

Option 2: Registry-Based (traditional PID model)

  • Institutions apply to GHCID Consortium for assignment
  • Manual review ensures accuracy
  • Slower, but higher quality control

Recommendation: Hybrid approach

  • Open generation for most institutions (self-service)
  • Optional manual review for complex cases (mergers, disputes)
  • Community validation via Wikidata, ISIL cross-checks

Dispute Resolution

Scenario: Two GHCID records claim to represent the same institution

Resolution Process:

  1. Automated detection (name similarity, ISIL code match)
  2. Community flagging (anyone can report suspected duplicates)
  3. Review by Technical Working Group
  4. Merge records, redirect old GHCID to canonical GHCID (HTTP 303)
  5. Update provenance metadata with merge event

Versioning and Deprecation

GHCID Specification Versioning:

  • Semantic versioning: MAJOR.MINOR.PATCH
  • Current version: 1.0.0 (this document)
  • Backward compatibility guaranteed for MAJOR versions

Identifier Deprecation:

  • GHCIDs are never deleted (persistence requirement)
  • Closed/merged institutions marked as organization_status: CLOSED
  • HTTP 410 Gone response with pointer to successor organization
  • Change history tracked in change_history field

Comparison with Existing PID Systems

Feature Comparison Table

Feature GHCID ISIL DOI ARK Handle Wikidata
Domain Heritage institutions Libraries/archives Scholarly objects Cultural heritage Digital objects Entities (all types)
Coverage Global (any institution) Limited (registry-based) Scholarly publications Libraries, museums Repositories Global (crowdsourced)
Registration Open (deterministic) Required Required Required Required Open (crowdsourced)
Format UUID, GHCID string, numeric Country-local code 10.xxxx/yyyy ark:/nnnnn/xxx hdl:xxxx/yyyy Q12345
Resolution HTTPS (planned) No standard resolver doi.org n2t.net handle.net wikidata.org
Governance Proposed consortium National agencies IDF (non-profit) CDL (California) CNRI (non-profit) Wikimedia Foundation
Cost Free (open) Free Paid (varies) Free Paid (varies) Free
Standard RFC 4122, ISO 3166 ISO 15511 ISO 26324 IETF draft IETF RFC 3650 Community-driven
Change Tracking Built-in No No No No Edit history
Multi-Format 4 formats Single Single Single Single Single
Adoption New (0) High (libraries) Very High Medium Medium Very High

Unique GHCID Advantages

  1. Change History Tracking: Built-in organizational evolution modeling (mergers, splits, relocations)
  2. Multi-Format Flexibility: Human-readable, UUID, numeric formats from same base ID
  3. Open Generation: Deterministic algorithm, no registration bureaucracy
  4. Global Coverage: Not limited to countries with ISIL registries
  5. Linked Data Native: Designed for RDF/JSON-LD from the start
  6. Data Quality Tiers: 4-tier provenance system (TIER_1 through TIER_4)

Why Not Just Use Wikidata?

Wikidata is excellent but has limitations for PIDs:

Aspect Wikidata GHCID
Identifier Format Q12345 (sequential) UUID (content-addressed)
Determinism No (assigned sequentially) Yes (hash-based)
Regeneration Lost if database corrupted Can regenerate from metadata
Governance Wikimedia Foundation Heritage community
Specialization General knowledge base Heritage institutions only
Provenance Edit history Structured provenance model
Data Quality Crowdsourced (variable) Tiered quality system

Recommendation: Use GHCID as primary PID, link to Wikidata Q-number as secondary identifier


Implementation Roadmap

Phase 1: Foundation (2025 Q1-Q2) IN PROGRESS

Status: Currently implementing

Deliverables:

  • GHCID specification document (this document)
  • UUID generation library (src/glam_extractor/identifiers/ghcid.py)
  • LinkML schema with GHCID fields (schemas/core.yaml)
  • Test suite (UUID determinism, collision resistance)
  • 🔄 GeoNames integration for city codes
  • 🔄 ISO 3166-2 lookup tables

Milestones:

  • GHCID format design
  • UUID v5 implementation
  • UUID SHA-256 implementation
  • Numeric ID implementation
  • GeoNames geocoding service
  • ISO 3166-2 reference data

Phase 2: Data Production (2025 Q2-Q3)

Deliverables:

  • Dutch institutions with GHCID (1,351 organizations)
  • ISIL registry with GHCID (364 institutions)
  • Conversation data with GHCID (estimated 2,000-5,000 institutions)
  • Cross-linked dataset (merged by GHCID UUID)

Milestones:

  • Generate GHCID for all Dutch datasets
  • Generate GHCID for conversation extractions
  • Cross-link by UUID v5
  • Publish test dataset (100 institutions)

Phase 3: Resolution Service (2025 Q4)

Deliverables:

  • GHCID resolver prototype (Python FastAPI)
  • JSON-LD API endpoint
  • HTML landing pages
  • Content negotiation support
  • Registry database (PostgreSQL + RDF triplestore)

Milestones:

  • Resolver API implementation
  • Database schema for GHCID registry
  • Load Dutch dataset into resolver
  • Public demo deployment

Phase 4: Community Engagement (2026 Q1-Q2)

Deliverables:

  • GHCID specification v1.0 (finalized)
  • Outreach to Europeana, DPLA, IIIF communities
  • Proposal to Wikidata for new external ID property
  • Coordination meetings with ISIL agencies
  • Community feedback incorporation

Milestones:

  • Present at CIDOC, ICA, IFLA conferences
  • Publish RFC or W3C Community Group Note
  • Partner with 3-5 heritage institutions for pilot
  • Gather feedback, iterate on specification

Phase 5: Governance Establishment (2026 Q3-Q4)

Deliverables:

  • GHCID Consortium formation
  • Steering Committee election
  • Long-term funding secured
  • Resolver production deployment
  • Governance policies published

Milestones:

  • Incorporate GHCID Consortium (non-profit)
  • Secure 3-year funding commitment
  • Deploy production resolver (multi-region)
  • Establish community governance processes

Phase 6: Scaling and Adoption (2027+)

Deliverables:

  • Global dataset (50,000+ institutions)
  • Integration with major aggregators
  • Resolver SLA (99.9% uptime)
  • Annual community meetings
  • Ongoing maintenance and updates

Milestones:

  • 10,000 institutions with GHCID
  • Europeana integration
  • DPLA integration
  • Wikidata property approved
  • 50-year persistence commitment

Technical Specifications

Data Model

Core GHCID Fields (LinkML schema):

HeritageCustodian:
  slots:
    # Primary identifiers
    - id                    # UUID v5 (primary key)
    - record_id             # UUID v7 (database PK, time-ordered)
    - ghcid_uuid            # UUID v5 (same as id)
    - ghcid_uuid_sha256     # UUID SHA-256 (future-proof)
    - ghcid_numeric         # Numeric (64-bit)
    
    # Human-readable GHCIDs
    - ghcid_current         # Current GHCID string (may change)
    - ghcid_original        # Original GHCID string (FROZEN, true PID)
    
    # GHCID history
    - ghcid_history         # List of GHCIDHistoryEntry

GHCID History Entry:

GHCIDHistoryEntry:
  description: Tracks changes to GHCID over time (relocations, name changes)
  slots:
    - ghcid_value           # GHCID string at this point in time
    - valid_from            # ISO 8601 date (when this GHCID became active)
    - valid_to              # ISO 8601 date (when this GHCID was superseded)
    - change_reason         # Reason for change (relocation, name change, etc.)
    - related_event         # Link to ChangeEvent if applicable

Example:

# Noord-Hollands Archief (formed 2001 via merger)
id: 550e8400-e29b-41d4-a716-446655440000  # UUID v5
ghcid_original: NL-NH-2750053-A-NHA       # Frozen forever
ghcid_current: NL-NH-2750053-A-NHA        # Same (no changes yet)

ghcid_history:
  - ghcid_value: NL-NH-2750053-A-NHA
    valid_from: "2001-01-01"
    valid_to: null  # Still valid
    change_reason: FOUNDING
    related_event: ghcid:event-nha-merger-2001

# If it relocates in 2030:
ghcid_history:
  - ghcid_value: NL-NH-2750053-A-NHA
    valid_from: "2001-01-01"
    valid_to: "2030-06-15"
  - ghcid_value: NL-NH-9876543-A-NHA  # New city GeoNames ID
    valid_from: "2030-06-15"
    valid_to: null
    change_reason: RELOCATION
    related_event: ghcid:event-nha-relocation-2030

API Specification

Resolver API Endpoints:

1. Resolve by UUID v5

GET /uuid/{uuid} HTTP/1.1
Host: id.heritage.example.org
Accept: application/ld+json

Response: 200 OK
Content-Type: application/ld+json
{
  "@context": "https://w3id.org/heritage/custodian/context.jsonld",
  "@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
  ...
}

2. Resolve by GHCID String

GET /ghcid/NL-NH-2759794-M-RM HTTP/1.1

Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000

3. Search by Name

GET /search?name=Rijksmuseum&country=NL HTTP/1.1

Response: 200 OK
Content-Type: application/json
{
  "results": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "Rijksmuseum",
      "ghcid": "NL-NH-2759794-M-RM",
      "url": "https://www.rijksmuseum.nl"
    }
  ],
  "total": 1
}

4. Reverse Lookup (Numeric → UUID)

GET /numeric/12345678901234567 HTTP/1.1

Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000

Database Schema (PostgreSQL)

CREATE TABLE ghcid_registry (
    -- Primary keys
    id UUID PRIMARY KEY,                    -- UUID v5 (primary identifier)
    record_id UUID UNIQUE NOT NULL,         -- UUID v7 (database record ID)
    
    -- Identifiers
    ghcid_uuid UUID UNIQUE NOT NULL,        -- UUID v5 (same as id)
    ghcid_uuid_sha256 UUID UNIQUE NOT NULL, -- UUID SHA-256
    ghcid_numeric BIGINT UNIQUE NOT NULL,   -- Numeric (64-bit)
    ghcid_original VARCHAR(100) UNIQUE NOT NULL, -- Frozen GHCID string
    ghcid_current VARCHAR(100) NOT NULL,    -- Current GHCID string
    
    -- Metadata
    name TEXT NOT NULL,
    institution_type VARCHAR(50) NOT NULL,
    organization_status VARCHAR(20) DEFAULT 'ACTIVE',
    
    -- Geographic
    country CHAR(2) NOT NULL,               -- ISO 3166-1
    region VARCHAR(10),                     -- ISO 3166-2
    city_geonames_id INTEGER,               -- GeoNames ID
    
    -- External identifiers
    isil_code VARCHAR(50),
    wikidata_id VARCHAR(20),
    viaf_id VARCHAR(50),
    
    -- Provenance
    data_source VARCHAR(50) NOT NULL,
    data_tier VARCHAR(20) NOT NULL,
    extraction_date TIMESTAMPTZ NOT NULL,
    
    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    
    -- Indexes
    INDEX idx_ghcid_original (ghcid_original),
    INDEX idx_ghcid_current (ghcid_current),
    INDEX idx_name (name),
    INDEX idx_country_region (country, region),
    INDEX idx_isil (isil_code),
    INDEX idx_wikidata (wikidata_id)
);

CREATE TABLE ghcid_history (
    id SERIAL PRIMARY KEY,
    ghcid_uuid UUID NOT NULL REFERENCES ghcid_registry(id),
    ghcid_value VARCHAR(100) NOT NULL,
    valid_from DATE NOT NULL,
    valid_to DATE,
    change_reason VARCHAR(50),
    related_event_id VARCHAR(200),
    
    INDEX idx_ghcid_uuid (ghcid_uuid),
    INDEX idx_valid_dates (valid_from, valid_to)
);

GeoNames Settlement Resolution

Overview

The City component of GHCID relies on GeoNames for standardized settlement names and geographic resolution. This section defines critical rules for GeoNames integration.

Feature Code Filtering (CRITICAL)

NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).

GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.

ALLOWED Feature Codes

Code Description Example
PPL Populated place (city/town/village) Apeldoorn, Hamont, Lelystad
PPLA Seat of first-order admin division Provincial capitals
PPLA2 Seat of second-order admin division Municipal seats
PPLA3 Seat of third-order admin division District seats
PPLA4 Seat of fourth-order admin division Sub-district seats
PPLC Capital of a political entity Amsterdam, Brussels
PPLS Populated places (multiple) Settlement clusters
PPLG Seat of government The Hague

EXCLUDED Feature Codes

Code Description Why Excluded
PPLX Section of populated place Neighborhoods, districts, quarters

Problem Example:

Without feature code filtering, reverse geocoding may return:

  • "Binnenstad" (PPLX, neighborhood, pop 4,900) - WRONG
  • "Apeldoorn" (PPL, city, pop 136,670) - CORRECT

SQL Implementation

SELECT 
    name, ascii_name, admin1_code, admin1_name,
    latitude, longitude, geonames_id, population, feature_code,
    ((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance_sq
FROM cities
WHERE country_code = ?
  AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
ORDER BY distance_sq
LIMIT 1

Country Code Detection

CRITICAL: Determine country code from entry data BEFORE calling GeoNames reverse geocoding.

GeoNames queries are country-specific. Using the wrong country code will return incorrect results.

Country Code Resolution Priority:

  1. zcbs_enrichment.country - Most explicit source
  2. location.country - Direct location field
  3. locations[].country - Array location field
  4. original_entry.country - CSV source field
  5. google_maps_enrichment.address - Parse from address string
  6. wikidata_enrichment.located_in.label - Infer from Wikidata
  7. Default: "NL" (Netherlands) - Only if no other source

Provenance Tracking

Record GeoNames resolution in entry metadata:

location_resolution:
  method: REVERSE_GEOCODE
  geonames_id: 2759706
  geonames_name: Apeldoorn
  feature_code: PPL  # MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
  admin1_code: '03'
  region_code: GE
  country_code: NL
  source_coordinates:
    latitude: 52.21116
    longitude: 5.96978
  distance_km: 0.5

Validation: If feature_code: PPLX appears in metadata, the GHCID is WRONG and must be regenerated.


Use Cases and Applications

1. Academic Citations

Scenario: Researcher cites archival collection in academic paper

Without GHCID:

"See the Municipal Archives of Haarlem for records from 1245-1800."

Problem: Name may change, institution may merge, citation becomes ambiguous

With GHCID:

"See the Noord-Hollands Archief (urn:uuid:550e8400-e29b-41d4-a716-446655440000) 
for Haarlem municipal records from 1245-1800."

Benefit: Persistent identifier remains valid even if organization changes name or merges

2. Metadata Aggregation (Europeana, DPLA)

Scenario: Europeana aggregates metadata from 4,000 institutions

Without GHCID:

{
  "institution": "National Library",
  "country": "Netherlands"
}

Problem: Multiple "National Library" institutions, ambiguous

With GHCID:

{
  "@id": "urn:uuid:a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d",
  "institution": "Koninklijke Bibliotheek",
  "sameAs": "https://id.heritage.example.org/uuid/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d"
}

Benefit: Unique identifier enables deduplication, cross-referencing, provenance tracking

3. IIIF Manifests

Scenario: Museum publishes IIIF manifest for digitized collection

GHCID Integration:

{
  "@context": "http://iiif.io/api/presentation/3/context.json",
  "@id": "https://iiif.rijksmuseum.nl/manifest/123",
  "type": "Manifest",
  "provider": [{
    "id": "urn:uuid:550e8400-e29b-41d4-a716-446655440000",
    "type": "Agent",
    "label": {"en": ["Rijksmuseum"]},
    "sameAs": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000"
  }]
}

Benefit: IIIF consumers can resolve provider to authoritative metadata

4. Wikidata Integration

Scenario: Link Wikidata item to heritage institution

Wikidata Property Proposal: P-GHCID (GHCID identifier)

# Wikidata SPARQL query
SELECT ?item ?itemLabel ?ghcid WHERE {
  ?item wdt:P-GHCID ?ghcid .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Benefit: Bidirectional linking between Wikidata and GHCID registry

5. Merger/Split Tracking

Scenario: Two archives merge in 2001

GHCID Representation:

# Gemeentearchief Haarlem (predecessor)
- id: old-uuid-1
  ghcid_original: NL-NH-2759794-A-GH
  organization_status: CLOSED
  change_history:
    - change_type: MERGER
      event_date: "2001-01-01"
      resulting_organization: new-uuid

# Rijksarchief in Noord-Holland (predecessor)
- id: old-uuid-2
  ghcid_original: NL-NH-2759794-A-RNH
  organization_status: CLOSED
  change_history:
    - change_type: MERGER
      event_date: "2001-01-01"
      resulting_organization: new-uuid

# Noord-Hollands Archief (successor)
- id: new-uuid
  ghcid_original: NL-NH-2750053-A-NHA
  organization_status: ACTIVE
  change_history:
    - change_type: FOUNDING
      event_date: "2001-01-01"
      affected_organization: [old-uuid-1, old-uuid-2]

Resolver Behavior:

GET /uuid/old-uuid-1 HTTP/1.1

Response: 410 Gone
Location: /uuid/new-uuid
{
  "status": "CLOSED",
  "reason": "Merged into Noord-Hollands Archief",
  "successor": "https://id.heritage.example.org/uuid/new-uuid",
  "effective_date": "2001-01-01"
}

GHCID Collision Resolution and Timeline Examples

Collision Scenarios

GHCID collisions occur when two different institutions generate the same base GHCID string (identical country, region, city, type, and abbreviation). Resolution strategy depends on temporal context: when were the colliding institutions discovered?

Core Principle: Temporal Priority Determines Strategy

Rule: The timing of institution discovery determines which institutions receive native language name suffixes.

Collision Suffix: Collisions are resolved by appending the full legal name in native language in snake_case format.

Scenario When Detected Resolution Strategy Rationale
First Batch Collision Multiple institutions discovered simultaneously (e.g., batch CSV import) ALL colliding institutions receive name suffixes Fair treatment: no institution has temporal priority
Historical Addition New institution collides with already published GHCID ONLY new institution receives name suffix PID stability: preserve existing published identifiers

Name Suffix Generation

Converting institution names to snake_case suffixes:

import re
import unicodedata

def generate_name_suffix(native_name: str) -> str:
    """Convert native language institution name to snake_case suffix.
    
    Examples:
        "Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
        "Musée d'Orsay" → "musee_dorsay"
        "Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
    """
    # Normalize unicode (NFD decomposition) and remove diacritics
    normalized = unicodedata.normalize('NFD', native_name)
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    
    # Convert to lowercase
    lowercase = ascii_name.lower()
    
    # Remove apostrophes, commas, and other punctuation
    no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
    
    # Replace spaces and hyphens with underscores
    underscored = re.sub(r'[\s\-]+', '_', no_punct)
    
    # Remove any remaining non-alphanumeric characters (except underscores)
    clean = re.sub(r'[^a-z0-9_]', '', underscored)
    
    # Collapse multiple underscores
    final = re.sub(r'_+', '_', clean).strip('_')
    
    return final

Timeline Example 1: First Batch Collision

Date: 2025-11-01
Event: Dutch ISIL Registry batch import (364 institutions)

┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import from Dutch ISIL Registry        │
└─────────────────────────────────────────────────────────────────┘

Collision Detected:
  Base GHCID: NL-NH-AMS-M-SM
  
  Institution 1: Stedelijk Museum Amsterdam
    - Discovered: 2025-11-01 (batch import)
    → Resolution: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam

  Institution 2: Science Museum Amsterdam
    - Discovered: 2025-11-01 (batch import)
    → Resolution: NL-NH-AMS-M-SM-science_museum_amsterdam

Strategy: FIRST_BATCH
Reason: Both institutions discovered simultaneously in batch import
Result: BOTH receive name suffixes derived from their native language names

GHCID Records:

# Institution 1
- id: 550e8400-e29b-41d4-a716-446655440000  # UUID v5 of full GHCID
  ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
  ghcid_current: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
  name: Stedelijk Museum Amsterdam
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"
    publication_date: "2025-11-01T10:00:00Z"
    data_source: CSV_REGISTRY
    notes: "First batch collision: name suffix added during initial import"

# Institution 2
- id: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d  # UUID v5 of full GHCID
  ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
  ghcid_current: NL-NH-AMS-M-SM-science_museum_amsterdam
  name: Science Museum Amsterdam
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"
    publication_date: "2025-11-01T10:00:00Z"
    data_source: CSV_REGISTRY
    notes: "First batch collision: name suffix added during initial import"

Key Point: Both institutions created with name suffixes from the start. No existing PID was changed.


Timeline Example 2: Historical Addition (Single Institution)

Date: 2025-11-15
Event: Historical research reveals Amsterdam Historical Museum (closed 2001)

┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import                                  │
│   Hermitage Museum Amsterdam → NL-NH-AMS-M-HM (PUBLISHED)      │
└─────────────────────────────────────────────────────────────────┘
         │
         │ (14 days pass, GHCID used in citations, APIs)
         │
         ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-15: Historical Institution Added                        │
│   Amsterdam Historical Museum → Collides with published GHCID!  │
└─────────────────────────────────────────────────────────────────┘

Collision Detected:
  Base GHCID: NL-NH-AMS-M-HM
  
  Existing Institution (PUBLISHED 2025-11-01):
    - Name: Hermitage Museum Amsterdam
    - GHCID: NL-NH-AMS-M-HM  ← UNCHANGED (published, immutable)
    - Publication date: 2025-11-01T10:00:00Z
  
  New Institution (Being added 2025-11-15):
    - Name: Amsterdam Historical Museum (historical, 1926-2001)
    - GHCID: NL-NH-AMS-M-HM-amsterdam_historical_museum  ← Gets name suffix

Strategy: HISTORICAL_ADDITION
Reason: New institution collides with already published GHCID
Result: ONLY new institution receives name suffix; existing GHCID preserved

GHCID Records:

# Existing institution (UNCHANGED)
- id: existing-uuid-1234  # Unchanged
  ghcid_original: NL-NH-AMS-M-HM  # FROZEN (no name suffix)
  ghcid_current: NL-NH-AMS-M-HM   # FROZEN (no name suffix)
  name: Hermitage Museum Amsterdam
  provenance:
    extraction_date: "2025-11-01T10:00:00Z"
    publication_date: "2025-11-01T10:00:00Z"  # PUBLISHED
    data_source: CSV_REGISTRY
  # No collision resolution metadata (institution not modified)

# New historical institution (Gets name suffix)
- id: new-uuid-5678  # UUID v5 of full GHCID with name suffix
  ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
  ghcid_current: NL-NH-AMS-M-HM-amsterdam_historical_museum
  name: Amsterdam Historical Museum
  organization_status: CLOSED
  provenance:
    extraction_date: "2025-11-15T14:30:00Z"
    publication_date: "2025-11-15T14:30:00Z"
    data_source: CONVERSATION_NLP
    notes: >-
      Historical addition collision with published GHCID NL-NH-AMS-M-HM
      (Hermitage Museum Amsterdam, published 2025-11-01). Added name suffix
      to preserve existing PID stability.      
  change_history:
    - change_type: FOUNDING
      event_date: "1926-01-01"
    - change_type: CLOSURE
      event_date: "2001-12-31"
      event_description: "Closed; collections transferred to Amsterdam Museum"

Why Preserve Existing GHCID?

The existing GHCID NL-NH-AMS-M-HM may already be:

  • Cited in academic publications
  • Referenced in third-party datasets (Europeana, Wikidata)
  • Used in API responses
  • Embedded in RDF triple stores

Changing it would break citations and external references → violates PID stability principle ("Cool URIs don't change").


Timeline Example 3: Multiple Historical Additions

Date: 2025-12-01
Event: Two historical naval museums discovered in archival research

┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import                                  │
│   Maritime Museum Amsterdam → NL-NH-AMS-M-MM (PUBLISHED)       │
└─────────────────────────────────────────────────────────────────┘
         │
         │ (30 days pass)
         │
         ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2025-12-01: Historical Research Uncovers Two Naval Museums      │
│   Both collide with published NL-NH-AMS-M-MM                   │
└─────────────────────────────────────────────────────────────────┘

Collision Detected:
  Base GHCID: NL-NH-AMS-M-MM
  
  Existing Institution (PUBLISHED 2025-11-01):
    - Name: Maritime Museum Amsterdam
    - GHCID: NL-NH-AMS-M-MM  ← UNCHANGED
  
  New Institution 1 (Being added 2025-12-01):
    - Name: Dutch Navy Museum (historical, 1906-1955)
    - GHCID: NL-NH-AMS-M-MM-dutch_navy_museum  ← Gets name suffix
  
  New Institution 2 (Being added 2025-12-01):
    - Name: Amsterdam Naval Archive (historical, 1820-1901)
    - GHCID: NL-NH-AMS-M-MM-amsterdam_naval_archive  ← Gets name suffix

Strategy: HISTORICAL_ADDITION (multiple)
Reason: Multiple new institutions collide with same published GHCID
Result: All new institutions get name suffixes; existing GHCID unchanged

GHCID Records:

# Existing institution (UNCHANGED)
- id: maritime-uuid
  ghcid_original: NL-NH-AMS-M-MM  # No name suffix (published first)
  ghcid_current: NL-NH-AMS-M-MM
  name: Maritime Museum Amsterdam
  organization_status: ACTIVE
  provenance:
    publication_date: "2025-11-01T10:00:00Z"

# New historical institution 1
- id: navy-museum-uuid
  ghcid_original: NL-NH-AMS-M-MM-dutch_navy_museum
  ghcid_current: NL-NH-AMS-M-MM-dutch_navy_museum
  name: Dutch Navy Museum
  organization_status: CLOSED
  provenance:
    extraction_date: "2025-12-01T09:00:00Z"
    notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
  change_history:
    - change_type: FOUNDING
      event_date: "1906-01-01"
    - change_type: CLOSURE
      event_date: "1955-12-31"

# New historical institution 2
- id: naval-archive-uuid
  ghcid_original: NL-NH-AMS-M-MM-amsterdam_naval_archive
  ghcid_current: NL-NH-AMS-M-MM-amsterdam_naval_archive
  name: Amsterdam Naval Archive
  organization_status: CLOSED
  provenance:
    extraction_date: "2025-12-01T09:00:00Z"
    notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
  change_history:
    - change_type: FOUNDING
      event_date: "1820-01-01"
    - change_type: CLOSURE
      event_date: "1901-12-31"

Pattern: When multiple historical institutions are added simultaneously and collide with the same existing GHCID:

  • Existing: No change
  • All new: Each gets name suffix derived from native language name

Collision Resolution Workflow Diagram

New Institution Detected
        │
        ▼
Generate Base GHCID
        │
        ▼
Check Existing Registry
        │
        ├─── No Collision Found ────────────► Use Base GHCID (no name suffix)
        │
        └─── Collision Found
                │
                ▼
        Check Publication Date of Existing Record
                │
                ├─── Existing Published ────────► HISTORICAL_ADDITION
                │         │                         - Existing: UNCHANGED
                │         └───────────────────────► - New: Add name suffix
                │
                └─── Both Being Created ────────► FIRST_BATCH
                          │                         - All: Add name suffixes
                          └─────────────────────────►

Implementation Guidance

When implementing collision resolution, always check:

  1. Does base GHCID exist in registry?

    • No → Use base GHCID without name suffix
    • Yes → Proceed to step 2
  2. Does existing record have publication_date?

    • Yes → HISTORICAL_ADDITION strategy (only new gets name suffix)
    • No → FIRST_BATCH strategy (all get name suffixes)
  3. Track resolution in provenance:

    provenance:
      collision_resolution:
        strategy: HISTORICAL_ADDITION | FIRST_BATCH
        collides_with: existing_ghcid (if historical addition)
        existing_publication_date: ISO 8601 timestamp
        reason: "Human-readable explanation"
    

Why This Matters: PID Stability Principle

Cool URIs Don't Change (W3C Architecture):

  • Once a persistent identifier is published, it should never be modified
  • External systems may reference the PID (citations, links, datasets)
  • Changing published PIDs breaks citations, causes 404 errors, violates trust

GHCID Approach:

  • First batch: Fair treatment, all receive name suffixes (no existing PIDs to protect)
  • Historical addition: Asymmetric treatment preserves existing PIDs (new institution accommodates)

Alternative (Rejected):

  • Could retroactively add name suffixes to existing records when collisions occur
  • Problem: Breaks existing citations, violates PID persistence guarantee
  • Principle: "First publisher wins" → existing GHCID has temporal priority

Migration and Transition Strategies

From ISIL to GHCID

Step 1: Preserve ISIL Codes

- id: uuid-from-isil-ghcid
  ghcid_original: NL-NH-2759794-M-RM
  identifiers:
    - identifier_scheme: ISIL
      identifier_value: NL-AsdRM  # Preserved!
    - identifier_scheme: GHCID
      identifier_value: NL-NH-2759794-M-RM

Step 2: Provide ISIL → GHCID Lookup

GET /isil/NL-AsdRM HTTP/1.1

Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000

Step 3: Gradual Migration

  • Years 1-2: Dual identifiers (ISIL + GHCID)
  • Years 3-5: GHCID primary, ISIL secondary
  • Years 6+: GHCID primary, ISIL legacy

From Wikidata Q-Numbers

Strategy: Complement, don't replace

- id: ghcid-uuid
  identifiers:
    - identifier_scheme: Wikidata
      identifier_value: Q190804
      identifier_url: https://www.wikidata.org/wiki/Q190804

Bidirectional linking:

  • GHCID record → sameAs: wikidata:Q190804
  • Wikidata item → P-GHCID: uuid (proposed property)

Legacy System Integration

For systems requiring numeric IDs:

# Map GHCID UUID to numeric ID
uuid_str = "550e8400-e29b-41d4-a716-446655440000"
numeric_id = ghcid_components.to_numeric()  # 12345678901234567

# Store mapping in legacy database
INSERT INTO institutions (id, uuid_reference)
VALUES (12345678901234567, '550e8400-e29b-41d4-a716-446655440000');

For systems requiring ISIL:

# Provide ISIL fallback
isil_code = record.get_identifier('ISIL') or f"GHCID-{record.ghcid_numeric}"

Appendices

Appendix A: Collision Probability Calculations

Birthday Paradox Formula:

P(collision) ≈ n² / (2 × 2^bits)

Where:
  n = number of institutions
  bits = identifier bit length

UUID v5 (128 bits):

n = 1,000,000 institutions
P = (10^6)² / (2 × 2^128) 
P ≈ 1.5 × 10^-29
P ≈ 0.000000000000000000000000000015%

Numeric (64 bits):

n = 1,000,000 institutions
P = (10^6)² / (2 × 2^64)
P ≈ 2.7 × 10^-7
P ≈ 0.00003%

Conclusion: Both formats provide negligible collision risk for heritage domain (<10M institutions expected).

Appendix B: GHCID Generation Pseudocode

def generate_ghcid(
    name: str,
    institution_type: InstitutionTypeEnum,
    country: str,  # ISO 3166-1 alpha-2
    region: str,   # ISO 3166-2 or GeoNames admin1
    city_geonames_id: int  # GeoNames ID
) -> GHCIDComponents:
    """
    Generate GHCID components from institution metadata.
    """
    # Normalize inputs
    country = country.upper()
    region = region.upper()
    
    # Convert GeoNames ID to string
    city_code = str(city_geonames_id)
    
    # Get institution type code
    type_code = INSTITUTION_TYPE_CODES[institution_type]  # "M", "L", "A", etc.
    
    # Generate abbreviation from emic (native language) name
    # Uses first letter of each significant word, skipping prepositions/articles
    abbreviation = extract_abbreviation_from_name(name)  # "RM", "LC", etc.
    
    # Construct GHCID string
    ghcid_string = f"{country}-{region}-{city_code}-{type_code}-{abbreviation}"
    
    # Generate UUID v5
    ghcid_uuid = uuid.uuid5(GHCID_NAMESPACE, ghcid_string)
    
    # Generate UUID SHA-256
    hash_bytes = hashlib.sha256(ghcid_string.encode('utf-8')).digest()
    ghcid_uuid_sha256 = uuid.UUID(bytes=hash_bytes[:16])  # Truncate to 128 bits
    
    # Generate numeric
    ghcid_numeric = int.from_bytes(hash_bytes[:8], byteorder='big')
    
    return GHCIDComponents(
        country=country,
        region=region,
        city_code=city_code,
        type_code=type_code,
        abbreviation=abbreviation,
        ghcid_string=ghcid_string,
        ghcid_uuid=str(ghcid_uuid),
        ghcid_uuid_sha256=str(ghcid_uuid_sha256),
        ghcid_numeric=ghcid_numeric
    )

Appendix C: References and Standards

IETF RFCs:

ISO Standards:

  • ISO 15511: International Standard Identifier for Libraries and Related Organizations (ISIL)
  • ISO 3166-1: Codes for the representation of names of countries and their subdivisions Part 1: Country codes
  • ISO 3166-2: Codes for the representation of names of countries and their subdivisions Part 2: Country subdivision codes
  • ISO 26324: Information and documentation — Digital object identifier system
  • ISO 21127: Information and documentation — CIDOC Conceptual Reference Model (CRM)

W3C Standards:

Heritage Standards:

GeoNames:

Related PID Systems:


Acknowledgments

This specification builds on the foundational work of:

  • ISIL agencies worldwide for pioneering library/archive identifiers
  • DOI Foundation for persistent identifier governance models
  • California Digital Library for ARK design principles
  • Wikimedia Foundation for crowdsourced identifier systems
  • GeoNames for geographic identifier infrastructure
  • Europeana and DPLA for cultural heritage aggregation standards

Special thanks to the heritage informatics community for feedback and guidance.


Version: 1.0
Date: 2025-11-06
Status: Draft for Community Review
Next Review: 2026-01-01
Contact: GLAM Data Extraction Project


License: This specification is released under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.