glam/docs/GHCID_PID_SCHEME.md
2025-12-01 16:06:34 +01:00

1605 lines
56 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Global Heritage Custodian Identifier (GHCID) Persistent Identifier Scheme
**Version**: 1.0
**Date**: 2025-11-06
**Status**: Formal Specification (Draft for Community Review)
**Authors**: GLAM Data Extraction Project
---
## Table of Contents
1. [Introduction](#introduction)
2. [Persistent Identifier Requirements](#persistent-identifier-requirements)
3. [GHCID Identifier Formats](#ghcid-identifier-formats)
4. [Resolution Architecture](#resolution-architecture)
5. [Governance Model](#governance-model)
6. [Comparison with Existing PID Systems](#comparison-with-existing-pid-systems)
7. [Implementation Roadmap](#implementation-roadmap)
8. [Technical Specifications](#technical-specifications)
9. [Use Cases and Applications](#use-cases-and-applications)
10. [Migration and Transition Strategies](#migration-and-transition-strategies)
---
## Introduction
### Purpose
The **Global Heritage Custodian Identifier (GHCID)** is a persistent identifier (PID) scheme designed to uniquely and permanently identify heritage custodian organizations worldwide—including galleries, libraries, archives, museums (GLAM), research centers, botanical gardens, collecting societies, and other cultural heritage institutions.
GHCID addresses critical gaps in existing identifier systems:
1. **Global Coverage**: Extends beyond ISIL's limited geographic coverage to include institutions in all countries
2. **Non-Registry Institutions**: Provides identifiers for organizations without ISIL codes (estimated 70-80% of heritage institutions worldwide)
3. **Change Tracking**: Models organizational evolution through mergers, splits, relocations, and name changes
4. **Multi-Format Support**: Offers human-readable, UUID, and numeric formats for diverse system requirements
5. **Linked Data Integration**: Aligns with W3C, ISO, and IETF standards for semantic web compatibility
### Scope
GHCID is intended for:
- **Heritage custodian organizations**: Museums, archives, libraries, galleries, research centers, botanical gardens, zoos, collecting societies, and heritage platforms
- **Cross-system references**: Citations, metadata aggregation, Linked Open Data (LOD) graphs
- **Long-term persistence**: Identifiers designed to remain stable for decades or centuries
- **Global interoperability**: Compatible with Europeana, DPLA, IIIF, Wikidata, GeoNames, and other aggregators
GHCID is NOT intended for:
- Individual collection items (use ARK, DOI, Handle for objects)
- Digital files or surrogates (use IIIF, ARK for digital objects)
- Person identifiers (use ORCID, ISNI, VIAF for people)
- Geographic locations (use GeoNames, OSM for places)
### Design Principles
1. **Transparency**: Publicly documented algorithms, verifiable by anyone
2. **Determinism**: Same input always produces same identifier
3. **Persistence**: Identifiers remain valid even if organizations change names or relocate
4. **Interoperability**: Compatible with existing PID systems (ISIL, VIAF, Wikidata)
5. **Open Standards**: Based on IETF RFCs, ISO standards, W3C recommendations
6. **No Vendor Lock-In**: Open-source implementation, no proprietary dependencies
---
## Persistent Identifier Requirements
### Core PID Properties
A persistent identifier system must satisfy:
1. **Uniqueness**: No two entities share the same identifier
2. **Persistence**: Identifiers remain valid indefinitely (decades to centuries)
3. **Resolvability**: Identifiers can be resolved to authoritative metadata
4. **Transparency**: Generation algorithm is publicly documented
5. **Governance**: Clear authority and policies for identifier assignment
6. **Actionability**: Identifiers can be used in URLs, APIs, citations
### GHCID Compliance
| Requirement | GHCID Implementation | Status |
|-------------|---------------------|--------|
| **Uniqueness** | UUID v5 (128-bit, P(collision) ≈ 10^-29), SHA-256 fallback | ✅ Implemented |
| **Persistence** | Deterministic generation from stable metadata, `ghcid_original` frozen on first assignment | ✅ Implemented |
| **Resolvability** | HTTP resolution service (planned), JSON-LD API | 🔄 In Design |
| **Transparency** | Open-source code, RFC 4122 standard, public algorithm docs | ✅ Implemented |
| **Governance** | Community governance model (proposed), ISIL coordination | ⏳ Pending |
| **Actionability** | URN and HTTPS URI formats, embeddable in RDF/JSON-LD | ✅ Implemented |
---
## GHCID Identifier Formats
GHCID provides **four complementary identifier formats**, each optimized for specific use cases. All formats are **deterministic** and derived from the same underlying GHCID components.
### 1. Human-Readable GHCID String
**Format**: `{Country}-{Region}-{City}-{Type}-{Abbreviation}`
**Components**:
| Component | Format | Standard | Example |
|-----------|--------|----------|---------|
| **Country** | ISO 3166-1 alpha-2 (2 chars) | ISO 3166-1 | `NL`, `US`, `BR` |
| **Region** | ISO 3166-2 subdivision (2-3 chars) OR GeoNames admin1 code | ISO 3166-2, GeoNames | `NH` (Noord-Holland), `CA` (California) |
| **City** | GeoNames city code (3-4 chars, base36 encoding) | GeoNames | `2759794``2759794` |
| **Type** | Institution type (1 char) | GHCID taxonomy | `M` (Museum), `L` (Library), `A` (Archive) |
| **Abbreviation** | First letter of each word in emic name (2-10 chars) | Derived | `RM` (Rijksmuseum) |
**Examples**:
```
NL-NH-2759794-M-RM # Rijksmuseum, Amsterdam, Netherlands
US-DC-4140963-L-LC # Library of Congress, Washington DC, USA
BR-RJ-3451190-L-BNB # Biblioteca Nacional do Brasil, Rio de Janeiro, Brazil
GB-EN-2643743-M-BM # British Museum, London, United Kingdom
FR-IL-2988507-M-LM # Louvre Museum, Paris, France
```
**Use Cases**:
- Academic citations
- Documentation and reports
- Debugging and logging
- Human-readable data exchange
**Persistence Note**:
- Organizations may relocate or change names over time
- **`ghcid_original`**: Frozen on first assignment, never changes (TRUE PID)
- **`ghcid_current`**: Updated if organization changes (convenience field)
- Both stored in record; `ghcid_original` used for citations and cross-system references
**Historical Institutions Rule** (Added 2025-11-06):
GHCID supports **historical heritage institutions** (e.g., 17th-century cabinet collections, defunct museums, closed archives) using modern geographic coordinates:
- **Geographic Components**: Country, region, and city codes are based on where the institution's coordinates would fall on a **modern world map (2025)** using the last recorded date of the institution's existence
- **Temporal Projection**: Historical locations are projected onto current ISO 3166-1/3166-2 and GeoNames geocoding standards
- **Abbreviation**: Institution abbreviations use the **first letter of each significant word in the official emic (native language) name**, skipping prepositions, articles, and conjunctions in all languages
- **Institution Type**: Uses the full GHCID taxonomy (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.) - historical context preserved in metadata, not identifier
- **Flexibility**: The GHCID format is deliberately designed to accommodate institutions from any historical period
**Examples of Historical Institutions**:
```
# Wunderkammer of Ole Worm (Copenhagen, 1655 - closed 1654)
DK-84-2618425-P-OW # Denmark, Capital Region, Copenhagen (modern), Personal Collection, Ole Worm
# Bibliotheca Corviniana (Buda, 1490 - dispersed 1526)
HU-BU-3054643-L-BC # Hungary, Budapest (modern coords), Library, Bibliotheca Corviniana
# Cabinet of Curiosities of Ferdinand II (Innsbruck, 1620s)
AT-7-2775216-P-FII # Austria, Tyrol, Innsbruck (modern), Personal Collection, Ferdinand II
# Dutch East India Company Archives (Jakarta, 1602-1800)
ID-JK-1642911-C-VOC # Indonesia (modern country), Jakarta (modern), Corporation, VOC
```
**Rationale**:
- Historical institutions existed at specific geographic coordinates
- Modern political boundaries and city identifiers provide stable reference points
- Emic name abbreviations preserve original cultural/linguistic context while ensuring deterministic generation
- Metadata fields capture full historical context (founding/closure dates, historical names, organizational changes)
- GHCID history tracks temporal evolution via `ghcid_history` entries
- Enables citation of historical collections in modern scholarship using persistent identifiers
### 2. UUID v5 (SHA-1) - Primary Persistent Identifier
**Format**: RFC 4122 UUID v5 (128-bit, hyphenated)
**Algorithm**:
1. Construct GHCID string: `{Country}-{Region}-{City}-{Type}-{Abbreviation}`
2. Apply RFC 4122 UUID v5 with:
- **Namespace**: `6ba7b810-9dad-11d1-80b4-00c04fd430c8` (DNS namespace from RFC 4122)
- **Name**: GHCID string (UTF-8 encoded)
- **Hash**: SHA-1 (per RFC 4122 specification)
3. Format as UUID: `xxxxxxxx-xxxx-5xxx-yxxx-xxxxxxxxxxxx`
**Examples**:
```
NL-NH-2759794-M-RM → 550e8400-e29b-41d4-a716-446655440000
US-DC-4140963-L-LC → 8b3e6f12-a4d5-5c89-b123-456789abcdef
```
**Properties**:
- **Standard Compliance**: RFC 4122 (2005), IETF standard
- **Collision Resistance**: P(collision) ≈ 1.5 × 10^-29 for 1M institutions
- **Deterministic**: Same GHCID always produces same UUID
- **Interoperable**: Compatible with Europeana, DPLA, IIIF, Wikidata
- **Transparent**: Built-in function in all major programming languages
**SHA-1 Safety**:
- SHA-1 is **deprecated for cryptographic security** (digital signatures, TLS)
- SHA-1 is **appropriate for identifier generation** (non-adversarial, collision-resistant)
- UUID v5 collision resistance relies on 128-bit output space, not SHA-1 preimage resistance
- See [WHY_UUID_V5_SHA1.md](WHY_UUID_V5_SHA1.md) for detailed rationale
**Use Cases**:
- **Primary identifier** for all GHCID records
- RDF/JSON-LD `@id` field
- IIIF manifest identifiers
- Wikidata external ID
- Database foreign keys
- Cross-system references
**URN Format**:
```
urn:uuid:550e8400-e29b-41d4-a716-446655440000
```
**HTTPS URI Format** (with resolution service):
```
https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000
```
### 3. UUID SHA-256 (Future-Proof Alternative)
**Format**: RFC 9562 UUID v8 (128-bit, custom SHA-256)
**Algorithm**:
1. Construct GHCID string
2. Hash with SHA-256 → 256 bits
3. Truncate to first 128 bits (16 bytes)
4. Set version bits to `8` (custom/experimental)
5. Set variant bits to RFC 4122 (`0b10xxxxxx`)
**Examples**:
```
NL-NH-2759794-M-RM → a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
```
**Properties**:
- **Cryptographic Strength**: SHA-256 (NIST-approved through 2030+)
- **Collision Resistance**: P(collision) ≈ 1.5 × 10^-29 (same as UUID v5)
- **Future-Proof**: No known practical attacks against SHA-256
- **Deterministic**: Same GHCID always produces same UUID
- **Less Transparent**: Custom algorithm requires sharing implementation code
**Use Cases**:
- Security policies mandating SHA-256
- Future migration path if SHA-1 fully deprecated
- Custom identifier resolution services
- Internal systems with strict cryptographic requirements
**Status**: Generated alongside UUID v5, stored as secondary identifier
### 4. Numeric (64-bit Integer)
**Format**: Unsigned 64-bit integer (0 to 18,446,744,073,709,551,615)
**Algorithm**:
1. Hash GHCID string with SHA-256 → 256 bits
2. Extract first 8 bytes (64 bits)
3. Convert to unsigned integer (big-endian)
**Examples**:
```
NL-NH-2759794-M-RM → 12345678901234567
US-DC-4140963-L-LC → 98765432109876543
```
**Properties**:
- **Compact**: 8 bytes (vs. 36 bytes for UUID string)
- **Deterministic**: Same GHCID always produces same numeric ID
- **Fast Indexing**: Integer comparisons faster than string UUIDs
- **CSV-Friendly**: No special characters
- **Reduced Collision Resistance**: P(collision) ≈ 2.7 × 10^-7 for 1M institutions (still negligible)
**Use Cases**:
- Database primary keys (SQL `BIGINT`)
- CSV exports for spreadsheet analysis
- Numeric sorting requirements
- Systems without UUID support
- Legacy system integration
**Limitations**:
- NOT recommended as primary PID (use UUID v5 instead)
- Suitable for heritage domain (<10M institutions expected)
- For >100M institutions, collision risk becomes non-negligible (0.27%)
---
## Resolution Architecture
### Resolution Service Design
A persistent identifier is only as persistent as its resolution infrastructure. GHCID requires a **long-term, reliable resolution service** to resolve identifiers to authoritative metadata.
### Resolver Endpoints
**Base URL**: `https://id.heritage.example.org/` (example domain)
| Endpoint Pattern | Format | Example |
|-----------------|--------|---------|
| `/uuid/{uuid}` | UUID v5 | `/uuid/550e8400-e29b-41d4-a716-446655440000` |
| `/uuid-sha256/{uuid}` | UUID SHA-256 | `/uuid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d` |
| `/numeric/{id}` | Numeric | `/numeric/12345678901234567` |
| `/ghcid/{string}` | Human-readable | `/ghcid/NL-NH-2759794-M-RM` |
**All four endpoints resolve to the SAME institutional record.**
### Resolution Protocol
**HTTP GET Request**:
```http
GET /uuid/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1
Host: id.heritage.example.org
Accept: application/ld+json
```
**HTTP Response** (JSON-LD):
```json
{
"@context": "https://w3id.org/heritage/custodian/context.jsonld",
"@type": ["HeritageCustodian", "schema:Museum", "org:Organization"],
"@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
"ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
"ghcid_original": "NL-NH-2759794-M-RM",
"ghcid_current": "NL-NH-2759794-M-RM",
"name": "Rijksmuseum",
"alternateName": ["Rijksmuseum Amsterdam", "Rijks"],
"description": "The national museum of the Netherlands, dedicated to arts and history.",
"institution_type": "MUSEUM",
"url": "https://www.rijksmuseum.nl",
"sameAs": [
"https://www.wikidata.org/wiki/Q190804",
"https://viaf.org/viaf/131511535",
"urn:isil:NL-AsdRM"
],
"address": {
"@type": "PostalAddress",
"streetAddress": "Museumstraat 1",
"addressLocality": "Amsterdam",
"postalCode": "1071 XX",
"addressCountry": "NL",
"geonames": "https://sws.geonames.org/2759794/"
},
"foundingDate": "1800-01-01",
"provenance": {
"data_source": "CSV_REGISTRY",
"data_tier": "TIER_1_AUTHORITATIVE",
"extraction_date": "2025-11-06T10:30:00Z"
}
}
```
**Content Negotiation**:
| Accept Header | Response Format | Use Case |
|---------------|----------------|----------|
| `application/ld+json` | JSON-LD | Linked Data applications |
| `application/json` | Plain JSON | APIs, JavaScript |
| `text/turtle` | RDF Turtle | SPARQL, semantic web |
| `application/rdf+xml` | RDF/XML | Legacy RDF systems |
| `text/html` | HTML landing page | Human browsing |
| `text/plain` | Plain text summary | Simple debugging |
### HTTP Status Codes
| Code | Meaning | When Used |
|------|---------|-----------|
| **200 OK** | Identifier resolved successfully | Record found |
| **303 See Other** | Redirect to canonical URL | Multiple URLs for same resource |
| **404 Not Found** | Identifier not in registry | Unknown GHCID |
| **410 Gone** | Institution closed/merged | Record marked as inactive |
| **500 Internal Server Error** | Resolver malfunction | Service downtime |
### Persistence Commitment
**Requirement**: Resolution service must commit to:
1. **Minimum 50-year operation** (heritage institutions have multi-century lifespans)
2. **High availability** (99.9% uptime SLA)
3. **Multi-region redundancy** (geographic distribution)
4. **Daily backups** with disaster recovery plan
5. **Transparent governance** (public policies, community oversight)
6. **Open-source resolver code** (forkable by community if needed)
**Funding Model** (options):
- Grant funding (national libraries, heritage foundations)
- Membership fees (GLAM consortia, aggregators)
- Government support (cultural heritage agencies)
- Cloud provider donations (Google, AWS, Azure)
---
## Governance Model
### Organizational Structure
**GHCID is proposed as a community-governed persistent identifier scheme**, modeled on successful PID systems like DOI, ARK, and Handle.
#### Proposed Governance Body
**GHCID Consortium** (working name):
1. **Steering Committee** (7-9 members)
- Representatives from: National libraries, international archives, museum networks
- Terms: 3 years, staggered rotation
- Responsibilities: Policy decisions, budget oversight, strategic direction
2. **Technical Working Group**
- Developers, data scientists, Linked Data experts
- Responsibilities: Specification updates, resolver development, community tools
3. **Community Advisory Board**
- Heritage institutions, researchers, aggregators (Europeana, DPLA)
- Responsibilities: Use case feedback, adoption guidance
4. **Secretariat**
- Permanent staff (2-3 FTE)
- Responsibilities: Day-to-day operations, resolver maintenance, documentation
### Coordination with Existing Systems
GHCID does NOT replace existing identifier systems; it **complements and coordinates** with:
1. **ISIL (ISO 15511)**
- Store ISIL codes as secondary identifiers
- Cross-reference GHCID ↔ ISIL mapping
- Collaborate with national ISIL agencies
2. **Wikidata**
- Propose GHCID as new external identifier property
- Link GHCID records to Wikidata Q-numbers
- Enable bidirectional cross-referencing
3. **VIAF (Virtual International Authority File)**
- For institutions with VIAF records, store VIAF ID
- Coordinate with OCLC on authority control
4. **GeoNames**
- Use GeoNames IDs for geographic components
- Link GHCID locations to GeoNames URIs
5. **Europeana / DPLA**
- Integrate GHCID into aggregator metadata
- Use UUID v5 format for interoperability
### Identifier Assignment Policies
**Who can assign a GHCID?**
Option 1: **Open Generation** (preferred for transparency)
- Anyone can generate a GHCID using the open-source algorithm
- Deterministic generation ensures same institution → same ID
- Conflicts resolved via community review
Option 2: **Registry-Based** (traditional PID model)
- Institutions apply to GHCID Consortium for assignment
- Manual review ensures accuracy
- Slower, but higher quality control
**Recommendation**: Hybrid approach
- Open generation for most institutions (self-service)
- Optional manual review for complex cases (mergers, disputes)
- Community validation via Wikidata, ISIL cross-checks
### Dispute Resolution
**Scenario**: Two GHCID records claim to represent the same institution
**Resolution Process**:
1. Automated detection (name similarity, ISIL code match)
2. Community flagging (anyone can report suspected duplicates)
3. Review by Technical Working Group
4. Merge records, redirect old GHCID to canonical GHCID (HTTP 303)
5. Update provenance metadata with merge event
### Versioning and Deprecation
**GHCID Specification Versioning**:
- Semantic versioning: `MAJOR.MINOR.PATCH`
- Current version: `1.0.0` (this document)
- Backward compatibility guaranteed for MAJOR versions
**Identifier Deprecation**:
- GHCIDs are **never deleted** (persistence requirement)
- Closed/merged institutions marked as `organization_status: CLOSED`
- HTTP 410 Gone response with pointer to successor organization
- Change history tracked in `change_history` field
---
## Comparison with Existing PID Systems
### Feature Comparison Table
| Feature | **GHCID** | **ISIL** | **DOI** | **ARK** | **Handle** | **Wikidata** |
|---------|----------|---------|---------|---------|------------|--------------|
| **Domain** | Heritage institutions | Libraries/archives | Scholarly objects | Cultural heritage | Digital objects | Entities (all types) |
| **Coverage** | Global (any institution) | Limited (registry-based) | Scholarly publications | Libraries, museums | Repositories | Global (crowdsourced) |
| **Registration** | Open (deterministic) | Required | Required | Required | Required | Open (crowdsourced) |
| **Format** | UUID, GHCID string, numeric | Country-local code | `10.xxxx/yyyy` | `ark:/nnnnn/xxx` | `hdl:xxxx/yyyy` | `Q12345` |
| **Resolution** | HTTPS (planned) | No standard resolver | doi.org | n2t.net | handle.net | wikidata.org |
| **Governance** | Proposed consortium | National agencies | IDF (non-profit) | CDL (California) | CNRI (non-profit) | Wikimedia Foundation |
| **Cost** | Free (open) | Free | Paid (varies) | Free | Paid (varies) | Free |
| **Standard** | RFC 4122, ISO 3166 | ISO 15511 | ISO 26324 | IETF draft | IETF RFC 3650 | Community-driven |
| **Change Tracking** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Edit history |
| **Multi-Format** | ✅ 4 formats | ❌ Single | ❌ Single | ❌ Single | ❌ Single | ❌ Single |
| **Adoption** | New (0) | High (libraries) | Very High | Medium | Medium | Very High |
### Unique GHCID Advantages
1. **Change History Tracking**: Built-in organizational evolution modeling (mergers, splits, relocations)
2. **Multi-Format Flexibility**: Human-readable, UUID, numeric formats from same base ID
3. **Open Generation**: Deterministic algorithm, no registration bureaucracy
4. **Global Coverage**: Not limited to countries with ISIL registries
5. **Linked Data Native**: Designed for RDF/JSON-LD from the start
6. **Data Quality Tiers**: 4-tier provenance system (TIER_1 through TIER_4)
### Why Not Just Use Wikidata?
**Wikidata is excellent but has limitations for PIDs**:
| Aspect | Wikidata | GHCID |
|--------|----------|-------|
| **Identifier Format** | `Q12345` (sequential) | UUID (content-addressed) |
| **Determinism** | ❌ No (assigned sequentially) | ✅ Yes (hash-based) |
| **Regeneration** | ❌ Lost if database corrupted | ✅ Can regenerate from metadata |
| **Governance** | Wikimedia Foundation | Heritage community |
| **Specialization** | General knowledge base | Heritage institutions only |
| **Provenance** | Edit history | Structured provenance model |
| **Data Quality** | Crowdsourced (variable) | Tiered quality system |
**Recommendation**: Use GHCID as **primary PID**, link to Wikidata Q-number as **secondary identifier**
---
## Implementation Roadmap
### Phase 1: Foundation (2025 Q1-Q2) ✅ IN PROGRESS
**Status**: Currently implementing
**Deliverables**:
- ✅ GHCID specification document (this document)
- ✅ UUID generation library (`src/glam_extractor/identifiers/ghcid.py`)
- ✅ LinkML schema with GHCID fields (`schemas/core.yaml`)
- ✅ Test suite (UUID determinism, collision resistance)
- 🔄 GeoNames integration for city codes
- 🔄 ISO 3166-2 lookup tables
**Milestones**:
- [x] GHCID format design
- [x] UUID v5 implementation
- [x] UUID SHA-256 implementation
- [x] Numeric ID implementation
- [ ] GeoNames geocoding service
- [ ] ISO 3166-2 reference data
### Phase 2: Data Production (2025 Q2-Q3)
**Deliverables**:
- Dutch institutions with GHCID (1,351 organizations)
- ISIL registry with GHCID (364 institutions)
- Conversation data with GHCID (estimated 2,000-5,000 institutions)
- Cross-linked dataset (merged by GHCID UUID)
**Milestones**:
- [ ] Generate GHCID for all Dutch datasets
- [ ] Generate GHCID for conversation extractions
- [ ] Cross-link by UUID v5
- [ ] Publish test dataset (100 institutions)
### Phase 3: Resolution Service (2025 Q4)
**Deliverables**:
- GHCID resolver prototype (Python FastAPI)
- JSON-LD API endpoint
- HTML landing pages
- Content negotiation support
- Registry database (PostgreSQL + RDF triplestore)
**Milestones**:
- [ ] Resolver API implementation
- [ ] Database schema for GHCID registry
- [ ] Load Dutch dataset into resolver
- [ ] Public demo deployment
### Phase 4: Community Engagement (2026 Q1-Q2)
**Deliverables**:
- GHCID specification v1.0 (finalized)
- Outreach to Europeana, DPLA, IIIF communities
- Proposal to Wikidata for new external ID property
- Coordination meetings with ISIL agencies
- Community feedback incorporation
**Milestones**:
- [ ] Present at CIDOC, ICA, IFLA conferences
- [ ] Publish RFC or W3C Community Group Note
- [ ] Partner with 3-5 heritage institutions for pilot
- [ ] Gather feedback, iterate on specification
### Phase 5: Governance Establishment (2026 Q3-Q4)
**Deliverables**:
- GHCID Consortium formation
- Steering Committee election
- Long-term funding secured
- Resolver production deployment
- Governance policies published
**Milestones**:
- [ ] Incorporate GHCID Consortium (non-profit)
- [ ] Secure 3-year funding commitment
- [ ] Deploy production resolver (multi-region)
- [ ] Establish community governance processes
### Phase 6: Scaling and Adoption (2027+)
**Deliverables**:
- Global dataset (50,000+ institutions)
- Integration with major aggregators
- Resolver SLA (99.9% uptime)
- Annual community meetings
- Ongoing maintenance and updates
**Milestones**:
- [ ] 10,000 institutions with GHCID
- [ ] Europeana integration
- [ ] DPLA integration
- [ ] Wikidata property approved
- [ ] 50-year persistence commitment
---
## Technical Specifications
### Data Model
**Core GHCID Fields** (LinkML schema):
```yaml
HeritageCustodian:
slots:
# Primary identifiers
- id # UUID v5 (primary key)
- record_id # UUID v7 (database PK, time-ordered)
- ghcid_uuid # UUID v5 (same as id)
- ghcid_uuid_sha256 # UUID SHA-256 (future-proof)
- ghcid_numeric # Numeric (64-bit)
# Human-readable GHCIDs
- ghcid_current # Current GHCID string (may change)
- ghcid_original # Original GHCID string (FROZEN, true PID)
# GHCID history
- ghcid_history # List of GHCIDHistoryEntry
```
**GHCID History Entry**:
```yaml
GHCIDHistoryEntry:
description: Tracks changes to GHCID over time (relocations, name changes)
slots:
- ghcid_value # GHCID string at this point in time
- valid_from # ISO 8601 date (when this GHCID became active)
- valid_to # ISO 8601 date (when this GHCID was superseded)
- change_reason # Reason for change (relocation, name change, etc.)
- related_event # Link to ChangeEvent if applicable
```
**Example**:
```yaml
# Noord-Hollands Archief (formed 2001 via merger)
id: 550e8400-e29b-41d4-a716-446655440000 # UUID v5
ghcid_original: NL-NH-2750053-A-NHA # Frozen forever
ghcid_current: NL-NH-2750053-A-NHA # Same (no changes yet)
ghcid_history:
- ghcid_value: NL-NH-2750053-A-NHA
valid_from: "2001-01-01"
valid_to: null # Still valid
change_reason: FOUNDING
related_event: ghcid:event-nha-merger-2001
# If it relocates in 2030:
ghcid_history:
- ghcid_value: NL-NH-2750053-A-NHA
valid_from: "2001-01-01"
valid_to: "2030-06-15"
- ghcid_value: NL-NH-9876543-A-NHA # New city GeoNames ID
valid_from: "2030-06-15"
valid_to: null
change_reason: RELOCATION
related_event: ghcid:event-nha-relocation-2030
```
### API Specification
**Resolver API Endpoints**:
#### 1. Resolve by UUID v5
```http
GET /uuid/{uuid} HTTP/1.1
Host: id.heritage.example.org
Accept: application/ld+json
Response: 200 OK
Content-Type: application/ld+json
{
"@context": "https://w3id.org/heritage/custodian/context.jsonld",
"@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
...
}
```
#### 2. Resolve by GHCID String
```http
GET /ghcid/NL-NH-2759794-M-RM HTTP/1.1
Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
```
#### 3. Search by Name
```http
GET /search?name=Rijksmuseum&country=NL HTTP/1.1
Response: 200 OK
Content-Type: application/json
{
"results": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Rijksmuseum",
"ghcid": "NL-NH-2759794-M-RM",
"url": "https://www.rijksmuseum.nl"
}
],
"total": 1
}
```
#### 4. Reverse Lookup (Numeric → UUID)
```http
GET /numeric/12345678901234567 HTTP/1.1
Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
```
### Database Schema (PostgreSQL)
```sql
CREATE TABLE ghcid_registry (
-- Primary keys
id UUID PRIMARY KEY, -- UUID v5 (primary identifier)
record_id UUID UNIQUE NOT NULL, -- UUID v7 (database record ID)
-- Identifiers
ghcid_uuid UUID UNIQUE NOT NULL, -- UUID v5 (same as id)
ghcid_uuid_sha256 UUID UNIQUE NOT NULL, -- UUID SHA-256
ghcid_numeric BIGINT UNIQUE NOT NULL, -- Numeric (64-bit)
ghcid_original VARCHAR(100) UNIQUE NOT NULL, -- Frozen GHCID string
ghcid_current VARCHAR(100) NOT NULL, -- Current GHCID string
-- Metadata
name TEXT NOT NULL,
institution_type VARCHAR(50) NOT NULL,
organization_status VARCHAR(20) DEFAULT 'ACTIVE',
-- Geographic
country CHAR(2) NOT NULL, -- ISO 3166-1
region VARCHAR(10), -- ISO 3166-2
city_geonames_id INTEGER, -- GeoNames ID
-- External identifiers
isil_code VARCHAR(50),
wikidata_id VARCHAR(20),
viaf_id VARCHAR(50),
-- Provenance
data_source VARCHAR(50) NOT NULL,
data_tier VARCHAR(20) NOT NULL,
extraction_date TIMESTAMPTZ NOT NULL,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
-- Indexes
INDEX idx_ghcid_original (ghcid_original),
INDEX idx_ghcid_current (ghcid_current),
INDEX idx_name (name),
INDEX idx_country_region (country, region),
INDEX idx_isil (isil_code),
INDEX idx_wikidata (wikidata_id)
);
CREATE TABLE ghcid_history (
id SERIAL PRIMARY KEY,
ghcid_uuid UUID NOT NULL REFERENCES ghcid_registry(id),
ghcid_value VARCHAR(100) NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
change_reason VARCHAR(50),
related_event_id VARCHAR(200),
INDEX idx_ghcid_uuid (ghcid_uuid),
INDEX idx_valid_dates (valid_from, valid_to)
);
```
---
## GeoNames Settlement Resolution
### Overview
The City component of GHCID relies on GeoNames for standardized settlement names and geographic resolution. This section defines critical rules for GeoNames integration.
### Feature Code Filtering (CRITICAL)
**NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).**
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.
#### ALLOWED Feature Codes
| Code | Description | Example |
|------|-------------|---------|
| **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
| **PPLA** | Seat of first-order admin division | Provincial capitals |
| **PPLA2** | Seat of second-order admin division | Municipal seats |
| **PPLA3** | Seat of third-order admin division | District seats |
| **PPLA4** | Seat of fourth-order admin division | Sub-district seats |
| **PPLC** | Capital of a political entity | Amsterdam, Brussels |
| **PPLS** | Populated places (multiple) | Settlement clusters |
| **PPLG** | Seat of government | The Hague |
#### EXCLUDED Feature Codes
| Code | Description | Why Excluded |
|------|-------------|--------------|
| **PPLX** | Section of populated place | Neighborhoods, districts, quarters |
**Problem Example**:
Without feature code filtering, reverse geocoding may return:
- ❌ "Binnenstad" (PPLX, neighborhood, pop 4,900) - WRONG
- ✅ "Apeldoorn" (PPL, city, pop 136,670) - CORRECT
#### SQL Implementation
```sql
SELECT
name, ascii_name, admin1_code, admin1_name,
latitude, longitude, geonames_id, population, feature_code,
((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance_sq
FROM cities
WHERE country_code = ?
AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
ORDER BY distance_sq
LIMIT 1
```
### Country Code Detection
**CRITICAL**: Determine country code from entry data BEFORE calling GeoNames reverse geocoding.
GeoNames queries are country-specific. Using the wrong country code will return incorrect results.
**Country Code Resolution Priority**:
1. `zcbs_enrichment.country` - Most explicit source
2. `location.country` - Direct location field
3. `locations[].country` - Array location field
4. `original_entry.country` - CSV source field
5. `google_maps_enrichment.address` - Parse from address string
6. `wikidata_enrichment.located_in.label` - Infer from Wikidata
7. Default: `"NL"` (Netherlands) - Only if no other source
### Provenance Tracking
Record GeoNames resolution in entry metadata:
```yaml
location_resolution:
method: REVERSE_GEOCODE
geonames_id: 2759706
geonames_name: Apeldoorn
feature_code: PPL # MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
admin1_code: '03'
region_code: GE
country_code: NL
source_coordinates:
latitude: 52.21116
longitude: 5.96978
distance_km: 0.5
```
**Validation**: If `feature_code: PPLX` appears in metadata, the GHCID is WRONG and must be regenerated.
---
## Use Cases and Applications
### 1. Academic Citations
**Scenario**: Researcher cites archival collection in academic paper
**Without GHCID**:
```
"See the Municipal Archives of Haarlem for records from 1245-1800."
```
Problem: Name may change, institution may merge, citation becomes ambiguous
**With GHCID**:
```
"See the Noord-Hollands Archief (urn:uuid:550e8400-e29b-41d4-a716-446655440000)
for Haarlem municipal records from 1245-1800."
```
Benefit: Persistent identifier remains valid even if organization changes name or merges
### 2. Metadata Aggregation (Europeana, DPLA)
**Scenario**: Europeana aggregates metadata from 4,000 institutions
**Without GHCID**:
```json
{
"institution": "National Library",
"country": "Netherlands"
}
```
Problem: Multiple "National Library" institutions, ambiguous
**With GHCID**:
```json
{
"@id": "urn:uuid:a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d",
"institution": "Koninklijke Bibliotheek",
"sameAs": "https://id.heritage.example.org/uuid/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d"
}
```
Benefit: Unique identifier enables deduplication, cross-referencing, provenance tracking
### 3. IIIF Manifests
**Scenario**: Museum publishes IIIF manifest for digitized collection
**GHCID Integration**:
```json
{
"@context": "http://iiif.io/api/presentation/3/context.json",
"@id": "https://iiif.rijksmuseum.nl/manifest/123",
"type": "Manifest",
"provider": [{
"id": "urn:uuid:550e8400-e29b-41d4-a716-446655440000",
"type": "Agent",
"label": {"en": ["Rijksmuseum"]},
"sameAs": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000"
}]
}
```
Benefit: IIIF consumers can resolve provider to authoritative metadata
### 4. Wikidata Integration
**Scenario**: Link Wikidata item to heritage institution
**Wikidata Property Proposal**: `P-GHCID` (GHCID identifier)
```sparql
# Wikidata SPARQL query
SELECT ?item ?itemLabel ?ghcid WHERE {
?item wdt:P-GHCID ?ghcid .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```
Benefit: Bidirectional linking between Wikidata and GHCID registry
### 5. Merger/Split Tracking
**Scenario**: Two archives merge in 2001
**GHCID Representation**:
```yaml
# Gemeentearchief Haarlem (predecessor)
- id: old-uuid-1
ghcid_original: NL-NH-2759794-A-GH
organization_status: CLOSED
change_history:
- change_type: MERGER
event_date: "2001-01-01"
resulting_organization: new-uuid
# Rijksarchief in Noord-Holland (predecessor)
- id: old-uuid-2
ghcid_original: NL-NH-2759794-A-RNH
organization_status: CLOSED
change_history:
- change_type: MERGER
event_date: "2001-01-01"
resulting_organization: new-uuid
# Noord-Hollands Archief (successor)
- id: new-uuid
ghcid_original: NL-NH-2750053-A-NHA
organization_status: ACTIVE
change_history:
- change_type: FOUNDING
event_date: "2001-01-01"
affected_organization: [old-uuid-1, old-uuid-2]
```
**Resolver Behavior**:
```http
GET /uuid/old-uuid-1 HTTP/1.1
Response: 410 Gone
Location: /uuid/new-uuid
{
"status": "CLOSED",
"reason": "Merged into Noord-Hollands Archief",
"successor": "https://id.heritage.example.org/uuid/new-uuid",
"effective_date": "2001-01-01"
}
```
---
## GHCID Collision Resolution and Timeline Examples
### Collision Scenarios
GHCID collisions occur when two different institutions generate the **same base GHCID string** (identical country, region, city, type, and abbreviation). Resolution strategy **depends on temporal context**: when were the colliding institutions discovered?
### Core Principle: Temporal Priority Determines Strategy
**Rule**: The **timing of institution discovery** determines which institutions receive native language name suffixes.
**Collision Suffix**: Collisions are resolved by appending the **full legal name in native language in snake_case format**.
| Scenario | When Detected | Resolution Strategy | Rationale |
|----------|--------------|---------------------|-----------|
| **First Batch Collision** | Multiple institutions discovered simultaneously (e.g., batch CSV import) | **ALL colliding institutions** receive name suffixes | Fair treatment: no institution has temporal priority |
| **Historical Addition** | New institution collides with **already published** GHCID | **ONLY new institution** receives name suffix | PID stability: preserve existing published identifiers |
### Name Suffix Generation
**Converting institution names to snake_case suffixes:**
```python
import re
import unicodedata
def generate_name_suffix(native_name: str) -> str:
"""Convert native language institution name to snake_case suffix.
Examples:
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
"Musée d'Orsay" → "musee_dorsay"
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
"""
# Normalize unicode (NFD decomposition) and remove diacritics
normalized = unicodedata.normalize('NFD', native_name)
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Convert to lowercase
lowercase = ascii_name.lower()
# Remove apostrophes, commas, and other punctuation
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
# Replace spaces and hyphens with underscores
underscored = re.sub(r'[\s\-]+', '_', no_punct)
# Remove any remaining non-alphanumeric characters (except underscores)
clean = re.sub(r'[^a-z0-9_]', '', underscored)
# Collapse multiple underscores
final = re.sub(r'_+', '_', clean).strip('_')
return final
```
### Timeline Example 1: First Batch Collision
**Date**: 2025-11-01
**Event**: Dutch ISIL Registry batch import (364 institutions)
```
┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import from Dutch ISIL Registry │
└─────────────────────────────────────────────────────────────────┘
Collision Detected:
Base GHCID: NL-NH-AMS-M-SM
Institution 1: Stedelijk Museum Amsterdam
- Discovered: 2025-11-01 (batch import)
→ Resolution: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
Institution 2: Science Museum Amsterdam
- Discovered: 2025-11-01 (batch import)
→ Resolution: NL-NH-AMS-M-SM-science_museum_amsterdam
Strategy: FIRST_BATCH
Reason: Both institutions discovered simultaneously in batch import
Result: BOTH receive name suffixes derived from their native language names
```
**GHCID Records**:
```yaml
# Institution 1
- id: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 of full GHCID
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
ghcid_current: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
name: Stedelijk Museum Amsterdam
provenance:
extraction_date: "2025-11-01T10:00:00Z"
publication_date: "2025-11-01T10:00:00Z"
data_source: CSV_REGISTRY
notes: "First batch collision: name suffix added during initial import"
# Institution 2
- id: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v5 of full GHCID
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
ghcid_current: NL-NH-AMS-M-SM-science_museum_amsterdam
name: Science Museum Amsterdam
provenance:
extraction_date: "2025-11-01T10:00:00Z"
publication_date: "2025-11-01T10:00:00Z"
data_source: CSV_REGISTRY
notes: "First batch collision: name suffix added during initial import"
```
**Key Point**: Both institutions created with name suffixes from the start. No existing PID was changed.
---
### Timeline Example 2: Historical Addition (Single Institution)
**Date**: 2025-11-15
**Event**: Historical research reveals Amsterdam Historical Museum (closed 2001)
```
┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import │
│ Hermitage Museum Amsterdam → NL-NH-AMS-M-HM (PUBLISHED) │
└─────────────────────────────────────────────────────────────────┘
│ (14 days pass, GHCID used in citations, APIs)
┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-15: Historical Institution Added │
│ Amsterdam Historical Museum → Collides with published GHCID! │
└─────────────────────────────────────────────────────────────────┘
Collision Detected:
Base GHCID: NL-NH-AMS-M-HM
Existing Institution (PUBLISHED 2025-11-01):
- Name: Hermitage Museum Amsterdam
- GHCID: NL-NH-AMS-M-HM ← UNCHANGED (published, immutable)
- Publication date: 2025-11-01T10:00:00Z
New Institution (Being added 2025-11-15):
- Name: Amsterdam Historical Museum (historical, 1926-2001)
- GHCID: NL-NH-AMS-M-HM-amsterdam_historical_museum ← Gets name suffix
Strategy: HISTORICAL_ADDITION
Reason: New institution collides with already published GHCID
Result: ONLY new institution receives name suffix; existing GHCID preserved
```
**GHCID Records**:
```yaml
# Existing institution (UNCHANGED)
- id: existing-uuid-1234 # Unchanged
ghcid_original: NL-NH-AMS-M-HM # FROZEN (no name suffix)
ghcid_current: NL-NH-AMS-M-HM # FROZEN (no name suffix)
name: Hermitage Museum Amsterdam
provenance:
extraction_date: "2025-11-01T10:00:00Z"
publication_date: "2025-11-01T10:00:00Z" # PUBLISHED
data_source: CSV_REGISTRY
# No collision resolution metadata (institution not modified)
# New historical institution (Gets name suffix)
- id: new-uuid-5678 # UUID v5 of full GHCID with name suffix
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
ghcid_current: NL-NH-AMS-M-HM-amsterdam_historical_museum
name: Amsterdam Historical Museum
organization_status: CLOSED
provenance:
extraction_date: "2025-11-15T14:30:00Z"
publication_date: "2025-11-15T14:30:00Z"
data_source: CONVERSATION_NLP
notes: >-
Historical addition collision with published GHCID NL-NH-AMS-M-HM
(Hermitage Museum Amsterdam, published 2025-11-01). Added name suffix
to preserve existing PID stability.
change_history:
- change_type: FOUNDING
event_date: "1926-01-01"
- change_type: CLOSURE
event_date: "2001-12-31"
event_description: "Closed; collections transferred to Amsterdam Museum"
```
**Why Preserve Existing GHCID?**
The existing GHCID `NL-NH-AMS-M-HM` may already be:
- Cited in academic publications
- Referenced in third-party datasets (Europeana, Wikidata)
- Used in API responses
- Embedded in RDF triple stores
**Changing it would break citations and external references** → violates PID stability principle ("Cool URIs don't change").
---
### Timeline Example 3: Multiple Historical Additions
**Date**: 2025-12-01
**Event**: Two historical naval museums discovered in archival research
```
┌─────────────────────────────────────────────────────────────────┐
│ 2025-11-01: First Batch Import │
│ Maritime Museum Amsterdam → NL-NH-AMS-M-MM (PUBLISHED) │
└─────────────────────────────────────────────────────────────────┘
│ (30 days pass)
┌─────────────────────────────────────────────────────────────────┐
│ 2025-12-01: Historical Research Uncovers Two Naval Museums │
│ Both collide with published NL-NH-AMS-M-MM │
└─────────────────────────────────────────────────────────────────┘
Collision Detected:
Base GHCID: NL-NH-AMS-M-MM
Existing Institution (PUBLISHED 2025-11-01):
- Name: Maritime Museum Amsterdam
- GHCID: NL-NH-AMS-M-MM ← UNCHANGED
New Institution 1 (Being added 2025-12-01):
- Name: Dutch Navy Museum (historical, 1906-1955)
- GHCID: NL-NH-AMS-M-MM-dutch_navy_museum ← Gets name suffix
New Institution 2 (Being added 2025-12-01):
- Name: Amsterdam Naval Archive (historical, 1820-1901)
- GHCID: NL-NH-AMS-M-MM-amsterdam_naval_archive ← Gets name suffix
Strategy: HISTORICAL_ADDITION (multiple)
Reason: Multiple new institutions collide with same published GHCID
Result: All new institutions get name suffixes; existing GHCID unchanged
```
**GHCID Records**:
```yaml
# Existing institution (UNCHANGED)
- id: maritime-uuid
ghcid_original: NL-NH-AMS-M-MM # No name suffix (published first)
ghcid_current: NL-NH-AMS-M-MM
name: Maritime Museum Amsterdam
organization_status: ACTIVE
provenance:
publication_date: "2025-11-01T10:00:00Z"
# New historical institution 1
- id: navy-museum-uuid
ghcid_original: NL-NH-AMS-M-MM-dutch_navy_museum
ghcid_current: NL-NH-AMS-M-MM-dutch_navy_museum
name: Dutch Navy Museum
organization_status: CLOSED
provenance:
extraction_date: "2025-12-01T09:00:00Z"
notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
change_history:
- change_type: FOUNDING
event_date: "1906-01-01"
- change_type: CLOSURE
event_date: "1955-12-31"
# New historical institution 2
- id: naval-archive-uuid
ghcid_original: NL-NH-AMS-M-MM-amsterdam_naval_archive
ghcid_current: NL-NH-AMS-M-MM-amsterdam_naval_archive
name: Amsterdam Naval Archive
organization_status: CLOSED
provenance:
extraction_date: "2025-12-01T09:00:00Z"
notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
change_history:
- change_type: FOUNDING
event_date: "1820-01-01"
- change_type: CLOSURE
event_date: "1901-12-31"
```
**Pattern**: When multiple historical institutions are added simultaneously and collide with the same existing GHCID:
- Existing: No change
- All new: Each gets name suffix derived from native language name
---
### Collision Resolution Workflow Diagram
```
New Institution Detected
Generate Base GHCID
Check Existing Registry
├─── No Collision Found ────────────► Use Base GHCID (no name suffix)
└─── Collision Found
Check Publication Date of Existing Record
├─── Existing Published ────────► HISTORICAL_ADDITION
│ │ - Existing: UNCHANGED
│ └───────────────────────► - New: Add name suffix
└─── Both Being Created ────────► FIRST_BATCH
│ - All: Add name suffixes
└─────────────────────────►
```
### Implementation Guidance
When implementing collision resolution, always check:
1. **Does base GHCID exist in registry?**
- No → Use base GHCID without name suffix
- Yes → Proceed to step 2
2. **Does existing record have publication_date?**
- Yes → HISTORICAL_ADDITION strategy (only new gets name suffix)
- No → FIRST_BATCH strategy (all get name suffixes)
3. **Track resolution in provenance**:
```yaml
provenance:
collision_resolution:
strategy: HISTORICAL_ADDITION | FIRST_BATCH
collides_with: existing_ghcid (if historical addition)
existing_publication_date: ISO 8601 timestamp
reason: "Human-readable explanation"
```
### Why This Matters: PID Stability Principle
**Cool URIs Don't Change** (W3C Architecture):
- Once a persistent identifier is published, it should **never** be modified
- External systems may reference the PID (citations, links, datasets)
- Changing published PIDs breaks citations, causes 404 errors, violates trust
**GHCID Approach**:
- First batch: Fair treatment, all receive name suffixes (no existing PIDs to protect)
- Historical addition: Asymmetric treatment preserves existing PIDs (new institution accommodates)
**Alternative (Rejected)**:
- Could retroactively add name suffixes to existing records when collisions occur
- **Problem**: Breaks existing citations, violates PID persistence guarantee
- **Principle**: "First publisher wins" → existing GHCID has temporal priority
---
## Migration and Transition Strategies
### From ISIL to GHCID
**Step 1: Preserve ISIL Codes**
```yaml
- id: uuid-from-isil-ghcid
ghcid_original: NL-NH-2759794-M-RM
identifiers:
- identifier_scheme: ISIL
identifier_value: NL-AsdRM # Preserved!
- identifier_scheme: GHCID
identifier_value: NL-NH-2759794-M-RM
```
**Step 2: Provide ISIL → GHCID Lookup**
```http
GET /isil/NL-AsdRM HTTP/1.1
Response: 303 See Other
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
```
**Step 3: Gradual Migration**
- Years 1-2: Dual identifiers (ISIL + GHCID)
- Years 3-5: GHCID primary, ISIL secondary
- Years 6+: GHCID primary, ISIL legacy
### From Wikidata Q-Numbers
**Strategy**: Complement, don't replace
```yaml
- id: ghcid-uuid
identifiers:
- identifier_scheme: Wikidata
identifier_value: Q190804
identifier_url: https://www.wikidata.org/wiki/Q190804
```
**Bidirectional linking**:
- GHCID record → `sameAs: wikidata:Q190804`
- Wikidata item → `P-GHCID: uuid` (proposed property)
### Legacy System Integration
**For systems requiring numeric IDs**:
```python
# Map GHCID UUID to numeric ID
uuid_str = "550e8400-e29b-41d4-a716-446655440000"
numeric_id = ghcid_components.to_numeric() # 12345678901234567
# Store mapping in legacy database
INSERT INTO institutions (id, uuid_reference)
VALUES (12345678901234567, '550e8400-e29b-41d4-a716-446655440000');
```
**For systems requiring ISIL**:
```python
# Provide ISIL fallback
isil_code = record.get_identifier('ISIL') or f"GHCID-{record.ghcid_numeric}"
```
---
## Appendices
### Appendix A: Collision Probability Calculations
**Birthday Paradox Formula**:
```
P(collision) ≈ n² / (2 × 2^bits)
Where:
n = number of institutions
bits = identifier bit length
```
**UUID v5 (128 bits)**:
```
n = 1,000,000 institutions
P = (10^6)² / (2 × 2^128)
P ≈ 1.5 × 10^-29
P ≈ 0.000000000000000000000000000015%
```
**Numeric (64 bits)**:
```
n = 1,000,000 institutions
P = (10^6)² / (2 × 2^64)
P ≈ 2.7 × 10^-7
P ≈ 0.00003%
```
**Conclusion**: Both formats provide negligible collision risk for heritage domain (<10M institutions expected).
### Appendix B: GHCID Generation Pseudocode
```python
def generate_ghcid(
name: str,
institution_type: InstitutionTypeEnum,
country: str, # ISO 3166-1 alpha-2
region: str, # ISO 3166-2 or GeoNames admin1
city_geonames_id: int # GeoNames ID
) -> GHCIDComponents:
"""
Generate GHCID components from institution metadata.
"""
# Normalize inputs
country = country.upper()
region = region.upper()
# Convert GeoNames ID to string
city_code = str(city_geonames_id)
# Get institution type code
type_code = INSTITUTION_TYPE_CODES[institution_type] # "M", "L", "A", etc.
# Generate abbreviation from emic (native language) name
# Uses first letter of each significant word, skipping prepositions/articles
abbreviation = extract_abbreviation_from_name(name) # "RM", "LC", etc.
# Construct GHCID string
ghcid_string = f"{country}-{region}-{city_code}-{type_code}-{abbreviation}"
# Generate UUID v5
ghcid_uuid = uuid.uuid5(GHCID_NAMESPACE, ghcid_string)
# Generate UUID SHA-256
hash_bytes = hashlib.sha256(ghcid_string.encode('utf-8')).digest()
ghcid_uuid_sha256 = uuid.UUID(bytes=hash_bytes[:16]) # Truncate to 128 bits
# Generate numeric
ghcid_numeric = int.from_bytes(hash_bytes[:8], byteorder='big')
return GHCIDComponents(
country=country,
region=region,
city_code=city_code,
type_code=type_code,
abbreviation=abbreviation,
ghcid_string=ghcid_string,
ghcid_uuid=str(ghcid_uuid),
ghcid_uuid_sha256=str(ghcid_uuid_sha256),
ghcid_numeric=ghcid_numeric
)
```
### Appendix C: References and Standards
**IETF RFCs**:
- **RFC 4122**: A Universally Unique IDentifier (UUID) URN Namespace - https://tools.ietf.org/html/rfc4122
- **RFC 9562**: Universally Unique IDentifiers (UUIDs) - https://datatracker.ietf.org/doc/rfc9562/
- **RFC 3650**: Handle System Overview - https://tools.ietf.org/html/rfc3650
**ISO Standards**:
- **ISO 15511**: International Standard Identifier for Libraries and Related Organizations (ISIL)
- **ISO 3166-1**: Codes for the representation of names of countries and their subdivisions Part 1: Country codes
- **ISO 3166-2**: Codes for the representation of names of countries and their subdivisions Part 2: Country subdivision codes
- **ISO 26324**: Information and documentation Digital object identifier system
- **ISO 21127**: Information and documentation CIDOC Conceptual Reference Model (CRM)
**W3C Standards**:
- **W3C PROV-O**: PROV Ontology - https://www.w3.org/TR/prov-o/
- **W3C ORG**: Organization Ontology - https://www.w3.org/TR/vocab-org/
- **Schema.org**: Structured data vocabulary - https://schema.org/
**Heritage Standards**:
- **RiC-O v1.1**: Records in Contexts Ontology - https://www.ica.org/standards/RiC/ontology
- **CIDOC-CRM v7.1.3**: Conceptual Reference Model - http://www.cidoc-crm.org/
- **IIIF Presentation API**: International Image Interoperability Framework - https://iiif.io/api/presentation/
**GeoNames**:
- GeoNames Gazetteer - https://www.geonames.org/
- GeoNames Ontology - http://www.geonames.org/ontology/
**Related PID Systems**:
- **DOI**: Digital Object Identifier - https://www.doi.org/
- **ARK**: Archival Resource Key - https://n2t.net/e/ark_ids.html
- **Handle System**: https://www.handle.net/
- **VIAF**: Virtual International Authority File - https://viaf.org/
---
## Acknowledgments
This specification builds on the foundational work of:
- **ISIL agencies** worldwide for pioneering library/archive identifiers
- **DOI Foundation** for persistent identifier governance models
- **California Digital Library** for ARK design principles
- **Wikimedia Foundation** for crowdsourced identifier systems
- **GeoNames** for geographic identifier infrastructure
- **Europeana** and **DPLA** for cultural heritage aggregation standards
Special thanks to the heritage informatics community for feedback and guidance.
---
**Version**: 1.0
**Date**: 2025-11-06
**Status**: Draft for Community Review
**Next Review**: 2026-01-01
**Contact**: [GLAM Data Extraction Project](https://github.com/kempersc/glam)
---
**License**: This specification is released under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).