1605 lines
56 KiB
Markdown
1605 lines
56 KiB
Markdown
# Global Heritage Custodian Identifier (GHCID) Persistent Identifier Scheme
|
||
|
||
**Version**: 1.0
|
||
**Date**: 2025-11-06
|
||
**Status**: Formal Specification (Draft for Community Review)
|
||
**Authors**: GLAM Data Extraction Project
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Introduction](#introduction)
|
||
2. [Persistent Identifier Requirements](#persistent-identifier-requirements)
|
||
3. [GHCID Identifier Formats](#ghcid-identifier-formats)
|
||
4. [Resolution Architecture](#resolution-architecture)
|
||
5. [Governance Model](#governance-model)
|
||
6. [Comparison with Existing PID Systems](#comparison-with-existing-pid-systems)
|
||
7. [Implementation Roadmap](#implementation-roadmap)
|
||
8. [Technical Specifications](#technical-specifications)
|
||
9. [Use Cases and Applications](#use-cases-and-applications)
|
||
10. [Migration and Transition Strategies](#migration-and-transition-strategies)
|
||
|
||
---
|
||
|
||
## Introduction
|
||
|
||
### Purpose
|
||
|
||
The **Global Heritage Custodian Identifier (GHCID)** is a persistent identifier (PID) scheme designed to uniquely and permanently identify heritage custodian organizations worldwide—including galleries, libraries, archives, museums (GLAM), research centers, botanical gardens, collecting societies, and other cultural heritage institutions.
|
||
|
||
GHCID addresses critical gaps in existing identifier systems:
|
||
|
||
1. **Global Coverage**: Extends beyond ISIL's limited geographic coverage to include institutions in all countries
|
||
2. **Non-Registry Institutions**: Provides identifiers for organizations without ISIL codes (estimated 70-80% of heritage institutions worldwide)
|
||
3. **Change Tracking**: Models organizational evolution through mergers, splits, relocations, and name changes
|
||
4. **Multi-Format Support**: Offers human-readable, UUID, and numeric formats for diverse system requirements
|
||
5. **Linked Data Integration**: Aligns with W3C, ISO, and IETF standards for semantic web compatibility
|
||
|
||
### Scope
|
||
|
||
GHCID is intended for:
|
||
|
||
- **Heritage custodian organizations**: Museums, archives, libraries, galleries, research centers, botanical gardens, zoos, collecting societies, and heritage platforms
|
||
- **Cross-system references**: Citations, metadata aggregation, Linked Open Data (LOD) graphs
|
||
- **Long-term persistence**: Identifiers designed to remain stable for decades or centuries
|
||
- **Global interoperability**: Compatible with Europeana, DPLA, IIIF, Wikidata, GeoNames, and other aggregators
|
||
|
||
GHCID is NOT intended for:
|
||
|
||
- Individual collection items (use ARK, DOI, Handle for objects)
|
||
- Digital files or surrogates (use IIIF, ARK for digital objects)
|
||
- Person identifiers (use ORCID, ISNI, VIAF for people)
|
||
- Geographic locations (use GeoNames, OSM for places)
|
||
|
||
### Design Principles
|
||
|
||
1. **Transparency**: Publicly documented algorithms, verifiable by anyone
|
||
2. **Determinism**: Same input always produces same identifier
|
||
3. **Persistence**: Identifiers remain valid even if organizations change names or relocate
|
||
4. **Interoperability**: Compatible with existing PID systems (ISIL, VIAF, Wikidata)
|
||
5. **Open Standards**: Based on IETF RFCs, ISO standards, W3C recommendations
|
||
6. **No Vendor Lock-In**: Open-source implementation, no proprietary dependencies
|
||
|
||
---
|
||
|
||
## Persistent Identifier Requirements
|
||
|
||
### Core PID Properties
|
||
|
||
A persistent identifier system must satisfy:
|
||
|
||
1. **Uniqueness**: No two entities share the same identifier
|
||
2. **Persistence**: Identifiers remain valid indefinitely (decades to centuries)
|
||
3. **Resolvability**: Identifiers can be resolved to authoritative metadata
|
||
4. **Transparency**: Generation algorithm is publicly documented
|
||
5. **Governance**: Clear authority and policies for identifier assignment
|
||
6. **Actionability**: Identifiers can be used in URLs, APIs, citations
|
||
|
||
### GHCID Compliance
|
||
|
||
| Requirement | GHCID Implementation | Status |
|
||
|-------------|---------------------|--------|
|
||
| **Uniqueness** | UUID v5 (128-bit, P(collision) ≈ 10^-29), SHA-256 fallback | ✅ Implemented |
|
||
| **Persistence** | Deterministic generation from stable metadata, `ghcid_original` frozen on first assignment | ✅ Implemented |
|
||
| **Resolvability** | HTTP resolution service (planned), JSON-LD API | 🔄 In Design |
|
||
| **Transparency** | Open-source code, RFC 4122 standard, public algorithm docs | ✅ Implemented |
|
||
| **Governance** | Community governance model (proposed), ISIL coordination | ⏳ Pending |
|
||
| **Actionability** | URN and HTTPS URI formats, embeddable in RDF/JSON-LD | ✅ Implemented |
|
||
|
||
---
|
||
|
||
## GHCID Identifier Formats
|
||
|
||
GHCID provides **four complementary identifier formats**, each optimized for specific use cases. All formats are **deterministic** and derived from the same underlying GHCID components.
|
||
|
||
### 1. Human-Readable GHCID String
|
||
|
||
**Format**: `{Country}-{Region}-{City}-{Type}-{Abbreviation}`
|
||
|
||
**Components**:
|
||
|
||
| Component | Format | Standard | Example |
|
||
|-----------|--------|----------|---------|
|
||
| **Country** | ISO 3166-1 alpha-2 (2 chars) | ISO 3166-1 | `NL`, `US`, `BR` |
|
||
| **Region** | ISO 3166-2 subdivision (2-3 chars) OR GeoNames admin1 code | ISO 3166-2, GeoNames | `NH` (Noord-Holland), `CA` (California) |
|
||
| **City** | GeoNames city code (3-4 chars, base36 encoding) | GeoNames | `2759794` → `2759794` |
|
||
| **Type** | Institution type (1 char) | GHCID taxonomy | `M` (Museum), `L` (Library), `A` (Archive) |
|
||
| **Abbreviation** | First letter of each word in emic name (2-10 chars) | Derived | `RM` (Rijksmuseum) |
|
||
|
||
**Examples**:
|
||
|
||
```
|
||
NL-NH-2759794-M-RM # Rijksmuseum, Amsterdam, Netherlands
|
||
US-DC-4140963-L-LC # Library of Congress, Washington DC, USA
|
||
BR-RJ-3451190-L-BNB # Biblioteca Nacional do Brasil, Rio de Janeiro, Brazil
|
||
GB-EN-2643743-M-BM # British Museum, London, United Kingdom
|
||
FR-IL-2988507-M-LM # Louvre Museum, Paris, France
|
||
```
|
||
|
||
**Use Cases**:
|
||
- Academic citations
|
||
- Documentation and reports
|
||
- Debugging and logging
|
||
- Human-readable data exchange
|
||
|
||
**Persistence Note**:
|
||
- Organizations may relocate or change names over time
|
||
- **`ghcid_original`**: Frozen on first assignment, never changes (TRUE PID)
|
||
- **`ghcid_current`**: Updated if organization changes (convenience field)
|
||
- Both stored in record; `ghcid_original` used for citations and cross-system references
|
||
|
||
**Historical Institutions Rule** (Added 2025-11-06):
|
||
|
||
GHCID supports **historical heritage institutions** (e.g., 17th-century cabinet collections, defunct museums, closed archives) using modern geographic coordinates:
|
||
|
||
- **Geographic Components**: Country, region, and city codes are based on where the institution's coordinates would fall on a **modern world map (2025)** using the last recorded date of the institution's existence
|
||
- **Temporal Projection**: Historical locations are projected onto current ISO 3166-1/3166-2 and GeoNames geocoding standards
|
||
- **Abbreviation**: Institution abbreviations use the **first letter of each significant word in the official emic (native language) name**, skipping prepositions, articles, and conjunctions in all languages
|
||
- **Institution Type**: Uses the full GHCID taxonomy (GALLERY, LIBRARY, ARCHIVE, MUSEUM, etc.) - historical context preserved in metadata, not identifier
|
||
- **Flexibility**: The GHCID format is deliberately designed to accommodate institutions from any historical period
|
||
|
||
**Examples of Historical Institutions**:
|
||
|
||
```
|
||
# Wunderkammer of Ole Worm (Copenhagen, 1655 - closed 1654)
|
||
DK-84-2618425-P-OW # Denmark, Capital Region, Copenhagen (modern), Personal Collection, Ole Worm
|
||
|
||
# Bibliotheca Corviniana (Buda, 1490 - dispersed 1526)
|
||
HU-BU-3054643-L-BC # Hungary, Budapest (modern coords), Library, Bibliotheca Corviniana
|
||
|
||
# Cabinet of Curiosities of Ferdinand II (Innsbruck, 1620s)
|
||
AT-7-2775216-P-FII # Austria, Tyrol, Innsbruck (modern), Personal Collection, Ferdinand II
|
||
|
||
# Dutch East India Company Archives (Jakarta, 1602-1800)
|
||
ID-JK-1642911-C-VOC # Indonesia (modern country), Jakarta (modern), Corporation, VOC
|
||
```
|
||
|
||
**Rationale**:
|
||
- Historical institutions existed at specific geographic coordinates
|
||
- Modern political boundaries and city identifiers provide stable reference points
|
||
- Emic name abbreviations preserve original cultural/linguistic context while ensuring deterministic generation
|
||
- Metadata fields capture full historical context (founding/closure dates, historical names, organizational changes)
|
||
- GHCID history tracks temporal evolution via `ghcid_history` entries
|
||
- Enables citation of historical collections in modern scholarship using persistent identifiers
|
||
|
||
### 2. UUID v5 (SHA-1) - Primary Persistent Identifier
|
||
|
||
**Format**: RFC 4122 UUID v5 (128-bit, hyphenated)
|
||
|
||
**Algorithm**:
|
||
1. Construct GHCID string: `{Country}-{Region}-{City}-{Type}-{Abbreviation}`
|
||
2. Apply RFC 4122 UUID v5 with:
|
||
- **Namespace**: `6ba7b810-9dad-11d1-80b4-00c04fd430c8` (DNS namespace from RFC 4122)
|
||
- **Name**: GHCID string (UTF-8 encoded)
|
||
- **Hash**: SHA-1 (per RFC 4122 specification)
|
||
3. Format as UUID: `xxxxxxxx-xxxx-5xxx-yxxx-xxxxxxxxxxxx`
|
||
|
||
**Examples**:
|
||
|
||
```
|
||
NL-NH-2759794-M-RM → 550e8400-e29b-41d4-a716-446655440000
|
||
US-DC-4140963-L-LC → 8b3e6f12-a4d5-5c89-b123-456789abcdef
|
||
```
|
||
|
||
**Properties**:
|
||
- **Standard Compliance**: RFC 4122 (2005), IETF standard
|
||
- **Collision Resistance**: P(collision) ≈ 1.5 × 10^-29 for 1M institutions
|
||
- **Deterministic**: Same GHCID always produces same UUID
|
||
- **Interoperable**: Compatible with Europeana, DPLA, IIIF, Wikidata
|
||
- **Transparent**: Built-in function in all major programming languages
|
||
|
||
**SHA-1 Safety**:
|
||
- SHA-1 is **deprecated for cryptographic security** (digital signatures, TLS)
|
||
- SHA-1 is **appropriate for identifier generation** (non-adversarial, collision-resistant)
|
||
- UUID v5 collision resistance relies on 128-bit output space, not SHA-1 preimage resistance
|
||
- See [WHY_UUID_V5_SHA1.md](WHY_UUID_V5_SHA1.md) for detailed rationale
|
||
|
||
**Use Cases**:
|
||
- **Primary identifier** for all GHCID records
|
||
- RDF/JSON-LD `@id` field
|
||
- IIIF manifest identifiers
|
||
- Wikidata external ID
|
||
- Database foreign keys
|
||
- Cross-system references
|
||
|
||
**URN Format**:
|
||
```
|
||
urn:uuid:550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
**HTTPS URI Format** (with resolution service):
|
||
```
|
||
https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
### 3. UUID SHA-256 (Future-Proof Alternative)
|
||
|
||
**Format**: RFC 9562 UUID v8 (128-bit, custom SHA-256)
|
||
|
||
**Algorithm**:
|
||
1. Construct GHCID string
|
||
2. Hash with SHA-256 → 256 bits
|
||
3. Truncate to first 128 bits (16 bytes)
|
||
4. Set version bits to `8` (custom/experimental)
|
||
5. Set variant bits to RFC 4122 (`0b10xxxxxx`)
|
||
|
||
**Examples**:
|
||
|
||
```
|
||
NL-NH-2759794-M-RM → a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d
|
||
```
|
||
|
||
**Properties**:
|
||
- **Cryptographic Strength**: SHA-256 (NIST-approved through 2030+)
|
||
- **Collision Resistance**: P(collision) ≈ 1.5 × 10^-29 (same as UUID v5)
|
||
- **Future-Proof**: No known practical attacks against SHA-256
|
||
- **Deterministic**: Same GHCID always produces same UUID
|
||
- **Less Transparent**: Custom algorithm requires sharing implementation code
|
||
|
||
**Use Cases**:
|
||
- Security policies mandating SHA-256
|
||
- Future migration path if SHA-1 fully deprecated
|
||
- Custom identifier resolution services
|
||
- Internal systems with strict cryptographic requirements
|
||
|
||
**Status**: Generated alongside UUID v5, stored as secondary identifier
|
||
|
||
### 4. Numeric (64-bit Integer)
|
||
|
||
**Format**: Unsigned 64-bit integer (0 to 18,446,744,073,709,551,615)
|
||
|
||
**Algorithm**:
|
||
1. Hash GHCID string with SHA-256 → 256 bits
|
||
2. Extract first 8 bytes (64 bits)
|
||
3. Convert to unsigned integer (big-endian)
|
||
|
||
**Examples**:
|
||
|
||
```
|
||
NL-NH-2759794-M-RM → 12345678901234567
|
||
US-DC-4140963-L-LC → 98765432109876543
|
||
```
|
||
|
||
**Properties**:
|
||
- **Compact**: 8 bytes (vs. 36 bytes for UUID string)
|
||
- **Deterministic**: Same GHCID always produces same numeric ID
|
||
- **Fast Indexing**: Integer comparisons faster than string UUIDs
|
||
- **CSV-Friendly**: No special characters
|
||
- **Reduced Collision Resistance**: P(collision) ≈ 2.7 × 10^-7 for 1M institutions (still negligible)
|
||
|
||
**Use Cases**:
|
||
- Database primary keys (SQL `BIGINT`)
|
||
- CSV exports for spreadsheet analysis
|
||
- Numeric sorting requirements
|
||
- Systems without UUID support
|
||
- Legacy system integration
|
||
|
||
**Limitations**:
|
||
- NOT recommended as primary PID (use UUID v5 instead)
|
||
- Suitable for heritage domain (<10M institutions expected)
|
||
- For >100M institutions, collision risk becomes non-negligible (0.27%)
|
||
|
||
---
|
||
|
||
## Resolution Architecture
|
||
|
||
### Resolution Service Design
|
||
|
||
A persistent identifier is only as persistent as its resolution infrastructure. GHCID requires a **long-term, reliable resolution service** to resolve identifiers to authoritative metadata.
|
||
|
||
### Resolver Endpoints
|
||
|
||
**Base URL**: `https://id.heritage.example.org/` (example domain)
|
||
|
||
| Endpoint Pattern | Format | Example |
|
||
|-----------------|--------|---------|
|
||
| `/uuid/{uuid}` | UUID v5 | `/uuid/550e8400-e29b-41d4-a716-446655440000` |
|
||
| `/uuid-sha256/{uuid}` | UUID SHA-256 | `/uuid-sha256/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d` |
|
||
| `/numeric/{id}` | Numeric | `/numeric/12345678901234567` |
|
||
| `/ghcid/{string}` | Human-readable | `/ghcid/NL-NH-2759794-M-RM` |
|
||
|
||
**All four endpoints resolve to the SAME institutional record.**
|
||
|
||
### Resolution Protocol
|
||
|
||
**HTTP GET Request**:
|
||
|
||
```http
|
||
GET /uuid/550e8400-e29b-41d4-a716-446655440000 HTTP/1.1
|
||
Host: id.heritage.example.org
|
||
Accept: application/ld+json
|
||
```
|
||
|
||
**HTTP Response** (JSON-LD):
|
||
|
||
```json
|
||
{
|
||
"@context": "https://w3id.org/heritage/custodian/context.jsonld",
|
||
"@type": ["HeritageCustodian", "schema:Museum", "org:Organization"],
|
||
"@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
|
||
"ghcid_uuid": "550e8400-e29b-41d4-a716-446655440000",
|
||
"ghcid_original": "NL-NH-2759794-M-RM",
|
||
"ghcid_current": "NL-NH-2759794-M-RM",
|
||
"name": "Rijksmuseum",
|
||
"alternateName": ["Rijksmuseum Amsterdam", "Rijks"],
|
||
"description": "The national museum of the Netherlands, dedicated to arts and history.",
|
||
"institution_type": "MUSEUM",
|
||
"url": "https://www.rijksmuseum.nl",
|
||
"sameAs": [
|
||
"https://www.wikidata.org/wiki/Q190804",
|
||
"https://viaf.org/viaf/131511535",
|
||
"urn:isil:NL-AsdRM"
|
||
],
|
||
"address": {
|
||
"@type": "PostalAddress",
|
||
"streetAddress": "Museumstraat 1",
|
||
"addressLocality": "Amsterdam",
|
||
"postalCode": "1071 XX",
|
||
"addressCountry": "NL",
|
||
"geonames": "https://sws.geonames.org/2759794/"
|
||
},
|
||
"foundingDate": "1800-01-01",
|
||
"provenance": {
|
||
"data_source": "CSV_REGISTRY",
|
||
"data_tier": "TIER_1_AUTHORITATIVE",
|
||
"extraction_date": "2025-11-06T10:30:00Z"
|
||
}
|
||
}
|
||
```
|
||
|
||
**Content Negotiation**:
|
||
|
||
| Accept Header | Response Format | Use Case |
|
||
|---------------|----------------|----------|
|
||
| `application/ld+json` | JSON-LD | Linked Data applications |
|
||
| `application/json` | Plain JSON | APIs, JavaScript |
|
||
| `text/turtle` | RDF Turtle | SPARQL, semantic web |
|
||
| `application/rdf+xml` | RDF/XML | Legacy RDF systems |
|
||
| `text/html` | HTML landing page | Human browsing |
|
||
| `text/plain` | Plain text summary | Simple debugging |
|
||
|
||
### HTTP Status Codes
|
||
|
||
| Code | Meaning | When Used |
|
||
|------|---------|-----------|
|
||
| **200 OK** | Identifier resolved successfully | Record found |
|
||
| **303 See Other** | Redirect to canonical URL | Multiple URLs for same resource |
|
||
| **404 Not Found** | Identifier not in registry | Unknown GHCID |
|
||
| **410 Gone** | Institution closed/merged | Record marked as inactive |
|
||
| **500 Internal Server Error** | Resolver malfunction | Service downtime |
|
||
|
||
### Persistence Commitment
|
||
|
||
**Requirement**: Resolution service must commit to:
|
||
|
||
1. **Minimum 50-year operation** (heritage institutions have multi-century lifespans)
|
||
2. **High availability** (99.9% uptime SLA)
|
||
3. **Multi-region redundancy** (geographic distribution)
|
||
4. **Daily backups** with disaster recovery plan
|
||
5. **Transparent governance** (public policies, community oversight)
|
||
6. **Open-source resolver code** (forkable by community if needed)
|
||
|
||
**Funding Model** (options):
|
||
- Grant funding (national libraries, heritage foundations)
|
||
- Membership fees (GLAM consortia, aggregators)
|
||
- Government support (cultural heritage agencies)
|
||
- Cloud provider donations (Google, AWS, Azure)
|
||
|
||
---
|
||
|
||
## Governance Model
|
||
|
||
### Organizational Structure
|
||
|
||
**GHCID is proposed as a community-governed persistent identifier scheme**, modeled on successful PID systems like DOI, ARK, and Handle.
|
||
|
||
#### Proposed Governance Body
|
||
|
||
**GHCID Consortium** (working name):
|
||
|
||
1. **Steering Committee** (7-9 members)
|
||
- Representatives from: National libraries, international archives, museum networks
|
||
- Terms: 3 years, staggered rotation
|
||
- Responsibilities: Policy decisions, budget oversight, strategic direction
|
||
|
||
2. **Technical Working Group**
|
||
- Developers, data scientists, Linked Data experts
|
||
- Responsibilities: Specification updates, resolver development, community tools
|
||
|
||
3. **Community Advisory Board**
|
||
- Heritage institutions, researchers, aggregators (Europeana, DPLA)
|
||
- Responsibilities: Use case feedback, adoption guidance
|
||
|
||
4. **Secretariat**
|
||
- Permanent staff (2-3 FTE)
|
||
- Responsibilities: Day-to-day operations, resolver maintenance, documentation
|
||
|
||
### Coordination with Existing Systems
|
||
|
||
GHCID does NOT replace existing identifier systems; it **complements and coordinates** with:
|
||
|
||
1. **ISIL (ISO 15511)**
|
||
- Store ISIL codes as secondary identifiers
|
||
- Cross-reference GHCID ↔ ISIL mapping
|
||
- Collaborate with national ISIL agencies
|
||
|
||
2. **Wikidata**
|
||
- Propose GHCID as new external identifier property
|
||
- Link GHCID records to Wikidata Q-numbers
|
||
- Enable bidirectional cross-referencing
|
||
|
||
3. **VIAF (Virtual International Authority File)**
|
||
- For institutions with VIAF records, store VIAF ID
|
||
- Coordinate with OCLC on authority control
|
||
|
||
4. **GeoNames**
|
||
- Use GeoNames IDs for geographic components
|
||
- Link GHCID locations to GeoNames URIs
|
||
|
||
5. **Europeana / DPLA**
|
||
- Integrate GHCID into aggregator metadata
|
||
- Use UUID v5 format for interoperability
|
||
|
||
### Identifier Assignment Policies
|
||
|
||
**Who can assign a GHCID?**
|
||
|
||
Option 1: **Open Generation** (preferred for transparency)
|
||
- Anyone can generate a GHCID using the open-source algorithm
|
||
- Deterministic generation ensures same institution → same ID
|
||
- Conflicts resolved via community review
|
||
|
||
Option 2: **Registry-Based** (traditional PID model)
|
||
- Institutions apply to GHCID Consortium for assignment
|
||
- Manual review ensures accuracy
|
||
- Slower, but higher quality control
|
||
|
||
**Recommendation**: Hybrid approach
|
||
- Open generation for most institutions (self-service)
|
||
- Optional manual review for complex cases (mergers, disputes)
|
||
- Community validation via Wikidata, ISIL cross-checks
|
||
|
||
### Dispute Resolution
|
||
|
||
**Scenario**: Two GHCID records claim to represent the same institution
|
||
|
||
**Resolution Process**:
|
||
1. Automated detection (name similarity, ISIL code match)
|
||
2. Community flagging (anyone can report suspected duplicates)
|
||
3. Review by Technical Working Group
|
||
4. Merge records, redirect old GHCID to canonical GHCID (HTTP 303)
|
||
5. Update provenance metadata with merge event
|
||
|
||
### Versioning and Deprecation
|
||
|
||
**GHCID Specification Versioning**:
|
||
- Semantic versioning: `MAJOR.MINOR.PATCH`
|
||
- Current version: `1.0.0` (this document)
|
||
- Backward compatibility guaranteed for MAJOR versions
|
||
|
||
**Identifier Deprecation**:
|
||
- GHCIDs are **never deleted** (persistence requirement)
|
||
- Closed/merged institutions marked as `organization_status: CLOSED`
|
||
- HTTP 410 Gone response with pointer to successor organization
|
||
- Change history tracked in `change_history` field
|
||
|
||
---
|
||
|
||
## Comparison with Existing PID Systems
|
||
|
||
### Feature Comparison Table
|
||
|
||
| Feature | **GHCID** | **ISIL** | **DOI** | **ARK** | **Handle** | **Wikidata** |
|
||
|---------|----------|---------|---------|---------|------------|--------------|
|
||
| **Domain** | Heritage institutions | Libraries/archives | Scholarly objects | Cultural heritage | Digital objects | Entities (all types) |
|
||
| **Coverage** | Global (any institution) | Limited (registry-based) | Scholarly publications | Libraries, museums | Repositories | Global (crowdsourced) |
|
||
| **Registration** | Open (deterministic) | Required | Required | Required | Required | Open (crowdsourced) |
|
||
| **Format** | UUID, GHCID string, numeric | Country-local code | `10.xxxx/yyyy` | `ark:/nnnnn/xxx` | `hdl:xxxx/yyyy` | `Q12345` |
|
||
| **Resolution** | HTTPS (planned) | No standard resolver | doi.org | n2t.net | handle.net | wikidata.org |
|
||
| **Governance** | Proposed consortium | National agencies | IDF (non-profit) | CDL (California) | CNRI (non-profit) | Wikimedia Foundation |
|
||
| **Cost** | Free (open) | Free | Paid (varies) | Free | Paid (varies) | Free |
|
||
| **Standard** | RFC 4122, ISO 3166 | ISO 15511 | ISO 26324 | IETF draft | IETF RFC 3650 | Community-driven |
|
||
| **Change Tracking** | ✅ Built-in | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Edit history |
|
||
| **Multi-Format** | ✅ 4 formats | ❌ Single | ❌ Single | ❌ Single | ❌ Single | ❌ Single |
|
||
| **Adoption** | New (0) | High (libraries) | Very High | Medium | Medium | Very High |
|
||
|
||
### Unique GHCID Advantages
|
||
|
||
1. **Change History Tracking**: Built-in organizational evolution modeling (mergers, splits, relocations)
|
||
2. **Multi-Format Flexibility**: Human-readable, UUID, numeric formats from same base ID
|
||
3. **Open Generation**: Deterministic algorithm, no registration bureaucracy
|
||
4. **Global Coverage**: Not limited to countries with ISIL registries
|
||
5. **Linked Data Native**: Designed for RDF/JSON-LD from the start
|
||
6. **Data Quality Tiers**: 4-tier provenance system (TIER_1 through TIER_4)
|
||
|
||
### Why Not Just Use Wikidata?
|
||
|
||
**Wikidata is excellent but has limitations for PIDs**:
|
||
|
||
| Aspect | Wikidata | GHCID |
|
||
|--------|----------|-------|
|
||
| **Identifier Format** | `Q12345` (sequential) | UUID (content-addressed) |
|
||
| **Determinism** | ❌ No (assigned sequentially) | ✅ Yes (hash-based) |
|
||
| **Regeneration** | ❌ Lost if database corrupted | ✅ Can regenerate from metadata |
|
||
| **Governance** | Wikimedia Foundation | Heritage community |
|
||
| **Specialization** | General knowledge base | Heritage institutions only |
|
||
| **Provenance** | Edit history | Structured provenance model |
|
||
| **Data Quality** | Crowdsourced (variable) | Tiered quality system |
|
||
|
||
**Recommendation**: Use GHCID as **primary PID**, link to Wikidata Q-number as **secondary identifier**
|
||
|
||
---
|
||
|
||
## Implementation Roadmap
|
||
|
||
### Phase 1: Foundation (2025 Q1-Q2) ✅ IN PROGRESS
|
||
|
||
**Status**: Currently implementing
|
||
|
||
**Deliverables**:
|
||
- ✅ GHCID specification document (this document)
|
||
- ✅ UUID generation library (`src/glam_extractor/identifiers/ghcid.py`)
|
||
- ✅ LinkML schema with GHCID fields (`schemas/core.yaml`)
|
||
- ✅ Test suite (UUID determinism, collision resistance)
|
||
- 🔄 GeoNames integration for city codes
|
||
- 🔄 ISO 3166-2 lookup tables
|
||
|
||
**Milestones**:
|
||
- [x] GHCID format design
|
||
- [x] UUID v5 implementation
|
||
- [x] UUID SHA-256 implementation
|
||
- [x] Numeric ID implementation
|
||
- [ ] GeoNames geocoding service
|
||
- [ ] ISO 3166-2 reference data
|
||
|
||
### Phase 2: Data Production (2025 Q2-Q3)
|
||
|
||
**Deliverables**:
|
||
- Dutch institutions with GHCID (1,351 organizations)
|
||
- ISIL registry with GHCID (364 institutions)
|
||
- Conversation data with GHCID (estimated 2,000-5,000 institutions)
|
||
- Cross-linked dataset (merged by GHCID UUID)
|
||
|
||
**Milestones**:
|
||
- [ ] Generate GHCID for all Dutch datasets
|
||
- [ ] Generate GHCID for conversation extractions
|
||
- [ ] Cross-link by UUID v5
|
||
- [ ] Publish test dataset (100 institutions)
|
||
|
||
### Phase 3: Resolution Service (2025 Q4)
|
||
|
||
**Deliverables**:
|
||
- GHCID resolver prototype (Python FastAPI)
|
||
- JSON-LD API endpoint
|
||
- HTML landing pages
|
||
- Content negotiation support
|
||
- Registry database (PostgreSQL + RDF triplestore)
|
||
|
||
**Milestones**:
|
||
- [ ] Resolver API implementation
|
||
- [ ] Database schema for GHCID registry
|
||
- [ ] Load Dutch dataset into resolver
|
||
- [ ] Public demo deployment
|
||
|
||
### Phase 4: Community Engagement (2026 Q1-Q2)
|
||
|
||
**Deliverables**:
|
||
- GHCID specification v1.0 (finalized)
|
||
- Outreach to Europeana, DPLA, IIIF communities
|
||
- Proposal to Wikidata for new external ID property
|
||
- Coordination meetings with ISIL agencies
|
||
- Community feedback incorporation
|
||
|
||
**Milestones**:
|
||
- [ ] Present at CIDOC, ICA, IFLA conferences
|
||
- [ ] Publish RFC or W3C Community Group Note
|
||
- [ ] Partner with 3-5 heritage institutions for pilot
|
||
- [ ] Gather feedback, iterate on specification
|
||
|
||
### Phase 5: Governance Establishment (2026 Q3-Q4)
|
||
|
||
**Deliverables**:
|
||
- GHCID Consortium formation
|
||
- Steering Committee election
|
||
- Long-term funding secured
|
||
- Resolver production deployment
|
||
- Governance policies published
|
||
|
||
**Milestones**:
|
||
- [ ] Incorporate GHCID Consortium (non-profit)
|
||
- [ ] Secure 3-year funding commitment
|
||
- [ ] Deploy production resolver (multi-region)
|
||
- [ ] Establish community governance processes
|
||
|
||
### Phase 6: Scaling and Adoption (2027+)
|
||
|
||
**Deliverables**:
|
||
- Global dataset (50,000+ institutions)
|
||
- Integration with major aggregators
|
||
- Resolver SLA (99.9% uptime)
|
||
- Annual community meetings
|
||
- Ongoing maintenance and updates
|
||
|
||
**Milestones**:
|
||
- [ ] 10,000 institutions with GHCID
|
||
- [ ] Europeana integration
|
||
- [ ] DPLA integration
|
||
- [ ] Wikidata property approved
|
||
- [ ] 50-year persistence commitment
|
||
|
||
---
|
||
|
||
## Technical Specifications
|
||
|
||
### Data Model
|
||
|
||
**Core GHCID Fields** (LinkML schema):
|
||
|
||
```yaml
|
||
HeritageCustodian:
|
||
slots:
|
||
# Primary identifiers
|
||
- id # UUID v5 (primary key)
|
||
- record_id # UUID v7 (database PK, time-ordered)
|
||
- ghcid_uuid # UUID v5 (same as id)
|
||
- ghcid_uuid_sha256 # UUID SHA-256 (future-proof)
|
||
- ghcid_numeric # Numeric (64-bit)
|
||
|
||
# Human-readable GHCIDs
|
||
- ghcid_current # Current GHCID string (may change)
|
||
- ghcid_original # Original GHCID string (FROZEN, true PID)
|
||
|
||
# GHCID history
|
||
- ghcid_history # List of GHCIDHistoryEntry
|
||
```
|
||
|
||
**GHCID History Entry**:
|
||
|
||
```yaml
|
||
GHCIDHistoryEntry:
|
||
description: Tracks changes to GHCID over time (relocations, name changes)
|
||
slots:
|
||
- ghcid_value # GHCID string at this point in time
|
||
- valid_from # ISO 8601 date (when this GHCID became active)
|
||
- valid_to # ISO 8601 date (when this GHCID was superseded)
|
||
- change_reason # Reason for change (relocation, name change, etc.)
|
||
- related_event # Link to ChangeEvent if applicable
|
||
```
|
||
|
||
**Example**:
|
||
|
||
```yaml
|
||
# Noord-Hollands Archief (formed 2001 via merger)
|
||
id: 550e8400-e29b-41d4-a716-446655440000 # UUID v5
|
||
ghcid_original: NL-NH-2750053-A-NHA # Frozen forever
|
||
ghcid_current: NL-NH-2750053-A-NHA # Same (no changes yet)
|
||
|
||
ghcid_history:
|
||
- ghcid_value: NL-NH-2750053-A-NHA
|
||
valid_from: "2001-01-01"
|
||
valid_to: null # Still valid
|
||
change_reason: FOUNDING
|
||
related_event: ghcid:event-nha-merger-2001
|
||
|
||
# If it relocates in 2030:
|
||
ghcid_history:
|
||
- ghcid_value: NL-NH-2750053-A-NHA
|
||
valid_from: "2001-01-01"
|
||
valid_to: "2030-06-15"
|
||
- ghcid_value: NL-NH-9876543-A-NHA # New city GeoNames ID
|
||
valid_from: "2030-06-15"
|
||
valid_to: null
|
||
change_reason: RELOCATION
|
||
related_event: ghcid:event-nha-relocation-2030
|
||
```
|
||
|
||
### API Specification
|
||
|
||
**Resolver API Endpoints**:
|
||
|
||
#### 1. Resolve by UUID v5
|
||
|
||
```http
|
||
GET /uuid/{uuid} HTTP/1.1
|
||
Host: id.heritage.example.org
|
||
Accept: application/ld+json
|
||
|
||
Response: 200 OK
|
||
Content-Type: application/ld+json
|
||
{
|
||
"@context": "https://w3id.org/heritage/custodian/context.jsonld",
|
||
"@id": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000",
|
||
...
|
||
}
|
||
```
|
||
|
||
#### 2. Resolve by GHCID String
|
||
|
||
```http
|
||
GET /ghcid/NL-NH-2759794-M-RM HTTP/1.1
|
||
|
||
Response: 303 See Other
|
||
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
#### 3. Search by Name
|
||
|
||
```http
|
||
GET /search?name=Rijksmuseum&country=NL HTTP/1.1
|
||
|
||
Response: 200 OK
|
||
Content-Type: application/json
|
||
{
|
||
"results": [
|
||
{
|
||
"id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"name": "Rijksmuseum",
|
||
"ghcid": "NL-NH-2759794-M-RM",
|
||
"url": "https://www.rijksmuseum.nl"
|
||
}
|
||
],
|
||
"total": 1
|
||
}
|
||
```
|
||
|
||
#### 4. Reverse Lookup (Numeric → UUID)
|
||
|
||
```http
|
||
GET /numeric/12345678901234567 HTTP/1.1
|
||
|
||
Response: 303 See Other
|
||
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
### Database Schema (PostgreSQL)
|
||
|
||
```sql
|
||
CREATE TABLE ghcid_registry (
|
||
-- Primary keys
|
||
id UUID PRIMARY KEY, -- UUID v5 (primary identifier)
|
||
record_id UUID UNIQUE NOT NULL, -- UUID v7 (database record ID)
|
||
|
||
-- Identifiers
|
||
ghcid_uuid UUID UNIQUE NOT NULL, -- UUID v5 (same as id)
|
||
ghcid_uuid_sha256 UUID UNIQUE NOT NULL, -- UUID SHA-256
|
||
ghcid_numeric BIGINT UNIQUE NOT NULL, -- Numeric (64-bit)
|
||
ghcid_original VARCHAR(100) UNIQUE NOT NULL, -- Frozen GHCID string
|
||
ghcid_current VARCHAR(100) NOT NULL, -- Current GHCID string
|
||
|
||
-- Metadata
|
||
name TEXT NOT NULL,
|
||
institution_type VARCHAR(50) NOT NULL,
|
||
organization_status VARCHAR(20) DEFAULT 'ACTIVE',
|
||
|
||
-- Geographic
|
||
country CHAR(2) NOT NULL, -- ISO 3166-1
|
||
region VARCHAR(10), -- ISO 3166-2
|
||
city_geonames_id INTEGER, -- GeoNames ID
|
||
|
||
-- External identifiers
|
||
isil_code VARCHAR(50),
|
||
wikidata_id VARCHAR(20),
|
||
viaf_id VARCHAR(50),
|
||
|
||
-- Provenance
|
||
data_source VARCHAR(50) NOT NULL,
|
||
data_tier VARCHAR(20) NOT NULL,
|
||
extraction_date TIMESTAMPTZ NOT NULL,
|
||
|
||
-- Timestamps
|
||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
||
|
||
-- Indexes
|
||
INDEX idx_ghcid_original (ghcid_original),
|
||
INDEX idx_ghcid_current (ghcid_current),
|
||
INDEX idx_name (name),
|
||
INDEX idx_country_region (country, region),
|
||
INDEX idx_isil (isil_code),
|
||
INDEX idx_wikidata (wikidata_id)
|
||
);
|
||
|
||
CREATE TABLE ghcid_history (
|
||
id SERIAL PRIMARY KEY,
|
||
ghcid_uuid UUID NOT NULL REFERENCES ghcid_registry(id),
|
||
ghcid_value VARCHAR(100) NOT NULL,
|
||
valid_from DATE NOT NULL,
|
||
valid_to DATE,
|
||
change_reason VARCHAR(50),
|
||
related_event_id VARCHAR(200),
|
||
|
||
INDEX idx_ghcid_uuid (ghcid_uuid),
|
||
INDEX idx_valid_dates (valid_from, valid_to)
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## GeoNames Settlement Resolution
|
||
|
||
### Overview
|
||
|
||
The City component of GHCID relies on GeoNames for standardized settlement names and geographic resolution. This section defines critical rules for GeoNames integration.
|
||
|
||
### Feature Code Filtering (CRITICAL)
|
||
|
||
**NEVER use neighborhoods or districts (PPLX) for GHCID generation. ONLY use proper settlements (cities, towns, villages).**
|
||
|
||
GeoNames classifies populated places with feature codes. When reverse geocoding coordinates to find a settlement, you MUST filter by feature code.
|
||
|
||
#### ALLOWED Feature Codes
|
||
|
||
| Code | Description | Example |
|
||
|------|-------------|---------|
|
||
| **PPL** | Populated place (city/town/village) | Apeldoorn, Hamont, Lelystad |
|
||
| **PPLA** | Seat of first-order admin division | Provincial capitals |
|
||
| **PPLA2** | Seat of second-order admin division | Municipal seats |
|
||
| **PPLA3** | Seat of third-order admin division | District seats |
|
||
| **PPLA4** | Seat of fourth-order admin division | Sub-district seats |
|
||
| **PPLC** | Capital of a political entity | Amsterdam, Brussels |
|
||
| **PPLS** | Populated places (multiple) | Settlement clusters |
|
||
| **PPLG** | Seat of government | The Hague |
|
||
|
||
#### EXCLUDED Feature Codes
|
||
|
||
| Code | Description | Why Excluded |
|
||
|------|-------------|--------------|
|
||
| **PPLX** | Section of populated place | Neighborhoods, districts, quarters |
|
||
|
||
**Problem Example**:
|
||
|
||
Without feature code filtering, reverse geocoding may return:
|
||
- ❌ "Binnenstad" (PPLX, neighborhood, pop 4,900) - WRONG
|
||
- ✅ "Apeldoorn" (PPL, city, pop 136,670) - CORRECT
|
||
|
||
#### SQL Implementation
|
||
|
||
```sql
|
||
SELECT
|
||
name, ascii_name, admin1_code, admin1_name,
|
||
latitude, longitude, geonames_id, population, feature_code,
|
||
((latitude - ?) * (latitude - ?) + (longitude - ?) * (longitude - ?)) as distance_sq
|
||
FROM cities
|
||
WHERE country_code = ?
|
||
AND feature_code IN ('PPL', 'PPLA', 'PPLA2', 'PPLA3', 'PPLA4', 'PPLC', 'PPLS', 'PPLG')
|
||
ORDER BY distance_sq
|
||
LIMIT 1
|
||
```
|
||
|
||
### Country Code Detection
|
||
|
||
**CRITICAL**: Determine country code from entry data BEFORE calling GeoNames reverse geocoding.
|
||
|
||
GeoNames queries are country-specific. Using the wrong country code will return incorrect results.
|
||
|
||
**Country Code Resolution Priority**:
|
||
|
||
1. `zcbs_enrichment.country` - Most explicit source
|
||
2. `location.country` - Direct location field
|
||
3. `locations[].country` - Array location field
|
||
4. `original_entry.country` - CSV source field
|
||
5. `google_maps_enrichment.address` - Parse from address string
|
||
6. `wikidata_enrichment.located_in.label` - Infer from Wikidata
|
||
7. Default: `"NL"` (Netherlands) - Only if no other source
|
||
|
||
### Provenance Tracking
|
||
|
||
Record GeoNames resolution in entry metadata:
|
||
|
||
```yaml
|
||
location_resolution:
|
||
method: REVERSE_GEOCODE
|
||
geonames_id: 2759706
|
||
geonames_name: Apeldoorn
|
||
feature_code: PPL # MUST be PPL, PPLA*, PPLC, PPLS, or PPLG
|
||
admin1_code: '03'
|
||
region_code: GE
|
||
country_code: NL
|
||
source_coordinates:
|
||
latitude: 52.21116
|
||
longitude: 5.96978
|
||
distance_km: 0.5
|
||
```
|
||
|
||
**Validation**: If `feature_code: PPLX` appears in metadata, the GHCID is WRONG and must be regenerated.
|
||
|
||
---
|
||
|
||
## Use Cases and Applications
|
||
|
||
### 1. Academic Citations
|
||
|
||
**Scenario**: Researcher cites archival collection in academic paper
|
||
|
||
**Without GHCID**:
|
||
```
|
||
"See the Municipal Archives of Haarlem for records from 1245-1800."
|
||
```
|
||
Problem: Name may change, institution may merge, citation becomes ambiguous
|
||
|
||
**With GHCID**:
|
||
```
|
||
"See the Noord-Hollands Archief (urn:uuid:550e8400-e29b-41d4-a716-446655440000)
|
||
for Haarlem municipal records from 1245-1800."
|
||
```
|
||
Benefit: Persistent identifier remains valid even if organization changes name or merges
|
||
|
||
### 2. Metadata Aggregation (Europeana, DPLA)
|
||
|
||
**Scenario**: Europeana aggregates metadata from 4,000 institutions
|
||
|
||
**Without GHCID**:
|
||
```json
|
||
{
|
||
"institution": "National Library",
|
||
"country": "Netherlands"
|
||
}
|
||
```
|
||
Problem: Multiple "National Library" institutions, ambiguous
|
||
|
||
**With GHCID**:
|
||
```json
|
||
{
|
||
"@id": "urn:uuid:a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d",
|
||
"institution": "Koninklijke Bibliotheek",
|
||
"sameAs": "https://id.heritage.example.org/uuid/a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d"
|
||
}
|
||
```
|
||
Benefit: Unique identifier enables deduplication, cross-referencing, provenance tracking
|
||
|
||
### 3. IIIF Manifests
|
||
|
||
**Scenario**: Museum publishes IIIF manifest for digitized collection
|
||
|
||
**GHCID Integration**:
|
||
```json
|
||
{
|
||
"@context": "http://iiif.io/api/presentation/3/context.json",
|
||
"@id": "https://iiif.rijksmuseum.nl/manifest/123",
|
||
"type": "Manifest",
|
||
"provider": [{
|
||
"id": "urn:uuid:550e8400-e29b-41d4-a716-446655440000",
|
||
"type": "Agent",
|
||
"label": {"en": ["Rijksmuseum"]},
|
||
"sameAs": "https://id.heritage.example.org/uuid/550e8400-e29b-41d4-a716-446655440000"
|
||
}]
|
||
}
|
||
```
|
||
Benefit: IIIF consumers can resolve provider to authoritative metadata
|
||
|
||
### 4. Wikidata Integration
|
||
|
||
**Scenario**: Link Wikidata item to heritage institution
|
||
|
||
**Wikidata Property Proposal**: `P-GHCID` (GHCID identifier)
|
||
|
||
```sparql
|
||
# Wikidata SPARQL query
|
||
SELECT ?item ?itemLabel ?ghcid WHERE {
|
||
?item wdt:P-GHCID ?ghcid .
|
||
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
|
||
}
|
||
```
|
||
|
||
Benefit: Bidirectional linking between Wikidata and GHCID registry
|
||
|
||
### 5. Merger/Split Tracking
|
||
|
||
**Scenario**: Two archives merge in 2001
|
||
|
||
**GHCID Representation**:
|
||
|
||
```yaml
|
||
# Gemeentearchief Haarlem (predecessor)
|
||
- id: old-uuid-1
|
||
ghcid_original: NL-NH-2759794-A-GH
|
||
organization_status: CLOSED
|
||
change_history:
|
||
- change_type: MERGER
|
||
event_date: "2001-01-01"
|
||
resulting_organization: new-uuid
|
||
|
||
# Rijksarchief in Noord-Holland (predecessor)
|
||
- id: old-uuid-2
|
||
ghcid_original: NL-NH-2759794-A-RNH
|
||
organization_status: CLOSED
|
||
change_history:
|
||
- change_type: MERGER
|
||
event_date: "2001-01-01"
|
||
resulting_organization: new-uuid
|
||
|
||
# Noord-Hollands Archief (successor)
|
||
- id: new-uuid
|
||
ghcid_original: NL-NH-2750053-A-NHA
|
||
organization_status: ACTIVE
|
||
change_history:
|
||
- change_type: FOUNDING
|
||
event_date: "2001-01-01"
|
||
affected_organization: [old-uuid-1, old-uuid-2]
|
||
```
|
||
|
||
**Resolver Behavior**:
|
||
|
||
```http
|
||
GET /uuid/old-uuid-1 HTTP/1.1
|
||
|
||
Response: 410 Gone
|
||
Location: /uuid/new-uuid
|
||
{
|
||
"status": "CLOSED",
|
||
"reason": "Merged into Noord-Hollands Archief",
|
||
"successor": "https://id.heritage.example.org/uuid/new-uuid",
|
||
"effective_date": "2001-01-01"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## GHCID Collision Resolution and Timeline Examples
|
||
|
||
### Collision Scenarios
|
||
|
||
GHCID collisions occur when two different institutions generate the **same base GHCID string** (identical country, region, city, type, and abbreviation). Resolution strategy **depends on temporal context**: when were the colliding institutions discovered?
|
||
|
||
### Core Principle: Temporal Priority Determines Strategy
|
||
|
||
**Rule**: The **timing of institution discovery** determines which institutions receive native language name suffixes.
|
||
|
||
**Collision Suffix**: Collisions are resolved by appending the **full legal name in native language in snake_case format**.
|
||
|
||
| Scenario | When Detected | Resolution Strategy | Rationale |
|
||
|----------|--------------|---------------------|-----------|
|
||
| **First Batch Collision** | Multiple institutions discovered simultaneously (e.g., batch CSV import) | **ALL colliding institutions** receive name suffixes | Fair treatment: no institution has temporal priority |
|
||
| **Historical Addition** | New institution collides with **already published** GHCID | **ONLY new institution** receives name suffix | PID stability: preserve existing published identifiers |
|
||
|
||
### Name Suffix Generation
|
||
|
||
**Converting institution names to snake_case suffixes:**
|
||
|
||
```python
|
||
import re
|
||
import unicodedata
|
||
|
||
def generate_name_suffix(native_name: str) -> str:
|
||
"""Convert native language institution name to snake_case suffix.
|
||
|
||
Examples:
|
||
"Stedelijk Museum Amsterdam" → "stedelijk_museum_amsterdam"
|
||
"Musée d'Orsay" → "musee_dorsay"
|
||
"Österreichische Nationalbibliothek" → "osterreichische_nationalbibliothek"
|
||
"""
|
||
# Normalize unicode (NFD decomposition) and remove diacritics
|
||
normalized = unicodedata.normalize('NFD', native_name)
|
||
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
||
|
||
# Convert to lowercase
|
||
lowercase = ascii_name.lower()
|
||
|
||
# Remove apostrophes, commas, and other punctuation
|
||
no_punct = re.sub(r"[''`\",.:;!?()[\]{}]", '', lowercase)
|
||
|
||
# Replace spaces and hyphens with underscores
|
||
underscored = re.sub(r'[\s\-]+', '_', no_punct)
|
||
|
||
# Remove any remaining non-alphanumeric characters (except underscores)
|
||
clean = re.sub(r'[^a-z0-9_]', '', underscored)
|
||
|
||
# Collapse multiple underscores
|
||
final = re.sub(r'_+', '_', clean).strip('_')
|
||
|
||
return final
|
||
```
|
||
|
||
### Timeline Example 1: First Batch Collision
|
||
|
||
**Date**: 2025-11-01
|
||
**Event**: Dutch ISIL Registry batch import (364 institutions)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 2025-11-01: First Batch Import from Dutch ISIL Registry │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Collision Detected:
|
||
Base GHCID: NL-NH-AMS-M-SM
|
||
|
||
Institution 1: Stedelijk Museum Amsterdam
|
||
- Discovered: 2025-11-01 (batch import)
|
||
→ Resolution: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
|
||
Institution 2: Science Museum Amsterdam
|
||
- Discovered: 2025-11-01 (batch import)
|
||
→ Resolution: NL-NH-AMS-M-SM-science_museum_amsterdam
|
||
|
||
Strategy: FIRST_BATCH
|
||
Reason: Both institutions discovered simultaneously in batch import
|
||
Result: BOTH receive name suffixes derived from their native language names
|
||
```
|
||
|
||
**GHCID Records**:
|
||
|
||
```yaml
|
||
# Institution 1
|
||
- id: 550e8400-e29b-41d4-a716-446655440000 # UUID v5 of full GHCID
|
||
ghcid_original: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
ghcid_current: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
||
name: Stedelijk Museum Amsterdam
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z"
|
||
publication_date: "2025-11-01T10:00:00Z"
|
||
data_source: CSV_REGISTRY
|
||
notes: "First batch collision: name suffix added during initial import"
|
||
|
||
# Institution 2
|
||
- id: a1b2c3d4-e5f6-8a1b-9c2d-3e4f5a6b7c8d # UUID v5 of full GHCID
|
||
ghcid_original: NL-NH-AMS-M-SM-science_museum_amsterdam
|
||
ghcid_current: NL-NH-AMS-M-SM-science_museum_amsterdam
|
||
name: Science Museum Amsterdam
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z"
|
||
publication_date: "2025-11-01T10:00:00Z"
|
||
data_source: CSV_REGISTRY
|
||
notes: "First batch collision: name suffix added during initial import"
|
||
```
|
||
|
||
**Key Point**: Both institutions created with name suffixes from the start. No existing PID was changed.
|
||
|
||
---
|
||
|
||
### Timeline Example 2: Historical Addition (Single Institution)
|
||
|
||
**Date**: 2025-11-15
|
||
**Event**: Historical research reveals Amsterdam Historical Museum (closed 2001)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 2025-11-01: First Batch Import │
|
||
│ Hermitage Museum Amsterdam → NL-NH-AMS-M-HM (PUBLISHED) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
│ (14 days pass, GHCID used in citations, APIs)
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 2025-11-15: Historical Institution Added │
|
||
│ Amsterdam Historical Museum → Collides with published GHCID! │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Collision Detected:
|
||
Base GHCID: NL-NH-AMS-M-HM
|
||
|
||
Existing Institution (PUBLISHED 2025-11-01):
|
||
- Name: Hermitage Museum Amsterdam
|
||
- GHCID: NL-NH-AMS-M-HM ← UNCHANGED (published, immutable)
|
||
- Publication date: 2025-11-01T10:00:00Z
|
||
|
||
New Institution (Being added 2025-11-15):
|
||
- Name: Amsterdam Historical Museum (historical, 1926-2001)
|
||
- GHCID: NL-NH-AMS-M-HM-amsterdam_historical_museum ← Gets name suffix
|
||
|
||
Strategy: HISTORICAL_ADDITION
|
||
Reason: New institution collides with already published GHCID
|
||
Result: ONLY new institution receives name suffix; existing GHCID preserved
|
||
```
|
||
|
||
**GHCID Records**:
|
||
|
||
```yaml
|
||
# Existing institution (UNCHANGED)
|
||
- id: existing-uuid-1234 # Unchanged
|
||
ghcid_original: NL-NH-AMS-M-HM # FROZEN (no name suffix)
|
||
ghcid_current: NL-NH-AMS-M-HM # FROZEN (no name suffix)
|
||
name: Hermitage Museum Amsterdam
|
||
provenance:
|
||
extraction_date: "2025-11-01T10:00:00Z"
|
||
publication_date: "2025-11-01T10:00:00Z" # PUBLISHED
|
||
data_source: CSV_REGISTRY
|
||
# No collision resolution metadata (institution not modified)
|
||
|
||
# New historical institution (Gets name suffix)
|
||
- id: new-uuid-5678 # UUID v5 of full GHCID with name suffix
|
||
ghcid_original: NL-NH-AMS-M-HM-amsterdam_historical_museum
|
||
ghcid_current: NL-NH-AMS-M-HM-amsterdam_historical_museum
|
||
name: Amsterdam Historical Museum
|
||
organization_status: CLOSED
|
||
provenance:
|
||
extraction_date: "2025-11-15T14:30:00Z"
|
||
publication_date: "2025-11-15T14:30:00Z"
|
||
data_source: CONVERSATION_NLP
|
||
notes: >-
|
||
Historical addition collision with published GHCID NL-NH-AMS-M-HM
|
||
(Hermitage Museum Amsterdam, published 2025-11-01). Added name suffix
|
||
to preserve existing PID stability.
|
||
change_history:
|
||
- change_type: FOUNDING
|
||
event_date: "1926-01-01"
|
||
- change_type: CLOSURE
|
||
event_date: "2001-12-31"
|
||
event_description: "Closed; collections transferred to Amsterdam Museum"
|
||
```
|
||
|
||
**Why Preserve Existing GHCID?**
|
||
|
||
The existing GHCID `NL-NH-AMS-M-HM` may already be:
|
||
- Cited in academic publications
|
||
- Referenced in third-party datasets (Europeana, Wikidata)
|
||
- Used in API responses
|
||
- Embedded in RDF triple stores
|
||
|
||
**Changing it would break citations and external references** → violates PID stability principle ("Cool URIs don't change").
|
||
|
||
---
|
||
|
||
### Timeline Example 3: Multiple Historical Additions
|
||
|
||
**Date**: 2025-12-01
|
||
**Event**: Two historical naval museums discovered in archival research
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 2025-11-01: First Batch Import │
|
||
│ Maritime Museum Amsterdam → NL-NH-AMS-M-MM (PUBLISHED) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
│ (30 days pass)
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ 2025-12-01: Historical Research Uncovers Two Naval Museums │
|
||
│ Both collide with published NL-NH-AMS-M-MM │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
|
||
Collision Detected:
|
||
Base GHCID: NL-NH-AMS-M-MM
|
||
|
||
Existing Institution (PUBLISHED 2025-11-01):
|
||
- Name: Maritime Museum Amsterdam
|
||
- GHCID: NL-NH-AMS-M-MM ← UNCHANGED
|
||
|
||
New Institution 1 (Being added 2025-12-01):
|
||
- Name: Dutch Navy Museum (historical, 1906-1955)
|
||
- GHCID: NL-NH-AMS-M-MM-dutch_navy_museum ← Gets name suffix
|
||
|
||
New Institution 2 (Being added 2025-12-01):
|
||
- Name: Amsterdam Naval Archive (historical, 1820-1901)
|
||
- GHCID: NL-NH-AMS-M-MM-amsterdam_naval_archive ← Gets name suffix
|
||
|
||
Strategy: HISTORICAL_ADDITION (multiple)
|
||
Reason: Multiple new institutions collide with same published GHCID
|
||
Result: All new institutions get name suffixes; existing GHCID unchanged
|
||
```
|
||
|
||
**GHCID Records**:
|
||
|
||
```yaml
|
||
# Existing institution (UNCHANGED)
|
||
- id: maritime-uuid
|
||
ghcid_original: NL-NH-AMS-M-MM # No name suffix (published first)
|
||
ghcid_current: NL-NH-AMS-M-MM
|
||
name: Maritime Museum Amsterdam
|
||
organization_status: ACTIVE
|
||
provenance:
|
||
publication_date: "2025-11-01T10:00:00Z"
|
||
|
||
# New historical institution 1
|
||
- id: navy-museum-uuid
|
||
ghcid_original: NL-NH-AMS-M-MM-dutch_navy_museum
|
||
ghcid_current: NL-NH-AMS-M-MM-dutch_navy_museum
|
||
name: Dutch Navy Museum
|
||
organization_status: CLOSED
|
||
provenance:
|
||
extraction_date: "2025-12-01T09:00:00Z"
|
||
notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
|
||
change_history:
|
||
- change_type: FOUNDING
|
||
event_date: "1906-01-01"
|
||
- change_type: CLOSURE
|
||
event_date: "1955-12-31"
|
||
|
||
# New historical institution 2
|
||
- id: naval-archive-uuid
|
||
ghcid_original: NL-NH-AMS-M-MM-amsterdam_naval_archive
|
||
ghcid_current: NL-NH-AMS-M-MM-amsterdam_naval_archive
|
||
name: Amsterdam Naval Archive
|
||
organization_status: CLOSED
|
||
provenance:
|
||
extraction_date: "2025-12-01T09:00:00Z"
|
||
notes: "Historical addition: name suffix added to avoid collision with NL-NH-AMS-M-MM"
|
||
change_history:
|
||
- change_type: FOUNDING
|
||
event_date: "1820-01-01"
|
||
- change_type: CLOSURE
|
||
event_date: "1901-12-31"
|
||
```
|
||
|
||
**Pattern**: When multiple historical institutions are added simultaneously and collide with the same existing GHCID:
|
||
- Existing: No change
|
||
- All new: Each gets name suffix derived from native language name
|
||
|
||
---
|
||
|
||
### Collision Resolution Workflow Diagram
|
||
|
||
```
|
||
New Institution Detected
|
||
│
|
||
▼
|
||
Generate Base GHCID
|
||
│
|
||
▼
|
||
Check Existing Registry
|
||
│
|
||
├─── No Collision Found ────────────► Use Base GHCID (no name suffix)
|
||
│
|
||
└─── Collision Found
|
||
│
|
||
▼
|
||
Check Publication Date of Existing Record
|
||
│
|
||
├─── Existing Published ────────► HISTORICAL_ADDITION
|
||
│ │ - Existing: UNCHANGED
|
||
│ └───────────────────────► - New: Add name suffix
|
||
│
|
||
└─── Both Being Created ────────► FIRST_BATCH
|
||
│ - All: Add name suffixes
|
||
└─────────────────────────►
|
||
```
|
||
|
||
### Implementation Guidance
|
||
|
||
When implementing collision resolution, always check:
|
||
|
||
1. **Does base GHCID exist in registry?**
|
||
- No → Use base GHCID without name suffix
|
||
- Yes → Proceed to step 2
|
||
|
||
2. **Does existing record have publication_date?**
|
||
- Yes → HISTORICAL_ADDITION strategy (only new gets name suffix)
|
||
- No → FIRST_BATCH strategy (all get name suffixes)
|
||
|
||
3. **Track resolution in provenance**:
|
||
```yaml
|
||
provenance:
|
||
collision_resolution:
|
||
strategy: HISTORICAL_ADDITION | FIRST_BATCH
|
||
collides_with: existing_ghcid (if historical addition)
|
||
existing_publication_date: ISO 8601 timestamp
|
||
reason: "Human-readable explanation"
|
||
```
|
||
|
||
### Why This Matters: PID Stability Principle
|
||
|
||
**Cool URIs Don't Change** (W3C Architecture):
|
||
- Once a persistent identifier is published, it should **never** be modified
|
||
- External systems may reference the PID (citations, links, datasets)
|
||
- Changing published PIDs breaks citations, causes 404 errors, violates trust
|
||
|
||
**GHCID Approach**:
|
||
- First batch: Fair treatment, all receive name suffixes (no existing PIDs to protect)
|
||
- Historical addition: Asymmetric treatment preserves existing PIDs (new institution accommodates)
|
||
|
||
**Alternative (Rejected)**:
|
||
- Could retroactively add name suffixes to existing records when collisions occur
|
||
- **Problem**: Breaks existing citations, violates PID persistence guarantee
|
||
- **Principle**: "First publisher wins" → existing GHCID has temporal priority
|
||
|
||
---
|
||
|
||
## Migration and Transition Strategies
|
||
|
||
### From ISIL to GHCID
|
||
|
||
**Step 1: Preserve ISIL Codes**
|
||
|
||
```yaml
|
||
- id: uuid-from-isil-ghcid
|
||
ghcid_original: NL-NH-2759794-M-RM
|
||
identifiers:
|
||
- identifier_scheme: ISIL
|
||
identifier_value: NL-AsdRM # Preserved!
|
||
- identifier_scheme: GHCID
|
||
identifier_value: NL-NH-2759794-M-RM
|
||
```
|
||
|
||
**Step 2: Provide ISIL → GHCID Lookup**
|
||
|
||
```http
|
||
GET /isil/NL-AsdRM HTTP/1.1
|
||
|
||
Response: 303 See Other
|
||
Location: /uuid/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
**Step 3: Gradual Migration**
|
||
|
||
- Years 1-2: Dual identifiers (ISIL + GHCID)
|
||
- Years 3-5: GHCID primary, ISIL secondary
|
||
- Years 6+: GHCID primary, ISIL legacy
|
||
|
||
### From Wikidata Q-Numbers
|
||
|
||
**Strategy**: Complement, don't replace
|
||
|
||
```yaml
|
||
- id: ghcid-uuid
|
||
identifiers:
|
||
- identifier_scheme: Wikidata
|
||
identifier_value: Q190804
|
||
identifier_url: https://www.wikidata.org/wiki/Q190804
|
||
```
|
||
|
||
**Bidirectional linking**:
|
||
- GHCID record → `sameAs: wikidata:Q190804`
|
||
- Wikidata item → `P-GHCID: uuid` (proposed property)
|
||
|
||
### Legacy System Integration
|
||
|
||
**For systems requiring numeric IDs**:
|
||
|
||
```python
|
||
# Map GHCID UUID to numeric ID
|
||
uuid_str = "550e8400-e29b-41d4-a716-446655440000"
|
||
numeric_id = ghcid_components.to_numeric() # 12345678901234567
|
||
|
||
# Store mapping in legacy database
|
||
INSERT INTO institutions (id, uuid_reference)
|
||
VALUES (12345678901234567, '550e8400-e29b-41d4-a716-446655440000');
|
||
```
|
||
|
||
**For systems requiring ISIL**:
|
||
|
||
```python
|
||
# Provide ISIL fallback
|
||
isil_code = record.get_identifier('ISIL') or f"GHCID-{record.ghcid_numeric}"
|
||
```
|
||
|
||
---
|
||
|
||
## Appendices
|
||
|
||
### Appendix A: Collision Probability Calculations
|
||
|
||
**Birthday Paradox Formula**:
|
||
|
||
```
|
||
P(collision) ≈ n² / (2 × 2^bits)
|
||
|
||
Where:
|
||
n = number of institutions
|
||
bits = identifier bit length
|
||
```
|
||
|
||
**UUID v5 (128 bits)**:
|
||
|
||
```
|
||
n = 1,000,000 institutions
|
||
P = (10^6)² / (2 × 2^128)
|
||
P ≈ 1.5 × 10^-29
|
||
P ≈ 0.000000000000000000000000000015%
|
||
```
|
||
|
||
**Numeric (64 bits)**:
|
||
|
||
```
|
||
n = 1,000,000 institutions
|
||
P = (10^6)² / (2 × 2^64)
|
||
P ≈ 2.7 × 10^-7
|
||
P ≈ 0.00003%
|
||
```
|
||
|
||
**Conclusion**: Both formats provide negligible collision risk for heritage domain (<10M institutions expected).
|
||
|
||
### Appendix B: GHCID Generation Pseudocode
|
||
|
||
```python
|
||
def generate_ghcid(
|
||
name: str,
|
||
institution_type: InstitutionTypeEnum,
|
||
country: str, # ISO 3166-1 alpha-2
|
||
region: str, # ISO 3166-2 or GeoNames admin1
|
||
city_geonames_id: int # GeoNames ID
|
||
) -> GHCIDComponents:
|
||
"""
|
||
Generate GHCID components from institution metadata.
|
||
"""
|
||
# Normalize inputs
|
||
country = country.upper()
|
||
region = region.upper()
|
||
|
||
# Convert GeoNames ID to string
|
||
city_code = str(city_geonames_id)
|
||
|
||
# Get institution type code
|
||
type_code = INSTITUTION_TYPE_CODES[institution_type] # "M", "L", "A", etc.
|
||
|
||
# Generate abbreviation from emic (native language) name
|
||
# Uses first letter of each significant word, skipping prepositions/articles
|
||
abbreviation = extract_abbreviation_from_name(name) # "RM", "LC", etc.
|
||
|
||
# Construct GHCID string
|
||
ghcid_string = f"{country}-{region}-{city_code}-{type_code}-{abbreviation}"
|
||
|
||
# Generate UUID v5
|
||
ghcid_uuid = uuid.uuid5(GHCID_NAMESPACE, ghcid_string)
|
||
|
||
# Generate UUID SHA-256
|
||
hash_bytes = hashlib.sha256(ghcid_string.encode('utf-8')).digest()
|
||
ghcid_uuid_sha256 = uuid.UUID(bytes=hash_bytes[:16]) # Truncate to 128 bits
|
||
|
||
# Generate numeric
|
||
ghcid_numeric = int.from_bytes(hash_bytes[:8], byteorder='big')
|
||
|
||
return GHCIDComponents(
|
||
country=country,
|
||
region=region,
|
||
city_code=city_code,
|
||
type_code=type_code,
|
||
abbreviation=abbreviation,
|
||
ghcid_string=ghcid_string,
|
||
ghcid_uuid=str(ghcid_uuid),
|
||
ghcid_uuid_sha256=str(ghcid_uuid_sha256),
|
||
ghcid_numeric=ghcid_numeric
|
||
)
|
||
```
|
||
|
||
### Appendix C: References and Standards
|
||
|
||
**IETF RFCs**:
|
||
- **RFC 4122**: A Universally Unique IDentifier (UUID) URN Namespace - https://tools.ietf.org/html/rfc4122
|
||
- **RFC 9562**: Universally Unique IDentifiers (UUIDs) - https://datatracker.ietf.org/doc/rfc9562/
|
||
- **RFC 3650**: Handle System Overview - https://tools.ietf.org/html/rfc3650
|
||
|
||
**ISO Standards**:
|
||
- **ISO 15511**: International Standard Identifier for Libraries and Related Organizations (ISIL)
|
||
- **ISO 3166-1**: Codes for the representation of names of countries and their subdivisions – Part 1: Country codes
|
||
- **ISO 3166-2**: Codes for the representation of names of countries and their subdivisions – Part 2: Country subdivision codes
|
||
- **ISO 26324**: Information and documentation — Digital object identifier system
|
||
- **ISO 21127**: Information and documentation — CIDOC Conceptual Reference Model (CRM)
|
||
|
||
**W3C Standards**:
|
||
- **W3C PROV-O**: PROV Ontology - https://www.w3.org/TR/prov-o/
|
||
- **W3C ORG**: Organization Ontology - https://www.w3.org/TR/vocab-org/
|
||
- **Schema.org**: Structured data vocabulary - https://schema.org/
|
||
|
||
**Heritage Standards**:
|
||
- **RiC-O v1.1**: Records in Contexts Ontology - https://www.ica.org/standards/RiC/ontology
|
||
- **CIDOC-CRM v7.1.3**: Conceptual Reference Model - http://www.cidoc-crm.org/
|
||
- **IIIF Presentation API**: International Image Interoperability Framework - https://iiif.io/api/presentation/
|
||
|
||
**GeoNames**:
|
||
- GeoNames Gazetteer - https://www.geonames.org/
|
||
- GeoNames Ontology - http://www.geonames.org/ontology/
|
||
|
||
**Related PID Systems**:
|
||
- **DOI**: Digital Object Identifier - https://www.doi.org/
|
||
- **ARK**: Archival Resource Key - https://n2t.net/e/ark_ids.html
|
||
- **Handle System**: https://www.handle.net/
|
||
- **VIAF**: Virtual International Authority File - https://viaf.org/
|
||
|
||
---
|
||
|
||
## Acknowledgments
|
||
|
||
This specification builds on the foundational work of:
|
||
|
||
- **ISIL agencies** worldwide for pioneering library/archive identifiers
|
||
- **DOI Foundation** for persistent identifier governance models
|
||
- **California Digital Library** for ARK design principles
|
||
- **Wikimedia Foundation** for crowdsourced identifier systems
|
||
- **GeoNames** for geographic identifier infrastructure
|
||
- **Europeana** and **DPLA** for cultural heritage aggregation standards
|
||
|
||
Special thanks to the heritage informatics community for feedback and guidance.
|
||
|
||
---
|
||
|
||
**Version**: 1.0
|
||
**Date**: 2025-11-06
|
||
**Status**: Draft for Community Review
|
||
**Next Review**: 2026-01-01
|
||
**Contact**: [GLAM Data Extraction Project](https://github.com/kempersc/glam)
|
||
|
||
---
|
||
|
||
**License**: This specification is released under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
|