15 KiB
15 KiB
Identifier Structure Design
Version: 0.1.0
Last Updated: 2025-01-09
Related: SOTA Identifier Systems | Implementation Guidelines
1. Overview
This document specifies the technical structure of PPID identifiers, including:
- Format and syntax
- Checksum algorithm
- Namespace design
- URI structure
- Generation algorithms
2. Design Principles
2.1 Core Requirements
| Requirement | Rationale |
|---|---|
| Opaque | No personal information encoded |
| Persistent | Never reused, stable for life |
| Resolvable | Valid HTTP URIs |
| Verifiable | Checksum for validation |
| Interoperable | Compatible with ORCID/ISNI format |
| Scalable | Support billions of identifiers |
2.2 Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Length | 16 characters | ORCID/ISNI compatible |
| Character set | Hex (0-9, a-f) + type prefix | URL-safe, case-insensitive |
| Checksum | MOD 11-2 | ISO standard, ORCID compatible |
| Type distinction | Prefix: POID/PRID | Clear observation vs reconstruction |
| UUID backing | UUID v5 (SHA-1) | Deterministic, reproducible |
3. Identifier Format
3.1 Structure Overview
Format: {TYPE}-{xxxx}-{xxxx}-{xxxx}-{xxxx}
│ │ │ │ └── Block 4 (3 hex + check digit)
│ │ │ └── Block 3 (4 hex digits)
│ │ └── Block 2 (4 hex digits)
│ └── Block 1 (4 hex digits)
└── Type prefix (POID or PRID)
Examples:
POID-7a3b-c4d5-e6f7-890X (Person Observation ID)
PRID-1234-5678-90ab-cdeX (Person Reconstruction ID)
3.2 Component Breakdown
| Component | Format | Description |
|---|---|---|
| Type prefix | POID or PRID |
Observation vs Reconstruction |
| Block 1 | [0-9a-f]{4} |
4 hex digits |
| Block 2 | [0-9a-f]{4} |
4 hex digits |
| Block 3 | [0-9a-f]{4} |
4 hex digits |
| Block 4 | [0-9a-f]{3}[0-9X] |
3 hex + check digit |
| Separator | - |
Hyphen between blocks |
3.3 Identifier Types
| Type | Prefix | Purpose | Creation Trigger |
|---|---|---|---|
| Person Observation ID | POID |
Raw source observation | Data extraction from source |
| Person Reconstruction ID | PRID |
Curated person identity | Entity resolution / curation |
4. Checksum Algorithm
4.1 MOD 11-2 (ISO/IEC 7064:2003)
PPID uses the same checksum as ORCID for interoperability:
def calculate_ppid_checksum(digits: str) -> str:
"""
Calculate PPID check digit using ISO/IEC 7064 MOD 11-2.
Args:
digits: 15-character hex string (without check digit)
Returns:
Check digit (0-9 or X)
Algorithm:
1. For each digit, add to running total and multiply by 2
2. Take result modulo 11
3. Subtract from 12, take modulo 11
4. If result is 10, use 'X'
"""
# Convert hex digits to integers (0-15 for 0-9, a-f)
total = 0
for char in digits.lower():
if char.isdigit():
value = int(char)
else:
value = ord(char) - ord('a') + 10
total = (total + value) * 2
remainder = total % 11
result = (12 - remainder) % 11
return 'X' if result == 10 else str(result)
def validate_ppid(ppid: str) -> bool:
"""
Validate a complete PPID identifier.
Args:
ppid: Full PPID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
Returns:
True if valid, False otherwise
"""
# Remove prefix and hyphens
parts = ppid.upper().split('-')
# Validate prefix
if parts[0] not in ('POID', 'PRID'):
return False
# Validate length (4 blocks of 4 chars each)
if len(parts) != 5:
return False
# Extract hex portion (without prefix)
hex_part = ''.join(parts[1:])
if len(hex_part) != 16:
return False
# Validate hex characters (except last which can be X)
hex_digits = hex_part[:15]
check_digit = hex_part[15]
if not all(c in '0123456789abcdefABCDEF' for c in hex_digits):
return False
if check_digit not in '0123456789Xx':
return False
# Validate checksum
calculated = calculate_ppid_checksum(hex_digits)
return calculated.upper() == check_digit.upper()
4.2 Checksum Examples
| Hex Portion (15 chars) | Check Digit | Full ID |
|---|---|---|
7a3bc4d5e6f7890 |
X |
POID-7a3b-c4d5-e6f7-890X |
1234567890abcde |
5 |
PRID-1234-5678-90ab-cde5 |
000000000000000 |
0 |
POID-0000-0000-0000-0000 |
5. UUID Generation
5.1 UUID v5 for Deterministic IDs
PPID uses UUID v5 (SHA-1 based) to generate deterministic identifiers:
import uuid
import hashlib
# PPID namespace UUID (generated once, used forever)
PPID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') # Example
# Sub-namespaces for different ID types
POID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonObservation')
PRID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonReconstruction')
def generate_poid(source_url: str, retrieval_timestamp: str, content_hash: str) -> str:
"""
Generate deterministic POID from source metadata.
The same source + timestamp + content will always produce the same POID.
Args:
source_url: URL where observation was extracted
retrieval_timestamp: ISO 8601 timestamp of extraction
content_hash: SHA-256 hash of extracted content
Returns:
POID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
"""
# Create deterministic input string
input_string = f"{source_url}|{retrieval_timestamp}|{content_hash}"
# Generate UUID v5
raw_uuid = uuid.uuid5(POID_NAMESPACE, input_string)
# Convert to PPID format
return uuid_to_ppid(raw_uuid, 'POID')
def generate_prid(observation_ids: list[str], curator_id: str, timestamp: str) -> str:
"""
Generate deterministic PRID from linked observations.
Args:
observation_ids: Sorted list of POIDs that comprise this reconstruction
curator_id: Identifier of curator/algorithm creating reconstruction
timestamp: ISO 8601 timestamp of reconstruction
Returns:
PRID string (e.g., "PRID-1234-5678-90ab-cde5")
"""
# Sort observations for deterministic ordering
sorted_obs = sorted(observation_ids)
# Create deterministic input string
input_string = f"{'|'.join(sorted_obs)}|{curator_id}|{timestamp}"
# Generate UUID v5
raw_uuid = uuid.uuid5(PRID_NAMESPACE, input_string)
# Convert to PPID format
return uuid_to_ppid(raw_uuid, 'PRID')
def uuid_to_ppid(raw_uuid: uuid.UUID, prefix: str) -> str:
"""
Convert UUID to PPID format with checksum.
Args:
raw_uuid: UUID object
prefix: 'POID' or 'PRID'
Returns:
Formatted PPID string
"""
# Get hex representation (32 chars)
hex_str = raw_uuid.hex
# Take first 15 characters
hex_15 = hex_str[:15]
# Calculate checksum
check_digit = calculate_ppid_checksum(hex_15)
# Format with hyphens
hex_16 = hex_15 + check_digit.lower()
formatted = f"{prefix}-{hex_16[0:4]}-{hex_16[4:8]}-{hex_16[8:12]}-{hex_16[12:16]}"
return formatted
5.2 Why UUID v5?
| Property | UUID v5 | UUID v4 (Random) | UUID v7 (Time-ordered) |
|---|---|---|---|
| Deterministic | Yes | No | No |
| Reproducible | Yes | No | No |
| No state required | Yes | Yes | No |
| Standard algorithm | Yes (RFC 4122) | Yes | Yes |
| Collision resistance | 128-bit | 122-bit | 48-bit time + 74-bit random |
Key advantage: Same input always produces same PPID, enabling deduplication and verification.
6. URI Structure
6.1 HTTP URIs
PPID identifiers are resolvable HTTP URIs:
Base URI: https://ppid.org/
POID URI: https://ppid.org/POID-7a3b-c4d5-e6f7-890X
PRID URI: https://ppid.org/PRID-1234-5678-90ab-cde5
6.2 Content Negotiation
| Accept Header | Response Format |
|---|---|
text/html |
Human-readable webpage |
application/json |
JSON-LD representation |
application/ld+json |
JSON-LD representation |
text/turtle |
RDF Turtle |
application/rdf+xml |
RDF/XML |
6.3 URI Patterns
# Person observation
https://ppid.org/POID-7a3b-c4d5-e6f7-890X
# Person reconstruction
https://ppid.org/PRID-1234-5678-90ab-cde5
# Observation's source claims
https://ppid.org/POID-7a3b-c4d5-e6f7-890X/claims
# Reconstruction's derived-from observations
https://ppid.org/PRID-1234-5678-90ab-cde5/observations
# Version history
https://ppid.org/PRID-1234-5678-90ab-cde5/history
# Specific version
https://ppid.org/PRID-1234-5678-90ab-cde5/v2
7. Namespace Design
7.1 RDF Namespaces
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .
7.2 Vocabulary Terms
# Classes
ppidt:PersonObservation a owl:Class ;
rdfs:subClassOf picom:PersonObservation .
ppidt:PersonReconstruction a owl:Class ;
rdfs:subClassOf picom:PersonReconstruction .
# Properties
ppidv:poid a owl:DatatypeProperty ;
rdfs:domain ppidt:PersonObservation ;
rdfs:range xsd:string ;
rdfs:label "Person Observation ID" .
ppidv:prid a owl:DatatypeProperty ;
rdfs:domain ppidt:PersonReconstruction ;
rdfs:range xsd:string ;
rdfs:label "Person Reconstruction ID" .
ppidv:hasObservation a owl:ObjectProperty ;
rdfs:domain ppidt:PersonReconstruction ;
rdfs:range ppidt:PersonObservation .
8. Interoperability Mapping
8.1 External Identifier Links
<https://ppid.org/PRID-1234-5678-90ab-cde5>
# Same-as links to other systems
owl:sameAs <https://orcid.org/0000-0002-1825-0097> ;
owl:sameAs <https://isni.org/isni/0000000121032683> ;
owl:sameAs <http://viaf.org/viaf/102333412> ;
owl:sameAs <https://www.wikidata.org/wiki/Q12345> ;
# SKOS mapping for partial matches
skos:closeMatch <http://id.loc.gov/authorities/names/n12345678> ;
# External ID properties
ppidv:orcid "0000-0002-1825-0097" ;
ppidv:isni "0000000121032683" ;
ppidv:viaf "102333412" ;
ppidv:wikidata "Q12345" .
8.2 GHCID Integration
Link persons to heritage institutions via GHCID:
<https://ppid.org/PRID-1234-5678-90ab-cde5>
ppidv:employedAt <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
ppidv:employmentRole "Senior Archivist" ;
ppidv:employmentStart "2015"^^xsd:gYear .
9. Collision Handling
9.1 Collision Probability
With 15 hex characters (60 bits of entropy):
- Total identifiers possible: 2^60 ≈ 1.15 × 10^18
- For 1 billion identifiers: P(collision) ≈ 4.3 × 10^-10
9.2 Collision Detection
def check_collision(new_ppid: str, existing_ppids: set[str]) -> bool:
"""
Check if generated PPID collides with existing identifiers.
In practice, use database unique constraint instead.
"""
return new_ppid in existing_ppids
9.3 Collision Resolution
If collision detected (extremely rare):
- For POID: Add microsecond precision to timestamp, regenerate
- For PRID: Add version suffix, regenerate
def handle_collision(base_ppid: str, collision_count: int) -> str:
"""
Resolve collision by adding entropy.
"""
input_with_entropy = f"{base_ppid}|collision:{collision_count}"
return generate_ppid_from_string(input_with_entropy)
10. Versioning Strategy
10.1 Observation Versioning
Observations are immutable - new extraction creates new POID:
Source extracted 2025-01-09: POID-7a3b-c4d5-e6f7-890X
Same source extracted 2025-02-15: POID-8c4d-e5f6-g7h8-901Y (different)
10.2 Reconstruction Versioning
Reconstructions can be revised - same PRID, new version:
# Version 1 (original)
<https://ppid.org/PRID-1234-5678-90ab-cde5/v1>
prov:generatedAtTime "2025-01-09T10:30:00Z"^^xsd:dateTime ;
prov:wasDerivedFrom <https://ppid.org/POID-7a3b-...> ,
<https://ppid.org/POID-8c4d-...> .
# Version 2 (revised with new evidence)
<https://ppid.org/PRID-1234-5678-90ab-cde5/v2>
prov:generatedAtTime "2025-02-15T14:00:00Z"^^xsd:dateTime ;
prov:wasRevisionOf <https://ppid.org/PRID-1234-5678-90ab-cde5/v1> ;
prov:wasDerivedFrom <https://ppid.org/POID-7a3b-...> ,
<https://ppid.org/POID-8c4d-...> ,
<https://ppid.org/POID-9d5e-...> . # New observation
# Current version (alias)
<https://ppid.org/PRID-1234-5678-90ab-cde5>
owl:sameAs <https://ppid.org/PRID-1234-5678-90ab-cde5/v2> .
11. Validation Rules
11.1 Syntax Validation
import re
PPID_PATTERN = re.compile(
r'^(POID|PRID)-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{3}[0-9a-fA-FxX]$'
)
def validate_ppid_syntax(ppid: str) -> bool:
"""Validate PPID syntax without checksum verification."""
return bool(PPID_PATTERN.match(ppid))
def validate_ppid_full(ppid: str) -> tuple[bool, str]:
"""
Full PPID validation including checksum.
Returns:
Tuple of (is_valid, error_message)
"""
if not validate_ppid_syntax(ppid):
return False, "Invalid syntax"
if not validate_ppid(ppid): # Checksum validation
return False, "Invalid checksum"
return True, "Valid"
11.2 Semantic Validation
| Rule | Description |
|---|---|
| POID must have source | Every POID must link to source URL |
| PRID must have observations | Every PRID must link to at least one POID |
| No circular references | PRIDs cannot derive from themselves |
| Valid timestamps | All timestamps must be valid ISO 8601 |
12. Implementation Checklist
12.1 Core Functions
generate_poid(source_url, timestamp, content_hash) → POIDgenerate_prid(observation_ids, curator_id, timestamp) → PRIDvalidate_ppid(ppid) → boolparse_ppid(ppid) → {type, hex, checksum}ppid_to_uuid(ppid) → UUIDuuid_to_ppid(uuid, type) → PPID
12.2 Storage Requirements
| Field | Type | Index |
|---|---|---|
ppid |
VARCHAR(24) | PRIMARY KEY |
ppid_type |
ENUM('POID', 'PRID') | INDEX |
created_at |
TIMESTAMP | INDEX |
uuid_raw |
UUID | UNIQUE |
12.3 API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/poid |
POST | Create new observation |
/api/v1/prid |
POST | Create new reconstruction |
/api/v1/{ppid} |
GET | Retrieve record |
/api/v1/{ppid}/validate |
GET | Validate identifier |
/api/v1/{prid}/observations |
GET | List linked observations |
13. References
- ISO/IEC 7064:2003 - Check character systems
- RFC 4122 - UUID URN Namespace
- ORCID Identifier Structure: https://support.orcid.org/hc/en-us/articles/360006897674
- W3C Cool URIs: https://www.w3.org/TR/cooluris/