# Identifier Structure Design
**Version**: 0.1.0
**Last Updated**: 2025-01-09
**Related**: [SOTA Identifier Systems](./02_sota_identifier_systems.md) | [Implementation Guidelines](./08_implementation_guidelines.md)
---
## 1. Overview
This document specifies the technical structure of PPID identifiers, including:
- Format and syntax
- Checksum algorithm
- Namespace design
- URI structure
- Generation algorithms
---
## 2. Design Principles
### 2.1 Core Requirements
| Requirement | Rationale |
|-------------|-----------|
| **Opaque** | No personal information encoded |
| **Persistent** | Never reused, stable for life |
| **Resolvable** | Valid HTTP URIs |
| **Verifiable** | Checksum for validation |
| **Interoperable** | Compatible with ORCID/ISNI format |
| **Scalable** | Support billions of identifiers |
### 2.2 Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| **Length** | 16 characters | ORCID/ISNI compatible |
| **Character set** | Hex (0-9, a-f) + type prefix | URL-safe, case-insensitive |
| **Checksum** | MOD 11-2 | ISO standard, ORCID compatible |
| **Type distinction** | Prefix: POID/PRID | Clear observation vs reconstruction |
| **UUID backing** | UUID v5 (SHA-1) | Deterministic, reproducible |
---
## 3. Identifier Format
### 3.1 Structure Overview
```
Format: {TYPE}-{xxxx}-{xxxx}-{xxxx}-{xxxx}
│ │ │ │ └── Block 4 (3 hex + check digit)
│ │ │ └── Block 3 (4 hex digits)
│ │ └── Block 2 (4 hex digits)
│ └── Block 1 (4 hex digits)
└── Type prefix (POID or PRID)
Examples:
POID-7a3b-c4d5-e6f7-890X (Person Observation ID)
PRID-1234-5678-90ab-cdeX (Person Reconstruction ID)
```
### 3.2 Component Breakdown
| Component | Format | Description |
|-----------|--------|-------------|
| **Type prefix** | `POID` or `PRID` | Observation vs Reconstruction |
| **Block 1** | `[0-9a-f]{4}` | 4 hex digits |
| **Block 2** | `[0-9a-f]{4}` | 4 hex digits |
| **Block 3** | `[0-9a-f]{4}` | 4 hex digits |
| **Block 4** | `[0-9a-f]{3}[0-9X]` | 3 hex + check digit |
| **Separator** | `-` | Hyphen between blocks |
### 3.3 Identifier Types
| Type | Prefix | Purpose | Creation Trigger |
|------|--------|---------|------------------|
| **Person Observation ID** | `POID` | Raw source observation | Data extraction from source |
| **Person Reconstruction ID** | `PRID` | Curated person identity | Entity resolution / curation |
---
## 4. Checksum Algorithm
### 4.1 MOD 11-2 (ISO/IEC 7064:2003)
PPID uses the same checksum as ORCID for interoperability:
```python
def calculate_ppid_checksum(digits: str) -> str:
"""
Calculate PPID check digit using ISO/IEC 7064 MOD 11-2.
Args:
digits: 15-character hex string (without check digit)
Returns:
Check digit (0-9 or X)
Algorithm:
1. For each digit, add to running total and multiply by 2
2. Take result modulo 11
3. Subtract from 12, take modulo 11
4. If result is 10, use 'X'
"""
# Convert hex digits to integers (0-15 for 0-9, a-f)
total = 0
for char in digits.lower():
if char.isdigit():
value = int(char)
else:
value = ord(char) - ord('a') + 10
total = (total + value) * 2
remainder = total % 11
result = (12 - remainder) % 11
return 'X' if result == 10 else str(result)
def validate_ppid(ppid: str) -> bool:
"""
Validate a complete PPID identifier.
Args:
ppid: Full PPID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
Returns:
True if valid, False otherwise
"""
# Remove prefix and hyphens
parts = ppid.upper().split('-')
# Validate prefix
if parts[0] not in ('POID', 'PRID'):
return False
# Validate length (4 blocks of 4 chars each)
if len(parts) != 5:
return False
# Extract hex portion (without prefix)
hex_part = ''.join(parts[1:])
if len(hex_part) != 16:
return False
# Validate hex characters (except last which can be X)
hex_digits = hex_part[:15]
check_digit = hex_part[15]
if not all(c in '0123456789abcdefABCDEF' for c in hex_digits):
return False
if check_digit not in '0123456789Xx':
return False
# Validate checksum
calculated = calculate_ppid_checksum(hex_digits)
return calculated.upper() == check_digit.upper()
```
### 4.2 Checksum Examples
| Hex Portion (15 chars) | Check Digit | Full ID |
|------------------------|-------------|---------|
| `7a3bc4d5e6f7890` | `X` | `POID-7a3b-c4d5-e6f7-890X` |
| `1234567890abcde` | `5` | `PRID-1234-5678-90ab-cde5` |
| `000000000000000` | `0` | `POID-0000-0000-0000-0000` |
---
## 5. UUID Generation
### 5.1 UUID v5 for Deterministic IDs
PPID uses UUID v5 (SHA-1 based) to generate deterministic identifiers:
```python
import uuid
import hashlib
# PPID namespace UUID (generated once, used forever)
PPID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') # Example
# Sub-namespaces for different ID types
POID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonObservation')
PRID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonReconstruction')
def generate_poid(source_url: str, retrieval_timestamp: str, content_hash: str) -> str:
"""
Generate deterministic POID from source metadata.
The same source + timestamp + content will always produce the same POID.
Args:
source_url: URL where observation was extracted
retrieval_timestamp: ISO 8601 timestamp of extraction
content_hash: SHA-256 hash of extracted content
Returns:
POID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
"""
# Create deterministic input string
input_string = f"{source_url}|{retrieval_timestamp}|{content_hash}"
# Generate UUID v5
raw_uuid = uuid.uuid5(POID_NAMESPACE, input_string)
# Convert to PPID format
return uuid_to_ppid(raw_uuid, 'POID')
def generate_prid(observation_ids: list[str], curator_id: str, timestamp: str) -> str:
"""
Generate deterministic PRID from linked observations.
Args:
observation_ids: Sorted list of POIDs that comprise this reconstruction
curator_id: Identifier of curator/algorithm creating reconstruction
timestamp: ISO 8601 timestamp of reconstruction
Returns:
PRID string (e.g., "PRID-1234-5678-90ab-cde5")
"""
# Sort observations for deterministic ordering
sorted_obs = sorted(observation_ids)
# Create deterministic input string
input_string = f"{'|'.join(sorted_obs)}|{curator_id}|{timestamp}"
# Generate UUID v5
raw_uuid = uuid.uuid5(PRID_NAMESPACE, input_string)
# Convert to PPID format
return uuid_to_ppid(raw_uuid, 'PRID')
def uuid_to_ppid(raw_uuid: uuid.UUID, prefix: str) -> str:
"""
Convert UUID to PPID format with checksum.
Args:
raw_uuid: UUID object
prefix: 'POID' or 'PRID'
Returns:
Formatted PPID string
"""
# Get hex representation (32 chars)
hex_str = raw_uuid.hex
# Take first 15 characters
hex_15 = hex_str[:15]
# Calculate checksum
check_digit = calculate_ppid_checksum(hex_15)
# Format with hyphens
hex_16 = hex_15 + check_digit.lower()
formatted = f"{prefix}-{hex_16[0:4]}-{hex_16[4:8]}-{hex_16[8:12]}-{hex_16[12:16]}"
return formatted
```
### 5.2 Why UUID v5?
| Property | UUID v5 | UUID v4 (Random) | UUID v7 (Time-ordered) |
|----------|---------|------------------|------------------------|
| **Deterministic** | Yes | No | No |
| **Reproducible** | Yes | No | No |
| **No state required** | Yes | Yes | No |
| **Standard algorithm** | Yes (RFC 4122) | Yes | Yes |
| **Collision resistance** | 128-bit | 122-bit | 48-bit time + 74-bit random |
**Key advantage**: Same input always produces same PPID, enabling deduplication and verification.
---
## 6. URI Structure
### 6.1 HTTP URIs
PPID identifiers are resolvable HTTP URIs:
```
Base URI: https://ppid.org/
POID URI: https://ppid.org/POID-7a3b-c4d5-e6f7-890X
PRID URI: https://ppid.org/PRID-1234-5678-90ab-cde5
```
### 6.2 Content Negotiation
| Accept Header | Response Format |
|---------------|-----------------|
| `text/html` | Human-readable webpage |
| `application/json` | JSON-LD representation |
| `application/ld+json` | JSON-LD representation |
| `text/turtle` | RDF Turtle |
| `application/rdf+xml` | RDF/XML |
### 6.3 URI Patterns
```
# Person observation
https://ppid.org/POID-7a3b-c4d5-e6f7-890X
# Person reconstruction
https://ppid.org/PRID-1234-5678-90ab-cde5
# Observation's source claims
https://ppid.org/POID-7a3b-c4d5-e6f7-890X/claims
# Reconstruction's derived-from observations
https://ppid.org/PRID-1234-5678-90ab-cde5/observations
# Version history
https://ppid.org/PRID-1234-5678-90ab-cde5/history
# Specific version
https://ppid.org/PRID-1234-5678-90ab-cde5/v2
```
---
## 7. Namespace Design
### 7.1 RDF Namespaces
```turtle
@prefix ppid: .
@prefix ppidv: .
@prefix ppidt: .
```
### 7.2 Vocabulary Terms
```turtle
# Classes
ppidt:PersonObservation a owl:Class ;
rdfs:subClassOf picom:PersonObservation .
ppidt:PersonReconstruction a owl:Class ;
rdfs:subClassOf picom:PersonReconstruction .
# Properties
ppidv:poid a owl:DatatypeProperty ;
rdfs:domain ppidt:PersonObservation ;
rdfs:range xsd:string ;
rdfs:label "Person Observation ID" .
ppidv:prid a owl:DatatypeProperty ;
rdfs:domain ppidt:PersonReconstruction ;
rdfs:range xsd:string ;
rdfs:label "Person Reconstruction ID" .
ppidv:hasObservation a owl:ObjectProperty ;
rdfs:domain ppidt:PersonReconstruction ;
rdfs:range ppidt:PersonObservation .
```
---
## 8. Interoperability Mapping
### 8.1 External Identifier Links
```turtle
# Same-as links to other systems
owl:sameAs ;
owl:sameAs ;
owl:sameAs ;
owl:sameAs ;
# SKOS mapping for partial matches
skos:closeMatch ;
# External ID properties
ppidv:orcid "0000-0002-1825-0097" ;
ppidv:isni "0000000121032683" ;
ppidv:viaf "102333412" ;
ppidv:wikidata "Q12345" .
```
### 8.2 GHCID Integration
Link persons to heritage institutions via GHCID:
```turtle
ppidv:employedAt ;
ppidv:employmentRole "Senior Archivist" ;
ppidv:employmentStart "2015"^^xsd:gYear .
```
---
## 9. Collision Handling
### 9.1 Collision Probability
With 15 hex characters (60 bits of entropy):
- Total identifiers possible: 2^60 ≈ 1.15 × 10^18
- For 1 billion identifiers: P(collision) ≈ 4.3 × 10^-10
### 9.2 Collision Detection
```python
def check_collision(new_ppid: str, existing_ppids: set[str]) -> bool:
"""
Check if generated PPID collides with existing identifiers.
In practice, use database unique constraint instead.
"""
return new_ppid in existing_ppids
```
### 9.3 Collision Resolution
If collision detected (extremely rare):
1. **For POID**: Add microsecond precision to timestamp, regenerate
2. **For PRID**: Add version suffix, regenerate
```python
def handle_collision(base_ppid: str, collision_count: int) -> str:
"""
Resolve collision by adding entropy.
"""
input_with_entropy = f"{base_ppid}|collision:{collision_count}"
return generate_ppid_from_string(input_with_entropy)
```
---
## 10. Versioning Strategy
### 10.1 Observation Versioning
Observations are **immutable** - new extraction creates new POID:
```
Source extracted 2025-01-09: POID-7a3b-c4d5-e6f7-890X
Same source extracted 2025-02-15: POID-8c4d-e5f6-g7h8-901Y (different)
```
### 10.2 Reconstruction Versioning
Reconstructions can be **revised** - same PRID, new version:
```turtle
# Version 1 (original)
prov:generatedAtTime "2025-01-09T10:30:00Z"^^xsd:dateTime ;
prov:wasDerivedFrom ,
.
# Version 2 (revised with new evidence)
prov:generatedAtTime "2025-02-15T14:00:00Z"^^xsd:dateTime ;
prov:wasRevisionOf ;
prov:wasDerivedFrom ,
,
. # New observation
# Current version (alias)
owl:sameAs .
```
---
## 11. Validation Rules
### 11.1 Syntax Validation
```python
import re
PPID_PATTERN = re.compile(
r'^(POID|PRID)-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{3}[0-9a-fA-FxX]$'
)
def validate_ppid_syntax(ppid: str) -> bool:
"""Validate PPID syntax without checksum verification."""
return bool(PPID_PATTERN.match(ppid))
def validate_ppid_full(ppid: str) -> tuple[bool, str]:
"""
Full PPID validation including checksum.
Returns:
Tuple of (is_valid, error_message)
"""
if not validate_ppid_syntax(ppid):
return False, "Invalid syntax"
if not validate_ppid(ppid): # Checksum validation
return False, "Invalid checksum"
return True, "Valid"
```
### 11.2 Semantic Validation
| Rule | Description |
|------|-------------|
| **POID must have source** | Every POID must link to source URL |
| **PRID must have observations** | Every PRID must link to at least one POID |
| **No circular references** | PRIDs cannot derive from themselves |
| **Valid timestamps** | All timestamps must be valid ISO 8601 |
---
## 12. Implementation Checklist
### 12.1 Core Functions
- [ ] `generate_poid(source_url, timestamp, content_hash) → POID`
- [ ] `generate_prid(observation_ids, curator_id, timestamp) → PRID`
- [ ] `validate_ppid(ppid) → bool`
- [ ] `parse_ppid(ppid) → {type, hex, checksum}`
- [ ] `ppid_to_uuid(ppid) → UUID`
- [ ] `uuid_to_ppid(uuid, type) → PPID`
### 12.2 Storage Requirements
| Field | Type | Index |
|-------|------|-------|
| `ppid` | VARCHAR(24) | PRIMARY KEY |
| `ppid_type` | ENUM('POID', 'PRID') | INDEX |
| `created_at` | TIMESTAMP | INDEX |
| `uuid_raw` | UUID | UNIQUE |
### 12.3 API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/v1/poid` | POST | Create new observation |
| `/api/v1/prid` | POST | Create new reconstruction |
| `/api/v1/{ppid}` | GET | Retrieve record |
| `/api/v1/{ppid}/validate` | GET | Validate identifier |
| `/api/v1/{prid}/observations` | GET | List linked observations |
---
## 13. References
- ISO/IEC 7064:2003 - Check character systems
- RFC 4122 - UUID URN Namespace
- ORCID Identifier Structure: https://support.orcid.org/hc/en-us/articles/360006897674
- W3C Cool URIs: https://www.w3.org/TR/cooluris/