# Identifier Structure Design **Version**: 0.1.0 **Last Updated**: 2025-01-09 **Related**: [SOTA Identifier Systems](./02_sota_identifier_systems.md) | [Implementation Guidelines](./08_implementation_guidelines.md) --- ## 1. Overview This document specifies the technical structure of PPID identifiers, including: - Format and syntax - Checksum algorithm - Namespace design - URI structure - Generation algorithms --- ## 2. Design Principles ### 2.1 Core Requirements | Requirement | Rationale | |-------------|-----------| | **Opaque** | No personal information encoded | | **Persistent** | Never reused, stable for life | | **Resolvable** | Valid HTTP URIs | | **Verifiable** | Checksum for validation | | **Interoperable** | Compatible with ORCID/ISNI format | | **Scalable** | Support billions of identifiers | ### 2.2 Design Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | **Length** | 16 characters | ORCID/ISNI compatible | | **Character set** | Hex (0-9, a-f) + type prefix | URL-safe, case-insensitive | | **Checksum** | MOD 11-2 | ISO standard, ORCID compatible | | **Type distinction** | Prefix: POID/PRID | Clear observation vs reconstruction | | **UUID backing** | UUID v5 (SHA-1) | Deterministic, reproducible | --- ## 3. Identifier Format ### 3.1 Structure Overview ``` Format: {TYPE}-{xxxx}-{xxxx}-{xxxx}-{xxxx} │ │ │ │ └── Block 4 (3 hex + check digit) │ │ │ └── Block 3 (4 hex digits) │ │ └── Block 2 (4 hex digits) │ └── Block 1 (4 hex digits) └── Type prefix (POID or PRID) Examples: POID-7a3b-c4d5-e6f7-890X (Person Observation ID) PRID-1234-5678-90ab-cdeX (Person Reconstruction ID) ``` ### 3.2 Component Breakdown | Component | Format | Description | |-----------|--------|-------------| | **Type prefix** | `POID` or `PRID` | Observation vs Reconstruction | | **Block 1** | `[0-9a-f]{4}` | 4 hex digits | | **Block 2** | `[0-9a-f]{4}` | 4 hex digits | | **Block 3** | `[0-9a-f]{4}` | 4 hex digits | | **Block 4** | `[0-9a-f]{3}[0-9X]` | 3 hex + check digit | | **Separator** | `-` | Hyphen between blocks | ### 3.3 Identifier Types | Type | Prefix | Purpose | Creation Trigger | |------|--------|---------|------------------| | **Person Observation ID** | `POID` | Raw source observation | Data extraction from source | | **Person Reconstruction ID** | `PRID` | Curated person identity | Entity resolution / curation | --- ## 4. Checksum Algorithm ### 4.1 MOD 11-2 (ISO/IEC 7064:2003) PPID uses the same checksum as ORCID for interoperability: ```python def calculate_ppid_checksum(digits: str) -> str: """ Calculate PPID check digit using ISO/IEC 7064 MOD 11-2. Args: digits: 15-character hex string (without check digit) Returns: Check digit (0-9 or X) Algorithm: 1. For each digit, add to running total and multiply by 2 2. Take result modulo 11 3. Subtract from 12, take modulo 11 4. If result is 10, use 'X' """ # Convert hex digits to integers (0-15 for 0-9, a-f) total = 0 for char in digits.lower(): if char.isdigit(): value = int(char) else: value = ord(char) - ord('a') + 10 total = (total + value) * 2 remainder = total % 11 result = (12 - remainder) % 11 return 'X' if result == 10 else str(result) def validate_ppid(ppid: str) -> bool: """ Validate a complete PPID identifier. Args: ppid: Full PPID string (e.g., "POID-7a3b-c4d5-e6f7-890X") Returns: True if valid, False otherwise """ # Remove prefix and hyphens parts = ppid.upper().split('-') # Validate prefix if parts[0] not in ('POID', 'PRID'): return False # Validate length (4 blocks of 4 chars each) if len(parts) != 5: return False # Extract hex portion (without prefix) hex_part = ''.join(parts[1:]) if len(hex_part) != 16: return False # Validate hex characters (except last which can be X) hex_digits = hex_part[:15] check_digit = hex_part[15] if not all(c in '0123456789abcdefABCDEF' for c in hex_digits): return False if check_digit not in '0123456789Xx': return False # Validate checksum calculated = calculate_ppid_checksum(hex_digits) return calculated.upper() == check_digit.upper() ``` ### 4.2 Checksum Examples | Hex Portion (15 chars) | Check Digit | Full ID | |------------------------|-------------|---------| | `7a3bc4d5e6f7890` | `X` | `POID-7a3b-c4d5-e6f7-890X` | | `1234567890abcde` | `5` | `PRID-1234-5678-90ab-cde5` | | `000000000000000` | `0` | `POID-0000-0000-0000-0000` | --- ## 5. UUID Generation ### 5.1 UUID v5 for Deterministic IDs PPID uses UUID v5 (SHA-1 based) to generate deterministic identifiers: ```python import uuid import hashlib # PPID namespace UUID (generated once, used forever) PPID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8') # Example # Sub-namespaces for different ID types POID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonObservation') PRID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonReconstruction') def generate_poid(source_url: str, retrieval_timestamp: str, content_hash: str) -> str: """ Generate deterministic POID from source metadata. The same source + timestamp + content will always produce the same POID. Args: source_url: URL where observation was extracted retrieval_timestamp: ISO 8601 timestamp of extraction content_hash: SHA-256 hash of extracted content Returns: POID string (e.g., "POID-7a3b-c4d5-e6f7-890X") """ # Create deterministic input string input_string = f"{source_url}|{retrieval_timestamp}|{content_hash}" # Generate UUID v5 raw_uuid = uuid.uuid5(POID_NAMESPACE, input_string) # Convert to PPID format return uuid_to_ppid(raw_uuid, 'POID') def generate_prid(observation_ids: list[str], curator_id: str, timestamp: str) -> str: """ Generate deterministic PRID from linked observations. Args: observation_ids: Sorted list of POIDs that comprise this reconstruction curator_id: Identifier of curator/algorithm creating reconstruction timestamp: ISO 8601 timestamp of reconstruction Returns: PRID string (e.g., "PRID-1234-5678-90ab-cde5") """ # Sort observations for deterministic ordering sorted_obs = sorted(observation_ids) # Create deterministic input string input_string = f"{'|'.join(sorted_obs)}|{curator_id}|{timestamp}" # Generate UUID v5 raw_uuid = uuid.uuid5(PRID_NAMESPACE, input_string) # Convert to PPID format return uuid_to_ppid(raw_uuid, 'PRID') def uuid_to_ppid(raw_uuid: uuid.UUID, prefix: str) -> str: """ Convert UUID to PPID format with checksum. Args: raw_uuid: UUID object prefix: 'POID' or 'PRID' Returns: Formatted PPID string """ # Get hex representation (32 chars) hex_str = raw_uuid.hex # Take first 15 characters hex_15 = hex_str[:15] # Calculate checksum check_digit = calculate_ppid_checksum(hex_15) # Format with hyphens hex_16 = hex_15 + check_digit.lower() formatted = f"{prefix}-{hex_16[0:4]}-{hex_16[4:8]}-{hex_16[8:12]}-{hex_16[12:16]}" return formatted ``` ### 5.2 Why UUID v5? | Property | UUID v5 | UUID v4 (Random) | UUID v7 (Time-ordered) | |----------|---------|------------------|------------------------| | **Deterministic** | Yes | No | No | | **Reproducible** | Yes | No | No | | **No state required** | Yes | Yes | No | | **Standard algorithm** | Yes (RFC 4122) | Yes | Yes | | **Collision resistance** | 128-bit | 122-bit | 48-bit time + 74-bit random | **Key advantage**: Same input always produces same PPID, enabling deduplication and verification. --- ## 6. URI Structure ### 6.1 HTTP URIs PPID identifiers are resolvable HTTP URIs: ``` Base URI: https://ppid.org/ POID URI: https://ppid.org/POID-7a3b-c4d5-e6f7-890X PRID URI: https://ppid.org/PRID-1234-5678-90ab-cde5 ``` ### 6.2 Content Negotiation | Accept Header | Response Format | |---------------|-----------------| | `text/html` | Human-readable webpage | | `application/json` | JSON-LD representation | | `application/ld+json` | JSON-LD representation | | `text/turtle` | RDF Turtle | | `application/rdf+xml` | RDF/XML | ### 6.3 URI Patterns ``` # Person observation https://ppid.org/POID-7a3b-c4d5-e6f7-890X # Person reconstruction https://ppid.org/PRID-1234-5678-90ab-cde5 # Observation's source claims https://ppid.org/POID-7a3b-c4d5-e6f7-890X/claims # Reconstruction's derived-from observations https://ppid.org/PRID-1234-5678-90ab-cde5/observations # Version history https://ppid.org/PRID-1234-5678-90ab-cde5/history # Specific version https://ppid.org/PRID-1234-5678-90ab-cde5/v2 ``` --- ## 7. Namespace Design ### 7.1 RDF Namespaces ```turtle @prefix ppid: . @prefix ppidv: . @prefix ppidt: . ``` ### 7.2 Vocabulary Terms ```turtle # Classes ppidt:PersonObservation a owl:Class ; rdfs:subClassOf picom:PersonObservation . ppidt:PersonReconstruction a owl:Class ; rdfs:subClassOf picom:PersonReconstruction . # Properties ppidv:poid a owl:DatatypeProperty ; rdfs:domain ppidt:PersonObservation ; rdfs:range xsd:string ; rdfs:label "Person Observation ID" . ppidv:prid a owl:DatatypeProperty ; rdfs:domain ppidt:PersonReconstruction ; rdfs:range xsd:string ; rdfs:label "Person Reconstruction ID" . ppidv:hasObservation a owl:ObjectProperty ; rdfs:domain ppidt:PersonReconstruction ; rdfs:range ppidt:PersonObservation . ``` --- ## 8. Interoperability Mapping ### 8.1 External Identifier Links ```turtle # Same-as links to other systems owl:sameAs ; owl:sameAs ; owl:sameAs ; owl:sameAs ; # SKOS mapping for partial matches skos:closeMatch ; # External ID properties ppidv:orcid "0000-0002-1825-0097" ; ppidv:isni "0000000121032683" ; ppidv:viaf "102333412" ; ppidv:wikidata "Q12345" . ``` ### 8.2 GHCID Integration Link persons to heritage institutions via GHCID: ```turtle ppidv:employedAt ; ppidv:employmentRole "Senior Archivist" ; ppidv:employmentStart "2015"^^xsd:gYear . ``` --- ## 9. Collision Handling ### 9.1 Collision Probability With 15 hex characters (60 bits of entropy): - Total identifiers possible: 2^60 ≈ 1.15 × 10^18 - For 1 billion identifiers: P(collision) ≈ 4.3 × 10^-10 ### 9.2 Collision Detection ```python def check_collision(new_ppid: str, existing_ppids: set[str]) -> bool: """ Check if generated PPID collides with existing identifiers. In practice, use database unique constraint instead. """ return new_ppid in existing_ppids ``` ### 9.3 Collision Resolution If collision detected (extremely rare): 1. **For POID**: Add microsecond precision to timestamp, regenerate 2. **For PRID**: Add version suffix, regenerate ```python def handle_collision(base_ppid: str, collision_count: int) -> str: """ Resolve collision by adding entropy. """ input_with_entropy = f"{base_ppid}|collision:{collision_count}" return generate_ppid_from_string(input_with_entropy) ``` --- ## 10. Versioning Strategy ### 10.1 Observation Versioning Observations are **immutable** - new extraction creates new POID: ``` Source extracted 2025-01-09: POID-7a3b-c4d5-e6f7-890X Same source extracted 2025-02-15: POID-8c4d-e5f6-g7h8-901Y (different) ``` ### 10.2 Reconstruction Versioning Reconstructions can be **revised** - same PRID, new version: ```turtle # Version 1 (original) prov:generatedAtTime "2025-01-09T10:30:00Z"^^xsd:dateTime ; prov:wasDerivedFrom , . # Version 2 (revised with new evidence) prov:generatedAtTime "2025-02-15T14:00:00Z"^^xsd:dateTime ; prov:wasRevisionOf ; prov:wasDerivedFrom , , . # New observation # Current version (alias) owl:sameAs . ``` --- ## 11. Validation Rules ### 11.1 Syntax Validation ```python import re PPID_PATTERN = re.compile( r'^(POID|PRID)-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{3}[0-9a-fA-FxX]$' ) def validate_ppid_syntax(ppid: str) -> bool: """Validate PPID syntax without checksum verification.""" return bool(PPID_PATTERN.match(ppid)) def validate_ppid_full(ppid: str) -> tuple[bool, str]: """ Full PPID validation including checksum. Returns: Tuple of (is_valid, error_message) """ if not validate_ppid_syntax(ppid): return False, "Invalid syntax" if not validate_ppid(ppid): # Checksum validation return False, "Invalid checksum" return True, "Valid" ``` ### 11.2 Semantic Validation | Rule | Description | |------|-------------| | **POID must have source** | Every POID must link to source URL | | **PRID must have observations** | Every PRID must link to at least one POID | | **No circular references** | PRIDs cannot derive from themselves | | **Valid timestamps** | All timestamps must be valid ISO 8601 | --- ## 12. Implementation Checklist ### 12.1 Core Functions - [ ] `generate_poid(source_url, timestamp, content_hash) → POID` - [ ] `generate_prid(observation_ids, curator_id, timestamp) → PRID` - [ ] `validate_ppid(ppid) → bool` - [ ] `parse_ppid(ppid) → {type, hex, checksum}` - [ ] `ppid_to_uuid(ppid) → UUID` - [ ] `uuid_to_ppid(uuid, type) → PPID` ### 12.2 Storage Requirements | Field | Type | Index | |-------|------|-------| | `ppid` | VARCHAR(24) | PRIMARY KEY | | `ppid_type` | ENUM('POID', 'PRID') | INDEX | | `created_at` | TIMESTAMP | INDEX | | `uuid_raw` | UUID | UNIQUE | ### 12.3 API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/api/v1/poid` | POST | Create new observation | | `/api/v1/prid` | POST | Create new reconstruction | | `/api/v1/{ppid}` | GET | Retrieve record | | `/api/v1/{ppid}/validate` | GET | Validate identifier | | `/api/v1/{prid}/observations` | GET | List linked observations | --- ## 13. References - ISO/IEC 7064:2003 - Check character systems - RFC 4122 - UUID URN Namespace - ORCID Identifier Structure: https://support.orcid.org/hc/en-us/articles/360006897674 - W3C Cool URIs: https://www.w3.org/TR/cooluris/