glam/docs/plan/person_pid/05_identifier_structure_design.md

15 KiB
Raw Blame History

Identifier Structure Design

Version: 0.1.0
Last Updated: 2025-01-09
Related: SOTA Identifier Systems | Implementation Guidelines


1. Overview

This document specifies the technical structure of PPID identifiers, including:

  • Format and syntax
  • Checksum algorithm
  • Namespace design
  • URI structure
  • Generation algorithms

2. Design Principles

2.1 Core Requirements

Requirement Rationale
Opaque No personal information encoded
Persistent Never reused, stable for life
Resolvable Valid HTTP URIs
Verifiable Checksum for validation
Interoperable Compatible with ORCID/ISNI format
Scalable Support billions of identifiers

2.2 Design Decisions

Decision Choice Rationale
Length 16 characters ORCID/ISNI compatible
Character set Hex (0-9, a-f) + type prefix URL-safe, case-insensitive
Checksum MOD 11-2 ISO standard, ORCID compatible
Type distinction Prefix: POID/PRID Clear observation vs reconstruction
UUID backing UUID v5 (SHA-1) Deterministic, reproducible

3. Identifier Format

3.1 Structure Overview

Format: {TYPE}-{xxxx}-{xxxx}-{xxxx}-{xxxx}
        │      │      │      │      └── Block 4 (3 hex + check digit)
        │      │      │      └── Block 3 (4 hex digits)
        │      │      └── Block 2 (4 hex digits)
        │      └── Block 1 (4 hex digits)
        └── Type prefix (POID or PRID)

Examples:
  POID-7a3b-c4d5-e6f7-890X   (Person Observation ID)
  PRID-1234-5678-90ab-cdeX   (Person Reconstruction ID)

3.2 Component Breakdown

Component Format Description
Type prefix POID or PRID Observation vs Reconstruction
Block 1 [0-9a-f]{4} 4 hex digits
Block 2 [0-9a-f]{4} 4 hex digits
Block 3 [0-9a-f]{4} 4 hex digits
Block 4 [0-9a-f]{3}[0-9X] 3 hex + check digit
Separator - Hyphen between blocks

3.3 Identifier Types

Type Prefix Purpose Creation Trigger
Person Observation ID POID Raw source observation Data extraction from source
Person Reconstruction ID PRID Curated person identity Entity resolution / curation

4. Checksum Algorithm

4.1 MOD 11-2 (ISO/IEC 7064:2003)

PPID uses the same checksum as ORCID for interoperability:

def calculate_ppid_checksum(digits: str) -> str:
    """
    Calculate PPID check digit using ISO/IEC 7064 MOD 11-2.
    
    Args:
        digits: 15-character hex string (without check digit)
    
    Returns:
        Check digit (0-9 or X)
    
    Algorithm:
        1. For each digit, add to running total and multiply by 2
        2. Take result modulo 11
        3. Subtract from 12, take modulo 11
        4. If result is 10, use 'X'
    """
    # Convert hex digits to integers (0-15 for 0-9, a-f)
    total = 0
    for char in digits.lower():
        if char.isdigit():
            value = int(char)
        else:
            value = ord(char) - ord('a') + 10
        total = (total + value) * 2
    
    remainder = total % 11
    result = (12 - remainder) % 11
    
    return 'X' if result == 10 else str(result)


def validate_ppid(ppid: str) -> bool:
    """
    Validate a complete PPID identifier.
    
    Args:
        ppid: Full PPID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
    
    Returns:
        True if valid, False otherwise
    """
    # Remove prefix and hyphens
    parts = ppid.upper().split('-')
    
    # Validate prefix
    if parts[0] not in ('POID', 'PRID'):
        return False
    
    # Validate length (4 blocks of 4 chars each)
    if len(parts) != 5:
        return False
    
    # Extract hex portion (without prefix)
    hex_part = ''.join(parts[1:])
    if len(hex_part) != 16:
        return False
    
    # Validate hex characters (except last which can be X)
    hex_digits = hex_part[:15]
    check_digit = hex_part[15]
    
    if not all(c in '0123456789abcdefABCDEF' for c in hex_digits):
        return False
    
    if check_digit not in '0123456789Xx':
        return False
    
    # Validate checksum
    calculated = calculate_ppid_checksum(hex_digits)
    return calculated.upper() == check_digit.upper()

4.2 Checksum Examples

Hex Portion (15 chars) Check Digit Full ID
7a3bc4d5e6f7890 X POID-7a3b-c4d5-e6f7-890X
1234567890abcde 5 PRID-1234-5678-90ab-cde5
000000000000000 0 POID-0000-0000-0000-0000

5. UUID Generation

5.1 UUID v5 for Deterministic IDs

PPID uses UUID v5 (SHA-1 based) to generate deterministic identifiers:

import uuid
import hashlib

# PPID namespace UUID (generated once, used forever)
PPID_NAMESPACE = uuid.UUID('6ba7b810-9dad-11d1-80b4-00c04fd430c8')  # Example

# Sub-namespaces for different ID types
POID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonObservation')
PRID_NAMESPACE = uuid.uuid5(PPID_NAMESPACE, 'PersonReconstruction')


def generate_poid(source_url: str, retrieval_timestamp: str, content_hash: str) -> str:
    """
    Generate deterministic POID from source metadata.
    
    The same source + timestamp + content will always produce the same POID.
    
    Args:
        source_url: URL where observation was extracted
        retrieval_timestamp: ISO 8601 timestamp of extraction
        content_hash: SHA-256 hash of extracted content
    
    Returns:
        POID string (e.g., "POID-7a3b-c4d5-e6f7-890X")
    """
    # Create deterministic input string
    input_string = f"{source_url}|{retrieval_timestamp}|{content_hash}"
    
    # Generate UUID v5
    raw_uuid = uuid.uuid5(POID_NAMESPACE, input_string)
    
    # Convert to PPID format
    return uuid_to_ppid(raw_uuid, 'POID')


def generate_prid(observation_ids: list[str], curator_id: str, timestamp: str) -> str:
    """
    Generate deterministic PRID from linked observations.
    
    Args:
        observation_ids: Sorted list of POIDs that comprise this reconstruction
        curator_id: Identifier of curator/algorithm creating reconstruction
        timestamp: ISO 8601 timestamp of reconstruction
    
    Returns:
        PRID string (e.g., "PRID-1234-5678-90ab-cde5")
    """
    # Sort observations for deterministic ordering
    sorted_obs = sorted(observation_ids)
    
    # Create deterministic input string
    input_string = f"{'|'.join(sorted_obs)}|{curator_id}|{timestamp}"
    
    # Generate UUID v5
    raw_uuid = uuid.uuid5(PRID_NAMESPACE, input_string)
    
    # Convert to PPID format
    return uuid_to_ppid(raw_uuid, 'PRID')


def uuid_to_ppid(raw_uuid: uuid.UUID, prefix: str) -> str:
    """
    Convert UUID to PPID format with checksum.
    
    Args:
        raw_uuid: UUID object
        prefix: 'POID' or 'PRID'
    
    Returns:
        Formatted PPID string
    """
    # Get hex representation (32 chars)
    hex_str = raw_uuid.hex
    
    # Take first 15 characters
    hex_15 = hex_str[:15]
    
    # Calculate checksum
    check_digit = calculate_ppid_checksum(hex_15)
    
    # Format with hyphens
    hex_16 = hex_15 + check_digit.lower()
    formatted = f"{prefix}-{hex_16[0:4]}-{hex_16[4:8]}-{hex_16[8:12]}-{hex_16[12:16]}"
    
    return formatted

5.2 Why UUID v5?

Property UUID v5 UUID v4 (Random) UUID v7 (Time-ordered)
Deterministic Yes No No
Reproducible Yes No No
No state required Yes Yes No
Standard algorithm Yes (RFC 4122) Yes Yes
Collision resistance 128-bit 122-bit 48-bit time + 74-bit random

Key advantage: Same input always produces same PPID, enabling deduplication and verification.


6. URI Structure

6.1 HTTP URIs

PPID identifiers are resolvable HTTP URIs:

Base URI: https://ppid.org/

POID URI: https://ppid.org/POID-7a3b-c4d5-e6f7-890X
PRID URI: https://ppid.org/PRID-1234-5678-90ab-cde5

6.2 Content Negotiation

Accept Header Response Format
text/html Human-readable webpage
application/json JSON-LD representation
application/ld+json JSON-LD representation
text/turtle RDF Turtle
application/rdf+xml RDF/XML

6.3 URI Patterns

# Person observation
https://ppid.org/POID-7a3b-c4d5-e6f7-890X

# Person reconstruction
https://ppid.org/PRID-1234-5678-90ab-cde5

# Observation's source claims
https://ppid.org/POID-7a3b-c4d5-e6f7-890X/claims

# Reconstruction's derived-from observations
https://ppid.org/PRID-1234-5678-90ab-cde5/observations

# Version history
https://ppid.org/PRID-1234-5678-90ab-cde5/history

# Specific version
https://ppid.org/PRID-1234-5678-90ab-cde5/v2

7. Namespace Design

7.1 RDF Namespaces

@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .

7.2 Vocabulary Terms

# Classes
ppidt:PersonObservation a owl:Class ;
    rdfs:subClassOf picom:PersonObservation .

ppidt:PersonReconstruction a owl:Class ;
    rdfs:subClassOf picom:PersonReconstruction .

# Properties
ppidv:poid a owl:DatatypeProperty ;
    rdfs:domain ppidt:PersonObservation ;
    rdfs:range xsd:string ;
    rdfs:label "Person Observation ID" .

ppidv:prid a owl:DatatypeProperty ;
    rdfs:domain ppidt:PersonReconstruction ;
    rdfs:range xsd:string ;
    rdfs:label "Person Reconstruction ID" .

ppidv:hasObservation a owl:ObjectProperty ;
    rdfs:domain ppidt:PersonReconstruction ;
    rdfs:range ppidt:PersonObservation .

8. Interoperability Mapping

<https://ppid.org/PRID-1234-5678-90ab-cde5>
    # Same-as links to other systems
    owl:sameAs <https://orcid.org/0000-0002-1825-0097> ;
    owl:sameAs <https://isni.org/isni/0000000121032683> ;
    owl:sameAs <http://viaf.org/viaf/102333412> ;
    owl:sameAs <https://www.wikidata.org/wiki/Q12345> ;
    
    # SKOS mapping for partial matches
    skos:closeMatch <http://id.loc.gov/authorities/names/n12345678> ;
    
    # External ID properties
    ppidv:orcid "0000-0002-1825-0097" ;
    ppidv:isni "0000000121032683" ;
    ppidv:viaf "102333412" ;
    ppidv:wikidata "Q12345" .

8.2 GHCID Integration

Link persons to heritage institutions via GHCID:

<https://ppid.org/PRID-1234-5678-90ab-cde5>
    ppidv:employedAt <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
    ppidv:employmentRole "Senior Archivist" ;
    ppidv:employmentStart "2015"^^xsd:gYear .

9. Collision Handling

9.1 Collision Probability

With 15 hex characters (60 bits of entropy):

  • Total identifiers possible: 2^60 ≈ 1.15 × 10^18
  • For 1 billion identifiers: P(collision) ≈ 4.3 × 10^-10

9.2 Collision Detection

def check_collision(new_ppid: str, existing_ppids: set[str]) -> bool:
    """
    Check if generated PPID collides with existing identifiers.
    
    In practice, use database unique constraint instead.
    """
    return new_ppid in existing_ppids

9.3 Collision Resolution

If collision detected (extremely rare):

  1. For POID: Add microsecond precision to timestamp, regenerate
  2. For PRID: Add version suffix, regenerate
def handle_collision(base_ppid: str, collision_count: int) -> str:
    """
    Resolve collision by adding entropy.
    """
    input_with_entropy = f"{base_ppid}|collision:{collision_count}"
    return generate_ppid_from_string(input_with_entropy)

10. Versioning Strategy

10.1 Observation Versioning

Observations are immutable - new extraction creates new POID:

Source extracted 2025-01-09:  POID-7a3b-c4d5-e6f7-890X
Same source extracted 2025-02-15: POID-8c4d-e5f6-g7h8-901Y  (different)

10.2 Reconstruction Versioning

Reconstructions can be revised - same PRID, new version:

# Version 1 (original)
<https://ppid.org/PRID-1234-5678-90ab-cde5/v1>
    prov:generatedAtTime "2025-01-09T10:30:00Z"^^xsd:dateTime ;
    prov:wasDerivedFrom <https://ppid.org/POID-7a3b-...> ,
                        <https://ppid.org/POID-8c4d-...> .

# Version 2 (revised with new evidence)
<https://ppid.org/PRID-1234-5678-90ab-cde5/v2>
    prov:generatedAtTime "2025-02-15T14:00:00Z"^^xsd:dateTime ;
    prov:wasRevisionOf <https://ppid.org/PRID-1234-5678-90ab-cde5/v1> ;
    prov:wasDerivedFrom <https://ppid.org/POID-7a3b-...> ,
                        <https://ppid.org/POID-8c4d-...> ,
                        <https://ppid.org/POID-9d5e-...> .  # New observation

# Current version (alias)
<https://ppid.org/PRID-1234-5678-90ab-cde5>
    owl:sameAs <https://ppid.org/PRID-1234-5678-90ab-cde5/v2> .

11. Validation Rules

11.1 Syntax Validation

import re

PPID_PATTERN = re.compile(
    r'^(POID|PRID)-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{3}[0-9a-fA-FxX]$'
)

def validate_ppid_syntax(ppid: str) -> bool:
    """Validate PPID syntax without checksum verification."""
    return bool(PPID_PATTERN.match(ppid))

def validate_ppid_full(ppid: str) -> tuple[bool, str]:
    """
    Full PPID validation including checksum.
    
    Returns:
        Tuple of (is_valid, error_message)
    """
    if not validate_ppid_syntax(ppid):
        return False, "Invalid syntax"
    
    if not validate_ppid(ppid):  # Checksum validation
        return False, "Invalid checksum"
    
    return True, "Valid"

11.2 Semantic Validation

Rule Description
POID must have source Every POID must link to source URL
PRID must have observations Every PRID must link to at least one POID
No circular references PRIDs cannot derive from themselves
Valid timestamps All timestamps must be valid ISO 8601

12. Implementation Checklist

12.1 Core Functions

  • generate_poid(source_url, timestamp, content_hash) → POID
  • generate_prid(observation_ids, curator_id, timestamp) → PRID
  • validate_ppid(ppid) → bool
  • parse_ppid(ppid) → {type, hex, checksum}
  • ppid_to_uuid(ppid) → UUID
  • uuid_to_ppid(uuid, type) → PPID

12.2 Storage Requirements

Field Type Index
ppid VARCHAR(24) PRIMARY KEY
ppid_type ENUM('POID', 'PRID') INDEX
created_at TIMESTAMP INDEX
uuid_raw UUID UNIQUE

12.3 API Endpoints

Endpoint Method Description
/api/v1/poid POST Create new observation
/api/v1/prid POST Create new reconstruction
/api/v1/{ppid} GET Retrieve record
/api/v1/{ppid}/validate GET Validate identifier
/api/v1/{prid}/observations GET List linked observations

13. References