glam/docs/plan/person_pid/02_sota_identifier_systems.md

16 KiB

State-of-the-Art Identifier Systems Analysis

Version: 0.1.0
Last Updated: 2025-01-09
Related: Executive Summary | Identifier Structure Design


1. Overview

This document analyzes three major person identifier systems to inform the design of PPID:

  1. ORCID - Open Researcher and Contributor ID
  2. ISNI - International Standard Name Identifier
  3. VIAF - Virtual International Authority File

Each system has distinct design philosophies, governance models, and technical implementations that offer valuable lessons.


2. ORCID (Open Researcher and Contributor ID)

2.1 Background

  • Founded: 2010 (launched 2012)
  • Governance: Non-profit organization
  • Purpose: Uniquely identify researchers and their scholarly contributions
  • Website: https://orcid.org/

2.2 Identifier Structure

Format: xxxx-xxxx-xxxx-xxxx
Example: 0000-0002-1825-0097

Components:
- 16 digits total (15 digits + 1 check digit)
- Grouped in 4 blocks of 4 characters
- Hyphen-separated for readability
- Last character: check digit (0-9 or X)

2.3 Technical Specifications

Aspect Specification
Length 16 characters (excluding hyphens)
Character set Digits 0-9, plus X for check digit
Checksum ISO/IEC 7064:2003, MOD 11-2
Namespace https://orcid.org/
URI format https://orcid.org/0000-0002-1825-0097

2.4 Checksum Algorithm (MOD 11-2)

def calculate_orcid_checksum(digits: str) -> str:
    """
    Calculate ORCID check digit using ISO/IEC 7064 MOD 11-2.
    
    Args:
        digits: 15-digit string (without check digit)
    
    Returns:
        Check digit (0-9 or X)
    """
    total = 0
    for digit in digits:
        total = (total + int(digit)) * 2
    
    remainder = total % 11
    result = (12 - remainder) % 11
    
    return 'X' if result == 10 else str(result)


def validate_orcid(orcid: str) -> bool:
    """
    Validate complete ORCID identifier.
    
    Args:
        orcid: 16-character ORCID (with or without hyphens)
    
    Returns:
        True if valid, False otherwise
    """
    # Remove hyphens and URL prefix
    clean = orcid.replace('-', '').replace('https://orcid.org/', '')
    
    if len(clean) != 16:
        return False
    
    digits = clean[:15]
    check_digit = clean[15]
    
    return calculate_orcid_checksum(digits) == check_digit.upper()

2.5 Key Design Decisions

Decision Rationale Lesson for PPID
Opaque identifiers No personal info encoded - prevents discrimination, ensures persistence Adopt: Privacy-first design
Random assignment Prevents inference of registration date or status Adopt: Avoid sequential IDs
Self-registration Researchers control their own record Adapt: Heritage sector may need institutional registration
Single ID per person One identifier for life Adopt: Career-long persistence
ISNI compatible 16-digit format matches ISO 27729 Adopt: Interoperability with ISNI

2.6 Strengths

  • Wide adoption: 18+ million registrations
  • Self-service: Researchers manage own profiles
  • API-first: Robust REST API with OAuth
  • Open data: CC0 public data file available
  • Integration: Works with publishers, funders, institutions

2.7 Limitations for Heritage Domain

Limitation Impact on Heritage Use
Living persons only Cannot identify historical figures
Self-registration model Deceased persons cannot register
Research focus Not designed for archivists, curators, donors
Notability bias Assumes published output
English-centric metadata Limited support for historical name forms

3. ISNI (International Standard Name Identifier)

3.1 Background

  • Standard: ISO 27729:2012
  • Governance: ISNI International Agency (ISNI-IA)
  • Purpose: Identify public identities of contributors to creative works
  • Website: https://isni.org/

3.2 Identifier Structure

Format: xxxx xxxx xxxx xxxx
Example: 0000 0001 2103 2683

Components:
- 16 digits total (15 digits + 1 check digit)
- Typically displayed with spaces
- Last character: check digit (0-9 or X)
- Same format as ORCID (by design)

3.3 Registration Agencies

ISNI uses a federated model with multiple registration agencies:

Agency Domain
OCLC Libraries, publishers
BnF (France) French cultural heritage
ORCID Researchers
Ringgold Organizations
Bowker Publishers, authors

3.4 Key Differences from ORCID

Aspect ORCID ISNI
Scope Researchers only All public identities
Registration Self-service Agency-mediated
Cost Free Fee-based (agencies charge)
Historical persons No Yes
Data control Individual Agency

3.5 Strengths

  • Broader scope: Covers authors, performers, artists, historical figures
  • Quality control: Curated by registration agencies
  • Linked data: Published as RDF with owl:sameAs links
  • Disambiguation: Explicit clustering of variant names

3.6 Limitations for Heritage Domain

Limitation Impact
Cost Registration fees may limit adoption
Slow assignment Weeks/months to receive ISNI
Agency dependency Must work through intermediary
Limited coverage Heritage staff rarely have ISNIs
Metadata constraints Fixed schema may not fit genealogical data

4. VIAF (Virtual International Authority File)

4.1 Background

  • Founded: 2003 (OCLC-hosted since 2012)
  • Governance: OCLC with contributing libraries
  • Purpose: Link national library authority files
  • Website: https://viaf.org/

4.2 Architecture

                    ┌─────────────────────────────────┐
                    │           VIAF Cluster          │
                    │    viaf.org/viaf/102333412      │
                    └─────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
    ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐
    │  Library of     │   │  Deutsche       │   │  Bibliothèque   │
    │  Congress       │   │  Nationalbiblio │   │  nationale de   │
    │  n79021164      │   │  thek 118529579 │   │  France 11908666│
    └─────────────────┘   └─────────────────┘   └─────────────────┘
              │                     │                     │
              ▼                     ▼                     ▼
         "Twain, Mark"        "Twain, Mark"       "Twain, Mark"
         "Clemens, Samuel"    "Clemens, Samuel     "Clemens, Samuel
                              Langhorne"           Langhorne"

4.3 Key Concepts

Concept Description
Cluster A VIAF record grouping authority records from multiple sources
Contributor A library or agency providing authority data
Link owl:sameAs relationship between contributor records
Heading The authorized form of name from a contributor

4.4 Identifier Format

Format: Numeric ID (variable length)
Example: 102333412

URI: https://viaf.org/viaf/102333412

4.5 Matching Algorithm

VIAF uses sophisticated matching to cluster records:

  1. Name normalization: Standardize name forms
  2. Date matching: Birth/death dates when available
  3. Work matching: Shared bibliographic works
  4. Manual review: Disputed clusters resolved by humans

4.6 Strengths

  • Comprehensive: 40+ national libraries contributing
  • Algorithmic matching: Automatic clustering of variant names
  • Linked data: RDF with rich relationships
  • Free access: Open data, no registration fees
  • Historical coverage: Excellent for historical figures

4.7 Limitations for Heritage Domain

Limitation Impact
Library focus Primarily bibliographic authority control
Passive creation Cannot request VIAF for new person
Work-centric Expects persons to have authored works
No provenance model Limited tracking of source assertions
Cluster instability Records can be split/merged over time

5. Comparative Analysis

5.1 Feature Matrix

Feature ORCID ISNI VIAF PPID (Proposed)
Format 16-digit 16-digit Numeric 16-digit
Checksum MOD 11-2 MOD 11-2 None MOD 11-2
Living persons Yes Yes Yes Yes
Historical persons No Yes Yes Yes
Self-registration Yes No No Hybrid
Free registration Yes No N/A Yes
Observation/reconstruction No No Partial Yes
Provenance tracking Limited Limited Limited Full
Cultural name support Limited Limited Good Comprehensive
Heritage sector focus No No Partial Yes

5.2 Identifier Assignment Models

┌─────────────────────────────────────────────────────────────────┐
│                    SELF-SERVICE (ORCID)                         │
│                                                                  │
│  Person → Registers → Gets ID immediately → Manages own record  │
│                                                                  │
│  Pros: Fast, empowering, scalable                               │
│  Cons: Quality control, no historical persons, spam risk        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    MEDIATED (ISNI)                              │
│                                                                  │
│  Institution → Submits → Agency reviews → ID assigned           │
│                                                                  │
│  Pros: Quality control, historical persons, authority           │
│  Cons: Slow, costly, dependency on agencies                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    ALGORITHMIC (VIAF)                           │
│                                                                  │
│  Library catalogs → Matching algorithm → Cluster created        │
│                                                                  │
│  Pros: Automatic, comprehensive, existing data                  │
│  Cons: No new persons, cluster instability, opaque              │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    HYBRID (PPID PROPOSED)                       │
│                                                                  │
│  Source observation (POID) → Created automatically              │
│  Person reconstruction (PRID) → Curated with provenance         │
│                                                                  │
│  Pros: Best of all models, full provenance, heritage focus      │
│  Cons: Complexity, requires clear governance                    │
└─────────────────────────────────────────────────────────────────┘

6. Interoperability Considerations

6.1 Linking Between Systems

All three systems support linking:

# VIAF links to external identifiers
<http://viaf.org/viaf/102333412>
    owl:sameAs <http://id.loc.gov/authorities/names/n79021164> ;
    owl:sameAs <http://d-nb.info/gnd/118529579> ;
    schema:sameAs <https://www.wikidata.org/wiki/Q7245> .

# ORCID links via Wikidata
<https://www.wikidata.org/wiki/Q7245>
    wdt:P496 "0000-0002-1825-0097"^^xsd:string .  # ORCID

6.2 PPID Interoperability Design

PPID should support bidirectional linking:

# PPID links to external systems
<ppid:PRID-1234-5678-90ab-cdef>
    owl:sameAs <https://orcid.org/0000-0002-1825-0097> ;
    owl:sameAs <https://isni.org/isni/0000000121032683> ;
    owl:sameAs <http://viaf.org/viaf/102333412> ;
    owl:sameAs <https://www.wikidata.org/wiki/Q7245> ;
    skos:exactMatch <http://id.loc.gov/authorities/names/n79021164> .

7. Lessons for PPID Design

7.1 What to Adopt

From Lesson Implementation
ORCID Opaque 16-digit format Use same structure for recognizability
ORCID MOD 11-2 checksum Implement for validation
ORCID URI-based identifiers https://ppid.org/xxxx-xxxx-xxxx-xxxx
ISNI Historical person support No restriction to living persons
VIAF Algorithmic matching Support automatic clustering
VIAF Multiple name forms Store all variant names

7.2 What to Avoid

System Pitfall PPID Approach
ORCID Self-registration only Hybrid: institutional + algorithmic
ISNI Costly registration Free for heritage sector
VIAF Passive creation only Active creation supported
All No observation/reconstruction distinction PiCo-based two-level model
All Limited provenance Full claim tracking

7.3 Novel PPID Features

Features not found in existing systems:

  1. Observation-level identifiers (POID): Track raw source data separately
  2. Reconstruction-level identifiers (PRID): Curated person records with provenance
  3. Claim-based assertions: Every fact traceable to source
  4. Confidence scoring: Quantified certainty for assertions
  5. Heritage sector focus: Designed for archivists, curators, donors

8. References

Standards

  • ISO 27729:2012 - International Standard Name Identifier (ISNI)
  • ISO/IEC 7064:2003 - Check character systems

Technical Documentation

Research

  • Haak, L.L., et al. (2012). "ORCID: A system to uniquely identify researchers." Learned Publishing, 25(4), 259-264.
  • Hickey, T.B., & Toves, J.A. (2014). "VIAF: Linking the World's Library Data." Cataloging & Classification Quarterly, 52(2), 155-166.