kempersc 5ab9dd8ea2 docs(person_pid): add implementation guidelines and governance docs

Add final two chapters of the Person PID (PPID) design document:

- 08_implementation_guidelines.md: Database architecture, API design,
  data ingestion pipeline, GHCID integration, security, performance,
  technology stack, deployment, and monitoring specifications

- 09_governance_and_sustainability.md: Data governance policies,
  quality assurance, sustainability planning, community engagement,
  legal considerations, and long-term maintenance strategies

2026-01-09 14:51:57 +01:00

67 KiB

Raw Blame History

Implementation Guidelines

Version: 0.1.0
Last Updated: 2025-01-09
Related: Identifier Structure | Claims and Provenance

1. Overview

This document provides comprehensive technical specifications for implementing the PPID system:

Database architecture (PostgreSQL + RDF triple store)
API design (REST + GraphQL)
Data ingestion pipeline
GHCID integration patterns
Security and access control
Performance requirements
Technology stack recommendations
Deployment architecture
Monitoring and observability

2. Architecture Overview

2.1 System Components

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPID System                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                 │
│  │   Ingestion  │────▶│   Processing │────▶│   Storage    │                 │
│  │   Layer      │     │   Layer      │     │   Layer      │                 │
│  └──────────────┘     └──────────────┘     └──────────────┘                 │
│         │                    │                    │                          │
│         ▼                    ▼                    ▼                          │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                 │
│  │ Web Scrapers │     │ Entity Res.  │     │ PostgreSQL   │                 │
│  │ API Clients  │     │ NER/NLP      │     │ (Relational) │                 │
│  │ File Import  │     │ Validation   │     ├──────────────┤                 │
│  └──────────────┘     └──────────────┘     │ Apache Jena  │                 │
│                                             │ (RDF/SPARQL) │                 │
│                                             ├──────────────┤                 │
│                                             │ Redis        │                 │
│                                             │ (Cache)      │                 │
│                                             └──────────────┘                 │
│                                                    │                          │
│                              ┌─────────────────────┴─────────────────────┐   │
│                              ▼                                           ▼   │
│                       ┌──────────────┐                          ┌──────────┐│
│                       │   REST API   │                          │  GraphQL ││
│                       │   (FastAPI)  │                          │  (Ariadne││
│                       └──────────────┘                          └──────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2.2 Technology Stack

Layer	Technology	Purpose
API Framework	FastAPI (Python 3.11+)	REST API, async support
GraphQL	Ariadne	GraphQL endpoint
Relational DB	PostgreSQL 16	Primary data store
Triple Store	Apache Jena Fuseki	RDF/SPARQL queries
Cache	Redis 7	Session, rate limiting, caching
Queue	Apache Kafka	Async processing pipeline
Search	Elasticsearch 8	Full-text search
Object Storage	MinIO / S3	HTML archives, files
Container	Docker + Kubernetes	Deployment
Monitoring	Prometheus + Grafana	Metrics and alerting

3. Database Schema

3.1 PostgreSQL Schema

-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";  -- For fuzzy matching

-- Enum types
CREATE TYPE ppid_type AS ENUM ('POID', 'PRID');
CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted');
CREATE TYPE source_type AS ENUM (
    'official_registry',
    'institutional_website', 
    'professional_network',
    'social_media',
    'news_article',
    'academic_publication',
    'user_submitted',
    'inferred'
);

-- Person Observations (POID)
CREATE TABLE person_observations (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    poid VARCHAR(24) UNIQUE NOT NULL,  -- POID-xxxx-xxxx-xxxx-xxxx
    
    -- Source metadata
    source_url TEXT NOT NULL,
    source_type source_type NOT NULL,
    retrieved_at TIMESTAMPTZ NOT NULL,
    content_hash VARCHAR(64) NOT NULL,  -- SHA-256
    html_archive_path TEXT,
    
    -- Extracted name components (PNV-compatible)
    literal_name TEXT,
    given_name TEXT,
    surname TEXT,
    surname_prefix TEXT,  -- van, de, etc.
    patronymic TEXT,
    generation_suffix TEXT,  -- Jr., III, etc.
    
    -- Metadata
    extraction_agent VARCHAR(100),
    extraction_confidence DECIMAL(3,2),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    
    -- Indexes
    CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);

-- Person Reconstructions (PRID)
CREATE TABLE person_reconstructions (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid VARCHAR(24) UNIQUE NOT NULL,  -- PRID-xxxx-xxxx-xxxx-xxxx
    
    -- Canonical name (resolved from observations)
    canonical_name TEXT NOT NULL,
    given_name TEXT,
    surname TEXT,
    surname_prefix TEXT,
    
    -- Curation metadata
    curator_id UUID REFERENCES users(id),
    curation_method VARCHAR(50),  -- 'manual', 'algorithmic', 'hybrid'
    confidence_score DECIMAL(3,2),
    
    -- Versioning
    version INTEGER DEFAULT 1,
    previous_version_id UUID REFERENCES person_reconstructions(id),
    
    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    
    CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);

-- Link table: PRID derives from POIDs
CREATE TABLE reconstruction_observations (
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE,
    linked_at TIMESTAMPTZ DEFAULT NOW(),
    linked_by UUID REFERENCES users(id),
    link_confidence DECIMAL(3,2),
    PRIMARY KEY (prid_id, poid_id)
);

-- Claims (assertions with provenance)
CREATE TABLE claims (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    claim_id VARCHAR(50) UNIQUE NOT NULL,
    
    -- Subject (what this claim is about)
    poid_id UUID REFERENCES person_observations(id),
    
    -- Claim content
    claim_type VARCHAR(50) NOT NULL,  -- 'job_title', 'employer', 'email', etc.
    claim_value TEXT NOT NULL,
    
    -- Provenance (MANDATORY per Rule 6)
    source_url TEXT NOT NULL,
    retrieved_on TIMESTAMPTZ NOT NULL,
    xpath TEXT NOT NULL,
    html_file TEXT NOT NULL,
    xpath_match_score DECIMAL(3,2) NOT NULL,
    content_hash VARCHAR(64),
    
    -- Quality
    confidence DECIMAL(3,2),
    extraction_agent VARCHAR(100),
    status claim_status DEFAULT 'active',
    
    -- Relationships
    supersedes_id UUID REFERENCES claims(id),
    
    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT NOW(),
    verified_at TIMESTAMPTZ,
    
    -- Index for claim lookups
    CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1)
);

-- Claim relationships (supports, conflicts)
CREATE TABLE claim_relationships (
    claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE,
    claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE,
    relationship_type VARCHAR(20) NOT NULL,  -- 'supports', 'conflicts_with'
    notes TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (claim_a_id, claim_b_id, relationship_type)
);

-- External identifiers (ORCID, ISNI, VIAF, Wikidata)
CREATE TABLE external_identifiers (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    identifier_scheme VARCHAR(20) NOT NULL,  -- 'orcid', 'isni', 'viaf', 'wikidata'
    identifier_value VARCHAR(100) NOT NULL,
    verified BOOLEAN DEFAULT FALSE,
    verified_at TIMESTAMPTZ,
    source_url TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (prid_id, identifier_scheme, identifier_value)
);

-- GHCID links (person to heritage institution)
CREATE TABLE ghcid_affiliations (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    ghcid VARCHAR(50) NOT NULL,  -- e.g., NL-NH-HAA-A-NHA
    
    -- Affiliation details
    role_title TEXT,
    department TEXT,
    affiliation_start DATE,
    affiliation_end DATE,
    is_current BOOLEAN DEFAULT TRUE,
    
    -- Provenance
    source_poid_id UUID REFERENCES person_observations(id),
    confidence DECIMAL(3,2),
    
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_observations_source_url ON person_observations(source_url);
CREATE INDEX idx_observations_content_hash ON person_observations(content_hash);
CREATE INDEX idx_observations_name_trgm ON person_observations 
    USING gin (literal_name gin_trgm_ops);

CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions 
    USING gin (canonical_name gin_trgm_ops);

CREATE INDEX idx_claims_type ON claims(claim_type);
CREATE INDEX idx_claims_poid ON claims(poid_id);
CREATE INDEX idx_claims_status ON claims(status);

CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid);
CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id) 
    WHERE is_current = TRUE;

-- Full-text search
CREATE INDEX idx_observations_fts ON person_observations 
    USING gin (to_tsvector('english', literal_name));
CREATE INDEX idx_reconstructions_fts ON person_reconstructions 
    USING gin (to_tsvector('english', canonical_name));

3.2 RDF Triple Store Schema

@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .
@prefix picom: <https://personsincontext.org/model#> .
@prefix pnv: <https://w3id.org/pnv#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Example Person Observation
ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ;
    ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ;
    
    # Name (PNV structured)
    pnv:hasName [
        a pnv:PersonName ;
        pnv:literalName "Jan van den Berg" ;
        pnv:givenName "Jan" ;
        pnv:surnamePrefix "van den" ;
        pnv:baseSurname "Berg"
    ] ;
    
    # Provenance
    prov:wasDerivedFrom <https://linkedin.com/in/jan-van-den-berg> ;
    prov:wasGeneratedBy ppid:extraction-activity-001 ;
    prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ;
    
    # Claims
    ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 .

# Example Person Reconstruction
ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ;
    ppidv:prid "PRID-1234-5678-90ab-cde5" ;
    
    # Canonical name
    schema:name "Jan van den Berg" ;
    pnv:hasName [
        a pnv:PersonName ;
        pnv:literalName "Jan van den Berg" ;
        pnv:givenName "Jan" ;
        pnv:surnamePrefix "van den" ;
        pnv:baseSurname "Berg"
    ] ;
    
    # Derived from observations
    prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X,
                        ppid:POID-8c4d-e5f6-g7h8-901Y ;
    
    # GHCID affiliation
    ppidv:affiliatedWith <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
    
    # External identifiers
    ppidv:orcid "0000-0002-1234-5678" ;
    owl:sameAs <https://orcid.org/0000-0002-1234-5678> .

4. API Design

4.1 REST API Endpoints

openapi: 3.1.0
info:
  title: PPID API
  version: 1.0.0
  description: Person Persistent Identifier API

servers:
  - url: https://api.ppid.org/v1

paths:
  # Person Observations
  /observations:
    post:
      summary: Create new person observation
      operationId: createObservation
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateObservationRequest'
      responses:
        '201':
          description: Observation created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonObservation'
    
    get:
      summary: Search observations
      operationId: searchObservations
      parameters:
        - name: name
          in: query
          schema:
            type: string
        - name: source_url
          in: query
          schema:
            type: string
        - name: limit
          in: query
          schema:
            type: integer
            default: 20
        - name: offset
          in: query
          schema:
            type: integer
            default: 0
      responses:
        '200':
          description: List of observations
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ObservationList'

  /observations/{poid}:
    get:
      summary: Get observation by POID
      operationId: getObservation
      parameters:
        - name: poid
          in: path
          required: true
          schema:
            type: string
            pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$'
      responses:
        '200':
          description: Person observation
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonObservation'
            text/turtle:
              schema:
                type: string
            application/ld+json:
              schema:
                type: object

  /observations/{poid}/claims:
    get:
      summary: Get claims for observation
      operationId: getObservationClaims
      parameters:
        - name: poid
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: List of claims
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ClaimList'

  # Person Reconstructions
  /reconstructions:
    post:
      summary: Create person reconstruction from observations
      operationId: createReconstruction
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateReconstructionRequest'
      responses:
        '201':
          description: Reconstruction created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonReconstruction'

  /reconstructions/{prid}:
    get:
      summary: Get reconstruction by PRID
      operationId: getReconstruction
      parameters:
        - name: prid
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Person reconstruction
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonReconstruction'

  /reconstructions/{prid}/observations:
    get:
      summary: Get observations linked to reconstruction
      operationId: getReconstructionObservations
      responses:
        '200':
          description: Linked observations

  /reconstructions/{prid}/history:
    get:
      summary: Get version history
      operationId: getReconstructionHistory
      responses:
        '200':
          description: Version history

  # Validation
  /validate/{ppid}:
    get:
      summary: Validate PPID format and checksum
      operationId: validatePpid
      responses:
        '200':
          description: Validation result
          content:
            application/json:
              schema:
                type: object
                properties:
                  valid:
                    type: boolean
                  type:
                    type: string
                    enum: [POID, PRID]
                  exists:
                    type: boolean

  # Search
  /search:
    get:
      summary: Full-text search across all records
      operationId: search
      parameters:
        - name: q
          in: query
          required: true
          schema:
            type: string
        - name: type
          in: query
          schema:
            type: string
            enum: [observation, reconstruction, all]
            default: all
      responses:
        '200':
          description: Search results

  # Entity Resolution
  /resolve:
    post:
      summary: Find matching records for input data
      operationId: resolveEntity
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/EntityResolutionRequest'
      responses:
        '200':
          description: Resolution candidates
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ResolutionCandidates'

components:
  schemas:
    CreateObservationRequest:
      type: object
      required:
        - source_url
        - retrieved_at
        - claims
      properties:
        source_url:
          type: string
          format: uri
        source_type:
          type: string
          enum: [official_registry, institutional_website, professional_network, social_media]
        retrieved_at:
          type: string
          format: date-time
        content_hash:
          type: string
        html_archive_path:
          type: string
        extraction_agent:
          type: string
        claims:
          type: array
          items:
            $ref: '#/components/schemas/ClaimInput'

    ClaimInput:
      type: object
      required:
        - claim_type
        - claim_value
        - xpath
        - xpath_match_score
      properties:
        claim_type:
          type: string
        claim_value:
          type: string
        xpath:
          type: string
        xpath_match_score:
          type: number
          minimum: 0
          maximum: 1
        confidence:
          type: number

    PersonObservation:
      type: object
      properties:
        poid:
          type: string
        source_url:
          type: string
        literal_name:
          type: string
        claims:
          type: array
          items:
            $ref: '#/components/schemas/Claim'
        created_at:
          type: string
          format: date-time

    CreateReconstructionRequest:
      type: object
      required:
        - observation_ids
      properties:
        observation_ids:
          type: array
          items:
            type: string
          minItems: 1
        canonical_name:
          type: string
        external_identifiers:
          type: object

    PersonReconstruction:
      type: object
      properties:
        prid:
          type: string
        canonical_name:
          type: string
        observations:
          type: array
          items:
            type: string
        external_identifiers:
          type: object
        ghcid_affiliations:
          type: array
          items:
            $ref: '#/components/schemas/GhcidAffiliation'

    GhcidAffiliation:
      type: object
      properties:
        ghcid:
          type: string
        role_title:
          type: string
        is_current:
          type: boolean

  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: JWT
    apiKey:
      type: apiKey
      in: header
      name: X-API-Key

security:
  - bearerAuth: []
  - apiKey: []

4.2 GraphQL Schema

type Query {
  # Observations
  observation(poid: ID!): PersonObservation
  observations(
    name: String
    sourceUrl: String
    limit: Int = 20
    offset: Int = 0
  ): ObservationConnection!
  
  # Reconstructions
  reconstruction(prid: ID!): PersonReconstruction
  reconstructions(
    name: String
    ghcid: String
    limit: Int = 20
    offset: Int = 0
  ): ReconstructionConnection!
  
  # Search
  search(query: String!, type: SearchType = ALL): SearchResults!
  
  # Validation
  validatePpid(ppid: String!): ValidationResult!
  
  # Entity Resolution
  resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]!
}

type Mutation {
  # Create observation
  createObservation(input: CreateObservationInput!): PersonObservation!
  
  # Create reconstruction
  createReconstruction(input: CreateReconstructionInput!): PersonReconstruction!
  
  # Link observation to reconstruction
  linkObservation(prid: ID!, poid: ID!): PersonReconstruction!
  
  # Update reconstruction
  updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction!
  
  # Add external identifier
  addExternalIdentifier(
    prid: ID!
    scheme: IdentifierScheme!
    value: String!
  ): PersonReconstruction!
  
  # Add GHCID affiliation
  addGhcidAffiliation(
    prid: ID!
    ghcid: String!
    roleTitle: String
    isCurrent: Boolean = true
  ): PersonReconstruction!
}

type PersonObservation {
  poid: ID!
  sourceUrl: String!
  sourceType: SourceType!
  retrievedAt: DateTime!
  contentHash: String
  htmlArchivePath: String
  
  # Name components
  literalName: String
  givenName: String
  surname: String
  surnamePrefix: String
  
  # Related data
  claims: [Claim!]!
  linkedReconstructions: [PersonReconstruction!]!
  
  # Metadata
  extractionAgent: String
  extractionConfidence: Float
  createdAt: DateTime!
}

type PersonReconstruction {
  prid: ID!
  canonicalName: String!
  givenName: String
  surname: String
  surnamePrefix: String
  
  # Linked observations
  observations: [PersonObservation!]!
  
  # External identifiers
  orcid: String
  isni: String
  viaf: String
  wikidata: String
  externalIdentifiers: [ExternalIdentifier!]!
  
  # GHCID affiliations
  ghcidAffiliations: [GhcidAffiliation!]!
  currentAffiliations: [GhcidAffiliation!]!
  
  # Versioning
  version: Int!
  previousVersion: PersonReconstruction
  history: [PersonReconstruction!]!
  
  # Curation
  curator: User
  curationMethod: CurationMethod
  confidenceScore: Float
  
  createdAt: DateTime!
  updatedAt: DateTime!
}

type Claim {
  id: ID!
  claimType: ClaimType!
  claimValue: String!
  
  # Provenance (MANDATORY)
  sourceUrl: String!
  retrievedOn: DateTime!
  xpath: String!
  htmlFile: String!
  xpathMatchScore: Float!
  
  # Quality
  confidence: Float
  extractionAgent: String
  status: ClaimStatus!
  
  # Relationships
  supports: [Claim!]!
  conflictsWith: [Claim!]!
  supersedes: Claim
  
  createdAt: DateTime!
}

type GhcidAffiliation {
  ghcid: String!
  institution: HeritageCustodian  # Resolved from GHCID
  roleTitle: String
  department: String
  startDate: Date
  endDate: Date
  isCurrent: Boolean!
  confidence: Float
}

type HeritageCustodian {
  ghcid: String!
  name: String!
  institutionType: String!
  city: String
  country: String
}

type ExternalIdentifier {
  scheme: IdentifierScheme!
  value: String!
  verified: Boolean!
  verifiedAt: DateTime
}

enum SourceType {
  OFFICIAL_REGISTRY
  INSTITUTIONAL_WEBSITE
  PROFESSIONAL_NETWORK
  SOCIAL_MEDIA
  NEWS_ARTICLE
  ACADEMIC_PUBLICATION
  USER_SUBMITTED
  INFERRED
}

enum ClaimType {
  FULL_NAME
  GIVEN_NAME
  FAMILY_NAME
  JOB_TITLE
  EMPLOYER
  EMPLOYER_GHCID
  EMAIL
  LINKEDIN_URL
  ORCID
  BIRTH_DATE
  EDUCATION
}

enum ClaimStatus {
  ACTIVE
  SUPERSEDED
  RETRACTED
}

enum IdentifierScheme {
  ORCID
  ISNI
  VIAF
  WIKIDATA
  LOC_NAF
}

enum CurationMethod {
  MANUAL
  ALGORITHMIC
  HYBRID
}

enum SearchType {
  OBSERVATION
  RECONSTRUCTION
  ALL
}

input CreateObservationInput {
  sourceUrl: String!
  sourceType: SourceType!
  retrievedAt: DateTime!
  contentHash: String
  htmlArchivePath: String
  extractionAgent: String
  claims: [ClaimInput!]!
}

input ClaimInput {
  claimType: ClaimType!
  claimValue: String!
  xpath: String!
  xpathMatchScore: Float!
  confidence: Float
}

input CreateReconstructionInput {
  observationIds: [ID!]!
  canonicalName: String
  externalIdentifiers: ExternalIdentifiersInput
}

input EntityResolutionInput {
  name: String!
  employer: String
  jobTitle: String
  email: String
  linkedinUrl: String
}

type ResolutionCandidate {
  reconstruction: PersonReconstruction
  observation: PersonObservation
  matchScore: Float!
  matchFactors: [MatchFactor!]!
}

type MatchFactor {
  field: String!
  score: Float!
  method: String!
}

4.3 FastAPI Implementation

from fastapi import FastAPI, HTTPException, Depends, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import uuid

app = FastAPI(
    title="PPID API",
    description="Person Persistent Identifier API",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


# --- Pydantic Models ---

class ClaimInput(BaseModel):
    claim_type: str
    claim_value: str
    xpath: str
    xpath_match_score: float = Field(ge=0, le=1)
    confidence: Optional[float] = Field(None, ge=0, le=1)


class CreateObservationRequest(BaseModel):
    source_url: str
    source_type: str = "institutional_website"
    retrieved_at: datetime
    content_hash: Optional[str] = None
    html_archive_path: Optional[str] = None
    extraction_agent: Optional[str] = None
    claims: list[ClaimInput]


class PersonObservationResponse(BaseModel):
    poid: str
    source_url: str
    source_type: str
    retrieved_at: datetime
    literal_name: Optional[str] = None
    claims: list[dict]
    created_at: datetime


class CreateReconstructionRequest(BaseModel):
    observation_ids: list[str]
    canonical_name: Optional[str] = None
    external_identifiers: Optional[dict] = None


class ValidationResult(BaseModel):
    valid: bool
    ppid_type: Optional[str] = None
    exists: Optional[bool] = None
    error: Optional[str] = None


# --- Dependencies ---

async def get_db():
    """Database connection dependency."""
    # In production, use connection pool
    pass


async def get_current_user(api_key: str = Depends(oauth2_scheme)):
    """Authenticate user from API key or JWT."""
    pass


# --- Endpoints ---

@app.post("/api/v1/observations", response_model=PersonObservationResponse)
async def create_observation(
    request: CreateObservationRequest,
    db = Depends(get_db)
):
    """
    Create a new Person Observation from extracted data.
    
    The POID is generated deterministically from source metadata.
    """
    from ppid.identifiers import generate_poid
    
    # Generate deterministic POID
    poid = generate_poid(
        source_url=request.source_url,
        retrieval_timestamp=request.retrieved_at.isoformat(),
        content_hash=request.content_hash or ""
    )
    
    # Check for existing observation with same POID
    existing = await db.get_observation(poid)
    if existing:
        return existing
    
    # Extract name from claims
    literal_name = None
    for claim in request.claims:
        if claim.claim_type == "full_name":
            literal_name = claim.claim_value
            break
    
    # Create observation record
    observation = await db.create_observation(
        poid=poid,
        source_url=request.source_url,
        source_type=request.source_type,
        retrieved_at=request.retrieved_at,
        content_hash=request.content_hash,
        html_archive_path=request.html_archive_path,
        literal_name=literal_name,
        extraction_agent=request.extraction_agent,
        claims=[c.dict() for c in request.claims]
    )
    
    return observation


@app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse)
async def get_observation(poid: str, db = Depends(get_db)):
    """Get Person Observation by POID."""
    from ppid.identifiers import validate_ppid
    
    is_valid, error = validate_ppid(poid)
    if not is_valid:
        raise HTTPException(status_code=400, detail=f"Invalid POID: {error}")
    
    observation = await db.get_observation(poid)
    if not observation:
        raise HTTPException(status_code=404, detail="Observation not found")
    
    return observation


@app.get("/api/v1/validate/{ppid}", response_model=ValidationResult)
async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)):
    """Validate PPID format, checksum, and existence."""
    from ppid.identifiers import validate_ppid_full
    
    is_valid, error = validate_ppid_full(ppid)
    
    if not is_valid:
        return ValidationResult(valid=False, error=error)
    
    ppid_type = "POID" if ppid.startswith("POID") else "PRID"
    
    # Check existence
    if ppid_type == "POID":
        exists = await db.observation_exists(ppid)
    else:
        exists = await db.reconstruction_exists(ppid)
    
    return ValidationResult(
        valid=True,
        ppid_type=ppid_type,
        exists=exists
    )


@app.post("/api/v1/reconstructions")
async def create_reconstruction(
    request: CreateReconstructionRequest,
    db = Depends(get_db),
    user = Depends(get_current_user)
):
    """Create Person Reconstruction from linked observations."""
    from ppid.identifiers import generate_prid
    
    # Validate all POIDs exist
    for poid in request.observation_ids:
        if not await db.observation_exists(poid):
            raise HTTPException(
                status_code=400,
                detail=f"Observation not found: {poid}"
            )
    
    # Generate deterministic PRID
    prid = generate_prid(
        observation_ids=request.observation_ids,
        curator_id=str(user.id),
        timestamp=datetime.utcnow().isoformat()
    )
    
    # Determine canonical name
    if request.canonical_name:
        canonical_name = request.canonical_name
    else:
        # Use name from highest-confidence observation
        observations = await db.get_observations(request.observation_ids)
        canonical_name = max(
            observations,
            key=lambda o: o.extraction_confidence or 0
        ).literal_name
    
    # Create reconstruction
    reconstruction = await db.create_reconstruction(
        prid=prid,
        canonical_name=canonical_name,
        observation_ids=request.observation_ids,
        curator_id=user.id,
        external_identifiers=request.external_identifiers
    )
    
    return reconstruction


@app.get("/api/v1/search")
async def search(
    q: str = Query(..., min_length=2),
    type: str = Query("all", regex="^(observation|reconstruction|all)$"),
    limit: int = Query(20, ge=1, le=100),
    offset: int = Query(0, ge=0),
    db = Depends(get_db)
):
    """Full-text search across observations and reconstructions."""
    results = await db.search(
        query=q,
        search_type=type,
        limit=limit,
        offset=offset
    )
    return results


@app.post("/api/v1/resolve")
async def resolve_entity(
    name: str,
    employer: Optional[str] = None,
    job_title: Optional[str] = None,
    email: Optional[str] = None,
    db = Depends(get_db)
):
    """
    Entity resolution: find matching records for input data.
    
    Returns ranked candidates with match scores.
    """
    from ppid.entity_resolution import find_candidates
    
    candidates = await find_candidates(
        db=db,
        name=name,
        employer=employer,
        job_title=job_title,
        email=email
    )
    
    return {
        "candidates": candidates,
        "query": {
            "name": name,
            "employer": employer,
            "job_title": job_title,
            "email": email
        }
    }

5. Data Ingestion Pipeline

5.1 Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    DATA INGESTION PIPELINE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Source  │───▶│ Extract │───▶│ Transform│───▶│  Load   │      │
│  │ Fetch   │    │ (NER)   │    │ (Validate)   │ (Store) │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│       │              │              │              │             │
│       ▼              ▼              ▼              ▼             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Archive │    │ Claims  │    │ POID    │    │ Postgres│      │
│  │ HTML    │    │ XPath   │    │ Generate│    │ + RDF   │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5.2 Pipeline Implementation

from dataclasses import dataclass
from datetime import datetime
from typing import AsyncIterator
import hashlib
import asyncio
from kafka import KafkaProducer, KafkaConsumer

@dataclass
class SourceDocument:
    url: str
    html_content: str
    retrieved_at: datetime
    content_hash: str
    archive_path: str


@dataclass
class ExtractedClaim:
    claim_type: str
    claim_value: str
    xpath: str
    xpath_match_score: float
    confidence: float


@dataclass
class PersonObservationData:
    source: SourceDocument
    claims: list[ExtractedClaim]
    poid: str


class IngestionPipeline:
    """
    Main data ingestion pipeline for PPID.
    
    Stages:
    1. Fetch: Retrieve web pages, archive HTML
    2. Extract: NER/NLP to identify person data, generate claims with XPath
    3. Transform: Validate, generate POID, structure data
    4. Load: Store in PostgreSQL and RDF triple store
    """
    
    def __init__(
        self,
        db_pool,
        rdf_store,
        kafka_producer: KafkaProducer,
        archive_storage,
        llm_extractor
    ):
        self.db = db_pool
        self.rdf = rdf_store
        self.kafka = kafka_producer
        self.archive = archive_storage
        self.extractor = llm_extractor
    
    async def process_url(self, url: str) -> list[PersonObservationData]:
        """
        Full pipeline for a single URL.
        
        Returns list of PersonObservations extracted from the page.
        """
        # Stage 1: Fetch and archive
        source = await self._fetch_and_archive(url)
        
        # Stage 2: Extract claims with XPath
        observations = await self._extract_observations(source)
        
        # Stage 3: Transform and validate
        validated = await self._transform_and_validate(observations)
        
        # Stage 4: Load to databases
        await self._load_observations(validated)
        
        return validated
    
    async def _fetch_and_archive(self, url: str) -> SourceDocument:
        """Fetch URL and archive HTML."""
        from playwright.async_api import async_playwright
        
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()
            
            await page.goto(url, wait_until='networkidle')
            html_content = await page.content()
            
            await browser.close()
        
        # Calculate content hash
        content_hash = hashlib.sha256(html_content.encode()).hexdigest()
        
        # Archive HTML
        retrieved_at = datetime.utcnow()
        archive_path = await self.archive.store(
            url=url,
            content=html_content,
            timestamp=retrieved_at
        )
        
        return SourceDocument(
            url=url,
            html_content=html_content,
            retrieved_at=retrieved_at,
            content_hash=content_hash,
            archive_path=archive_path
        )
    
    async def _extract_observations(
        self,
        source: SourceDocument
    ) -> list[tuple[list[ExtractedClaim], str]]:
        """
        Extract person observations with XPath provenance.
        
        Uses LLM for extraction, then validates XPath.
        """
        from lxml import html
        
        # Parse HTML
        tree = html.fromstring(source.html_content)
        
        # Use LLM to extract person data with XPath
        extraction_result = await self.extractor.extract_persons(
            html_content=source.html_content,
            source_url=source.url
        )
        
        observations = []
        
        for person in extraction_result.persons:
            validated_claims = []
            
            for claim in person.claims:
                # Verify XPath points to expected value
                try:
                    elements = tree.xpath(claim.xpath)
                    if elements:
                        actual_value = elements[0].text_content().strip()
                        
                        # Calculate match score
                        if actual_value == claim.claim_value:
                            match_score = 1.0
                        else:
                            from difflib import SequenceMatcher
                            match_score = SequenceMatcher(
                                None, actual_value, claim.claim_value
                            ).ratio()
                        
                        if match_score >= 0.8:  # Accept if 80%+ match
                            validated_claims.append(ExtractedClaim(
                                claim_type=claim.claim_type,
                                claim_value=claim.claim_value,
                                xpath=claim.xpath,
                                xpath_match_score=match_score,
                                confidence=claim.confidence * match_score
                            ))
                except Exception as e:
                    # Skip claims with invalid XPath
                    continue
            
            if validated_claims:
                # Get literal name from claims
                literal_name = next(
                    (c.claim_value for c in validated_claims if c.claim_type == 'full_name'),
                    None
                )
                observations.append((validated_claims, literal_name))
        
        return observations
    
    async def _transform_and_validate(
        self,
        observations: list[tuple[list[ExtractedClaim], str]],
        source: SourceDocument
    ) -> list[PersonObservationData]:
        """Transform extracted data and generate POIDs."""
        from ppid.identifiers import generate_poid
        
        results = []
        
        for claims, literal_name in observations:
            # Generate deterministic POID
            claims_hash = hashlib.sha256(
                str(sorted([c.claim_value for c in claims])).encode()
            ).hexdigest()
            
            poid = generate_poid(
                source_url=source.url,
                retrieval_timestamp=source.retrieved_at.isoformat(),
                content_hash=f"{source.content_hash}:{claims_hash}"
            )
            
            results.append(PersonObservationData(
                source=source,
                claims=claims,
                poid=poid
            ))
        
        return results
    
    async def _load_observations(
        self,
        observations: list[PersonObservationData]
    ) -> None:
        """Load observations to PostgreSQL and RDF store."""
        for obs in observations:
            # Check if already exists (idempotent)
            existing = await self.db.get_observation(obs.poid)
            if existing:
                continue
            
            # Insert to PostgreSQL
            await self.db.create_observation(
                poid=obs.poid,
                source_url=obs.source.url,
                source_type='institutional_website',
                retrieved_at=obs.source.retrieved_at,
                content_hash=obs.source.content_hash,
                html_archive_path=obs.source.archive_path,
                literal_name=next(
                    (c.claim_value for c in obs.claims if c.claim_type == 'full_name'),
                    None
                ),
                claims=[{
                    'claim_type': c.claim_type,
                    'claim_value': c.claim_value,
                    'xpath': c.xpath,
                    'xpath_match_score': c.xpath_match_score,
                    'confidence': c.confidence
                } for c in obs.claims]
            )
            
            # Insert to RDF triple store
            await self._insert_rdf(obs)
            
            # Publish to Kafka for downstream processing
            self.kafka.send('ppid.observations.created', {
                'poid': obs.poid,
                'source_url': obs.source.url,
                'timestamp': obs.source.retrieved_at.isoformat()
            })
    
    async def _insert_rdf(self, obs: PersonObservationData) -> None:
        """Insert observation as RDF triples."""
        from rdflib import Graph, Namespace, Literal, URIRef
        from rdflib.namespace import RDF, XSD
        
        PPID = Namespace("https://ppid.org/")
        PPIDV = Namespace("https://ppid.org/vocab#")
        PROV = Namespace("http://www.w3.org/ns/prov#")
        
        g = Graph()
        
        obs_uri = PPID[obs.poid]
        
        g.add((obs_uri, RDF.type, PPIDV.PersonObservation))
        g.add((obs_uri, PPIDV.poid, Literal(obs.poid)))
        g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url)))
        g.add((obs_uri, PROV.generatedAtTime, Literal(
            obs.source.retrieved_at.isoformat(),
            datatype=XSD.dateTime
        )))
        
        # Add claims
        for i, claim in enumerate(obs.claims):
            claim_uri = PPID[f"{obs.poid}/claim/{i}"]
            g.add((obs_uri, PPIDV.hasClaim, claim_uri))
            g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type)))
            g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value)))
            g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath)))
            g.add((claim_uri, PPIDV.xpathMatchScore, Literal(
                claim.xpath_match_score, datatype=XSD.decimal
            )))
        
        # Insert to triple store
        await self.rdf.insert(g)

6. GHCID Integration

6.1 Linking Persons to Institutions

from dataclasses import dataclass
from typing import Optional
from datetime import date

@dataclass
class GhcidAffiliation:
    """
    Link between a person (PRID) and a heritage institution (GHCID).
    """
    ghcid: str  # e.g., "NL-NH-HAA-A-NHA"
    role_title: Optional[str] = None
    department: Optional[str] = None
    start_date: Optional[date] = None
    end_date: Optional[date] = None
    is_current: bool = True
    source_poid: Optional[str] = None
    confidence: float = 0.9


async def link_person_to_institution(
    db,
    prid: str,
    ghcid: str,
    role_title: str = None,
    source_poid: str = None
) -> GhcidAffiliation:
    """
    Create link between person and heritage institution.
    
    Args:
        prid: Person Reconstruction ID
        ghcid: Global Heritage Custodian ID
        role_title: Job title at institution
        source_poid: Observation where affiliation was extracted
    
    Returns:
        Created affiliation record
    """
    # Validate PRID exists
    reconstruction = await db.get_reconstruction(prid)
    if not reconstruction:
        raise ValueError(f"Reconstruction not found: {prid}")
    
    # Validate GHCID format
    if not validate_ghcid_format(ghcid):
        raise ValueError(f"Invalid GHCID format: {ghcid}")
    
    # Create affiliation
    affiliation = await db.create_ghcid_affiliation(
        prid_id=reconstruction.id,
        ghcid=ghcid,
        role_title=role_title,
        source_poid_id=source_poid,
        is_current=True
    )
    
    return affiliation


def validate_ghcid_format(ghcid: str) -> bool:
    """Validate GHCID format."""
    import re
    # Pattern: CC-RR-SSS-T-ABBREV
    pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$'
    return bool(re.match(pattern, ghcid))

6.2 RDF Integration

@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ghcid: <https://w3id.org/heritage/custodian/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix schema: <http://schema.org/> .

# Person with GHCID affiliation
ppid:PRID-1234-5678-90ab-cde5
    a ppidv:PersonReconstruction ;
    schema:name "Jan van den Berg" ;
    
    # Current employment
    org:memberOf [
        a org:Membership ;
        org:organization ghcid:NL-NH-HAA-A-NHA ;
        org:role [
            a org:Role ;
            schema:name "Senior Archivist"
        ] ;
        ppidv:isCurrent true ;
        schema:startDate "2015"^^xsd:gYear
    ] ;
    
    # Direct link for simple queries
    ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA .

# The heritage institution (from GHCID system)
ghcid:NL-NH-HAA-A-NHA
    a ppidv:HeritageCustodian ;
    schema:name "Noord-Hollands Archief" ;
    ppidv:ghcid "NL-NH-HAA-A-NHA" ;
    schema:address [
        schema:addressLocality "Haarlem" ;
        schema:addressCountry "NL"
    ] .

6.3 SPARQL Queries

# Find all persons affiliated with a specific institution
PREFIX ppid: <https://ppid.org/>
PREFIX ppidv: <https://ppid.org/vocab#>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
PREFIX schema: <http://schema.org/>
PREFIX org: <http://www.w3.org/ns/org#>

SELECT ?prid ?name ?role ?isCurrent
WHERE {
    ?prid a ppidv:PersonReconstruction ;
          schema:name ?name ;
          org:memberOf ?membership .
    
    ?membership org:organization ghcid:NL-NH-HAA-A-NHA ;
                ppidv:isCurrent ?isCurrent .
    
    OPTIONAL {
        ?membership org:role ?roleNode .
        ?roleNode schema:name ?role .
    }
}
ORDER BY DESC(?isCurrent) ?name


# Find all institutions a person has worked at
SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent
WHERE {
    ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership .
    
    ?membership org:organization ?institution ;
                ppidv:isCurrent ?isCurrent .
    
    ?institution ppidv:ghcid ?ghcid ;
                 schema:name ?institutionName .
    
    OPTIONAL { ?membership schema:startDate ?startDate }
    OPTIONAL { ?membership schema:endDate ?endDate }
    OPTIONAL {
        ?membership org:role ?roleNode .
        ?roleNode schema:name ?role .
    }
}
ORDER BY DESC(?isCurrent) DESC(?startDate)


# Find archivists across all Dutch archives
SELECT ?prid ?name ?institution ?institutionName
WHERE {
    ?prid a ppidv:PersonReconstruction ;
          schema:name ?name ;
          org:memberOf ?membership .
    
    ?membership org:organization ?institution ;
                ppidv:isCurrent true ;
                org:role ?roleNode .
    
    ?roleNode schema:name ?role .
    FILTER(CONTAINS(LCASE(?role), "archivist"))
    
    ?institution ppidv:ghcid ?ghcid ;
                 schema:name ?institutionName .
    FILTER(STRSTARTS(?ghcid, "NL-"))
}
ORDER BY ?institutionName ?name

7. Security and Access Control

7.1 Authentication

from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, APIKeyHeader
from jose import JWTError, jwt
from passlib.context import CryptContext
from datetime import datetime, timedelta
from typing import Optional

# Configuration
SECRET_KEY = "your-secret-key"  # Use env variable
ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 30

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False)
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)


class User:
    def __init__(self, id: str, email: str, roles: list[str]):
        self.id = id
        self.email = email
        self.roles = roles


def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
    """Create JWT access token."""
    to_encode = data.copy()
    expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15))
    to_encode.update({"exp": expire})
    return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)


async def get_current_user(
    token: Optional[str] = Depends(oauth2_scheme),
    api_key: Optional[str] = Depends(api_key_header),
    db = Depends(get_db)
) -> User:
    """
    Authenticate user via JWT token or API key.
    """
    credentials_exception = HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail="Could not validate credentials",
        headers={"WWW-Authenticate": "Bearer"},
    )
    
    # Try JWT token first
    if token:
        try:
            payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
            user_id: str = payload.get("sub")
            if user_id is None:
                raise credentials_exception
            
            user = await db.get_user(user_id)
            if user is None:
                raise credentials_exception
            
            return user
        except JWTError:
            pass
    
    # Try API key
    if api_key:
        user = await db.get_user_by_api_key(api_key)
        if user:
            return user
    
    raise credentials_exception


def require_role(required_roles: list[str]):
    """Dependency to require specific roles."""
    async def role_checker(user: User = Depends(get_current_user)):
        if not any(role in user.roles for role in required_roles):
            raise HTTPException(
                status_code=status.HTTP_403_FORBIDDEN,
                detail="Insufficient permissions"
            )
        return user
    return role_checker

7.2 Authorization Roles

Role	Permissions
`reader`	Read observations, reconstructions, claims
`contributor`	Create observations, add claims
`curator`	Create reconstructions, link observations, resolve conflicts
`admin`	Manage users, API keys, system configuration
`api_client`	Programmatic access via API key

7.3 Rate Limiting

from fastapi import Request
import redis
from datetime import datetime

class RateLimiter:
    """
    Token bucket rate limiter using Redis.
    """
    
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
    
    async def is_allowed(
        self,
        key: str,
        max_requests: int = 100,
        window_seconds: int = 60
    ) -> tuple[bool, dict]:
        """
        Check if request is allowed under rate limit.
        
        Returns:
            Tuple of (is_allowed, rate_limit_info)
        """
        now = datetime.utcnow().timestamp()
        window_start = now - window_seconds
        
        pipe = self.redis.pipeline()
        
        # Remove old requests
        pipe.zremrangebyscore(key, 0, window_start)
        
        # Count requests in window
        pipe.zcard(key)
        
        # Add current request
        pipe.zadd(key, {str(now): now})
        
        # Set expiry
        pipe.expire(key, window_seconds)
        
        results = pipe.execute()
        request_count = results[1]
        
        is_allowed = request_count < max_requests
        
        return is_allowed, {
            "limit": max_requests,
            "remaining": max(0, max_requests - request_count - 1),
            "reset": int(now + window_seconds)
        }


# Rate limit tiers
RATE_LIMITS = {
    "anonymous": {"requests": 60, "window": 60},
    "reader": {"requests": 100, "window": 60},
    "contributor": {"requests": 500, "window": 60},
    "curator": {"requests": 1000, "window": 60},
    "api_client": {"requests": 5000, "window": 60},
}

8. Performance Requirements

8.1 SLOs (Service Level Objectives)

Metric	Target	Measurement
Availability	99.9%	Monthly uptime
API Latency (p50)	< 50ms	Response time
API Latency (p99)	< 500ms	Response time
Search Latency	< 200ms	Full-text search
SPARQL Query	< 1s	Simple queries
Throughput	1000 req/s	Sustained load

8.2 Scaling Strategy

# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ppid-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ppid-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100

8.3 Caching Strategy

import redis
from functools import wraps
import json
import hashlib

class CacheManager:
    """
    Multi-tier caching for PPID.
    
    Tiers:
    1. L1: In-memory (per-instance)
    2. L2: Redis (shared)
    """
    
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.local_cache = {}
    
    def cache_observation(self, ttl: int = 3600):
        """Cache observation lookups."""
        def decorator(func):
            @wraps(func)
            async def wrapper(poid: str, *args, **kwargs):
                cache_key = f"observation:{poid}"
                
                # Check L1 cache
                if cache_key in self.local_cache:
                    return self.local_cache[cache_key]
                
                # Check L2 cache
                cached = self.redis.get(cache_key)
                if cached:
                    data = json.loads(cached)
                    self.local_cache[cache_key] = data
                    return data
                
                # Fetch from database
                result = await func(poid, *args, **kwargs)
                
                if result:
                    # Store in both caches
                    self.redis.setex(cache_key, ttl, json.dumps(result))
                    self.local_cache[cache_key] = result
                
                return result
            return wrapper
        return decorator
    
    def cache_search(self, ttl: int = 300):
        """Cache search results (shorter TTL)."""
        def decorator(func):
            @wraps(func)
            async def wrapper(query: str, *args, **kwargs):
                # Create deterministic cache key from query params
                key_data = {"query": query, "args": args, "kwargs": kwargs}
                cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"
                
                cached = self.redis.get(cache_key)
                if cached:
                    return json.loads(cached)
                
                result = await func(query, *args, **kwargs)
                
                self.redis.setex(cache_key, ttl, json.dumps(result))
                
                return result
            return wrapper
        return decorator
    
    def invalidate_observation(self, poid: str):
        """Invalidate cache when observation is updated."""
        cache_key = f"observation:{poid}"
        self.redis.delete(cache_key)
        self.local_cache.pop(cache_key, None)

9. Deployment Architecture

9.1 Kubernetes Deployment

# API Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ppid-api
  labels:
    app: ppid
    component: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ppid
      component: api
  template:
    metadata:
      labels:
        app: ppid
        component: api
    spec:
      containers:
        - name: api
          image: ppid/api:latest
          ports:
            - containerPort: 8000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: database-url
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: redis-url
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: jwt-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5

---
# Service
apiVersion: v1
kind: Service
metadata:
  name: ppid-api
spec:
  selector:
    app: ppid
    component: api
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

---
# Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ppid-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - api.ppid.org
      secretName: ppid-tls
  rules:
    - host: api.ppid.org
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ppid-api
                port:
                  number: 80

9.2 Docker Compose (Development)

version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid
      - REDIS_URL=redis://redis:6379
      - FUSEKI_URL=http://fuseki:3030
    depends_on:
      - postgres
      - redis
      - fuseki
    volumes:
      - ./src:/app/src
      - ./archives:/app/archives

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: ppid
      POSTGRES_PASSWORD: ppid
      POSTGRES_DB: ppid
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  fuseki:
    image: stain/jena-fuseki
    environment:
      ADMIN_PASSWORD: admin
      FUSEKI_DATASET_1: ppid
    volumes:
      - fuseki_data:/fuseki
    ports:
      - "3030:3030"

  elasticsearch:
    image: elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"

  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

volumes:
  postgres_data:
  fuseki_data:
  es_data:

10. Monitoring and Observability

10.1 Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
REQUEST_COUNT = Counter(
    'ppid_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'ppid_request_latency_seconds',
    'Request latency in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

OBSERVATIONS_CREATED = Counter(
    'ppid_observations_created_total',
    'Total observations created'
)

RECONSTRUCTIONS_CREATED = Counter(
    'ppid_reconstructions_created_total',
    'Total reconstructions created'
)

ENTITY_RESOLUTION_LATENCY = Histogram(
    'ppid_entity_resolution_seconds',
    'Entity resolution latency',
    buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30]
)

CACHE_HITS = Counter(
    'ppid_cache_hits_total',
    'Cache hits',
    ['cache_type']
)

CACHE_MISSES = Counter(
    'ppid_cache_misses_total',
    'Cache misses',
    ['cache_type']
)

DB_CONNECTIONS = Gauge(
    'ppid_db_connections',
    'Active database connections'
)


# Middleware for request metrics
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()
    
    response = await call_next(request)
    
    latency = time.time() - start_time
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)
    
    return response

10.2 Logging

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    """Structured JSON logging for observability."""
    
    def format(self, record):
        log_record = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
        }
        
        # Add extra fields
        if hasattr(record, 'poid'):
            log_record['poid'] = record.poid
        if hasattr(record, 'prid'):
            log_record['prid'] = record.prid
        if hasattr(record, 'request_id'):
            log_record['request_id'] = record.request_id
        if hasattr(record, 'user_id'):
            log_record['user_id'] = record.user_id
        if hasattr(record, 'duration_ms'):
            log_record['duration_ms'] = record.duration_ms
        
        if record.exc_info:
            log_record['exception'] = self.formatException(record.exc_info)
        
        return json.dumps(log_record)


# Configure logging
def setup_logging():
    handler = logging.StreamHandler()
    handler.setFormatter(JSONFormatter())
    
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)
    
    # Reduce noise from libraries
    logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
    logging.getLogger("httpx").setLevel(logging.WARNING)

10.3 Grafana Dashboard (JSON)

{
  "dashboard": {
    "title": "PPID System Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)"
          }
        ]
      },
      {
        "title": "Request Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Observations Created",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(ppid_observations_created_total)"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))"
          }
        ]
      }
    ]
  }
}

11. Implementation Checklist

11.1 Phase 1: Core Infrastructure

Set up PostgreSQL database with schema
Set up Apache Jena Fuseki for RDF
Set up Redis for caching
Implement POID/PRID generation
Implement checksum validation
Create basic REST API endpoints

11.2 Phase 2: Data Ingestion

Build web scraping infrastructure
Implement HTML archival
Integrate LLM for extraction
Implement XPath validation
Set up Kafka for async processing
Create ingestion pipeline

11.3 Phase 3: Entity Resolution

Implement blocking strategies
Implement similarity metrics
Build clustering algorithm
Create human-in-loop review UI
Integrate with reconstruction creation

11.4 Phase 4: GHCID Integration

Implement affiliation linking
Add SPARQL queries for institution lookups
Create bidirectional navigation
Sync with GHCID registry updates

11.5 Phase 5: Production Readiness

Implement authentication/authorization
Set up rate limiting
Configure monitoring and alerting
Create backup and recovery procedures
Performance testing and optimization
Security audit

12. References

Standards

OAuth 2.0: https://oauth.net/2/
OpenAPI 3.1: https://spec.openapis.org/oas/latest.html
GraphQL: https://graphql.org/learn/
SPARQL 1.1: https://www.w3.org/TR/sparql11-query/

67 KiB Raw Blame History