glam/docs/plan/person_pid/08_implementation_guidelines.md

# Implementation Guidelines

**Version**: 0.1.0
**Last Updated**: 2025-01-09
**Related**: [Identifier Structure](./05_identifier_structure_design.md) | [Claims and Provenance](./07_claims_and_provenance.md)

---

## 1. Overview

This document provides comprehensive technical specifications for implementing the PPID system:

- Database architecture (PostgreSQL + RDF triple store)
- API design (REST + GraphQL)
- Data ingestion pipeline
- GHCID integration patterns
- Security and access control
- Performance requirements
- Technology stack recommendations
- Deployment architecture
- Monitoring and observability

---

## 2. Architecture Overview

### 2.1 System Components

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPID System                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                 │
│  │   Ingestion  │────▶│   Processing │────▶│   Storage    │                 │
│  │   Layer      │     │   Layer      │     │   Layer      │                 │
│  └──────────────┘     └──────────────┘     └──────────────┘                 │
│         │                    │                    │                          │
│         ▼                    ▼                    ▼                          │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐                 │
│  │ Web Scrapers │     │ Entity Res.  │     │ PostgreSQL   │                 │
│  │ API Clients  │     │ NER/NLP      │     │ (Relational) │                 │
│  │ File Import  │     │ Validation   │     ├──────────────┤                 │
│  └──────────────┘     └──────────────┘     │ Apache Jena  │                 │
│                                             │ (RDF/SPARQL) │                 │
│                                             ├──────────────┤                 │
│                                             │ Redis        │                 │
│                                             │ (Cache)      │                 │
│                                             └──────────────┘                 │
│                                                    │                          │
│                              ┌─────────────────────┴─────────────────────┐   │
│                              ▼                                           ▼   │
│                       ┌──────────────┐                          ┌──────────┐│
│                       │   REST API   │                          │  GraphQL ││
│                       │   (FastAPI)  │                          │  (Ariadne││
│                       └──────────────┘                          └──────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 2.2 Technology Stack

| Layer | Technology | Purpose |
|-------|------------|---------|
| **API Framework** | FastAPI (Python 3.11+) | REST API, async support |
| **GraphQL** | Ariadne | GraphQL endpoint |
| **Relational DB** | PostgreSQL 16 | Primary data store |
| **Triple Store** | Apache Jena Fuseki | RDF/SPARQL queries |
| **Cache** | Redis 7 | Session, rate limiting, caching |
| **Queue** | Apache Kafka | Async processing pipeline |
| **Search** | Elasticsearch 8 | Full-text search |
| **Object Storage** | MinIO / S3 | HTML archives, files |
| **Container** | Docker + Kubernetes | Deployment |
| **Monitoring** | Prometheus + Grafana | Metrics and alerting |

---

## 3. Database Schema

### 3.1 PostgreSQL Schema

```sql
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";  -- For fuzzy matching

-- Enum types
CREATE TYPE ppid_type AS ENUM ('POID', 'PRID');
CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted');
CREATE TYPE source_type AS ENUM (
    'official_registry',
    'institutional_website',
    'professional_network',
    'social_media',
    'news_article',
    'academic_publication',
    'user_submitted',
    'inferred'
);

-- Person Observations (POID)
CREATE TABLE person_observations (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    poid VARCHAR(24) UNIQUE NOT NULL,  -- POID-xxxx-xxxx-xxxx-xxxx

    -- Source metadata
    source_url TEXT NOT NULL,
    source_type source_type NOT NULL,
    retrieved_at TIMESTAMPTZ NOT NULL,
    content_hash VARCHAR(64) NOT NULL,  -- SHA-256
    html_archive_path TEXT,

    -- Extracted name components (PNV-compatible)
    literal_name TEXT,
    given_name TEXT,
    surname TEXT,
    surname_prefix TEXT,  -- van, de, etc.
    patronymic TEXT,
    generation_suffix TEXT,  -- Jr., III, etc.

    -- Metadata
    extraction_agent VARCHAR(100),
    extraction_confidence DECIMAL(3,2),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),

    -- Indexes
    CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);

-- Person Reconstructions (PRID)
CREATE TABLE person_reconstructions (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid VARCHAR(24) UNIQUE NOT NULL,  -- PRID-xxxx-xxxx-xxxx-xxxx

    -- Canonical name (resolved from observations)
    canonical_name TEXT NOT NULL,
    given_name TEXT,
    surname TEXT,
    surname_prefix TEXT,

    -- Curation metadata
    curator_id UUID REFERENCES users(id),
    curation_method VARCHAR(50),  -- 'manual', 'algorithmic', 'hybrid'
    confidence_score DECIMAL(3,2),

    -- Versioning
    version INTEGER DEFAULT 1,
    previous_version_id UUID REFERENCES person_reconstructions(id),

    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),

    CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);

-- Link table: PRID derives from POIDs
CREATE TABLE reconstruction_observations (
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE,
    linked_at TIMESTAMPTZ DEFAULT NOW(),
    linked_by UUID REFERENCES users(id),
    link_confidence DECIMAL(3,2),
    PRIMARY KEY (prid_id, poid_id)
);

-- Claims (assertions with provenance)
CREATE TABLE claims (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    claim_id VARCHAR(50) UNIQUE NOT NULL,

    -- Subject (what this claim is about)
    poid_id UUID REFERENCES person_observations(id),

    -- Claim content
    claim_type VARCHAR(50) NOT NULL,  -- 'job_title', 'employer', 'email', etc.
    claim_value TEXT NOT NULL,

    -- Provenance (MANDATORY per Rule 6)
    source_url TEXT NOT NULL,
    retrieved_on TIMESTAMPTZ NOT NULL,
    xpath TEXT NOT NULL,
    html_file TEXT NOT NULL,
    xpath_match_score DECIMAL(3,2) NOT NULL,
    content_hash VARCHAR(64),

    -- Quality
    confidence DECIMAL(3,2),
    extraction_agent VARCHAR(100),
    status claim_status DEFAULT 'active',

    -- Relationships
    supersedes_id UUID REFERENCES claims(id),

    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT NOW(),
    verified_at TIMESTAMPTZ,

    -- Index for claim lookups
    CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1)
);

-- Claim relationships (supports, conflicts)
CREATE TABLE claim_relationships (
    claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE,
    claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE,
    relationship_type VARCHAR(20) NOT NULL,  -- 'supports', 'conflicts_with'
    notes TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (claim_a_id, claim_b_id, relationship_type)
);

-- External identifiers (ORCID, ISNI, VIAF, Wikidata)
CREATE TABLE external_identifiers (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    identifier_scheme VARCHAR(20) NOT NULL,  -- 'orcid', 'isni', 'viaf', 'wikidata'
    identifier_value VARCHAR(100) NOT NULL,
    verified BOOLEAN DEFAULT FALSE,
    verified_at TIMESTAMPTZ,
    source_url TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (prid_id, identifier_scheme, identifier_value)
);

-- GHCID links (person to heritage institution)
CREATE TABLE ghcid_affiliations (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
    ghcid VARCHAR(50) NOT NULL,  -- e.g., NL-NH-HAA-A-NHA

    -- Affiliation details
    role_title TEXT,
    department TEXT,
    affiliation_start DATE,
    affiliation_end DATE,
    is_current BOOLEAN DEFAULT TRUE,

    -- Provenance
    source_poid_id UUID REFERENCES person_observations(id),
    confidence DECIMAL(3,2),

    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Indexes for performance
CREATE INDEX idx_observations_source_url ON person_observations(source_url);
CREATE INDEX idx_observations_content_hash ON person_observations(content_hash);
CREATE INDEX idx_observations_name_trgm ON person_observations
    USING gin (literal_name gin_trgm_ops);

CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions
    USING gin (canonical_name gin_trgm_ops);

CREATE INDEX idx_claims_type ON claims(claim_type);
CREATE INDEX idx_claims_poid ON claims(poid_id);
CREATE INDEX idx_claims_status ON claims(status);

CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid);
CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id)
    WHERE is_current = TRUE;

-- Full-text search
CREATE INDEX idx_observations_fts ON person_observations
    USING gin (to_tsvector('english', literal_name));
CREATE INDEX idx_reconstructions_fts ON person_reconstructions
    USING gin (to_tsvector('english', canonical_name));
```

### 3.2 RDF Triple Store Schema

```turtle
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .
@prefix picom: <https://personsincontext.org/model#> .
@prefix pnv: <https://w3id.org/pnv#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Example Person Observation
ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ;
    ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ;

    # Name (PNV structured)
    pnv:hasName [
        a pnv:PersonName ;
        pnv:literalName "Jan van den Berg" ;
        pnv:givenName "Jan" ;
        pnv:surnamePrefix "van den" ;
        pnv:baseSurname "Berg"
    ] ;

    # Provenance
    prov:wasDerivedFrom <https://linkedin.com/in/jan-van-den-berg> ;
    prov:wasGeneratedBy ppid:extraction-activity-001 ;
    prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ;

    # Claims
    ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 .

# Example Person Reconstruction
ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ;
    ppidv:prid "PRID-1234-5678-90ab-cde5" ;

    # Canonical name
    schema:name "Jan van den Berg" ;
    pnv:hasName [
        a pnv:PersonName ;
        pnv:literalName "Jan van den Berg" ;
        pnv:givenName "Jan" ;
        pnv:surnamePrefix "van den" ;
        pnv:baseSurname "Berg"
    ] ;

    # Derived from observations
    prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X,
                        ppid:POID-8c4d-e5f6-g7h8-901Y ;

    # GHCID affiliation
    ppidv:affiliatedWith <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;

    # External identifiers
    ppidv:orcid "0000-0002-1234-5678" ;
    owl:sameAs <https://orcid.org/0000-0002-1234-5678> .
```

---

## 4. API Design

### 4.1 REST API Endpoints

```yaml
openapi: 3.1.0
info:
  title: PPID API
  version: 1.0.0
  description: Person Persistent Identifier API

servers:
  - url: https://api.ppid.org/v1

paths:
  # Person Observations
  /observations:
    post:
      summary: Create new person observation
      operationId: createObservation
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateObservationRequest'
      responses:
        '201':
          description: Observation created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonObservation'

    get:
      summary: Search observations
      operationId: searchObservations
      parameters:
        - name: name
          in: query
          schema:
            type: string
        - name: source_url
          in: query
          schema:
            type: string
        - name: limit
          in: query
          schema:
            type: integer
            default: 20
        - name: offset
          in: query
          schema:
            type: integer
            default: 0
      responses:
        '200':
          description: List of observations
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ObservationList'

  /observations/{poid}:
    get:
      summary: Get observation by POID
      operationId: getObservation
      parameters:
        - name: poid
          in: path
          required: true
          schema:
            type: string
            pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$'
      responses:
        '200':
          description: Person observation
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonObservation'
            text/turtle:
              schema:
                type: string
            application/ld+json:
              schema:
                type: object

  /observations/{poid}/claims:
    get:
      summary: Get claims for observation
      operationId: getObservationClaims
      parameters:
        - name: poid
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: List of claims
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ClaimList'

  # Person Reconstructions
  /reconstructions:
    post:
      summary: Create person reconstruction from observations
      operationId: createReconstruction
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateReconstructionRequest'
      responses:
        '201':
          description: Reconstruction created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonReconstruction'

  /reconstructions/{prid}:
    get:
      summary: Get reconstruction by PRID
      operationId: getReconstruction
      parameters:
        - name: prid
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Person reconstruction
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PersonReconstruction'

  /reconstructions/{prid}/observations:
    get:
      summary: Get observations linked to reconstruction
      operationId: getReconstructionObservations
      responses:
        '200':
          description: Linked observations

  /reconstructions/{prid}/history:
    get:
      summary: Get version history
      operationId: getReconstructionHistory
      responses:
        '200':
          description: Version history

  # Validation
  /validate/{ppid}:
    get:
      summary: Validate PPID format and checksum
      operationId: validatePpid
      responses:
        '200':
          description: Validation result
          content:
            application/json:
              schema:
                type: object
                properties:
                  valid:
                    type: boolean
                  type:
                    type: string
                    enum: [POID, PRID]
                  exists:
                    type: boolean

  # Search
  /search:
    get:
      summary: Full-text search across all records
      operationId: search
      parameters:
        - name: q
          in: query
          required: true
          schema:
            type: string
        - name: type
          in: query
          schema:
            type: string
            enum: [observation, reconstruction, all]
            default: all
      responses:
        '200':
          description: Search results

  # Entity Resolution
  /resolve:
    post:
      summary: Find matching records for input data
      operationId: resolveEntity
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/EntityResolutionRequest'
      responses:
        '200':
          description: Resolution candidates
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ResolutionCandidates'

components:
  schemas:
    CreateObservationRequest:
      type: object
      required:
        - source_url
        - retrieved_at
        - claims
      properties:
        source_url:
          type: string
          format: uri
        source_type:
          type: string
          enum: [official_registry, institutional_website, professional_network, social_media]
        retrieved_at:
          type: string
          format: date-time
        content_hash:
          type: string
        html_archive_path:
          type: string
        extraction_agent:
          type: string
        claims:
          type: array
          items:
            $ref: '#/components/schemas/ClaimInput'

    ClaimInput:
      type: object
      required:
        - claim_type
        - claim_value
        - xpath
        - xpath_match_score
      properties:
        claim_type:
          type: string
        claim_value:
          type: string
        xpath:
          type: string
        xpath_match_score:
          type: number
          minimum: 0
          maximum: 1
        confidence:
          type: number

    PersonObservation:
      type: object
      properties:
        poid:
          type: string
        source_url:
          type: string
        literal_name:
          type: string
        claims:
          type: array
          items:
            $ref: '#/components/schemas/Claim'
        created_at:
          type: string
          format: date-time

    CreateReconstructionRequest:
      type: object
      required:
        - observation_ids
      properties:
        observation_ids:
          type: array
          items:
            type: string
          minItems: 1
        canonical_name:
          type: string
        external_identifiers:
          type: object

    PersonReconstruction:
      type: object
      properties:
        prid:
          type: string
        canonical_name:
          type: string
        observations:
          type: array
          items:
            type: string
        external_identifiers:
          type: object
        ghcid_affiliations:
          type: array
          items:
            $ref: '#/components/schemas/GhcidAffiliation'

    GhcidAffiliation:
      type: object
      properties:
        ghcid:
          type: string
        role_title:
          type: string
        is_current:
          type: boolean

  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: JWT
    apiKey:
      type: apiKey
      in: header
      name: X-API-Key

security:
  - bearerAuth: []
  - apiKey: []
```

### 4.2 GraphQL Schema

```graphql
type Query {
  # Observations
  observation(poid: ID!): PersonObservation
  observations(
    name: String
    sourceUrl: String
    limit: Int = 20
    offset: Int = 0
  ): ObservationConnection!

  # Reconstructions
  reconstruction(prid: ID!): PersonReconstruction
  reconstructions(
    name: String
    ghcid: String
    limit: Int = 20
    offset: Int = 0
  ): ReconstructionConnection!

  # Search
  search(query: String!, type: SearchType = ALL): SearchResults!

  # Validation
  validatePpid(ppid: String!): ValidationResult!

  # Entity Resolution
  resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]!
}

type Mutation {
  # Create observation
  createObservation(input: CreateObservationInput!): PersonObservation!

  # Create reconstruction
  createReconstruction(input: CreateReconstructionInput!): PersonReconstruction!

  # Link observation to reconstruction
  linkObservation(prid: ID!, poid: ID!): PersonReconstruction!

  # Update reconstruction
  updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction!

  # Add external identifier
  addExternalIdentifier(
    prid: ID!
    scheme: IdentifierScheme!
    value: String!
  ): PersonReconstruction!

  # Add GHCID affiliation
  addGhcidAffiliation(
    prid: ID!
    ghcid: String!
    roleTitle: String
    isCurrent: Boolean = true
  ): PersonReconstruction!
}

type PersonObservation {
  poid: ID!
  sourceUrl: String!
  sourceType: SourceType!
  retrievedAt: DateTime!
  contentHash: String
  htmlArchivePath: String

  # Name components
  literalName: String
  givenName: String
  surname: String
  surnamePrefix: String

  # Related data
  claims: [Claim!]!
  linkedReconstructions: [PersonReconstruction!]!

  # Metadata
  extractionAgent: String
  extractionConfidence: Float
  createdAt: DateTime!
}

type PersonReconstruction {
  prid: ID!
  canonicalName: String!
  givenName: String
  surname: String
  surnamePrefix: String

  # Linked observations
  observations: [PersonObservation!]!

  # External identifiers
  orcid: String
  isni: String
  viaf: String
  wikidata: String
  externalIdentifiers: [ExternalIdentifier!]!

  # GHCID affiliations
  ghcidAffiliations: [GhcidAffiliation!]!
  currentAffiliations: [GhcidAffiliation!]!

  # Versioning
  version: Int!
  previousVersion: PersonReconstruction
  history: [PersonReconstruction!]!

  # Curation
  curator: User
  curationMethod: CurationMethod
  confidenceScore: Float

  createdAt: DateTime!
  updatedAt: DateTime!
}

type Claim {
  id: ID!
  claimType: ClaimType!
  claimValue: String!

  # Provenance (MANDATORY)
  sourceUrl: String!
  retrievedOn: DateTime!
  xpath: String!
  htmlFile: String!
  xpathMatchScore: Float!

  # Quality
  confidence: Float
  extractionAgent: String
  status: ClaimStatus!

  # Relationships
  supports: [Claim!]!
  conflictsWith: [Claim!]!
  supersedes: Claim

  createdAt: DateTime!
}

type GhcidAffiliation {
  ghcid: String!
  institution: HeritageCustodian  # Resolved from GHCID
  roleTitle: String
  department: String
  startDate: Date
  endDate: Date
  isCurrent: Boolean!
  confidence: Float
}

type HeritageCustodian {
  ghcid: String!
  name: String!
  institutionType: String!
  city: String
  country: String
}

type ExternalIdentifier {
  scheme: IdentifierScheme!
  value: String!
  verified: Boolean!
  verifiedAt: DateTime
}

enum SourceType {
  OFFICIAL_REGISTRY
  INSTITUTIONAL_WEBSITE
  PROFESSIONAL_NETWORK
  SOCIAL_MEDIA
  NEWS_ARTICLE
  ACADEMIC_PUBLICATION
  USER_SUBMITTED
  INFERRED
}

enum ClaimType {
  FULL_NAME
  GIVEN_NAME
  FAMILY_NAME
  JOB_TITLE
  EMPLOYER
  EMPLOYER_GHCID
  EMAIL
  LINKEDIN_URL
  ORCID
  BIRTH_DATE
  EDUCATION
}

enum ClaimStatus {
  ACTIVE
  SUPERSEDED
  RETRACTED
}

enum IdentifierScheme {
  ORCID
  ISNI
  VIAF
  WIKIDATA
  LOC_NAF
}

enum CurationMethod {
  MANUAL
  ALGORITHMIC
  HYBRID
}

enum SearchType {
  OBSERVATION
  RECONSTRUCTION
  ALL
}

input CreateObservationInput {
  sourceUrl: String!
  sourceType: SourceType!
  retrievedAt: DateTime!
  contentHash: String
  htmlArchivePath: String
  extractionAgent: String
  claims: [ClaimInput!]!
}

input ClaimInput {
  claimType: ClaimType!
  claimValue: String!
  xpath: String!
  xpathMatchScore: Float!
  confidence: Float
}

input CreateReconstructionInput {
  observationIds: [ID!]!
  canonicalName: String
  externalIdentifiers: ExternalIdentifiersInput
}

input EntityResolutionInput {
  name: String!
  employer: String
  jobTitle: String
  email: String
  linkedinUrl: String
}

type ResolutionCandidate {
  reconstruction: PersonReconstruction
  observation: PersonObservation
  matchScore: Float!
  matchFactors: [MatchFactor!]!
}

type MatchFactor {
  field: String!
  score: Float!
  method: String!
}
```

### 4.3 FastAPI Implementation

```python
from fastapi import FastAPI, HTTPException, Depends, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import uuid

app = FastAPI(
    title="PPID API",
    description="Person Persistent Identifier API",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


# --- Pydantic Models ---

class ClaimInput(BaseModel):
    claim_type: str
    claim_value: str
    xpath: str
    xpath_match_score: float = Field(ge=0, le=1)
    confidence: Optional[float] = Field(None, ge=0, le=1)


class CreateObservationRequest(BaseModel):
    source_url: str
    source_type: str = "institutional_website"
    retrieved_at: datetime
    content_hash: Optional[str] = None
    html_archive_path: Optional[str] = None
    extraction_agent: Optional[str] = None
    claims: list[ClaimInput]


class PersonObservationResponse(BaseModel):
    poid: str
    source_url: str
    source_type: str
    retrieved_at: datetime
    literal_name: Optional[str] = None
    claims: list[dict]
    created_at: datetime


class CreateReconstructionRequest(BaseModel):
    observation_ids: list[str]
    canonical_name: Optional[str] = None
    external_identifiers: Optional[dict] = None


class ValidationResult(BaseModel):
    valid: bool
    ppid_type: Optional[str] = None
    exists: Optional[bool] = None
    error: Optional[str] = None


# --- Dependencies ---

async def get_db():
    """Database connection dependency."""
    # In production, use connection pool
    pass


async def get_current_user(api_key: str = Depends(oauth2_scheme)):
    """Authenticate user from API key or JWT."""
    pass


# --- Endpoints ---

@app.post("/api/v1/observations", response_model=PersonObservationResponse)
async def create_observation(
    request: CreateObservationRequest,
    db = Depends(get_db)
):
    """
    Create a new Person Observation from extracted data.

    The POID is generated deterministically from source metadata.
    """
    from ppid.identifiers import generate_poid

    # Generate deterministic POID
    poid = generate_poid(
        source_url=request.source_url,
        retrieval_timestamp=request.retrieved_at.isoformat(),
        content_hash=request.content_hash or ""
    )

    # Check for existing observation with same POID
    existing = await db.get_observation(poid)
    if existing:
        return existing

    # Extract name from claims
    literal_name = None
    for claim in request.claims:
        if claim.claim_type == "full_name":
            literal_name = claim.claim_value
            break

    # Create observation record
    observation = await db.create_observation(
        poid=poid,
        source_url=request.source_url,
        source_type=request.source_type,
        retrieved_at=request.retrieved_at,
        content_hash=request.content_hash,
        html_archive_path=request.html_archive_path,
        literal_name=literal_name,
        extraction_agent=request.extraction_agent,
        claims=[c.dict() for c in request.claims]
    )

    return observation


@app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse)
async def get_observation(poid: str, db = Depends(get_db)):
    """Get Person Observation by POID."""
    from ppid.identifiers import validate_ppid

    is_valid, error = validate_ppid(poid)
    if not is_valid:
        raise HTTPException(status_code=400, detail=f"Invalid POID: {error}")

    observation = await db.get_observation(poid)
    if not observation:
        raise HTTPException(status_code=404, detail="Observation not found")

    return observation


@app.get("/api/v1/validate/{ppid}", response_model=ValidationResult)
async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)):
    """Validate PPID format, checksum, and existence."""
    from ppid.identifiers import validate_ppid_full

    is_valid, error = validate_ppid_full(ppid)

    if not is_valid:
        return ValidationResult(valid=False, error=error)

    ppid_type = "POID" if ppid.startswith("POID") else "PRID"

    # Check existence
    if ppid_type == "POID":
        exists = await db.observation_exists(ppid)
    else:
        exists = await db.reconstruction_exists(ppid)

    return ValidationResult(
        valid=True,
        ppid_type=ppid_type,
        exists=exists
    )


@app.post("/api/v1/reconstructions")
async def create_reconstruction(
    request: CreateReconstructionRequest,
    db = Depends(get_db),
    user = Depends(get_current_user)
):
    """Create Person Reconstruction from linked observations."""
    from ppid.identifiers import generate_prid

    # Validate all POIDs exist
    for poid in request.observation_ids:
        if not await db.observation_exists(poid):
            raise HTTPException(
                status_code=400,
                detail=f"Observation not found: {poid}"
            )

    # Generate deterministic PRID
    prid = generate_prid(
        observation_ids=request.observation_ids,
        curator_id=str(user.id),
        timestamp=datetime.utcnow().isoformat()
    )

    # Determine canonical name
    if request.canonical_name:
        canonical_name = request.canonical_name
    else:
        # Use name from highest-confidence observation
        observations = await db.get_observations(request.observation_ids)
        canonical_name = max(
            observations,
            key=lambda o: o.extraction_confidence or 0
        ).literal_name

    # Create reconstruction
    reconstruction = await db.create_reconstruction(
        prid=prid,
        canonical_name=canonical_name,
        observation_ids=request.observation_ids,
        curator_id=user.id,
        external_identifiers=request.external_identifiers
    )

    return reconstruction


@app.get("/api/v1/search")
async def search(
    q: str = Query(..., min_length=2),
    type: str = Query("all", regex="^(observation|reconstruction|all)$"),
    limit: int = Query(20, ge=1, le=100),
    offset: int = Query(0, ge=0),
    db = Depends(get_db)
):
    """Full-text search across observations and reconstructions."""
    results = await db.search(
        query=q,
        search_type=type,
        limit=limit,
        offset=offset
    )
    return results


@app.post("/api/v1/resolve")
async def resolve_entity(
    name: str,
    employer: Optional[str] = None,
    job_title: Optional[str] = None,
    email: Optional[str] = None,
    db = Depends(get_db)
):
    """
    Entity resolution: find matching records for input data.

    Returns ranked candidates with match scores.
    """
    from ppid.entity_resolution import find_candidates

    candidates = await find_candidates(
        db=db,
        name=name,
        employer=employer,
        job_title=job_title,
        email=email
    )

    return {
        "candidates": candidates,
        "query": {
            "name": name,
            "employer": employer,
            "job_title": job_title,
            "email": email
        }
    }
```

---

## 5. Data Ingestion Pipeline

### 5.1 Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    DATA INGESTION PIPELINE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Source  │───▶│ Extract │───▶│ Transform│───▶│  Load   │      │
│  │ Fetch   │    │ (NER)   │    │ (Validate)   │ (Store) │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│       │              │              │              │             │
│       ▼              ▼              ▼              ▼             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Archive │    │ Claims  │    │ POID    │    │ Postgres│      │
│  │ HTML    │    │ XPath   │    │ Generate│    │ + RDF   │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### 5.2 Pipeline Implementation

```python
from dataclasses import dataclass
from datetime import datetime
from typing import AsyncIterator
import hashlib
import asyncio
from kafka import KafkaProducer, KafkaConsumer

@dataclass
class SourceDocument:
    url: str
    html_content: str
    retrieved_at: datetime
    content_hash: str
    archive_path: str


@dataclass
class ExtractedClaim:
    claim_type: str
    claim_value: str
    xpath: str
    xpath_match_score: float
    confidence: float


@dataclass
class PersonObservationData:
    source: SourceDocument
    claims: list[ExtractedClaim]
    poid: str


class IngestionPipeline:
    """
    Main data ingestion pipeline for PPID.

    Stages:
    1. Fetch: Retrieve web pages, archive HTML
    2. Extract: NER/NLP to identify person data, generate claims with XPath
    3. Transform: Validate, generate POID, structure data
    4. Load: Store in PostgreSQL and RDF triple store
    """

    def __init__(
        self,
        db_pool,
        rdf_store,
        kafka_producer: KafkaProducer,
        archive_storage,
        llm_extractor
    ):
        self.db = db_pool
        self.rdf = rdf_store
        self.kafka = kafka_producer
        self.archive = archive_storage
        self.extractor = llm_extractor

    async def process_url(self, url: str) -> list[PersonObservationData]:
        """
        Full pipeline for a single URL.

        Returns list of PersonObservations extracted from the page.
        """
        # Stage 1: Fetch and archive
        source = await self._fetch_and_archive(url)

        # Stage 2: Extract claims with XPath
        observations = await self._extract_observations(source)

        # Stage 3: Transform and validate
        validated = await self._transform_and_validate(observations)

        # Stage 4: Load to databases
        await self._load_observations(validated)

        return validated

    async def _fetch_and_archive(self, url: str) -> SourceDocument:
        """Fetch URL and archive HTML."""
        from playwright.async_api import async_playwright

        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()

            await page.goto(url, wait_until='networkidle')
            html_content = await page.content()

            await browser.close()

        # Calculate content hash
        content_hash = hashlib.sha256(html_content.encode()).hexdigest()

        # Archive HTML
        retrieved_at = datetime.utcnow()
        archive_path = await self.archive.store(
            url=url,
            content=html_content,
            timestamp=retrieved_at
        )

        return SourceDocument(
            url=url,
            html_content=html_content,
            retrieved_at=retrieved_at,
            content_hash=content_hash,
            archive_path=archive_path
        )

    async def _extract_observations(
        self,
        source: SourceDocument
    ) -> list[tuple[list[ExtractedClaim], str]]:
        """
        Extract person observations with XPath provenance.

        Uses LLM for extraction, then validates XPath.
        """
        from lxml import html

        # Parse HTML
        tree = html.fromstring(source.html_content)

        # Use LLM to extract person data with XPath
        extraction_result = await self.extractor.extract_persons(
            html_content=source.html_content,
            source_url=source.url
        )

        observations = []

        for person in extraction_result.persons:
            validated_claims = []

            for claim in person.claims:
                # Verify XPath points to expected value
                try:
                    elements = tree.xpath(claim.xpath)
                    if elements:
                        actual_value = elements[0].text_content().strip()

                        # Calculate match score
                        if actual_value == claim.claim_value:
                            match_score = 1.0
                        else:
                            from difflib import SequenceMatcher
                            match_score = SequenceMatcher(
                                None, actual_value, claim.claim_value
                            ).ratio()

                        if match_score >= 0.8:  # Accept if 80%+ match
                            validated_claims.append(ExtractedClaim(
                                claim_type=claim.claim_type,
                                claim_value=claim.claim_value,
                                xpath=claim.xpath,
                                xpath_match_score=match_score,
                                confidence=claim.confidence * match_score
                            ))
                except Exception as e:
                    # Skip claims with invalid XPath
                    continue

            if validated_claims:
                # Get literal name from claims
                literal_name = next(
                    (c.claim_value for c in validated_claims if c.claim_type == 'full_name'),
                    None
                )
                observations.append((validated_claims, literal_name))

        return observations

    async def _transform_and_validate(
        self,
        observations: list[tuple[list[ExtractedClaim], str]],
        source: SourceDocument
    ) -> list[PersonObservationData]:
        """Transform extracted data and generate POIDs."""
        from ppid.identifiers import generate_poid

        results = []

        for claims, literal_name in observations:
            # Generate deterministic POID
            claims_hash = hashlib.sha256(
                str(sorted([c.claim_value for c in claims])).encode()
            ).hexdigest()

            poid = generate_poid(
                source_url=source.url,
                retrieval_timestamp=source.retrieved_at.isoformat(),
                content_hash=f"{source.content_hash}:{claims_hash}"
            )

            results.append(PersonObservationData(
                source=source,
                claims=claims,
                poid=poid
            ))

        return results

    async def _load_observations(
        self,
        observations: list[PersonObservationData]
    ) -> None:
        """Load observations to PostgreSQL and RDF store."""
        for obs in observations:
            # Check if already exists (idempotent)
            existing = await self.db.get_observation(obs.poid)
            if existing:
                continue

            # Insert to PostgreSQL
            await self.db.create_observation(
                poid=obs.poid,
                source_url=obs.source.url,
                source_type='institutional_website',
                retrieved_at=obs.source.retrieved_at,
                content_hash=obs.source.content_hash,
                html_archive_path=obs.source.archive_path,
                literal_name=next(
                    (c.claim_value for c in obs.claims if c.claim_type == 'full_name'),
                    None
                ),
                claims=[{
                    'claim_type': c.claim_type,
                    'claim_value': c.claim_value,
                    'xpath': c.xpath,
                    'xpath_match_score': c.xpath_match_score,
                    'confidence': c.confidence
                } for c in obs.claims]
            )

            # Insert to RDF triple store
            await self._insert_rdf(obs)

            # Publish to Kafka for downstream processing
            self.kafka.send('ppid.observations.created', {
                'poid': obs.poid,
                'source_url': obs.source.url,
                'timestamp': obs.source.retrieved_at.isoformat()
            })

    async def _insert_rdf(self, obs: PersonObservationData) -> None:
        """Insert observation as RDF triples."""
        from rdflib import Graph, Namespace, Literal, URIRef
        from rdflib.namespace import RDF, XSD

        PPID = Namespace("https://ppid.org/")
        PPIDV = Namespace("https://ppid.org/vocab#")
        PROV = Namespace("http://www.w3.org/ns/prov#")

        g = Graph()

        obs_uri = PPID[obs.poid]

        g.add((obs_uri, RDF.type, PPIDV.PersonObservation))
        g.add((obs_uri, PPIDV.poid, Literal(obs.poid)))
        g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url)))
        g.add((obs_uri, PROV.generatedAtTime, Literal(
            obs.source.retrieved_at.isoformat(),
            datatype=XSD.dateTime
        )))

        # Add claims
        for i, claim in enumerate(obs.claims):
            claim_uri = PPID[f"{obs.poid}/claim/{i}"]
            g.add((obs_uri, PPIDV.hasClaim, claim_uri))
            g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type)))
            g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value)))
            g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath)))
            g.add((claim_uri, PPIDV.xpathMatchScore, Literal(
                claim.xpath_match_score, datatype=XSD.decimal
            )))

        # Insert to triple store
        await self.rdf.insert(g)
```

---

## 6. GHCID Integration

### 6.1 Linking Persons to Institutions

```python
from dataclasses import dataclass
from typing import Optional
from datetime import date

@dataclass
class GhcidAffiliation:
    """
    Link between a person (PRID) and a heritage institution (GHCID).
    """
    ghcid: str  # e.g., "NL-NH-HAA-A-NHA"
    role_title: Optional[str] = None
    department: Optional[str] = None
    start_date: Optional[date] = None
    end_date: Optional[date] = None
    is_current: bool = True
    source_poid: Optional[str] = None
    confidence: float = 0.9


async def link_person_to_institution(
    db,
    prid: str,
    ghcid: str,
    role_title: str = None,
    source_poid: str = None
) -> GhcidAffiliation:
    """
    Create link between person and heritage institution.

    Args:
        prid: Person Reconstruction ID
        ghcid: Global Heritage Custodian ID
        role_title: Job title at institution
        source_poid: Observation where affiliation was extracted

    Returns:
        Created affiliation record
    """
    # Validate PRID exists
    reconstruction = await db.get_reconstruction(prid)
    if not reconstruction:
        raise ValueError(f"Reconstruction not found: {prid}")

    # Validate GHCID format
    if not validate_ghcid_format(ghcid):
        raise ValueError(f"Invalid GHCID format: {ghcid}")

    # Create affiliation
    affiliation = await db.create_ghcid_affiliation(
        prid_id=reconstruction.id,
        ghcid=ghcid,
        role_title=role_title,
        source_poid_id=source_poid,
        is_current=True
    )

    return affiliation


def validate_ghcid_format(ghcid: str) -> bool:
    """Validate GHCID format."""
    import re
    # Pattern: CC-RR-SSS-T-ABBREV
    pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$'
    return bool(re.match(pattern, ghcid))
```

### 6.2 RDF Integration

```turtle
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ghcid: <https://w3id.org/heritage/custodian/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix schema: <http://schema.org/> .

# Person with GHCID affiliation
ppid:PRID-1234-5678-90ab-cde5
    a ppidv:PersonReconstruction ;
    schema:name "Jan van den Berg" ;

    # Current employment
    org:memberOf [
        a org:Membership ;
        org:organization ghcid:NL-NH-HAA-A-NHA ;
        org:role [
            a org:Role ;
            schema:name "Senior Archivist"
        ] ;
        ppidv:isCurrent true ;
        schema:startDate "2015"^^xsd:gYear
    ] ;

    # Direct link for simple queries
    ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA .

# The heritage institution (from GHCID system)
ghcid:NL-NH-HAA-A-NHA
    a ppidv:HeritageCustodian ;
    schema:name "Noord-Hollands Archief" ;
    ppidv:ghcid "NL-NH-HAA-A-NHA" ;
    schema:address [
        schema:addressLocality "Haarlem" ;
        schema:addressCountry "NL"
    ] .
```

### 6.3 SPARQL Queries

```sparql
# Find all persons affiliated with a specific institution
PREFIX ppid: <https://ppid.org/>
PREFIX ppidv: <https://ppid.org/vocab#>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
PREFIX schema: <http://schema.org/>
PREFIX org: <http://www.w3.org/ns/org#>

SELECT ?prid ?name ?role ?isCurrent
WHERE {
    ?prid a ppidv:PersonReconstruction ;
          schema:name ?name ;
          org:memberOf ?membership .

    ?membership org:organization ghcid:NL-NH-HAA-A-NHA ;
                ppidv:isCurrent ?isCurrent .

    OPTIONAL {
        ?membership org:role ?roleNode .
        ?roleNode schema:name ?role .
    }
}
ORDER BY DESC(?isCurrent) ?name


# Find all institutions a person has worked at
SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent
WHERE {
    ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership .

    ?membership org:organization ?institution ;
                ppidv:isCurrent ?isCurrent .

    ?institution ppidv:ghcid ?ghcid ;
                 schema:name ?institutionName .

    OPTIONAL { ?membership schema:startDate ?startDate }
    OPTIONAL { ?membership schema:endDate ?endDate }
    OPTIONAL {
        ?membership org:role ?roleNode .
        ?roleNode schema:name ?role .
    }
}
ORDER BY DESC(?isCurrent) DESC(?startDate)


# Find archivists across all Dutch archives
SELECT ?prid ?name ?institution ?institutionName
WHERE {
    ?prid a ppidv:PersonReconstruction ;
          schema:name ?name ;
          org:memberOf ?membership .

    ?membership org:organization ?institution ;
                ppidv:isCurrent true ;
                org:role ?roleNode .

    ?roleNode schema:name ?role .
    FILTER(CONTAINS(LCASE(?role), "archivist"))

    ?institution ppidv:ghcid ?ghcid ;
                 schema:name ?institutionName .
    FILTER(STRSTARTS(?ghcid, "NL-"))
}
ORDER BY ?institutionName ?name
```

---

## 7. Security and Access Control

### 7.1 Authentication

```python
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, APIKeyHeader
from jose import JWTError, jwt
from passlib.context import CryptContext
from datetime import datetime, timedelta
from typing import Optional

# Configuration
SECRET_KEY = "your-secret-key"  # Use env variable
ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 30

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False)
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)


class User:
    def __init__(self, id: str, email: str, roles: list[str]):
        self.id = id
        self.email = email
        self.roles = roles


def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
    """Create JWT access token."""
    to_encode = data.copy()
    expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15))
    to_encode.update({"exp": expire})
    return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)


async def get_current_user(
    token: Optional[str] = Depends(oauth2_scheme),
    api_key: Optional[str] = Depends(api_key_header),
    db = Depends(get_db)
) -> User:
    """
    Authenticate user via JWT token or API key.
    """
    credentials_exception = HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail="Could not validate credentials",
        headers={"WWW-Authenticate": "Bearer"},
    )

    # Try JWT token first
    if token:
        try:
            payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
            user_id: str = payload.get("sub")
            if user_id is None:
                raise credentials_exception

            user = await db.get_user(user_id)
            if user is None:
                raise credentials_exception

            return user
        except JWTError:
            pass

    # Try API key
    if api_key:
        user = await db.get_user_by_api_key(api_key)
        if user:
            return user

    raise credentials_exception


def require_role(required_roles: list[str]):
    """Dependency to require specific roles."""
    async def role_checker(user: User = Depends(get_current_user)):
        if not any(role in user.roles for role in required_roles):
            raise HTTPException(
                status_code=status.HTTP_403_FORBIDDEN,
                detail="Insufficient permissions"
            )
        return user
    return role_checker
```

### 7.2 Authorization Roles

| Role | Permissions |
|------|-------------|
| `reader` | Read observations, reconstructions, claims |
| `contributor` | Create observations, add claims |
| `curator` | Create reconstructions, link observations, resolve conflicts |
| `admin` | Manage users, API keys, system configuration |
| `api_client` | Programmatic access via API key |

### 7.3 Rate Limiting

```python
from fastapi import Request
import redis
from datetime import datetime

class RateLimiter:
    """
    Token bucket rate limiter using Redis.
    """

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    async def is_allowed(
        self,
        key: str,
        max_requests: int = 100,
        window_seconds: int = 60
    ) -> tuple[bool, dict]:
        """
        Check if request is allowed under rate limit.

        Returns:
            Tuple of (is_allowed, rate_limit_info)
        """
        now = datetime.utcnow().timestamp()
        window_start = now - window_seconds

        pipe = self.redis.pipeline()

        # Remove old requests
        pipe.zremrangebyscore(key, 0, window_start)

        # Count requests in window
        pipe.zcard(key)

        # Add current request
        pipe.zadd(key, {str(now): now})

        # Set expiry
        pipe.expire(key, window_seconds)

        results = pipe.execute()
        request_count = results[1]

        is_allowed = request_count < max_requests

        return is_allowed, {
            "limit": max_requests,
            "remaining": max(0, max_requests - request_count - 1),
            "reset": int(now + window_seconds)
        }


# Rate limit tiers
RATE_LIMITS = {
    "anonymous": {"requests": 60, "window": 60},
    "reader": {"requests": 100, "window": 60},
    "contributor": {"requests": 500, "window": 60},
    "curator": {"requests": 1000, "window": 60},
    "api_client": {"requests": 5000, "window": 60},
}
```

---

## 8. Performance Requirements

### 8.1 SLOs (Service Level Objectives)

| Metric | Target | Measurement |
|--------|--------|-------------|
| **Availability** | 99.9% | Monthly uptime |
| **API Latency (p50)** | < 50ms | Response time |
| **API Latency (p99)** | < 500ms | Response time |
| **Search Latency** | < 200ms | Full-text search |
| **SPARQL Query** | < 1s | Simple queries |
| **Throughput** | 1000 req/s | Sustained load |

### 8.2 Scaling Strategy

```yaml
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ppid-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ppid-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100
```

### 8.3 Caching Strategy

```python
import redis
from functools import wraps
import json
import hashlib

class CacheManager:
    """
    Multi-tier caching for PPID.

    Tiers:
    1. L1: In-memory (per-instance)
    2. L2: Redis (shared)
    """

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.local_cache = {}

    def cache_observation(self, ttl: int = 3600):
        """Cache observation lookups."""
        def decorator(func):
            @wraps(func)
            async def wrapper(poid: str, *args, **kwargs):
                cache_key = f"observation:{poid}"

                # Check L1 cache
                if cache_key in self.local_cache:
                    return self.local_cache[cache_key]

                # Check L2 cache
                cached = self.redis.get(cache_key)
                if cached:
                    data = json.loads(cached)
                    self.local_cache[cache_key] = data
                    return data

                # Fetch from database
                result = await func(poid, *args, **kwargs)

                if result:
                    # Store in both caches
                    self.redis.setex(cache_key, ttl, json.dumps(result))
                    self.local_cache[cache_key] = result

                return result
            return wrapper
        return decorator

    def cache_search(self, ttl: int = 300):
        """Cache search results (shorter TTL)."""
        def decorator(func):
            @wraps(func)
            async def wrapper(query: str, *args, **kwargs):
                # Create deterministic cache key from query params
                key_data = {"query": query, "args": args, "kwargs": kwargs}
                cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"

                cached = self.redis.get(cache_key)
                if cached:
                    return json.loads(cached)

                result = await func(query, *args, **kwargs)

                self.redis.setex(cache_key, ttl, json.dumps(result))

                return result
            return wrapper
        return decorator

    def invalidate_observation(self, poid: str):
        """Invalidate cache when observation is updated."""
        cache_key = f"observation:{poid}"
        self.redis.delete(cache_key)
        self.local_cache.pop(cache_key, None)
```

---

## 9. Deployment Architecture

### 9.1 Kubernetes Deployment

```yaml
# API Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ppid-api
  labels:
    app: ppid
    component: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ppid
      component: api
  template:
    metadata:
      labels:
        app: ppid
        component: api
    spec:
      containers:
        - name: api
          image: ppid/api:latest
          ports:
            - containerPort: 8000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: database-url
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: redis-url
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: ppid-secrets
                  key: jwt-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5

---
# Service
apiVersion: v1
kind: Service
metadata:
  name: ppid-api
spec:
  selector:
    app: ppid
    component: api
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

---
# Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ppid-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - api.ppid.org
      secretName: ppid-tls
  rules:
    - host: api.ppid.org
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ppid-api
                port:
                  number: 80
```

### 9.2 Docker Compose (Development)

```yaml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid
      - REDIS_URL=redis://redis:6379
      - FUSEKI_URL=http://fuseki:3030
    depends_on:
      - postgres
      - redis
      - fuseki
    volumes:
      - ./src:/app/src
      - ./archives:/app/archives

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: ppid
      POSTGRES_PASSWORD: ppid
      POSTGRES_DB: ppid
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  fuseki:
    image: stain/jena-fuseki
    environment:
      ADMIN_PASSWORD: admin
      FUSEKI_DATASET_1: ppid
    volumes:
      - fuseki_data:/fuseki
    ports:
      - "3030:3030"

  elasticsearch:
    image: elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"

  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

volumes:
  postgres_data:
  fuseki_data:
  es_data:
```

---

## 10. Monitoring and Observability

### 10.1 Prometheus Metrics

```python
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
REQUEST_COUNT = Counter(
    'ppid_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'ppid_request_latency_seconds',
    'Request latency in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

OBSERVATIONS_CREATED = Counter(
    'ppid_observations_created_total',
    'Total observations created'
)

RECONSTRUCTIONS_CREATED = Counter(
    'ppid_reconstructions_created_total',
    'Total reconstructions created'
)

ENTITY_RESOLUTION_LATENCY = Histogram(
    'ppid_entity_resolution_seconds',
    'Entity resolution latency',
    buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30]
)

CACHE_HITS = Counter(
    'ppid_cache_hits_total',
    'Cache hits',
    ['cache_type']
)

CACHE_MISSES = Counter(
    'ppid_cache_misses_total',
    'Cache misses',
    ['cache_type']
)

DB_CONNECTIONS = Gauge(
    'ppid_db_connections',
    'Active database connections'
)


# Middleware for request metrics
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()

    response = await call_next(request)

    latency = time.time() - start_time

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)

    return response
```

### 10.2 Logging

```python
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    """Structured JSON logging for observability."""

    def format(self, record):
        log_record = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
        }

        # Add extra fields
        if hasattr(record, 'poid'):
            log_record['poid'] = record.poid
        if hasattr(record, 'prid'):
            log_record['prid'] = record.prid
        if hasattr(record, 'request_id'):
            log_record['request_id'] = record.request_id
        if hasattr(record, 'user_id'):
            log_record['user_id'] = record.user_id
        if hasattr(record, 'duration_ms'):
            log_record['duration_ms'] = record.duration_ms

        if record.exc_info:
            log_record['exception'] = self.formatException(record.exc_info)

        return json.dumps(log_record)


# Configure logging
def setup_logging():
    handler = logging.StreamHandler()
    handler.setFormatter(JSONFormatter())

    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # Reduce noise from libraries
    logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
    logging.getLogger("httpx").setLevel(logging.WARNING)
```

### 10.3 Grafana Dashboard (JSON)

```json
{
  "dashboard": {
    "title": "PPID System Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)"
          }
        ]
      },
      {
        "title": "Request Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Observations Created",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(ppid_observations_created_total)"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))"
          }
        ]
      }
    ]
  }
}
```

---

## 11. Implementation Checklist

### 11.1 Phase 1: Core Infrastructure

- [ ] Set up PostgreSQL database with schema
- [ ] Set up Apache Jena Fuseki for RDF
- [ ] Set up Redis for caching
- [ ] Implement POID/PRID generation
- [ ] Implement checksum validation
- [ ] Create basic REST API endpoints

### 11.2 Phase 2: Data Ingestion

- [ ] Build web scraping infrastructure
- [ ] Implement HTML archival
- [ ] Integrate LLM for extraction
- [ ] Implement XPath validation
- [ ] Set up Kafka for async processing
- [ ] Create ingestion pipeline

### 11.3 Phase 3: Entity Resolution

- [ ] Implement blocking strategies
- [ ] Implement similarity metrics
- [ ] Build clustering algorithm
- [ ] Create human-in-loop review UI
- [ ] Integrate with reconstruction creation

### 11.4 Phase 4: GHCID Integration

- [ ] Implement affiliation linking
- [ ] Add SPARQL queries for institution lookups
- [ ] Create bidirectional navigation
- [ ] Sync with GHCID registry updates

### 11.5 Phase 5: Production Readiness

- [ ] Implement authentication/authorization
- [ ] Set up rate limiting
- [ ] Configure monitoring and alerting
- [ ] Create backup and recovery procedures
- [ ] Performance testing and optimization
- [ ] Security audit

---

## 12. References

### Standards
- OAuth 2.0: https://oauth.net/2/
- OpenAPI 3.1: https://spec.openapis.org/oas/latest.html
- GraphQL: https://graphql.org/learn/
- SPARQL 1.1: https://www.w3.org/TR/sparql11-query/

### Related PPID Documents
- [Identifier Structure Design](./05_identifier_structure_design.md)
- [Entity Resolution Patterns](./06_entity_resolution_patterns.md)
- [Claims and Provenance](./07_claims_and_provenance.md)

### Technologies
- FastAPI: https://fastapi.tiangolo.com/
- Apache Jena: https://jena.apache.org/
- PostgreSQL: https://www.postgresql.org/
- Kubernetes: https://kubernetes.io/