# Implementation Guidelines **Version**: 0.1.0 **Last Updated**: 2025-01-09 **Related**: [Identifier Structure](./05_identifier_structure_design.md) | [Claims and Provenance](./07_claims_and_provenance.md) --- ## 1. Overview This document provides comprehensive technical specifications for implementing the PPID system: - Database architecture (PostgreSQL + RDF triple store) - API design (REST + GraphQL) - Data ingestion pipeline - GHCID integration patterns - Security and access control - Performance requirements - Technology stack recommendations - Deployment architecture - Monitoring and observability --- ## 2. Architecture Overview ### 2.1 System Components ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ PPID System │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Ingestion │────▶│ Processing │────▶│ Storage │ │ │ │ Layer │ │ Layer │ │ Layer │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Web Scrapers │ │ Entity Res. │ │ PostgreSQL │ │ │ │ API Clients │ │ NER/NLP │ │ (Relational) │ │ │ │ File Import │ │ Validation │ ├──────────────┤ │ │ └──────────────┘ └──────────────┘ │ Apache Jena │ │ │ │ (RDF/SPARQL) │ │ │ ├──────────────┤ │ │ │ Redis │ │ │ │ (Cache) │ │ │ └──────────────┘ │ │ │ │ │ ┌─────────────────────┴─────────────────────┐ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────┐│ │ │ REST API │ │ GraphQL ││ │ │ (FastAPI) │ │ (Ariadne││ │ └──────────────┘ └──────────┘│ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Technology Stack | Layer | Technology | Purpose | |-------|------------|---------| | **API Framework** | FastAPI (Python 3.11+) | REST API, async support | | **GraphQL** | Ariadne | GraphQL endpoint | | **Relational DB** | PostgreSQL 16 | Primary data store | | **Triple Store** | Apache Jena Fuseki | RDF/SPARQL queries | | **Cache** | Redis 7 | Session, rate limiting, caching | | **Queue** | Apache Kafka | Async processing pipeline | | **Search** | Elasticsearch 8 | Full-text search | | **Object Storage** | MinIO / S3 | HTML archives, files | | **Container** | Docker + Kubernetes | Deployment | | **Monitoring** | Prometheus + Grafana | Metrics and alerting | --- ## 3. Database Schema ### 3.1 PostgreSQL Schema ```sql -- Enable required extensions CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; CREATE EXTENSION IF NOT EXISTS "pg_trgm"; -- For fuzzy matching -- Enum types CREATE TYPE ppid_type AS ENUM ('POID', 'PRID'); CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted'); CREATE TYPE source_type AS ENUM ( 'official_registry', 'institutional_website', 'professional_network', 'social_media', 'news_article', 'academic_publication', 'user_submitted', 'inferred' ); -- Person Observations (POID) CREATE TABLE person_observations ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), poid VARCHAR(24) UNIQUE NOT NULL, -- POID-xxxx-xxxx-xxxx-xxxx -- Source metadata source_url TEXT NOT NULL, source_type source_type NOT NULL, retrieved_at TIMESTAMPTZ NOT NULL, content_hash VARCHAR(64) NOT NULL, -- SHA-256 html_archive_path TEXT, -- Extracted name components (PNV-compatible) literal_name TEXT, given_name TEXT, surname TEXT, surname_prefix TEXT, -- van, de, etc. patronymic TEXT, generation_suffix TEXT, -- Jr., III, etc. -- Metadata extraction_agent VARCHAR(100), extraction_confidence DECIMAL(3,2), created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW(), -- Indexes CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$') ); -- Person Reconstructions (PRID) CREATE TABLE person_reconstructions ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), prid VARCHAR(24) UNIQUE NOT NULL, -- PRID-xxxx-xxxx-xxxx-xxxx -- Canonical name (resolved from observations) canonical_name TEXT NOT NULL, given_name TEXT, surname TEXT, surname_prefix TEXT, -- Curation metadata curator_id UUID REFERENCES users(id), curation_method VARCHAR(50), -- 'manual', 'algorithmic', 'hybrid' confidence_score DECIMAL(3,2), -- Versioning version INTEGER DEFAULT 1, previous_version_id UUID REFERENCES person_reconstructions(id), -- Timestamps created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW(), CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$') ); -- Link table: PRID derives from POIDs CREATE TABLE reconstruction_observations ( prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE, poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE, linked_at TIMESTAMPTZ DEFAULT NOW(), linked_by UUID REFERENCES users(id), link_confidence DECIMAL(3,2), PRIMARY KEY (prid_id, poid_id) ); -- Claims (assertions with provenance) CREATE TABLE claims ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), claim_id VARCHAR(50) UNIQUE NOT NULL, -- Subject (what this claim is about) poid_id UUID REFERENCES person_observations(id), -- Claim content claim_type VARCHAR(50) NOT NULL, -- 'job_title', 'employer', 'email', etc. claim_value TEXT NOT NULL, -- Provenance (MANDATORY per Rule 6) source_url TEXT NOT NULL, retrieved_on TIMESTAMPTZ NOT NULL, xpath TEXT NOT NULL, html_file TEXT NOT NULL, xpath_match_score DECIMAL(3,2) NOT NULL, content_hash VARCHAR(64), -- Quality confidence DECIMAL(3,2), extraction_agent VARCHAR(100), status claim_status DEFAULT 'active', -- Relationships supersedes_id UUID REFERENCES claims(id), -- Timestamps created_at TIMESTAMPTZ DEFAULT NOW(), verified_at TIMESTAMPTZ, -- Index for claim lookups CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1) ); -- Claim relationships (supports, conflicts) CREATE TABLE claim_relationships ( claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE, claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE, relationship_type VARCHAR(20) NOT NULL, -- 'supports', 'conflicts_with' notes TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), PRIMARY KEY (claim_a_id, claim_b_id, relationship_type) ); -- External identifiers (ORCID, ISNI, VIAF, Wikidata) CREATE TABLE external_identifiers ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE, identifier_scheme VARCHAR(20) NOT NULL, -- 'orcid', 'isni', 'viaf', 'wikidata' identifier_value VARCHAR(100) NOT NULL, verified BOOLEAN DEFAULT FALSE, verified_at TIMESTAMPTZ, source_url TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE (prid_id, identifier_scheme, identifier_value) ); -- GHCID links (person to heritage institution) CREATE TABLE ghcid_affiliations ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE, ghcid VARCHAR(50) NOT NULL, -- e.g., NL-NH-HAA-A-NHA -- Affiliation details role_title TEXT, department TEXT, affiliation_start DATE, affiliation_end DATE, is_current BOOLEAN DEFAULT TRUE, -- Provenance source_poid_id UUID REFERENCES person_observations(id), confidence DECIMAL(3,2), created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); -- Indexes for performance CREATE INDEX idx_observations_source_url ON person_observations(source_url); CREATE INDEX idx_observations_content_hash ON person_observations(content_hash); CREATE INDEX idx_observations_name_trgm ON person_observations USING gin (literal_name gin_trgm_ops); CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions USING gin (canonical_name gin_trgm_ops); CREATE INDEX idx_claims_type ON claims(claim_type); CREATE INDEX idx_claims_poid ON claims(poid_id); CREATE INDEX idx_claims_status ON claims(status); CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid); CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id) WHERE is_current = TRUE; -- Full-text search CREATE INDEX idx_observations_fts ON person_observations USING gin (to_tsvector('english', literal_name)); CREATE INDEX idx_reconstructions_fts ON person_reconstructions USING gin (to_tsvector('english', canonical_name)); ``` ### 3.2 RDF Triple Store Schema ```turtle @prefix ppid: . @prefix ppidv: . @prefix ppidt: . @prefix picom: . @prefix pnv: . @prefix prov: . @prefix schema: . @prefix xsd: . # Example Person Observation ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ; ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ; # Name (PNV structured) pnv:hasName [ a pnv:PersonName ; pnv:literalName "Jan van den Berg" ; pnv:givenName "Jan" ; pnv:surnamePrefix "van den" ; pnv:baseSurname "Berg" ] ; # Provenance prov:wasDerivedFrom ; prov:wasGeneratedBy ppid:extraction-activity-001 ; prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ; # Claims ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 . # Example Person Reconstruction ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ; ppidv:prid "PRID-1234-5678-90ab-cde5" ; # Canonical name schema:name "Jan van den Berg" ; pnv:hasName [ a pnv:PersonName ; pnv:literalName "Jan van den Berg" ; pnv:givenName "Jan" ; pnv:surnamePrefix "van den" ; pnv:baseSurname "Berg" ] ; # Derived from observations prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X, ppid:POID-8c4d-e5f6-g7h8-901Y ; # GHCID affiliation ppidv:affiliatedWith ; # External identifiers ppidv:orcid "0000-0002-1234-5678" ; owl:sameAs . ``` --- ## 4. API Design ### 4.1 REST API Endpoints ```yaml openapi: 3.1.0 info: title: PPID API version: 1.0.0 description: Person Persistent Identifier API servers: - url: https://api.ppid.org/v1 paths: # Person Observations /observations: post: summary: Create new person observation operationId: createObservation requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateObservationRequest' responses: '201': description: Observation created content: application/json: schema: $ref: '#/components/schemas/PersonObservation' get: summary: Search observations operationId: searchObservations parameters: - name: name in: query schema: type: string - name: source_url in: query schema: type: string - name: limit in: query schema: type: integer default: 20 - name: offset in: query schema: type: integer default: 0 responses: '200': description: List of observations content: application/json: schema: $ref: '#/components/schemas/ObservationList' /observations/{poid}: get: summary: Get observation by POID operationId: getObservation parameters: - name: poid in: path required: true schema: type: string pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$' responses: '200': description: Person observation content: application/json: schema: $ref: '#/components/schemas/PersonObservation' text/turtle: schema: type: string application/ld+json: schema: type: object /observations/{poid}/claims: get: summary: Get claims for observation operationId: getObservationClaims parameters: - name: poid in: path required: true schema: type: string responses: '200': description: List of claims content: application/json: schema: $ref: '#/components/schemas/ClaimList' # Person Reconstructions /reconstructions: post: summary: Create person reconstruction from observations operationId: createReconstruction requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateReconstructionRequest' responses: '201': description: Reconstruction created content: application/json: schema: $ref: '#/components/schemas/PersonReconstruction' /reconstructions/{prid}: get: summary: Get reconstruction by PRID operationId: getReconstruction parameters: - name: prid in: path required: true schema: type: string responses: '200': description: Person reconstruction content: application/json: schema: $ref: '#/components/schemas/PersonReconstruction' /reconstructions/{prid}/observations: get: summary: Get observations linked to reconstruction operationId: getReconstructionObservations responses: '200': description: Linked observations /reconstructions/{prid}/history: get: summary: Get version history operationId: getReconstructionHistory responses: '200': description: Version history # Validation /validate/{ppid}: get: summary: Validate PPID format and checksum operationId: validatePpid responses: '200': description: Validation result content: application/json: schema: type: object properties: valid: type: boolean type: type: string enum: [POID, PRID] exists: type: boolean # Search /search: get: summary: Full-text search across all records operationId: search parameters: - name: q in: query required: true schema: type: string - name: type in: query schema: type: string enum: [observation, reconstruction, all] default: all responses: '200': description: Search results # Entity Resolution /resolve: post: summary: Find matching records for input data operationId: resolveEntity requestBody: content: application/json: schema: $ref: '#/components/schemas/EntityResolutionRequest' responses: '200': description: Resolution candidates content: application/json: schema: $ref: '#/components/schemas/ResolutionCandidates' components: schemas: CreateObservationRequest: type: object required: - source_url - retrieved_at - claims properties: source_url: type: string format: uri source_type: type: string enum: [official_registry, institutional_website, professional_network, social_media] retrieved_at: type: string format: date-time content_hash: type: string html_archive_path: type: string extraction_agent: type: string claims: type: array items: $ref: '#/components/schemas/ClaimInput' ClaimInput: type: object required: - claim_type - claim_value - xpath - xpath_match_score properties: claim_type: type: string claim_value: type: string xpath: type: string xpath_match_score: type: number minimum: 0 maximum: 1 confidence: type: number PersonObservation: type: object properties: poid: type: string source_url: type: string literal_name: type: string claims: type: array items: $ref: '#/components/schemas/Claim' created_at: type: string format: date-time CreateReconstructionRequest: type: object required: - observation_ids properties: observation_ids: type: array items: type: string minItems: 1 canonical_name: type: string external_identifiers: type: object PersonReconstruction: type: object properties: prid: type: string canonical_name: type: string observations: type: array items: type: string external_identifiers: type: object ghcid_affiliations: type: array items: $ref: '#/components/schemas/GhcidAffiliation' GhcidAffiliation: type: object properties: ghcid: type: string role_title: type: string is_current: type: boolean securitySchemes: bearerAuth: type: http scheme: bearer bearerFormat: JWT apiKey: type: apiKey in: header name: X-API-Key security: - bearerAuth: [] - apiKey: [] ``` ### 4.2 GraphQL Schema ```graphql type Query { # Observations observation(poid: ID!): PersonObservation observations( name: String sourceUrl: String limit: Int = 20 offset: Int = 0 ): ObservationConnection! # Reconstructions reconstruction(prid: ID!): PersonReconstruction reconstructions( name: String ghcid: String limit: Int = 20 offset: Int = 0 ): ReconstructionConnection! # Search search(query: String!, type: SearchType = ALL): SearchResults! # Validation validatePpid(ppid: String!): ValidationResult! # Entity Resolution resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]! } type Mutation { # Create observation createObservation(input: CreateObservationInput!): PersonObservation! # Create reconstruction createReconstruction(input: CreateReconstructionInput!): PersonReconstruction! # Link observation to reconstruction linkObservation(prid: ID!, poid: ID!): PersonReconstruction! # Update reconstruction updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction! # Add external identifier addExternalIdentifier( prid: ID! scheme: IdentifierScheme! value: String! ): PersonReconstruction! # Add GHCID affiliation addGhcidAffiliation( prid: ID! ghcid: String! roleTitle: String isCurrent: Boolean = true ): PersonReconstruction! } type PersonObservation { poid: ID! sourceUrl: String! sourceType: SourceType! retrievedAt: DateTime! contentHash: String htmlArchivePath: String # Name components literalName: String givenName: String surname: String surnamePrefix: String # Related data claims: [Claim!]! linkedReconstructions: [PersonReconstruction!]! # Metadata extractionAgent: String extractionConfidence: Float createdAt: DateTime! } type PersonReconstruction { prid: ID! canonicalName: String! givenName: String surname: String surnamePrefix: String # Linked observations observations: [PersonObservation!]! # External identifiers orcid: String isni: String viaf: String wikidata: String externalIdentifiers: [ExternalIdentifier!]! # GHCID affiliations ghcidAffiliations: [GhcidAffiliation!]! currentAffiliations: [GhcidAffiliation!]! # Versioning version: Int! previousVersion: PersonReconstruction history: [PersonReconstruction!]! # Curation curator: User curationMethod: CurationMethod confidenceScore: Float createdAt: DateTime! updatedAt: DateTime! } type Claim { id: ID! claimType: ClaimType! claimValue: String! # Provenance (MANDATORY) sourceUrl: String! retrievedOn: DateTime! xpath: String! htmlFile: String! xpathMatchScore: Float! # Quality confidence: Float extractionAgent: String status: ClaimStatus! # Relationships supports: [Claim!]! conflictsWith: [Claim!]! supersedes: Claim createdAt: DateTime! } type GhcidAffiliation { ghcid: String! institution: HeritageCustodian # Resolved from GHCID roleTitle: String department: String startDate: Date endDate: Date isCurrent: Boolean! confidence: Float } type HeritageCustodian { ghcid: String! name: String! institutionType: String! city: String country: String } type ExternalIdentifier { scheme: IdentifierScheme! value: String! verified: Boolean! verifiedAt: DateTime } enum SourceType { OFFICIAL_REGISTRY INSTITUTIONAL_WEBSITE PROFESSIONAL_NETWORK SOCIAL_MEDIA NEWS_ARTICLE ACADEMIC_PUBLICATION USER_SUBMITTED INFERRED } enum ClaimType { FULL_NAME GIVEN_NAME FAMILY_NAME JOB_TITLE EMPLOYER EMPLOYER_GHCID EMAIL LINKEDIN_URL ORCID BIRTH_DATE EDUCATION } enum ClaimStatus { ACTIVE SUPERSEDED RETRACTED } enum IdentifierScheme { ORCID ISNI VIAF WIKIDATA LOC_NAF } enum CurationMethod { MANUAL ALGORITHMIC HYBRID } enum SearchType { OBSERVATION RECONSTRUCTION ALL } input CreateObservationInput { sourceUrl: String! sourceType: SourceType! retrievedAt: DateTime! contentHash: String htmlArchivePath: String extractionAgent: String claims: [ClaimInput!]! } input ClaimInput { claimType: ClaimType! claimValue: String! xpath: String! xpathMatchScore: Float! confidence: Float } input CreateReconstructionInput { observationIds: [ID!]! canonicalName: String externalIdentifiers: ExternalIdentifiersInput } input EntityResolutionInput { name: String! employer: String jobTitle: String email: String linkedinUrl: String } type ResolutionCandidate { reconstruction: PersonReconstruction observation: PersonObservation matchScore: Float! matchFactors: [MatchFactor!]! } type MatchFactor { field: String! score: Float! method: String! } ``` ### 4.3 FastAPI Implementation ```python from fastapi import FastAPI, HTTPException, Depends, Query from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel, Field from typing import Optional from datetime import datetime import uuid app = FastAPI( title="PPID API", description="Person Persistent Identifier API", version="1.0.0" ) app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # --- Pydantic Models --- class ClaimInput(BaseModel): claim_type: str claim_value: str xpath: str xpath_match_score: float = Field(ge=0, le=1) confidence: Optional[float] = Field(None, ge=0, le=1) class CreateObservationRequest(BaseModel): source_url: str source_type: str = "institutional_website" retrieved_at: datetime content_hash: Optional[str] = None html_archive_path: Optional[str] = None extraction_agent: Optional[str] = None claims: list[ClaimInput] class PersonObservationResponse(BaseModel): poid: str source_url: str source_type: str retrieved_at: datetime literal_name: Optional[str] = None claims: list[dict] created_at: datetime class CreateReconstructionRequest(BaseModel): observation_ids: list[str] canonical_name: Optional[str] = None external_identifiers: Optional[dict] = None class ValidationResult(BaseModel): valid: bool ppid_type: Optional[str] = None exists: Optional[bool] = None error: Optional[str] = None # --- Dependencies --- async def get_db(): """Database connection dependency.""" # In production, use connection pool pass async def get_current_user(api_key: str = Depends(oauth2_scheme)): """Authenticate user from API key or JWT.""" pass # --- Endpoints --- @app.post("/api/v1/observations", response_model=PersonObservationResponse) async def create_observation( request: CreateObservationRequest, db = Depends(get_db) ): """ Create a new Person Observation from extracted data. The POID is generated deterministically from source metadata. """ from ppid.identifiers import generate_poid # Generate deterministic POID poid = generate_poid( source_url=request.source_url, retrieval_timestamp=request.retrieved_at.isoformat(), content_hash=request.content_hash or "" ) # Check for existing observation with same POID existing = await db.get_observation(poid) if existing: return existing # Extract name from claims literal_name = None for claim in request.claims: if claim.claim_type == "full_name": literal_name = claim.claim_value break # Create observation record observation = await db.create_observation( poid=poid, source_url=request.source_url, source_type=request.source_type, retrieved_at=request.retrieved_at, content_hash=request.content_hash, html_archive_path=request.html_archive_path, literal_name=literal_name, extraction_agent=request.extraction_agent, claims=[c.dict() for c in request.claims] ) return observation @app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse) async def get_observation(poid: str, db = Depends(get_db)): """Get Person Observation by POID.""" from ppid.identifiers import validate_ppid is_valid, error = validate_ppid(poid) if not is_valid: raise HTTPException(status_code=400, detail=f"Invalid POID: {error}") observation = await db.get_observation(poid) if not observation: raise HTTPException(status_code=404, detail="Observation not found") return observation @app.get("/api/v1/validate/{ppid}", response_model=ValidationResult) async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)): """Validate PPID format, checksum, and existence.""" from ppid.identifiers import validate_ppid_full is_valid, error = validate_ppid_full(ppid) if not is_valid: return ValidationResult(valid=False, error=error) ppid_type = "POID" if ppid.startswith("POID") else "PRID" # Check existence if ppid_type == "POID": exists = await db.observation_exists(ppid) else: exists = await db.reconstruction_exists(ppid) return ValidationResult( valid=True, ppid_type=ppid_type, exists=exists ) @app.post("/api/v1/reconstructions") async def create_reconstruction( request: CreateReconstructionRequest, db = Depends(get_db), user = Depends(get_current_user) ): """Create Person Reconstruction from linked observations.""" from ppid.identifiers import generate_prid # Validate all POIDs exist for poid in request.observation_ids: if not await db.observation_exists(poid): raise HTTPException( status_code=400, detail=f"Observation not found: {poid}" ) # Generate deterministic PRID prid = generate_prid( observation_ids=request.observation_ids, curator_id=str(user.id), timestamp=datetime.utcnow().isoformat() ) # Determine canonical name if request.canonical_name: canonical_name = request.canonical_name else: # Use name from highest-confidence observation observations = await db.get_observations(request.observation_ids) canonical_name = max( observations, key=lambda o: o.extraction_confidence or 0 ).literal_name # Create reconstruction reconstruction = await db.create_reconstruction( prid=prid, canonical_name=canonical_name, observation_ids=request.observation_ids, curator_id=user.id, external_identifiers=request.external_identifiers ) return reconstruction @app.get("/api/v1/search") async def search( q: str = Query(..., min_length=2), type: str = Query("all", regex="^(observation|reconstruction|all)$"), limit: int = Query(20, ge=1, le=100), offset: int = Query(0, ge=0), db = Depends(get_db) ): """Full-text search across observations and reconstructions.""" results = await db.search( query=q, search_type=type, limit=limit, offset=offset ) return results @app.post("/api/v1/resolve") async def resolve_entity( name: str, employer: Optional[str] = None, job_title: Optional[str] = None, email: Optional[str] = None, db = Depends(get_db) ): """ Entity resolution: find matching records for input data. Returns ranked candidates with match scores. """ from ppid.entity_resolution import find_candidates candidates = await find_candidates( db=db, name=name, employer=employer, job_title=job_title, email=email ) return { "candidates": candidates, "query": { "name": name, "employer": employer, "job_title": job_title, "email": email } } ``` --- ## 5. Data Ingestion Pipeline ### 5.1 Pipeline Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ DATA INGESTION PIPELINE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Source │───▶│ Extract │───▶│ Transform│───▶│ Load │ │ │ │ Fetch │ │ (NER) │ │ (Validate) │ (Store) │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Archive │ │ Claims │ │ POID │ │ Postgres│ │ │ │ HTML │ │ XPath │ │ Generate│ │ + RDF │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### 5.2 Pipeline Implementation ```python from dataclasses import dataclass from datetime import datetime from typing import AsyncIterator import hashlib import asyncio from kafka import KafkaProducer, KafkaConsumer @dataclass class SourceDocument: url: str html_content: str retrieved_at: datetime content_hash: str archive_path: str @dataclass class ExtractedClaim: claim_type: str claim_value: str xpath: str xpath_match_score: float confidence: float @dataclass class PersonObservationData: source: SourceDocument claims: list[ExtractedClaim] poid: str class IngestionPipeline: """ Main data ingestion pipeline for PPID. Stages: 1. Fetch: Retrieve web pages, archive HTML 2. Extract: NER/NLP to identify person data, generate claims with XPath 3. Transform: Validate, generate POID, structure data 4. Load: Store in PostgreSQL and RDF triple store """ def __init__( self, db_pool, rdf_store, kafka_producer: KafkaProducer, archive_storage, llm_extractor ): self.db = db_pool self.rdf = rdf_store self.kafka = kafka_producer self.archive = archive_storage self.extractor = llm_extractor async def process_url(self, url: str) -> list[PersonObservationData]: """ Full pipeline for a single URL. Returns list of PersonObservations extracted from the page. """ # Stage 1: Fetch and archive source = await self._fetch_and_archive(url) # Stage 2: Extract claims with XPath observations = await self._extract_observations(source) # Stage 3: Transform and validate validated = await self._transform_and_validate(observations) # Stage 4: Load to databases await self._load_observations(validated) return validated async def _fetch_and_archive(self, url: str) -> SourceDocument: """Fetch URL and archive HTML.""" from playwright.async_api import async_playwright async with async_playwright() as p: browser = await p.chromium.launch() page = await browser.new_page() await page.goto(url, wait_until='networkidle') html_content = await page.content() await browser.close() # Calculate content hash content_hash = hashlib.sha256(html_content.encode()).hexdigest() # Archive HTML retrieved_at = datetime.utcnow() archive_path = await self.archive.store( url=url, content=html_content, timestamp=retrieved_at ) return SourceDocument( url=url, html_content=html_content, retrieved_at=retrieved_at, content_hash=content_hash, archive_path=archive_path ) async def _extract_observations( self, source: SourceDocument ) -> list[tuple[list[ExtractedClaim], str]]: """ Extract person observations with XPath provenance. Uses LLM for extraction, then validates XPath. """ from lxml import html # Parse HTML tree = html.fromstring(source.html_content) # Use LLM to extract person data with XPath extraction_result = await self.extractor.extract_persons( html_content=source.html_content, source_url=source.url ) observations = [] for person in extraction_result.persons: validated_claims = [] for claim in person.claims: # Verify XPath points to expected value try: elements = tree.xpath(claim.xpath) if elements: actual_value = elements[0].text_content().strip() # Calculate match score if actual_value == claim.claim_value: match_score = 1.0 else: from difflib import SequenceMatcher match_score = SequenceMatcher( None, actual_value, claim.claim_value ).ratio() if match_score >= 0.8: # Accept if 80%+ match validated_claims.append(ExtractedClaim( claim_type=claim.claim_type, claim_value=claim.claim_value, xpath=claim.xpath, xpath_match_score=match_score, confidence=claim.confidence * match_score )) except Exception as e: # Skip claims with invalid XPath continue if validated_claims: # Get literal name from claims literal_name = next( (c.claim_value for c in validated_claims if c.claim_type == 'full_name'), None ) observations.append((validated_claims, literal_name)) return observations async def _transform_and_validate( self, observations: list[tuple[list[ExtractedClaim], str]], source: SourceDocument ) -> list[PersonObservationData]: """Transform extracted data and generate POIDs.""" from ppid.identifiers import generate_poid results = [] for claims, literal_name in observations: # Generate deterministic POID claims_hash = hashlib.sha256( str(sorted([c.claim_value for c in claims])).encode() ).hexdigest() poid = generate_poid( source_url=source.url, retrieval_timestamp=source.retrieved_at.isoformat(), content_hash=f"{source.content_hash}:{claims_hash}" ) results.append(PersonObservationData( source=source, claims=claims, poid=poid )) return results async def _load_observations( self, observations: list[PersonObservationData] ) -> None: """Load observations to PostgreSQL and RDF store.""" for obs in observations: # Check if already exists (idempotent) existing = await self.db.get_observation(obs.poid) if existing: continue # Insert to PostgreSQL await self.db.create_observation( poid=obs.poid, source_url=obs.source.url, source_type='institutional_website', retrieved_at=obs.source.retrieved_at, content_hash=obs.source.content_hash, html_archive_path=obs.source.archive_path, literal_name=next( (c.claim_value for c in obs.claims if c.claim_type == 'full_name'), None ), claims=[{ 'claim_type': c.claim_type, 'claim_value': c.claim_value, 'xpath': c.xpath, 'xpath_match_score': c.xpath_match_score, 'confidence': c.confidence } for c in obs.claims] ) # Insert to RDF triple store await self._insert_rdf(obs) # Publish to Kafka for downstream processing self.kafka.send('ppid.observations.created', { 'poid': obs.poid, 'source_url': obs.source.url, 'timestamp': obs.source.retrieved_at.isoformat() }) async def _insert_rdf(self, obs: PersonObservationData) -> None: """Insert observation as RDF triples.""" from rdflib import Graph, Namespace, Literal, URIRef from rdflib.namespace import RDF, XSD PPID = Namespace("https://ppid.org/") PPIDV = Namespace("https://ppid.org/vocab#") PROV = Namespace("http://www.w3.org/ns/prov#") g = Graph() obs_uri = PPID[obs.poid] g.add((obs_uri, RDF.type, PPIDV.PersonObservation)) g.add((obs_uri, PPIDV.poid, Literal(obs.poid))) g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url))) g.add((obs_uri, PROV.generatedAtTime, Literal( obs.source.retrieved_at.isoformat(), datatype=XSD.dateTime ))) # Add claims for i, claim in enumerate(obs.claims): claim_uri = PPID[f"{obs.poid}/claim/{i}"] g.add((obs_uri, PPIDV.hasClaim, claim_uri)) g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type))) g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value))) g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath))) g.add((claim_uri, PPIDV.xpathMatchScore, Literal( claim.xpath_match_score, datatype=XSD.decimal ))) # Insert to triple store await self.rdf.insert(g) ``` --- ## 6. GHCID Integration ### 6.1 Linking Persons to Institutions ```python from dataclasses import dataclass from typing import Optional from datetime import date @dataclass class GhcidAffiliation: """ Link between a person (PRID) and a heritage institution (GHCID). """ ghcid: str # e.g., "NL-NH-HAA-A-NHA" role_title: Optional[str] = None department: Optional[str] = None start_date: Optional[date] = None end_date: Optional[date] = None is_current: bool = True source_poid: Optional[str] = None confidence: float = 0.9 async def link_person_to_institution( db, prid: str, ghcid: str, role_title: str = None, source_poid: str = None ) -> GhcidAffiliation: """ Create link between person and heritage institution. Args: prid: Person Reconstruction ID ghcid: Global Heritage Custodian ID role_title: Job title at institution source_poid: Observation where affiliation was extracted Returns: Created affiliation record """ # Validate PRID exists reconstruction = await db.get_reconstruction(prid) if not reconstruction: raise ValueError(f"Reconstruction not found: {prid}") # Validate GHCID format if not validate_ghcid_format(ghcid): raise ValueError(f"Invalid GHCID format: {ghcid}") # Create affiliation affiliation = await db.create_ghcid_affiliation( prid_id=reconstruction.id, ghcid=ghcid, role_title=role_title, source_poid_id=source_poid, is_current=True ) return affiliation def validate_ghcid_format(ghcid: str) -> bool: """Validate GHCID format.""" import re # Pattern: CC-RR-SSS-T-ABBREV pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$' return bool(re.match(pattern, ghcid)) ``` ### 6.2 RDF Integration ```turtle @prefix ppid: . @prefix ppidv: . @prefix ghcid: . @prefix org: . @prefix schema: . # Person with GHCID affiliation ppid:PRID-1234-5678-90ab-cde5 a ppidv:PersonReconstruction ; schema:name "Jan van den Berg" ; # Current employment org:memberOf [ a org:Membership ; org:organization ghcid:NL-NH-HAA-A-NHA ; org:role [ a org:Role ; schema:name "Senior Archivist" ] ; ppidv:isCurrent true ; schema:startDate "2015"^^xsd:gYear ] ; # Direct link for simple queries ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA . # The heritage institution (from GHCID system) ghcid:NL-NH-HAA-A-NHA a ppidv:HeritageCustodian ; schema:name "Noord-Hollands Archief" ; ppidv:ghcid "NL-NH-HAA-A-NHA" ; schema:address [ schema:addressLocality "Haarlem" ; schema:addressCountry "NL" ] . ``` ### 6.3 SPARQL Queries ```sparql # Find all persons affiliated with a specific institution PREFIX ppid: PREFIX ppidv: PREFIX ghcid: PREFIX schema: PREFIX org: SELECT ?prid ?name ?role ?isCurrent WHERE { ?prid a ppidv:PersonReconstruction ; schema:name ?name ; org:memberOf ?membership . ?membership org:organization ghcid:NL-NH-HAA-A-NHA ; ppidv:isCurrent ?isCurrent . OPTIONAL { ?membership org:role ?roleNode . ?roleNode schema:name ?role . } } ORDER BY DESC(?isCurrent) ?name # Find all institutions a person has worked at SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent WHERE { ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership . ?membership org:organization ?institution ; ppidv:isCurrent ?isCurrent . ?institution ppidv:ghcid ?ghcid ; schema:name ?institutionName . OPTIONAL { ?membership schema:startDate ?startDate } OPTIONAL { ?membership schema:endDate ?endDate } OPTIONAL { ?membership org:role ?roleNode . ?roleNode schema:name ?role . } } ORDER BY DESC(?isCurrent) DESC(?startDate) # Find archivists across all Dutch archives SELECT ?prid ?name ?institution ?institutionName WHERE { ?prid a ppidv:PersonReconstruction ; schema:name ?name ; org:memberOf ?membership . ?membership org:organization ?institution ; ppidv:isCurrent true ; org:role ?roleNode . ?roleNode schema:name ?role . FILTER(CONTAINS(LCASE(?role), "archivist")) ?institution ppidv:ghcid ?ghcid ; schema:name ?institutionName . FILTER(STRSTARTS(?ghcid, "NL-")) } ORDER BY ?institutionName ?name ``` --- ## 7. Security and Access Control ### 7.1 Authentication ```python from fastapi import Depends, HTTPException, status from fastapi.security import OAuth2PasswordBearer, APIKeyHeader from jose import JWTError, jwt from passlib.context import CryptContext from datetime import datetime, timedelta from typing import Optional # Configuration SECRET_KEY = "your-secret-key" # Use env variable ALGORITHM = "HS256" ACCESS_TOKEN_EXPIRE_MINUTES = 30 pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto") oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False) api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False) class User: def __init__(self, id: str, email: str, roles: list[str]): self.id = id self.email = email self.roles = roles def create_access_token(data: dict, expires_delta: Optional[timedelta] = None): """Create JWT access token.""" to_encode = data.copy() expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15)) to_encode.update({"exp": expire}) return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM) async def get_current_user( token: Optional[str] = Depends(oauth2_scheme), api_key: Optional[str] = Depends(api_key_header), db = Depends(get_db) ) -> User: """ Authenticate user via JWT token or API key. """ credentials_exception = HTTPException( status_code=status.HTTP_401_UNAUTHORIZED, detail="Could not validate credentials", headers={"WWW-Authenticate": "Bearer"}, ) # Try JWT token first if token: try: payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM]) user_id: str = payload.get("sub") if user_id is None: raise credentials_exception user = await db.get_user(user_id) if user is None: raise credentials_exception return user except JWTError: pass # Try API key if api_key: user = await db.get_user_by_api_key(api_key) if user: return user raise credentials_exception def require_role(required_roles: list[str]): """Dependency to require specific roles.""" async def role_checker(user: User = Depends(get_current_user)): if not any(role in user.roles for role in required_roles): raise HTTPException( status_code=status.HTTP_403_FORBIDDEN, detail="Insufficient permissions" ) return user return role_checker ``` ### 7.2 Authorization Roles | Role | Permissions | |------|-------------| | `reader` | Read observations, reconstructions, claims | | `contributor` | Create observations, add claims | | `curator` | Create reconstructions, link observations, resolve conflicts | | `admin` | Manage users, API keys, system configuration | | `api_client` | Programmatic access via API key | ### 7.3 Rate Limiting ```python from fastapi import Request import redis from datetime import datetime class RateLimiter: """ Token bucket rate limiter using Redis. """ def __init__(self, redis_client: redis.Redis): self.redis = redis_client async def is_allowed( self, key: str, max_requests: int = 100, window_seconds: int = 60 ) -> tuple[bool, dict]: """ Check if request is allowed under rate limit. Returns: Tuple of (is_allowed, rate_limit_info) """ now = datetime.utcnow().timestamp() window_start = now - window_seconds pipe = self.redis.pipeline() # Remove old requests pipe.zremrangebyscore(key, 0, window_start) # Count requests in window pipe.zcard(key) # Add current request pipe.zadd(key, {str(now): now}) # Set expiry pipe.expire(key, window_seconds) results = pipe.execute() request_count = results[1] is_allowed = request_count < max_requests return is_allowed, { "limit": max_requests, "remaining": max(0, max_requests - request_count - 1), "reset": int(now + window_seconds) } # Rate limit tiers RATE_LIMITS = { "anonymous": {"requests": 60, "window": 60}, "reader": {"requests": 100, "window": 60}, "contributor": {"requests": 500, "window": 60}, "curator": {"requests": 1000, "window": 60}, "api_client": {"requests": 5000, "window": 60}, } ``` --- ## 8. Performance Requirements ### 8.1 SLOs (Service Level Objectives) | Metric | Target | Measurement | |--------|--------|-------------| | **Availability** | 99.9% | Monthly uptime | | **API Latency (p50)** | < 50ms | Response time | | **API Latency (p99)** | < 500ms | Response time | | **Search Latency** | < 200ms | Full-text search | | **SPARQL Query** | < 1s | Simple queries | | **Throughput** | 1000 req/s | Sustained load | ### 8.2 Scaling Strategy ```yaml # Kubernetes HPA (Horizontal Pod Autoscaler) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ppid-api spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ppid-api minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: 100 ``` ### 8.3 Caching Strategy ```python import redis from functools import wraps import json import hashlib class CacheManager: """ Multi-tier caching for PPID. Tiers: 1. L1: In-memory (per-instance) 2. L2: Redis (shared) """ def __init__(self, redis_client: redis.Redis): self.redis = redis_client self.local_cache = {} def cache_observation(self, ttl: int = 3600): """Cache observation lookups.""" def decorator(func): @wraps(func) async def wrapper(poid: str, *args, **kwargs): cache_key = f"observation:{poid}" # Check L1 cache if cache_key in self.local_cache: return self.local_cache[cache_key] # Check L2 cache cached = self.redis.get(cache_key) if cached: data = json.loads(cached) self.local_cache[cache_key] = data return data # Fetch from database result = await func(poid, *args, **kwargs) if result: # Store in both caches self.redis.setex(cache_key, ttl, json.dumps(result)) self.local_cache[cache_key] = result return result return wrapper return decorator def cache_search(self, ttl: int = 300): """Cache search results (shorter TTL).""" def decorator(func): @wraps(func) async def wrapper(query: str, *args, **kwargs): # Create deterministic cache key from query params key_data = {"query": query, "args": args, "kwargs": kwargs} cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}" cached = self.redis.get(cache_key) if cached: return json.loads(cached) result = await func(query, *args, **kwargs) self.redis.setex(cache_key, ttl, json.dumps(result)) return result return wrapper return decorator def invalidate_observation(self, poid: str): """Invalidate cache when observation is updated.""" cache_key = f"observation:{poid}" self.redis.delete(cache_key) self.local_cache.pop(cache_key, None) ``` --- ## 9. Deployment Architecture ### 9.1 Kubernetes Deployment ```yaml # API Deployment apiVersion: apps/v1 kind: Deployment metadata: name: ppid-api labels: app: ppid component: api spec: replicas: 3 selector: matchLabels: app: ppid component: api template: metadata: labels: app: ppid component: api spec: containers: - name: api image: ppid/api:latest ports: - containerPort: 8000 env: - name: DATABASE_URL valueFrom: secretKeyRef: name: ppid-secrets key: database-url - name: REDIS_URL valueFrom: secretKeyRef: name: ppid-secrets key: redis-url - name: JWT_SECRET valueFrom: secretKeyRef: name: ppid-secrets key: jwt-secret resources: requests: cpu: "250m" memory: "512Mi" limits: cpu: "1000m" memory: "2Gi" livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5 --- # Service apiVersion: v1 kind: Service metadata: name: ppid-api spec: selector: app: ppid component: api ports: - port: 80 targetPort: 8000 type: ClusterIP --- # Ingress apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ppid-ingress annotations: kubernetes.io/ingress.class: nginx cert-manager.io/cluster-issuer: letsencrypt-prod spec: tls: - hosts: - api.ppid.org secretName: ppid-tls rules: - host: api.ppid.org http: paths: - path: / pathType: Prefix backend: service: name: ppid-api port: number: 80 ``` ### 9.2 Docker Compose (Development) ```yaml version: '3.8' services: api: build: . ports: - "8000:8000" environment: - DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid - REDIS_URL=redis://redis:6379 - FUSEKI_URL=http://fuseki:3030 depends_on: - postgres - redis - fuseki volumes: - ./src:/app/src - ./archives:/app/archives postgres: image: postgres:16 environment: POSTGRES_USER: ppid POSTGRES_PASSWORD: ppid POSTGRES_DB: ppid volumes: - postgres_data:/var/lib/postgresql/data ports: - "5432:5432" redis: image: redis:7-alpine ports: - "6379:6379" fuseki: image: stain/jena-fuseki environment: ADMIN_PASSWORD: admin FUSEKI_DATASET_1: ppid volumes: - fuseki_data:/fuseki ports: - "3030:3030" elasticsearch: image: elasticsearch:8.11.0 environment: - discovery.type=single-node - xpack.security.enabled=false volumes: - es_data:/usr/share/elasticsearch/data ports: - "9200:9200" kafka: image: confluentinc/cp-kafka:7.5.0 environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092 depends_on: - zookeeper ports: - "9092:9092" zookeeper: image: confluentinc/cp-zookeeper:7.5.0 environment: ZOOKEEPER_CLIENT_PORT: 2181 volumes: postgres_data: fuseki_data: es_data: ``` --- ## 10. Monitoring and Observability ### 10.1 Prometheus Metrics ```python from prometheus_client import Counter, Histogram, Gauge import time # Metrics REQUEST_COUNT = Counter( 'ppid_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) REQUEST_LATENCY = Histogram( 'ppid_request_latency_seconds', 'Request latency in seconds', ['method', 'endpoint'], buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] ) OBSERVATIONS_CREATED = Counter( 'ppid_observations_created_total', 'Total observations created' ) RECONSTRUCTIONS_CREATED = Counter( 'ppid_reconstructions_created_total', 'Total reconstructions created' ) ENTITY_RESOLUTION_LATENCY = Histogram( 'ppid_entity_resolution_seconds', 'Entity resolution latency', buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30] ) CACHE_HITS = Counter( 'ppid_cache_hits_total', 'Cache hits', ['cache_type'] ) CACHE_MISSES = Counter( 'ppid_cache_misses_total', 'Cache misses', ['cache_type'] ) DB_CONNECTIONS = Gauge( 'ppid_db_connections', 'Active database connections' ) # Middleware for request metrics @app.middleware("http") async def metrics_middleware(request: Request, call_next): start_time = time.time() response = await call_next(request) latency = time.time() - start_time REQUEST_COUNT.labels( method=request.method, endpoint=request.url.path, status=response.status_code ).inc() REQUEST_LATENCY.labels( method=request.method, endpoint=request.url.path ).observe(latency) return response ``` ### 10.2 Logging ```python import logging import json from datetime import datetime class JSONFormatter(logging.Formatter): """Structured JSON logging for observability.""" def format(self, record): log_record = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "logger": record.name, "message": record.getMessage(), } # Add extra fields if hasattr(record, 'poid'): log_record['poid'] = record.poid if hasattr(record, 'prid'): log_record['prid'] = record.prid if hasattr(record, 'request_id'): log_record['request_id'] = record.request_id if hasattr(record, 'user_id'): log_record['user_id'] = record.user_id if hasattr(record, 'duration_ms'): log_record['duration_ms'] = record.duration_ms if record.exc_info: log_record['exception'] = self.formatException(record.exc_info) return json.dumps(log_record) # Configure logging def setup_logging(): handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) root_logger = logging.getLogger() root_logger.addHandler(handler) root_logger.setLevel(logging.INFO) # Reduce noise from libraries logging.getLogger("uvicorn.access").setLevel(logging.WARNING) logging.getLogger("httpx").setLevel(logging.WARNING) ``` ### 10.3 Grafana Dashboard (JSON) ```json { "dashboard": { "title": "PPID System Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)" } ] }, { "title": "Request Latency (p99)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))" } ] }, { "title": "Observations Created", "type": "stat", "targets": [ { "expr": "sum(ppid_observations_created_total)" } ] }, { "title": "Cache Hit Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))" } ] } ] } } ``` --- ## 11. Implementation Checklist ### 11.1 Phase 1: Core Infrastructure - [ ] Set up PostgreSQL database with schema - [ ] Set up Apache Jena Fuseki for RDF - [ ] Set up Redis for caching - [ ] Implement POID/PRID generation - [ ] Implement checksum validation - [ ] Create basic REST API endpoints ### 11.2 Phase 2: Data Ingestion - [ ] Build web scraping infrastructure - [ ] Implement HTML archival - [ ] Integrate LLM for extraction - [ ] Implement XPath validation - [ ] Set up Kafka for async processing - [ ] Create ingestion pipeline ### 11.3 Phase 3: Entity Resolution - [ ] Implement blocking strategies - [ ] Implement similarity metrics - [ ] Build clustering algorithm - [ ] Create human-in-loop review UI - [ ] Integrate with reconstruction creation ### 11.4 Phase 4: GHCID Integration - [ ] Implement affiliation linking - [ ] Add SPARQL queries for institution lookups - [ ] Create bidirectional navigation - [ ] Sync with GHCID registry updates ### 11.5 Phase 5: Production Readiness - [ ] Implement authentication/authorization - [ ] Set up rate limiting - [ ] Configure monitoring and alerting - [ ] Create backup and recovery procedures - [ ] Performance testing and optimization - [ ] Security audit --- ## 12. References ### Standards - OAuth 2.0: https://oauth.net/2/ - OpenAPI 3.1: https://spec.openapis.org/oas/latest.html - GraphQL: https://graphql.org/learn/ - SPARQL 1.1: https://www.w3.org/TR/sparql11-query/ ### Related PPID Documents - [Identifier Structure Design](./05_identifier_structure_design.md) - [Entity Resolution Patterns](./06_entity_resolution_patterns.md) - [Claims and Provenance](./07_claims_and_provenance.md) ### Technologies - FastAPI: https://fastapi.tiangolo.com/ - Apache Jena: https://jena.apache.org/ - PostgreSQL: https://www.postgresql.org/ - Kubernetes: https://kubernetes.io/