Add final two chapters of the Person PID (PPID) design document: - 08_implementation_guidelines.md: Database architecture, API design, data ingestion pipeline, GHCID integration, security, performance, technology stack, deployment, and monitoring specifications - 09_governance_and_sustainability.md: Data governance policies, quality assurance, sustainability planning, community engagement, legal considerations, and long-term maintenance strategies
67 KiB
67 KiB
Implementation Guidelines
Version: 0.1.0
Last Updated: 2025-01-09
Related: Identifier Structure | Claims and Provenance
1. Overview
This document provides comprehensive technical specifications for implementing the PPID system:
- Database architecture (PostgreSQL + RDF triple store)
- API design (REST + GraphQL)
- Data ingestion pipeline
- GHCID integration patterns
- Security and access control
- Performance requirements
- Technology stack recommendations
- Deployment architecture
- Monitoring and observability
2. Architecture Overview
2.1 System Components
┌─────────────────────────────────────────────────────────────────────────────┐
│ PPID System │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Ingestion │────▶│ Processing │────▶│ Storage │ │
│ │ Layer │ │ Layer │ │ Layer │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web Scrapers │ │ Entity Res. │ │ PostgreSQL │ │
│ │ API Clients │ │ NER/NLP │ │ (Relational) │ │
│ │ File Import │ │ Validation │ ├──────────────┤ │
│ └──────────────┘ └──────────────┘ │ Apache Jena │ │
│ │ (RDF/SPARQL) │ │
│ ├──────────────┤ │
│ │ Redis │ │
│ │ (Cache) │ │
│ └──────────────┘ │
│ │ │
│ ┌─────────────────────┴─────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────┐│
│ │ REST API │ │ GraphQL ││
│ │ (FastAPI) │ │ (Ariadne││
│ └──────────────┘ └──────────┘│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2.2 Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| API Framework | FastAPI (Python 3.11+) | REST API, async support |
| GraphQL | Ariadne | GraphQL endpoint |
| Relational DB | PostgreSQL 16 | Primary data store |
| Triple Store | Apache Jena Fuseki | RDF/SPARQL queries |
| Cache | Redis 7 | Session, rate limiting, caching |
| Queue | Apache Kafka | Async processing pipeline |
| Search | Elasticsearch 8 | Full-text search |
| Object Storage | MinIO / S3 | HTML archives, files |
| Container | Docker + Kubernetes | Deployment |
| Monitoring | Prometheus + Grafana | Metrics and alerting |
3. Database Schema
3.1 PostgreSQL Schema
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm"; -- For fuzzy matching
-- Enum types
CREATE TYPE ppid_type AS ENUM ('POID', 'PRID');
CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted');
CREATE TYPE source_type AS ENUM (
'official_registry',
'institutional_website',
'professional_network',
'social_media',
'news_article',
'academic_publication',
'user_submitted',
'inferred'
);
-- Person Observations (POID)
CREATE TABLE person_observations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
poid VARCHAR(24) UNIQUE NOT NULL, -- POID-xxxx-xxxx-xxxx-xxxx
-- Source metadata
source_url TEXT NOT NULL,
source_type source_type NOT NULL,
retrieved_at TIMESTAMPTZ NOT NULL,
content_hash VARCHAR(64) NOT NULL, -- SHA-256
html_archive_path TEXT,
-- Extracted name components (PNV-compatible)
literal_name TEXT,
given_name TEXT,
surname TEXT,
surname_prefix TEXT, -- van, de, etc.
patronymic TEXT,
generation_suffix TEXT, -- Jr., III, etc.
-- Metadata
extraction_agent VARCHAR(100),
extraction_confidence DECIMAL(3,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
-- Indexes
CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);
-- Person Reconstructions (PRID)
CREATE TABLE person_reconstructions (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid VARCHAR(24) UNIQUE NOT NULL, -- PRID-xxxx-xxxx-xxxx-xxxx
-- Canonical name (resolved from observations)
canonical_name TEXT NOT NULL,
given_name TEXT,
surname TEXT,
surname_prefix TEXT,
-- Curation metadata
curator_id UUID REFERENCES users(id),
curation_method VARCHAR(50), -- 'manual', 'algorithmic', 'hybrid'
confidence_score DECIMAL(3,2),
-- Versioning
version INTEGER DEFAULT 1,
previous_version_id UUID REFERENCES person_reconstructions(id),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);
-- Link table: PRID derives from POIDs
CREATE TABLE reconstruction_observations (
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE,
linked_at TIMESTAMPTZ DEFAULT NOW(),
linked_by UUID REFERENCES users(id),
link_confidence DECIMAL(3,2),
PRIMARY KEY (prid_id, poid_id)
);
-- Claims (assertions with provenance)
CREATE TABLE claims (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
claim_id VARCHAR(50) UNIQUE NOT NULL,
-- Subject (what this claim is about)
poid_id UUID REFERENCES person_observations(id),
-- Claim content
claim_type VARCHAR(50) NOT NULL, -- 'job_title', 'employer', 'email', etc.
claim_value TEXT NOT NULL,
-- Provenance (MANDATORY per Rule 6)
source_url TEXT NOT NULL,
retrieved_on TIMESTAMPTZ NOT NULL,
xpath TEXT NOT NULL,
html_file TEXT NOT NULL,
xpath_match_score DECIMAL(3,2) NOT NULL,
content_hash VARCHAR(64),
-- Quality
confidence DECIMAL(3,2),
extraction_agent VARCHAR(100),
status claim_status DEFAULT 'active',
-- Relationships
supersedes_id UUID REFERENCES claims(id),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
verified_at TIMESTAMPTZ,
-- Index for claim lookups
CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1)
);
-- Claim relationships (supports, conflicts)
CREATE TABLE claim_relationships (
claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE,
claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE,
relationship_type VARCHAR(20) NOT NULL, -- 'supports', 'conflicts_with'
notes TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (claim_a_id, claim_b_id, relationship_type)
);
-- External identifiers (ORCID, ISNI, VIAF, Wikidata)
CREATE TABLE external_identifiers (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
identifier_scheme VARCHAR(20) NOT NULL, -- 'orcid', 'isni', 'viaf', 'wikidata'
identifier_value VARCHAR(100) NOT NULL,
verified BOOLEAN DEFAULT FALSE,
verified_at TIMESTAMPTZ,
source_url TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (prid_id, identifier_scheme, identifier_value)
);
-- GHCID links (person to heritage institution)
CREATE TABLE ghcid_affiliations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
ghcid VARCHAR(50) NOT NULL, -- e.g., NL-NH-HAA-A-NHA
-- Affiliation details
role_title TEXT,
department TEXT,
affiliation_start DATE,
affiliation_end DATE,
is_current BOOLEAN DEFAULT TRUE,
-- Provenance
source_poid_id UUID REFERENCES person_observations(id),
confidence DECIMAL(3,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for performance
CREATE INDEX idx_observations_source_url ON person_observations(source_url);
CREATE INDEX idx_observations_content_hash ON person_observations(content_hash);
CREATE INDEX idx_observations_name_trgm ON person_observations
USING gin (literal_name gin_trgm_ops);
CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions
USING gin (canonical_name gin_trgm_ops);
CREATE INDEX idx_claims_type ON claims(claim_type);
CREATE INDEX idx_claims_poid ON claims(poid_id);
CREATE INDEX idx_claims_status ON claims(status);
CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid);
CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id)
WHERE is_current = TRUE;
-- Full-text search
CREATE INDEX idx_observations_fts ON person_observations
USING gin (to_tsvector('english', literal_name));
CREATE INDEX idx_reconstructions_fts ON person_reconstructions
USING gin (to_tsvector('english', canonical_name));
3.2 RDF Triple Store Schema
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .
@prefix picom: <https://personsincontext.org/model#> .
@prefix pnv: <https://w3id.org/pnv#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
# Example Person Observation
ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ;
ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ;
# Name (PNV structured)
pnv:hasName [
a pnv:PersonName ;
pnv:literalName "Jan van den Berg" ;
pnv:givenName "Jan" ;
pnv:surnamePrefix "van den" ;
pnv:baseSurname "Berg"
] ;
# Provenance
prov:wasDerivedFrom <https://linkedin.com/in/jan-van-den-berg> ;
prov:wasGeneratedBy ppid:extraction-activity-001 ;
prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ;
# Claims
ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 .
# Example Person Reconstruction
ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ;
ppidv:prid "PRID-1234-5678-90ab-cde5" ;
# Canonical name
schema:name "Jan van den Berg" ;
pnv:hasName [
a pnv:PersonName ;
pnv:literalName "Jan van den Berg" ;
pnv:givenName "Jan" ;
pnv:surnamePrefix "van den" ;
pnv:baseSurname "Berg"
] ;
# Derived from observations
prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X,
ppid:POID-8c4d-e5f6-g7h8-901Y ;
# GHCID affiliation
ppidv:affiliatedWith <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
# External identifiers
ppidv:orcid "0000-0002-1234-5678" ;
owl:sameAs <https://orcid.org/0000-0002-1234-5678> .
4. API Design
4.1 REST API Endpoints
openapi: 3.1.0
info:
title: PPID API
version: 1.0.0
description: Person Persistent Identifier API
servers:
- url: https://api.ppid.org/v1
paths:
# Person Observations
/observations:
post:
summary: Create new person observation
operationId: createObservation
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateObservationRequest'
responses:
'201':
description: Observation created
content:
application/json:
schema:
$ref: '#/components/schemas/PersonObservation'
get:
summary: Search observations
operationId: searchObservations
parameters:
- name: name
in: query
schema:
type: string
- name: source_url
in: query
schema:
type: string
- name: limit
in: query
schema:
type: integer
default: 20
- name: offset
in: query
schema:
type: integer
default: 0
responses:
'200':
description: List of observations
content:
application/json:
schema:
$ref: '#/components/schemas/ObservationList'
/observations/{poid}:
get:
summary: Get observation by POID
operationId: getObservation
parameters:
- name: poid
in: path
required: true
schema:
type: string
pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$'
responses:
'200':
description: Person observation
content:
application/json:
schema:
$ref: '#/components/schemas/PersonObservation'
text/turtle:
schema:
type: string
application/ld+json:
schema:
type: object
/observations/{poid}/claims:
get:
summary: Get claims for observation
operationId: getObservationClaims
parameters:
- name: poid
in: path
required: true
schema:
type: string
responses:
'200':
description: List of claims
content:
application/json:
schema:
$ref: '#/components/schemas/ClaimList'
# Person Reconstructions
/reconstructions:
post:
summary: Create person reconstruction from observations
operationId: createReconstruction
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateReconstructionRequest'
responses:
'201':
description: Reconstruction created
content:
application/json:
schema:
$ref: '#/components/schemas/PersonReconstruction'
/reconstructions/{prid}:
get:
summary: Get reconstruction by PRID
operationId: getReconstruction
parameters:
- name: prid
in: path
required: true
schema:
type: string
responses:
'200':
description: Person reconstruction
content:
application/json:
schema:
$ref: '#/components/schemas/PersonReconstruction'
/reconstructions/{prid}/observations:
get:
summary: Get observations linked to reconstruction
operationId: getReconstructionObservations
responses:
'200':
description: Linked observations
/reconstructions/{prid}/history:
get:
summary: Get version history
operationId: getReconstructionHistory
responses:
'200':
description: Version history
# Validation
/validate/{ppid}:
get:
summary: Validate PPID format and checksum
operationId: validatePpid
responses:
'200':
description: Validation result
content:
application/json:
schema:
type: object
properties:
valid:
type: boolean
type:
type: string
enum: [POID, PRID]
exists:
type: boolean
# Search
/search:
get:
summary: Full-text search across all records
operationId: search
parameters:
- name: q
in: query
required: true
schema:
type: string
- name: type
in: query
schema:
type: string
enum: [observation, reconstruction, all]
default: all
responses:
'200':
description: Search results
# Entity Resolution
/resolve:
post:
summary: Find matching records for input data
operationId: resolveEntity
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/EntityResolutionRequest'
responses:
'200':
description: Resolution candidates
content:
application/json:
schema:
$ref: '#/components/schemas/ResolutionCandidates'
components:
schemas:
CreateObservationRequest:
type: object
required:
- source_url
- retrieved_at
- claims
properties:
source_url:
type: string
format: uri
source_type:
type: string
enum: [official_registry, institutional_website, professional_network, social_media]
retrieved_at:
type: string
format: date-time
content_hash:
type: string
html_archive_path:
type: string
extraction_agent:
type: string
claims:
type: array
items:
$ref: '#/components/schemas/ClaimInput'
ClaimInput:
type: object
required:
- claim_type
- claim_value
- xpath
- xpath_match_score
properties:
claim_type:
type: string
claim_value:
type: string
xpath:
type: string
xpath_match_score:
type: number
minimum: 0
maximum: 1
confidence:
type: number
PersonObservation:
type: object
properties:
poid:
type: string
source_url:
type: string
literal_name:
type: string
claims:
type: array
items:
$ref: '#/components/schemas/Claim'
created_at:
type: string
format: date-time
CreateReconstructionRequest:
type: object
required:
- observation_ids
properties:
observation_ids:
type: array
items:
type: string
minItems: 1
canonical_name:
type: string
external_identifiers:
type: object
PersonReconstruction:
type: object
properties:
prid:
type: string
canonical_name:
type: string
observations:
type: array
items:
type: string
external_identifiers:
type: object
ghcid_affiliations:
type: array
items:
$ref: '#/components/schemas/GhcidAffiliation'
GhcidAffiliation:
type: object
properties:
ghcid:
type: string
role_title:
type: string
is_current:
type: boolean
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
apiKey:
type: apiKey
in: header
name: X-API-Key
security:
- bearerAuth: []
- apiKey: []
4.2 GraphQL Schema
type Query {
# Observations
observation(poid: ID!): PersonObservation
observations(
name: String
sourceUrl: String
limit: Int = 20
offset: Int = 0
): ObservationConnection!
# Reconstructions
reconstruction(prid: ID!): PersonReconstruction
reconstructions(
name: String
ghcid: String
limit: Int = 20
offset: Int = 0
): ReconstructionConnection!
# Search
search(query: String!, type: SearchType = ALL): SearchResults!
# Validation
validatePpid(ppid: String!): ValidationResult!
# Entity Resolution
resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]!
}
type Mutation {
# Create observation
createObservation(input: CreateObservationInput!): PersonObservation!
# Create reconstruction
createReconstruction(input: CreateReconstructionInput!): PersonReconstruction!
# Link observation to reconstruction
linkObservation(prid: ID!, poid: ID!): PersonReconstruction!
# Update reconstruction
updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction!
# Add external identifier
addExternalIdentifier(
prid: ID!
scheme: IdentifierScheme!
value: String!
): PersonReconstruction!
# Add GHCID affiliation
addGhcidAffiliation(
prid: ID!
ghcid: String!
roleTitle: String
isCurrent: Boolean = true
): PersonReconstruction!
}
type PersonObservation {
poid: ID!
sourceUrl: String!
sourceType: SourceType!
retrievedAt: DateTime!
contentHash: String
htmlArchivePath: String
# Name components
literalName: String
givenName: String
surname: String
surnamePrefix: String
# Related data
claims: [Claim!]!
linkedReconstructions: [PersonReconstruction!]!
# Metadata
extractionAgent: String
extractionConfidence: Float
createdAt: DateTime!
}
type PersonReconstruction {
prid: ID!
canonicalName: String!
givenName: String
surname: String
surnamePrefix: String
# Linked observations
observations: [PersonObservation!]!
# External identifiers
orcid: String
isni: String
viaf: String
wikidata: String
externalIdentifiers: [ExternalIdentifier!]!
# GHCID affiliations
ghcidAffiliations: [GhcidAffiliation!]!
currentAffiliations: [GhcidAffiliation!]!
# Versioning
version: Int!
previousVersion: PersonReconstruction
history: [PersonReconstruction!]!
# Curation
curator: User
curationMethod: CurationMethod
confidenceScore: Float
createdAt: DateTime!
updatedAt: DateTime!
}
type Claim {
id: ID!
claimType: ClaimType!
claimValue: String!
# Provenance (MANDATORY)
sourceUrl: String!
retrievedOn: DateTime!
xpath: String!
htmlFile: String!
xpathMatchScore: Float!
# Quality
confidence: Float
extractionAgent: String
status: ClaimStatus!
# Relationships
supports: [Claim!]!
conflictsWith: [Claim!]!
supersedes: Claim
createdAt: DateTime!
}
type GhcidAffiliation {
ghcid: String!
institution: HeritageCustodian # Resolved from GHCID
roleTitle: String
department: String
startDate: Date
endDate: Date
isCurrent: Boolean!
confidence: Float
}
type HeritageCustodian {
ghcid: String!
name: String!
institutionType: String!
city: String
country: String
}
type ExternalIdentifier {
scheme: IdentifierScheme!
value: String!
verified: Boolean!
verifiedAt: DateTime
}
enum SourceType {
OFFICIAL_REGISTRY
INSTITUTIONAL_WEBSITE
PROFESSIONAL_NETWORK
SOCIAL_MEDIA
NEWS_ARTICLE
ACADEMIC_PUBLICATION
USER_SUBMITTED
INFERRED
}
enum ClaimType {
FULL_NAME
GIVEN_NAME
FAMILY_NAME
JOB_TITLE
EMPLOYER
EMPLOYER_GHCID
EMAIL
LINKEDIN_URL
ORCID
BIRTH_DATE
EDUCATION
}
enum ClaimStatus {
ACTIVE
SUPERSEDED
RETRACTED
}
enum IdentifierScheme {
ORCID
ISNI
VIAF
WIKIDATA
LOC_NAF
}
enum CurationMethod {
MANUAL
ALGORITHMIC
HYBRID
}
enum SearchType {
OBSERVATION
RECONSTRUCTION
ALL
}
input CreateObservationInput {
sourceUrl: String!
sourceType: SourceType!
retrievedAt: DateTime!
contentHash: String
htmlArchivePath: String
extractionAgent: String
claims: [ClaimInput!]!
}
input ClaimInput {
claimType: ClaimType!
claimValue: String!
xpath: String!
xpathMatchScore: Float!
confidence: Float
}
input CreateReconstructionInput {
observationIds: [ID!]!
canonicalName: String
externalIdentifiers: ExternalIdentifiersInput
}
input EntityResolutionInput {
name: String!
employer: String
jobTitle: String
email: String
linkedinUrl: String
}
type ResolutionCandidate {
reconstruction: PersonReconstruction
observation: PersonObservation
matchScore: Float!
matchFactors: [MatchFactor!]!
}
type MatchFactor {
field: String!
score: Float!
method: String!
}
4.3 FastAPI Implementation
from fastapi import FastAPI, HTTPException, Depends, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import uuid
app = FastAPI(
title="PPID API",
description="Person Persistent Identifier API",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# --- Pydantic Models ---
class ClaimInput(BaseModel):
claim_type: str
claim_value: str
xpath: str
xpath_match_score: float = Field(ge=0, le=1)
confidence: Optional[float] = Field(None, ge=0, le=1)
class CreateObservationRequest(BaseModel):
source_url: str
source_type: str = "institutional_website"
retrieved_at: datetime
content_hash: Optional[str] = None
html_archive_path: Optional[str] = None
extraction_agent: Optional[str] = None
claims: list[ClaimInput]
class PersonObservationResponse(BaseModel):
poid: str
source_url: str
source_type: str
retrieved_at: datetime
literal_name: Optional[str] = None
claims: list[dict]
created_at: datetime
class CreateReconstructionRequest(BaseModel):
observation_ids: list[str]
canonical_name: Optional[str] = None
external_identifiers: Optional[dict] = None
class ValidationResult(BaseModel):
valid: bool
ppid_type: Optional[str] = None
exists: Optional[bool] = None
error: Optional[str] = None
# --- Dependencies ---
async def get_db():
"""Database connection dependency."""
# In production, use connection pool
pass
async def get_current_user(api_key: str = Depends(oauth2_scheme)):
"""Authenticate user from API key or JWT."""
pass
# --- Endpoints ---
@app.post("/api/v1/observations", response_model=PersonObservationResponse)
async def create_observation(
request: CreateObservationRequest,
db = Depends(get_db)
):
"""
Create a new Person Observation from extracted data.
The POID is generated deterministically from source metadata.
"""
from ppid.identifiers import generate_poid
# Generate deterministic POID
poid = generate_poid(
source_url=request.source_url,
retrieval_timestamp=request.retrieved_at.isoformat(),
content_hash=request.content_hash or ""
)
# Check for existing observation with same POID
existing = await db.get_observation(poid)
if existing:
return existing
# Extract name from claims
literal_name = None
for claim in request.claims:
if claim.claim_type == "full_name":
literal_name = claim.claim_value
break
# Create observation record
observation = await db.create_observation(
poid=poid,
source_url=request.source_url,
source_type=request.source_type,
retrieved_at=request.retrieved_at,
content_hash=request.content_hash,
html_archive_path=request.html_archive_path,
literal_name=literal_name,
extraction_agent=request.extraction_agent,
claims=[c.dict() for c in request.claims]
)
return observation
@app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse)
async def get_observation(poid: str, db = Depends(get_db)):
"""Get Person Observation by POID."""
from ppid.identifiers import validate_ppid
is_valid, error = validate_ppid(poid)
if not is_valid:
raise HTTPException(status_code=400, detail=f"Invalid POID: {error}")
observation = await db.get_observation(poid)
if not observation:
raise HTTPException(status_code=404, detail="Observation not found")
return observation
@app.get("/api/v1/validate/{ppid}", response_model=ValidationResult)
async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)):
"""Validate PPID format, checksum, and existence."""
from ppid.identifiers import validate_ppid_full
is_valid, error = validate_ppid_full(ppid)
if not is_valid:
return ValidationResult(valid=False, error=error)
ppid_type = "POID" if ppid.startswith("POID") else "PRID"
# Check existence
if ppid_type == "POID":
exists = await db.observation_exists(ppid)
else:
exists = await db.reconstruction_exists(ppid)
return ValidationResult(
valid=True,
ppid_type=ppid_type,
exists=exists
)
@app.post("/api/v1/reconstructions")
async def create_reconstruction(
request: CreateReconstructionRequest,
db = Depends(get_db),
user = Depends(get_current_user)
):
"""Create Person Reconstruction from linked observations."""
from ppid.identifiers import generate_prid
# Validate all POIDs exist
for poid in request.observation_ids:
if not await db.observation_exists(poid):
raise HTTPException(
status_code=400,
detail=f"Observation not found: {poid}"
)
# Generate deterministic PRID
prid = generate_prid(
observation_ids=request.observation_ids,
curator_id=str(user.id),
timestamp=datetime.utcnow().isoformat()
)
# Determine canonical name
if request.canonical_name:
canonical_name = request.canonical_name
else:
# Use name from highest-confidence observation
observations = await db.get_observations(request.observation_ids)
canonical_name = max(
observations,
key=lambda o: o.extraction_confidence or 0
).literal_name
# Create reconstruction
reconstruction = await db.create_reconstruction(
prid=prid,
canonical_name=canonical_name,
observation_ids=request.observation_ids,
curator_id=user.id,
external_identifiers=request.external_identifiers
)
return reconstruction
@app.get("/api/v1/search")
async def search(
q: str = Query(..., min_length=2),
type: str = Query("all", regex="^(observation|reconstruction|all)$"),
limit: int = Query(20, ge=1, le=100),
offset: int = Query(0, ge=0),
db = Depends(get_db)
):
"""Full-text search across observations and reconstructions."""
results = await db.search(
query=q,
search_type=type,
limit=limit,
offset=offset
)
return results
@app.post("/api/v1/resolve")
async def resolve_entity(
name: str,
employer: Optional[str] = None,
job_title: Optional[str] = None,
email: Optional[str] = None,
db = Depends(get_db)
):
"""
Entity resolution: find matching records for input data.
Returns ranked candidates with match scores.
"""
from ppid.entity_resolution import find_candidates
candidates = await find_candidates(
db=db,
name=name,
employer=employer,
job_title=job_title,
email=email
)
return {
"candidates": candidates,
"query": {
"name": name,
"employer": employer,
"job_title": job_title,
"email": email
}
}
5. Data Ingestion Pipeline
5.1 Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ DATA INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Source │───▶│ Extract │───▶│ Transform│───▶│ Load │ │
│ │ Fetch │ │ (NER) │ │ (Validate) │ (Store) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Archive │ │ Claims │ │ POID │ │ Postgres│ │
│ │ HTML │ │ XPath │ │ Generate│ │ + RDF │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
5.2 Pipeline Implementation
from dataclasses import dataclass
from datetime import datetime
from typing import AsyncIterator
import hashlib
import asyncio
from kafka import KafkaProducer, KafkaConsumer
@dataclass
class SourceDocument:
url: str
html_content: str
retrieved_at: datetime
content_hash: str
archive_path: str
@dataclass
class ExtractedClaim:
claim_type: str
claim_value: str
xpath: str
xpath_match_score: float
confidence: float
@dataclass
class PersonObservationData:
source: SourceDocument
claims: list[ExtractedClaim]
poid: str
class IngestionPipeline:
"""
Main data ingestion pipeline for PPID.
Stages:
1. Fetch: Retrieve web pages, archive HTML
2. Extract: NER/NLP to identify person data, generate claims with XPath
3. Transform: Validate, generate POID, structure data
4. Load: Store in PostgreSQL and RDF triple store
"""
def __init__(
self,
db_pool,
rdf_store,
kafka_producer: KafkaProducer,
archive_storage,
llm_extractor
):
self.db = db_pool
self.rdf = rdf_store
self.kafka = kafka_producer
self.archive = archive_storage
self.extractor = llm_extractor
async def process_url(self, url: str) -> list[PersonObservationData]:
"""
Full pipeline for a single URL.
Returns list of PersonObservations extracted from the page.
"""
# Stage 1: Fetch and archive
source = await self._fetch_and_archive(url)
# Stage 2: Extract claims with XPath
observations = await self._extract_observations(source)
# Stage 3: Transform and validate
validated = await self._transform_and_validate(observations)
# Stage 4: Load to databases
await self._load_observations(validated)
return validated
async def _fetch_and_archive(self, url: str) -> SourceDocument:
"""Fetch URL and archive HTML."""
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url, wait_until='networkidle')
html_content = await page.content()
await browser.close()
# Calculate content hash
content_hash = hashlib.sha256(html_content.encode()).hexdigest()
# Archive HTML
retrieved_at = datetime.utcnow()
archive_path = await self.archive.store(
url=url,
content=html_content,
timestamp=retrieved_at
)
return SourceDocument(
url=url,
html_content=html_content,
retrieved_at=retrieved_at,
content_hash=content_hash,
archive_path=archive_path
)
async def _extract_observations(
self,
source: SourceDocument
) -> list[tuple[list[ExtractedClaim], str]]:
"""
Extract person observations with XPath provenance.
Uses LLM for extraction, then validates XPath.
"""
from lxml import html
# Parse HTML
tree = html.fromstring(source.html_content)
# Use LLM to extract person data with XPath
extraction_result = await self.extractor.extract_persons(
html_content=source.html_content,
source_url=source.url
)
observations = []
for person in extraction_result.persons:
validated_claims = []
for claim in person.claims:
# Verify XPath points to expected value
try:
elements = tree.xpath(claim.xpath)
if elements:
actual_value = elements[0].text_content().strip()
# Calculate match score
if actual_value == claim.claim_value:
match_score = 1.0
else:
from difflib import SequenceMatcher
match_score = SequenceMatcher(
None, actual_value, claim.claim_value
).ratio()
if match_score >= 0.8: # Accept if 80%+ match
validated_claims.append(ExtractedClaim(
claim_type=claim.claim_type,
claim_value=claim.claim_value,
xpath=claim.xpath,
xpath_match_score=match_score,
confidence=claim.confidence * match_score
))
except Exception as e:
# Skip claims with invalid XPath
continue
if validated_claims:
# Get literal name from claims
literal_name = next(
(c.claim_value for c in validated_claims if c.claim_type == 'full_name'),
None
)
observations.append((validated_claims, literal_name))
return observations
async def _transform_and_validate(
self,
observations: list[tuple[list[ExtractedClaim], str]],
source: SourceDocument
) -> list[PersonObservationData]:
"""Transform extracted data and generate POIDs."""
from ppid.identifiers import generate_poid
results = []
for claims, literal_name in observations:
# Generate deterministic POID
claims_hash = hashlib.sha256(
str(sorted([c.claim_value for c in claims])).encode()
).hexdigest()
poid = generate_poid(
source_url=source.url,
retrieval_timestamp=source.retrieved_at.isoformat(),
content_hash=f"{source.content_hash}:{claims_hash}"
)
results.append(PersonObservationData(
source=source,
claims=claims,
poid=poid
))
return results
async def _load_observations(
self,
observations: list[PersonObservationData]
) -> None:
"""Load observations to PostgreSQL and RDF store."""
for obs in observations:
# Check if already exists (idempotent)
existing = await self.db.get_observation(obs.poid)
if existing:
continue
# Insert to PostgreSQL
await self.db.create_observation(
poid=obs.poid,
source_url=obs.source.url,
source_type='institutional_website',
retrieved_at=obs.source.retrieved_at,
content_hash=obs.source.content_hash,
html_archive_path=obs.source.archive_path,
literal_name=next(
(c.claim_value for c in obs.claims if c.claim_type == 'full_name'),
None
),
claims=[{
'claim_type': c.claim_type,
'claim_value': c.claim_value,
'xpath': c.xpath,
'xpath_match_score': c.xpath_match_score,
'confidence': c.confidence
} for c in obs.claims]
)
# Insert to RDF triple store
await self._insert_rdf(obs)
# Publish to Kafka for downstream processing
self.kafka.send('ppid.observations.created', {
'poid': obs.poid,
'source_url': obs.source.url,
'timestamp': obs.source.retrieved_at.isoformat()
})
async def _insert_rdf(self, obs: PersonObservationData) -> None:
"""Insert observation as RDF triples."""
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, XSD
PPID = Namespace("https://ppid.org/")
PPIDV = Namespace("https://ppid.org/vocab#")
PROV = Namespace("http://www.w3.org/ns/prov#")
g = Graph()
obs_uri = PPID[obs.poid]
g.add((obs_uri, RDF.type, PPIDV.PersonObservation))
g.add((obs_uri, PPIDV.poid, Literal(obs.poid)))
g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url)))
g.add((obs_uri, PROV.generatedAtTime, Literal(
obs.source.retrieved_at.isoformat(),
datatype=XSD.dateTime
)))
# Add claims
for i, claim in enumerate(obs.claims):
claim_uri = PPID[f"{obs.poid}/claim/{i}"]
g.add((obs_uri, PPIDV.hasClaim, claim_uri))
g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type)))
g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value)))
g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath)))
g.add((claim_uri, PPIDV.xpathMatchScore, Literal(
claim.xpath_match_score, datatype=XSD.decimal
)))
# Insert to triple store
await self.rdf.insert(g)
6. GHCID Integration
6.1 Linking Persons to Institutions
from dataclasses import dataclass
from typing import Optional
from datetime import date
@dataclass
class GhcidAffiliation:
"""
Link between a person (PRID) and a heritage institution (GHCID).
"""
ghcid: str # e.g., "NL-NH-HAA-A-NHA"
role_title: Optional[str] = None
department: Optional[str] = None
start_date: Optional[date] = None
end_date: Optional[date] = None
is_current: bool = True
source_poid: Optional[str] = None
confidence: float = 0.9
async def link_person_to_institution(
db,
prid: str,
ghcid: str,
role_title: str = None,
source_poid: str = None
) -> GhcidAffiliation:
"""
Create link between person and heritage institution.
Args:
prid: Person Reconstruction ID
ghcid: Global Heritage Custodian ID
role_title: Job title at institution
source_poid: Observation where affiliation was extracted
Returns:
Created affiliation record
"""
# Validate PRID exists
reconstruction = await db.get_reconstruction(prid)
if not reconstruction:
raise ValueError(f"Reconstruction not found: {prid}")
# Validate GHCID format
if not validate_ghcid_format(ghcid):
raise ValueError(f"Invalid GHCID format: {ghcid}")
# Create affiliation
affiliation = await db.create_ghcid_affiliation(
prid_id=reconstruction.id,
ghcid=ghcid,
role_title=role_title,
source_poid_id=source_poid,
is_current=True
)
return affiliation
def validate_ghcid_format(ghcid: str) -> bool:
"""Validate GHCID format."""
import re
# Pattern: CC-RR-SSS-T-ABBREV
pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$'
return bool(re.match(pattern, ghcid))
6.2 RDF Integration
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ghcid: <https://w3id.org/heritage/custodian/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix schema: <http://schema.org/> .
# Person with GHCID affiliation
ppid:PRID-1234-5678-90ab-cde5
a ppidv:PersonReconstruction ;
schema:name "Jan van den Berg" ;
# Current employment
org:memberOf [
a org:Membership ;
org:organization ghcid:NL-NH-HAA-A-NHA ;
org:role [
a org:Role ;
schema:name "Senior Archivist"
] ;
ppidv:isCurrent true ;
schema:startDate "2015"^^xsd:gYear
] ;
# Direct link for simple queries
ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA .
# The heritage institution (from GHCID system)
ghcid:NL-NH-HAA-A-NHA
a ppidv:HeritageCustodian ;
schema:name "Noord-Hollands Archief" ;
ppidv:ghcid "NL-NH-HAA-A-NHA" ;
schema:address [
schema:addressLocality "Haarlem" ;
schema:addressCountry "NL"
] .
6.3 SPARQL Queries
# Find all persons affiliated with a specific institution
PREFIX ppid: <https://ppid.org/>
PREFIX ppidv: <https://ppid.org/vocab#>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
PREFIX schema: <http://schema.org/>
PREFIX org: <http://www.w3.org/ns/org#>
SELECT ?prid ?name ?role ?isCurrent
WHERE {
?prid a ppidv:PersonReconstruction ;
schema:name ?name ;
org:memberOf ?membership .
?membership org:organization ghcid:NL-NH-HAA-A-NHA ;
ppidv:isCurrent ?isCurrent .
OPTIONAL {
?membership org:role ?roleNode .
?roleNode schema:name ?role .
}
}
ORDER BY DESC(?isCurrent) ?name
# Find all institutions a person has worked at
SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent
WHERE {
ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership .
?membership org:organization ?institution ;
ppidv:isCurrent ?isCurrent .
?institution ppidv:ghcid ?ghcid ;
schema:name ?institutionName .
OPTIONAL { ?membership schema:startDate ?startDate }
OPTIONAL { ?membership schema:endDate ?endDate }
OPTIONAL {
?membership org:role ?roleNode .
?roleNode schema:name ?role .
}
}
ORDER BY DESC(?isCurrent) DESC(?startDate)
# Find archivists across all Dutch archives
SELECT ?prid ?name ?institution ?institutionName
WHERE {
?prid a ppidv:PersonReconstruction ;
schema:name ?name ;
org:memberOf ?membership .
?membership org:organization ?institution ;
ppidv:isCurrent true ;
org:role ?roleNode .
?roleNode schema:name ?role .
FILTER(CONTAINS(LCASE(?role), "archivist"))
?institution ppidv:ghcid ?ghcid ;
schema:name ?institutionName .
FILTER(STRSTARTS(?ghcid, "NL-"))
}
ORDER BY ?institutionName ?name
7. Security and Access Control
7.1 Authentication
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, APIKeyHeader
from jose import JWTError, jwt
from passlib.context import CryptContext
from datetime import datetime, timedelta
from typing import Optional
# Configuration
SECRET_KEY = "your-secret-key" # Use env variable
ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 30
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False)
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
class User:
def __init__(self, id: str, email: str, roles: list[str]):
self.id = id
self.email = email
self.roles = roles
def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
"""Create JWT access token."""
to_encode = data.copy()
expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15))
to_encode.update({"exp": expire})
return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
async def get_current_user(
token: Optional[str] = Depends(oauth2_scheme),
api_key: Optional[str] = Depends(api_key_header),
db = Depends(get_db)
) -> User:
"""
Authenticate user via JWT token or API key.
"""
credentials_exception = HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Could not validate credentials",
headers={"WWW-Authenticate": "Bearer"},
)
# Try JWT token first
if token:
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
user_id: str = payload.get("sub")
if user_id is None:
raise credentials_exception
user = await db.get_user(user_id)
if user is None:
raise credentials_exception
return user
except JWTError:
pass
# Try API key
if api_key:
user = await db.get_user_by_api_key(api_key)
if user:
return user
raise credentials_exception
def require_role(required_roles: list[str]):
"""Dependency to require specific roles."""
async def role_checker(user: User = Depends(get_current_user)):
if not any(role in user.roles for role in required_roles):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Insufficient permissions"
)
return user
return role_checker
7.2 Authorization Roles
| Role | Permissions |
|---|---|
reader |
Read observations, reconstructions, claims |
contributor |
Create observations, add claims |
curator |
Create reconstructions, link observations, resolve conflicts |
admin |
Manage users, API keys, system configuration |
api_client |
Programmatic access via API key |
7.3 Rate Limiting
from fastapi import Request
import redis
from datetime import datetime
class RateLimiter:
"""
Token bucket rate limiter using Redis.
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
async def is_allowed(
self,
key: str,
max_requests: int = 100,
window_seconds: int = 60
) -> tuple[bool, dict]:
"""
Check if request is allowed under rate limit.
Returns:
Tuple of (is_allowed, rate_limit_info)
"""
now = datetime.utcnow().timestamp()
window_start = now - window_seconds
pipe = self.redis.pipeline()
# Remove old requests
pipe.zremrangebyscore(key, 0, window_start)
# Count requests in window
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(now): now})
# Set expiry
pipe.expire(key, window_seconds)
results = pipe.execute()
request_count = results[1]
is_allowed = request_count < max_requests
return is_allowed, {
"limit": max_requests,
"remaining": max(0, max_requests - request_count - 1),
"reset": int(now + window_seconds)
}
# Rate limit tiers
RATE_LIMITS = {
"anonymous": {"requests": 60, "window": 60},
"reader": {"requests": 100, "window": 60},
"contributor": {"requests": 500, "window": 60},
"curator": {"requests": 1000, "window": 60},
"api_client": {"requests": 5000, "window": 60},
}
8. Performance Requirements
8.1 SLOs (Service Level Objectives)
| Metric | Target | Measurement |
|---|---|---|
| Availability | 99.9% | Monthly uptime |
| API Latency (p50) | < 50ms | Response time |
| API Latency (p99) | < 500ms | Response time |
| Search Latency | < 200ms | Full-text search |
| SPARQL Query | < 1s | Simple queries |
| Throughput | 1000 req/s | Sustained load |
8.2 Scaling Strategy
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ppid-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ppid-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
8.3 Caching Strategy
import redis
from functools import wraps
import json
import hashlib
class CacheManager:
"""
Multi-tier caching for PPID.
Tiers:
1. L1: In-memory (per-instance)
2. L2: Redis (shared)
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.local_cache = {}
def cache_observation(self, ttl: int = 3600):
"""Cache observation lookups."""
def decorator(func):
@wraps(func)
async def wrapper(poid: str, *args, **kwargs):
cache_key = f"observation:{poid}"
# Check L1 cache
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# Check L2 cache
cached = self.redis.get(cache_key)
if cached:
data = json.loads(cached)
self.local_cache[cache_key] = data
return data
# Fetch from database
result = await func(poid, *args, **kwargs)
if result:
# Store in both caches
self.redis.setex(cache_key, ttl, json.dumps(result))
self.local_cache[cache_key] = result
return result
return wrapper
return decorator
def cache_search(self, ttl: int = 300):
"""Cache search results (shorter TTL)."""
def decorator(func):
@wraps(func)
async def wrapper(query: str, *args, **kwargs):
# Create deterministic cache key from query params
key_data = {"query": query, "args": args, "kwargs": kwargs}
cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
result = await func(query, *args, **kwargs)
self.redis.setex(cache_key, ttl, json.dumps(result))
return result
return wrapper
return decorator
def invalidate_observation(self, poid: str):
"""Invalidate cache when observation is updated."""
cache_key = f"observation:{poid}"
self.redis.delete(cache_key)
self.local_cache.pop(cache_key, None)
9. Deployment Architecture
9.1 Kubernetes Deployment
# API Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ppid-api
labels:
app: ppid
component: api
spec:
replicas: 3
selector:
matchLabels:
app: ppid
component: api
template:
metadata:
labels:
app: ppid
component: api
spec:
containers:
- name: api
image: ppid/api:latest
ports:
- containerPort: 8000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: ppid-secrets
key: database-url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: ppid-secrets
key: redis-url
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: ppid-secrets
key: jwt-secret
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
# Service
apiVersion: v1
kind: Service
metadata:
name: ppid-api
spec:
selector:
app: ppid
component: api
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
# Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ppid-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- api.ppid.org
secretName: ppid-tls
rules:
- host: api.ppid.org
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ppid-api
port:
number: 80
9.2 Docker Compose (Development)
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid
- REDIS_URL=redis://redis:6379
- FUSEKI_URL=http://fuseki:3030
depends_on:
- postgres
- redis
- fuseki
volumes:
- ./src:/app/src
- ./archives:/app/archives
postgres:
image: postgres:16
environment:
POSTGRES_USER: ppid
POSTGRES_PASSWORD: ppid
POSTGRES_DB: ppid
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
fuseki:
image: stain/jena-fuseki
environment:
ADMIN_PASSWORD: admin
FUSEKI_DATASET_1: ppid
volumes:
- fuseki_data:/fuseki
ports:
- "3030:3030"
elasticsearch:
image: elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
depends_on:
- zookeeper
ports:
- "9092:9092"
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
volumes:
postgres_data:
fuseki_data:
es_data:
10. Monitoring and Observability
10.1 Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
import time
# Metrics
REQUEST_COUNT = Counter(
'ppid_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'ppid_request_latency_seconds',
'Request latency in seconds',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
OBSERVATIONS_CREATED = Counter(
'ppid_observations_created_total',
'Total observations created'
)
RECONSTRUCTIONS_CREATED = Counter(
'ppid_reconstructions_created_total',
'Total reconstructions created'
)
ENTITY_RESOLUTION_LATENCY = Histogram(
'ppid_entity_resolution_seconds',
'Entity resolution latency',
buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30]
)
CACHE_HITS = Counter(
'ppid_cache_hits_total',
'Cache hits',
['cache_type']
)
CACHE_MISSES = Counter(
'ppid_cache_misses_total',
'Cache misses',
['cache_type']
)
DB_CONNECTIONS = Gauge(
'ppid_db_connections',
'Active database connections'
)
# Middleware for request metrics
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
latency = time.time() - start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path
).observe(latency)
return response
10.2 Logging
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
"""Structured JSON logging for observability."""
def format(self, record):
log_record = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
}
# Add extra fields
if hasattr(record, 'poid'):
log_record['poid'] = record.poid
if hasattr(record, 'prid'):
log_record['prid'] = record.prid
if hasattr(record, 'request_id'):
log_record['request_id'] = record.request_id
if hasattr(record, 'user_id'):
log_record['user_id'] = record.user_id
if hasattr(record, 'duration_ms'):
log_record['duration_ms'] = record.duration_ms
if record.exc_info:
log_record['exception'] = self.formatException(record.exc_info)
return json.dumps(log_record)
# Configure logging
def setup_logging():
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.setLevel(logging.INFO)
# Reduce noise from libraries
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
10.3 Grafana Dashboard (JSON)
{
"dashboard": {
"title": "PPID System Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)"
}
]
},
{
"title": "Request Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))"
}
]
},
{
"title": "Observations Created",
"type": "stat",
"targets": [
{
"expr": "sum(ppid_observations_created_total)"
}
]
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))"
}
]
}
]
}
}
11. Implementation Checklist
11.1 Phase 1: Core Infrastructure
- Set up PostgreSQL database with schema
- Set up Apache Jena Fuseki for RDF
- Set up Redis for caching
- Implement POID/PRID generation
- Implement checksum validation
- Create basic REST API endpoints
11.2 Phase 2: Data Ingestion
- Build web scraping infrastructure
- Implement HTML archival
- Integrate LLM for extraction
- Implement XPath validation
- Set up Kafka for async processing
- Create ingestion pipeline
11.3 Phase 3: Entity Resolution
- Implement blocking strategies
- Implement similarity metrics
- Build clustering algorithm
- Create human-in-loop review UI
- Integrate with reconstruction creation
11.4 Phase 4: GHCID Integration
- Implement affiliation linking
- Add SPARQL queries for institution lookups
- Create bidirectional navigation
- Sync with GHCID registry updates
11.5 Phase 5: Production Readiness
- Implement authentication/authorization
- Set up rate limiting
- Configure monitoring and alerting
- Create backup and recovery procedures
- Performance testing and optimization
- Security audit
12. References
Standards
- OAuth 2.0: https://oauth.net/2/
- OpenAPI 3.1: https://spec.openapis.org/oas/latest.html
- GraphQL: https://graphql.org/learn/
- SPARQL 1.1: https://www.w3.org/TR/sparql11-query/
Related PPID Documents
Technologies
- FastAPI: https://fastapi.tiangolo.com/
- Apache Jena: https://jena.apache.org/
- PostgreSQL: https://www.postgresql.org/
- Kubernetes: https://kubernetes.io/