glam/docs/plan/person_pid/08_implementation_guidelines.md
kempersc 5ab9dd8ea2 docs(person_pid): add implementation guidelines and governance docs
Add final two chapters of the Person PID (PPID) design document:

- 08_implementation_guidelines.md: Database architecture, API design,
  data ingestion pipeline, GHCID integration, security, performance,
  technology stack, deployment, and monitoring specifications

- 09_governance_and_sustainability.md: Data governance policies,
  quality assurance, sustainability planning, community engagement,
  legal considerations, and long-term maintenance strategies
2026-01-09 14:51:57 +01:00

2447 lines
67 KiB
Markdown

# Implementation Guidelines
**Version**: 0.1.0
**Last Updated**: 2025-01-09
**Related**: [Identifier Structure](./05_identifier_structure_design.md) | [Claims and Provenance](./07_claims_and_provenance.md)
---
## 1. Overview
This document provides comprehensive technical specifications for implementing the PPID system:
- Database architecture (PostgreSQL + RDF triple store)
- API design (REST + GraphQL)
- Data ingestion pipeline
- GHCID integration patterns
- Security and access control
- Performance requirements
- Technology stack recommendations
- Deployment architecture
- Monitoring and observability
---
## 2. Architecture Overview
### 2.1 System Components
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PPID System │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Ingestion │────▶│ Processing │────▶│ Storage │ │
│ │ Layer │ │ Layer │ │ Layer │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web Scrapers │ │ Entity Res. │ │ PostgreSQL │ │
│ │ API Clients │ │ NER/NLP │ │ (Relational) │ │
│ │ File Import │ │ Validation │ ├──────────────┤ │
│ └──────────────┘ └──────────────┘ │ Apache Jena │ │
│ │ (RDF/SPARQL) │ │
│ ├──────────────┤ │
│ │ Redis │ │
│ │ (Cache) │ │
│ └──────────────┘ │
│ │ │
│ ┌─────────────────────┴─────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────┐│
│ │ REST API │ │ GraphQL ││
│ │ (FastAPI) │ │ (Ariadne││
│ └──────────────┘ └──────────┘│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### 2.2 Technology Stack
| Layer | Technology | Purpose |
|-------|------------|---------|
| **API Framework** | FastAPI (Python 3.11+) | REST API, async support |
| **GraphQL** | Ariadne | GraphQL endpoint |
| **Relational DB** | PostgreSQL 16 | Primary data store |
| **Triple Store** | Apache Jena Fuseki | RDF/SPARQL queries |
| **Cache** | Redis 7 | Session, rate limiting, caching |
| **Queue** | Apache Kafka | Async processing pipeline |
| **Search** | Elasticsearch 8 | Full-text search |
| **Object Storage** | MinIO / S3 | HTML archives, files |
| **Container** | Docker + Kubernetes | Deployment |
| **Monitoring** | Prometheus + Grafana | Metrics and alerting |
---
## 3. Database Schema
### 3.1 PostgreSQL Schema
```sql
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "pg_trgm"; -- For fuzzy matching
-- Enum types
CREATE TYPE ppid_type AS ENUM ('POID', 'PRID');
CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted');
CREATE TYPE source_type AS ENUM (
'official_registry',
'institutional_website',
'professional_network',
'social_media',
'news_article',
'academic_publication',
'user_submitted',
'inferred'
);
-- Person Observations (POID)
CREATE TABLE person_observations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
poid VARCHAR(24) UNIQUE NOT NULL, -- POID-xxxx-xxxx-xxxx-xxxx
-- Source metadata
source_url TEXT NOT NULL,
source_type source_type NOT NULL,
retrieved_at TIMESTAMPTZ NOT NULL,
content_hash VARCHAR(64) NOT NULL, -- SHA-256
html_archive_path TEXT,
-- Extracted name components (PNV-compatible)
literal_name TEXT,
given_name TEXT,
surname TEXT,
surname_prefix TEXT, -- van, de, etc.
patronymic TEXT,
generation_suffix TEXT, -- Jr., III, etc.
-- Metadata
extraction_agent VARCHAR(100),
extraction_confidence DECIMAL(3,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
-- Indexes
CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);
-- Person Reconstructions (PRID)
CREATE TABLE person_reconstructions (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid VARCHAR(24) UNIQUE NOT NULL, -- PRID-xxxx-xxxx-xxxx-xxxx
-- Canonical name (resolved from observations)
canonical_name TEXT NOT NULL,
given_name TEXT,
surname TEXT,
surname_prefix TEXT,
-- Curation metadata
curator_id UUID REFERENCES users(id),
curation_method VARCHAR(50), -- 'manual', 'algorithmic', 'hybrid'
confidence_score DECIMAL(3,2),
-- Versioning
version INTEGER DEFAULT 1,
previous_version_id UUID REFERENCES person_reconstructions(id),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
);
-- Link table: PRID derives from POIDs
CREATE TABLE reconstruction_observations (
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE,
linked_at TIMESTAMPTZ DEFAULT NOW(),
linked_by UUID REFERENCES users(id),
link_confidence DECIMAL(3,2),
PRIMARY KEY (prid_id, poid_id)
);
-- Claims (assertions with provenance)
CREATE TABLE claims (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
claim_id VARCHAR(50) UNIQUE NOT NULL,
-- Subject (what this claim is about)
poid_id UUID REFERENCES person_observations(id),
-- Claim content
claim_type VARCHAR(50) NOT NULL, -- 'job_title', 'employer', 'email', etc.
claim_value TEXT NOT NULL,
-- Provenance (MANDATORY per Rule 6)
source_url TEXT NOT NULL,
retrieved_on TIMESTAMPTZ NOT NULL,
xpath TEXT NOT NULL,
html_file TEXT NOT NULL,
xpath_match_score DECIMAL(3,2) NOT NULL,
content_hash VARCHAR(64),
-- Quality
confidence DECIMAL(3,2),
extraction_agent VARCHAR(100),
status claim_status DEFAULT 'active',
-- Relationships
supersedes_id UUID REFERENCES claims(id),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
verified_at TIMESTAMPTZ,
-- Index for claim lookups
CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1)
);
-- Claim relationships (supports, conflicts)
CREATE TABLE claim_relationships (
claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE,
claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE,
relationship_type VARCHAR(20) NOT NULL, -- 'supports', 'conflicts_with'
notes TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (claim_a_id, claim_b_id, relationship_type)
);
-- External identifiers (ORCID, ISNI, VIAF, Wikidata)
CREATE TABLE external_identifiers (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
identifier_scheme VARCHAR(20) NOT NULL, -- 'orcid', 'isni', 'viaf', 'wikidata'
identifier_value VARCHAR(100) NOT NULL,
verified BOOLEAN DEFAULT FALSE,
verified_at TIMESTAMPTZ,
source_url TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (prid_id, identifier_scheme, identifier_value)
);
-- GHCID links (person to heritage institution)
CREATE TABLE ghcid_affiliations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
ghcid VARCHAR(50) NOT NULL, -- e.g., NL-NH-HAA-A-NHA
-- Affiliation details
role_title TEXT,
department TEXT,
affiliation_start DATE,
affiliation_end DATE,
is_current BOOLEAN DEFAULT TRUE,
-- Provenance
source_poid_id UUID REFERENCES person_observations(id),
confidence DECIMAL(3,2),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for performance
CREATE INDEX idx_observations_source_url ON person_observations(source_url);
CREATE INDEX idx_observations_content_hash ON person_observations(content_hash);
CREATE INDEX idx_observations_name_trgm ON person_observations
USING gin (literal_name gin_trgm_ops);
CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions
USING gin (canonical_name gin_trgm_ops);
CREATE INDEX idx_claims_type ON claims(claim_type);
CREATE INDEX idx_claims_poid ON claims(poid_id);
CREATE INDEX idx_claims_status ON claims(status);
CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid);
CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id)
WHERE is_current = TRUE;
-- Full-text search
CREATE INDEX idx_observations_fts ON person_observations
USING gin (to_tsvector('english', literal_name));
CREATE INDEX idx_reconstructions_fts ON person_reconstructions
USING gin (to_tsvector('english', canonical_name));
```
### 3.2 RDF Triple Store Schema
```turtle
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ppidt: <https://ppid.org/type#> .
@prefix picom: <https://personsincontext.org/model#> .
@prefix pnv: <https://w3id.org/pnv#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
# Example Person Observation
ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ;
ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ;
# Name (PNV structured)
pnv:hasName [
a pnv:PersonName ;
pnv:literalName "Jan van den Berg" ;
pnv:givenName "Jan" ;
pnv:surnamePrefix "van den" ;
pnv:baseSurname "Berg"
] ;
# Provenance
prov:wasDerivedFrom <https://linkedin.com/in/jan-van-den-berg> ;
prov:wasGeneratedBy ppid:extraction-activity-001 ;
prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ;
# Claims
ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 .
# Example Person Reconstruction
ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ;
ppidv:prid "PRID-1234-5678-90ab-cde5" ;
# Canonical name
schema:name "Jan van den Berg" ;
pnv:hasName [
a pnv:PersonName ;
pnv:literalName "Jan van den Berg" ;
pnv:givenName "Jan" ;
pnv:surnamePrefix "van den" ;
pnv:baseSurname "Berg"
] ;
# Derived from observations
prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X,
ppid:POID-8c4d-e5f6-g7h8-901Y ;
# GHCID affiliation
ppidv:affiliatedWith <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
# External identifiers
ppidv:orcid "0000-0002-1234-5678" ;
owl:sameAs <https://orcid.org/0000-0002-1234-5678> .
```
---
## 4. API Design
### 4.1 REST API Endpoints
```yaml
openapi: 3.1.0
info:
title: PPID API
version: 1.0.0
description: Person Persistent Identifier API
servers:
- url: https://api.ppid.org/v1
paths:
# Person Observations
/observations:
post:
summary: Create new person observation
operationId: createObservation
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateObservationRequest'
responses:
'201':
description: Observation created
content:
application/json:
schema:
$ref: '#/components/schemas/PersonObservation'
get:
summary: Search observations
operationId: searchObservations
parameters:
- name: name
in: query
schema:
type: string
- name: source_url
in: query
schema:
type: string
- name: limit
in: query
schema:
type: integer
default: 20
- name: offset
in: query
schema:
type: integer
default: 0
responses:
'200':
description: List of observations
content:
application/json:
schema:
$ref: '#/components/schemas/ObservationList'
/observations/{poid}:
get:
summary: Get observation by POID
operationId: getObservation
parameters:
- name: poid
in: path
required: true
schema:
type: string
pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$'
responses:
'200':
description: Person observation
content:
application/json:
schema:
$ref: '#/components/schemas/PersonObservation'
text/turtle:
schema:
type: string
application/ld+json:
schema:
type: object
/observations/{poid}/claims:
get:
summary: Get claims for observation
operationId: getObservationClaims
parameters:
- name: poid
in: path
required: true
schema:
type: string
responses:
'200':
description: List of claims
content:
application/json:
schema:
$ref: '#/components/schemas/ClaimList'
# Person Reconstructions
/reconstructions:
post:
summary: Create person reconstruction from observations
operationId: createReconstruction
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateReconstructionRequest'
responses:
'201':
description: Reconstruction created
content:
application/json:
schema:
$ref: '#/components/schemas/PersonReconstruction'
/reconstructions/{prid}:
get:
summary: Get reconstruction by PRID
operationId: getReconstruction
parameters:
- name: prid
in: path
required: true
schema:
type: string
responses:
'200':
description: Person reconstruction
content:
application/json:
schema:
$ref: '#/components/schemas/PersonReconstruction'
/reconstructions/{prid}/observations:
get:
summary: Get observations linked to reconstruction
operationId: getReconstructionObservations
responses:
'200':
description: Linked observations
/reconstructions/{prid}/history:
get:
summary: Get version history
operationId: getReconstructionHistory
responses:
'200':
description: Version history
# Validation
/validate/{ppid}:
get:
summary: Validate PPID format and checksum
operationId: validatePpid
responses:
'200':
description: Validation result
content:
application/json:
schema:
type: object
properties:
valid:
type: boolean
type:
type: string
enum: [POID, PRID]
exists:
type: boolean
# Search
/search:
get:
summary: Full-text search across all records
operationId: search
parameters:
- name: q
in: query
required: true
schema:
type: string
- name: type
in: query
schema:
type: string
enum: [observation, reconstruction, all]
default: all
responses:
'200':
description: Search results
# Entity Resolution
/resolve:
post:
summary: Find matching records for input data
operationId: resolveEntity
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/EntityResolutionRequest'
responses:
'200':
description: Resolution candidates
content:
application/json:
schema:
$ref: '#/components/schemas/ResolutionCandidates'
components:
schemas:
CreateObservationRequest:
type: object
required:
- source_url
- retrieved_at
- claims
properties:
source_url:
type: string
format: uri
source_type:
type: string
enum: [official_registry, institutional_website, professional_network, social_media]
retrieved_at:
type: string
format: date-time
content_hash:
type: string
html_archive_path:
type: string
extraction_agent:
type: string
claims:
type: array
items:
$ref: '#/components/schemas/ClaimInput'
ClaimInput:
type: object
required:
- claim_type
- claim_value
- xpath
- xpath_match_score
properties:
claim_type:
type: string
claim_value:
type: string
xpath:
type: string
xpath_match_score:
type: number
minimum: 0
maximum: 1
confidence:
type: number
PersonObservation:
type: object
properties:
poid:
type: string
source_url:
type: string
literal_name:
type: string
claims:
type: array
items:
$ref: '#/components/schemas/Claim'
created_at:
type: string
format: date-time
CreateReconstructionRequest:
type: object
required:
- observation_ids
properties:
observation_ids:
type: array
items:
type: string
minItems: 1
canonical_name:
type: string
external_identifiers:
type: object
PersonReconstruction:
type: object
properties:
prid:
type: string
canonical_name:
type: string
observations:
type: array
items:
type: string
external_identifiers:
type: object
ghcid_affiliations:
type: array
items:
$ref: '#/components/schemas/GhcidAffiliation'
GhcidAffiliation:
type: object
properties:
ghcid:
type: string
role_title:
type: string
is_current:
type: boolean
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
apiKey:
type: apiKey
in: header
name: X-API-Key
security:
- bearerAuth: []
- apiKey: []
```
### 4.2 GraphQL Schema
```graphql
type Query {
# Observations
observation(poid: ID!): PersonObservation
observations(
name: String
sourceUrl: String
limit: Int = 20
offset: Int = 0
): ObservationConnection!
# Reconstructions
reconstruction(prid: ID!): PersonReconstruction
reconstructions(
name: String
ghcid: String
limit: Int = 20
offset: Int = 0
): ReconstructionConnection!
# Search
search(query: String!, type: SearchType = ALL): SearchResults!
# Validation
validatePpid(ppid: String!): ValidationResult!
# Entity Resolution
resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]!
}
type Mutation {
# Create observation
createObservation(input: CreateObservationInput!): PersonObservation!
# Create reconstruction
createReconstruction(input: CreateReconstructionInput!): PersonReconstruction!
# Link observation to reconstruction
linkObservation(prid: ID!, poid: ID!): PersonReconstruction!
# Update reconstruction
updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction!
# Add external identifier
addExternalIdentifier(
prid: ID!
scheme: IdentifierScheme!
value: String!
): PersonReconstruction!
# Add GHCID affiliation
addGhcidAffiliation(
prid: ID!
ghcid: String!
roleTitle: String
isCurrent: Boolean = true
): PersonReconstruction!
}
type PersonObservation {
poid: ID!
sourceUrl: String!
sourceType: SourceType!
retrievedAt: DateTime!
contentHash: String
htmlArchivePath: String
# Name components
literalName: String
givenName: String
surname: String
surnamePrefix: String
# Related data
claims: [Claim!]!
linkedReconstructions: [PersonReconstruction!]!
# Metadata
extractionAgent: String
extractionConfidence: Float
createdAt: DateTime!
}
type PersonReconstruction {
prid: ID!
canonicalName: String!
givenName: String
surname: String
surnamePrefix: String
# Linked observations
observations: [PersonObservation!]!
# External identifiers
orcid: String
isni: String
viaf: String
wikidata: String
externalIdentifiers: [ExternalIdentifier!]!
# GHCID affiliations
ghcidAffiliations: [GhcidAffiliation!]!
currentAffiliations: [GhcidAffiliation!]!
# Versioning
version: Int!
previousVersion: PersonReconstruction
history: [PersonReconstruction!]!
# Curation
curator: User
curationMethod: CurationMethod
confidenceScore: Float
createdAt: DateTime!
updatedAt: DateTime!
}
type Claim {
id: ID!
claimType: ClaimType!
claimValue: String!
# Provenance (MANDATORY)
sourceUrl: String!
retrievedOn: DateTime!
xpath: String!
htmlFile: String!
xpathMatchScore: Float!
# Quality
confidence: Float
extractionAgent: String
status: ClaimStatus!
# Relationships
supports: [Claim!]!
conflictsWith: [Claim!]!
supersedes: Claim
createdAt: DateTime!
}
type GhcidAffiliation {
ghcid: String!
institution: HeritageCustodian # Resolved from GHCID
roleTitle: String
department: String
startDate: Date
endDate: Date
isCurrent: Boolean!
confidence: Float
}
type HeritageCustodian {
ghcid: String!
name: String!
institutionType: String!
city: String
country: String
}
type ExternalIdentifier {
scheme: IdentifierScheme!
value: String!
verified: Boolean!
verifiedAt: DateTime
}
enum SourceType {
OFFICIAL_REGISTRY
INSTITUTIONAL_WEBSITE
PROFESSIONAL_NETWORK
SOCIAL_MEDIA
NEWS_ARTICLE
ACADEMIC_PUBLICATION
USER_SUBMITTED
INFERRED
}
enum ClaimType {
FULL_NAME
GIVEN_NAME
FAMILY_NAME
JOB_TITLE
EMPLOYER
EMPLOYER_GHCID
EMAIL
LINKEDIN_URL
ORCID
BIRTH_DATE
EDUCATION
}
enum ClaimStatus {
ACTIVE
SUPERSEDED
RETRACTED
}
enum IdentifierScheme {
ORCID
ISNI
VIAF
WIKIDATA
LOC_NAF
}
enum CurationMethod {
MANUAL
ALGORITHMIC
HYBRID
}
enum SearchType {
OBSERVATION
RECONSTRUCTION
ALL
}
input CreateObservationInput {
sourceUrl: String!
sourceType: SourceType!
retrievedAt: DateTime!
contentHash: String
htmlArchivePath: String
extractionAgent: String
claims: [ClaimInput!]!
}
input ClaimInput {
claimType: ClaimType!
claimValue: String!
xpath: String!
xpathMatchScore: Float!
confidence: Float
}
input CreateReconstructionInput {
observationIds: [ID!]!
canonicalName: String
externalIdentifiers: ExternalIdentifiersInput
}
input EntityResolutionInput {
name: String!
employer: String
jobTitle: String
email: String
linkedinUrl: String
}
type ResolutionCandidate {
reconstruction: PersonReconstruction
observation: PersonObservation
matchScore: Float!
matchFactors: [MatchFactor!]!
}
type MatchFactor {
field: String!
score: Float!
method: String!
}
```
### 4.3 FastAPI Implementation
```python
from fastapi import FastAPI, HTTPException, Depends, Query
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import uuid
app = FastAPI(
title="PPID API",
description="Person Persistent Identifier API",
version="1.0.0"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# --- Pydantic Models ---
class ClaimInput(BaseModel):
claim_type: str
claim_value: str
xpath: str
xpath_match_score: float = Field(ge=0, le=1)
confidence: Optional[float] = Field(None, ge=0, le=1)
class CreateObservationRequest(BaseModel):
source_url: str
source_type: str = "institutional_website"
retrieved_at: datetime
content_hash: Optional[str] = None
html_archive_path: Optional[str] = None
extraction_agent: Optional[str] = None
claims: list[ClaimInput]
class PersonObservationResponse(BaseModel):
poid: str
source_url: str
source_type: str
retrieved_at: datetime
literal_name: Optional[str] = None
claims: list[dict]
created_at: datetime
class CreateReconstructionRequest(BaseModel):
observation_ids: list[str]
canonical_name: Optional[str] = None
external_identifiers: Optional[dict] = None
class ValidationResult(BaseModel):
valid: bool
ppid_type: Optional[str] = None
exists: Optional[bool] = None
error: Optional[str] = None
# --- Dependencies ---
async def get_db():
"""Database connection dependency."""
# In production, use connection pool
pass
async def get_current_user(api_key: str = Depends(oauth2_scheme)):
"""Authenticate user from API key or JWT."""
pass
# --- Endpoints ---
@app.post("/api/v1/observations", response_model=PersonObservationResponse)
async def create_observation(
request: CreateObservationRequest,
db = Depends(get_db)
):
"""
Create a new Person Observation from extracted data.
The POID is generated deterministically from source metadata.
"""
from ppid.identifiers import generate_poid
# Generate deterministic POID
poid = generate_poid(
source_url=request.source_url,
retrieval_timestamp=request.retrieved_at.isoformat(),
content_hash=request.content_hash or ""
)
# Check for existing observation with same POID
existing = await db.get_observation(poid)
if existing:
return existing
# Extract name from claims
literal_name = None
for claim in request.claims:
if claim.claim_type == "full_name":
literal_name = claim.claim_value
break
# Create observation record
observation = await db.create_observation(
poid=poid,
source_url=request.source_url,
source_type=request.source_type,
retrieved_at=request.retrieved_at,
content_hash=request.content_hash,
html_archive_path=request.html_archive_path,
literal_name=literal_name,
extraction_agent=request.extraction_agent,
claims=[c.dict() for c in request.claims]
)
return observation
@app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse)
async def get_observation(poid: str, db = Depends(get_db)):
"""Get Person Observation by POID."""
from ppid.identifiers import validate_ppid
is_valid, error = validate_ppid(poid)
if not is_valid:
raise HTTPException(status_code=400, detail=f"Invalid POID: {error}")
observation = await db.get_observation(poid)
if not observation:
raise HTTPException(status_code=404, detail="Observation not found")
return observation
@app.get("/api/v1/validate/{ppid}", response_model=ValidationResult)
async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)):
"""Validate PPID format, checksum, and existence."""
from ppid.identifiers import validate_ppid_full
is_valid, error = validate_ppid_full(ppid)
if not is_valid:
return ValidationResult(valid=False, error=error)
ppid_type = "POID" if ppid.startswith("POID") else "PRID"
# Check existence
if ppid_type == "POID":
exists = await db.observation_exists(ppid)
else:
exists = await db.reconstruction_exists(ppid)
return ValidationResult(
valid=True,
ppid_type=ppid_type,
exists=exists
)
@app.post("/api/v1/reconstructions")
async def create_reconstruction(
request: CreateReconstructionRequest,
db = Depends(get_db),
user = Depends(get_current_user)
):
"""Create Person Reconstruction from linked observations."""
from ppid.identifiers import generate_prid
# Validate all POIDs exist
for poid in request.observation_ids:
if not await db.observation_exists(poid):
raise HTTPException(
status_code=400,
detail=f"Observation not found: {poid}"
)
# Generate deterministic PRID
prid = generate_prid(
observation_ids=request.observation_ids,
curator_id=str(user.id),
timestamp=datetime.utcnow().isoformat()
)
# Determine canonical name
if request.canonical_name:
canonical_name = request.canonical_name
else:
# Use name from highest-confidence observation
observations = await db.get_observations(request.observation_ids)
canonical_name = max(
observations,
key=lambda o: o.extraction_confidence or 0
).literal_name
# Create reconstruction
reconstruction = await db.create_reconstruction(
prid=prid,
canonical_name=canonical_name,
observation_ids=request.observation_ids,
curator_id=user.id,
external_identifiers=request.external_identifiers
)
return reconstruction
@app.get("/api/v1/search")
async def search(
q: str = Query(..., min_length=2),
type: str = Query("all", regex="^(observation|reconstruction|all)$"),
limit: int = Query(20, ge=1, le=100),
offset: int = Query(0, ge=0),
db = Depends(get_db)
):
"""Full-text search across observations and reconstructions."""
results = await db.search(
query=q,
search_type=type,
limit=limit,
offset=offset
)
return results
@app.post("/api/v1/resolve")
async def resolve_entity(
name: str,
employer: Optional[str] = None,
job_title: Optional[str] = None,
email: Optional[str] = None,
db = Depends(get_db)
):
"""
Entity resolution: find matching records for input data.
Returns ranked candidates with match scores.
"""
from ppid.entity_resolution import find_candidates
candidates = await find_candidates(
db=db,
name=name,
employer=employer,
job_title=job_title,
email=email
)
return {
"candidates": candidates,
"query": {
"name": name,
"employer": employer,
"job_title": job_title,
"email": email
}
}
```
---
## 5. Data Ingestion Pipeline
### 5.1 Pipeline Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ DATA INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Source │───▶│ Extract │───▶│ Transform│───▶│ Load │ │
│ │ Fetch │ │ (NER) │ │ (Validate) │ (Store) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Archive │ │ Claims │ │ POID │ │ Postgres│ │
│ │ HTML │ │ XPath │ │ Generate│ │ + RDF │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### 5.2 Pipeline Implementation
```python
from dataclasses import dataclass
from datetime import datetime
from typing import AsyncIterator
import hashlib
import asyncio
from kafka import KafkaProducer, KafkaConsumer
@dataclass
class SourceDocument:
url: str
html_content: str
retrieved_at: datetime
content_hash: str
archive_path: str
@dataclass
class ExtractedClaim:
claim_type: str
claim_value: str
xpath: str
xpath_match_score: float
confidence: float
@dataclass
class PersonObservationData:
source: SourceDocument
claims: list[ExtractedClaim]
poid: str
class IngestionPipeline:
"""
Main data ingestion pipeline for PPID.
Stages:
1. Fetch: Retrieve web pages, archive HTML
2. Extract: NER/NLP to identify person data, generate claims with XPath
3. Transform: Validate, generate POID, structure data
4. Load: Store in PostgreSQL and RDF triple store
"""
def __init__(
self,
db_pool,
rdf_store,
kafka_producer: KafkaProducer,
archive_storage,
llm_extractor
):
self.db = db_pool
self.rdf = rdf_store
self.kafka = kafka_producer
self.archive = archive_storage
self.extractor = llm_extractor
async def process_url(self, url: str) -> list[PersonObservationData]:
"""
Full pipeline for a single URL.
Returns list of PersonObservations extracted from the page.
"""
# Stage 1: Fetch and archive
source = await self._fetch_and_archive(url)
# Stage 2: Extract claims with XPath
observations = await self._extract_observations(source)
# Stage 3: Transform and validate
validated = await self._transform_and_validate(observations)
# Stage 4: Load to databases
await self._load_observations(validated)
return validated
async def _fetch_and_archive(self, url: str) -> SourceDocument:
"""Fetch URL and archive HTML."""
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url, wait_until='networkidle')
html_content = await page.content()
await browser.close()
# Calculate content hash
content_hash = hashlib.sha256(html_content.encode()).hexdigest()
# Archive HTML
retrieved_at = datetime.utcnow()
archive_path = await self.archive.store(
url=url,
content=html_content,
timestamp=retrieved_at
)
return SourceDocument(
url=url,
html_content=html_content,
retrieved_at=retrieved_at,
content_hash=content_hash,
archive_path=archive_path
)
async def _extract_observations(
self,
source: SourceDocument
) -> list[tuple[list[ExtractedClaim], str]]:
"""
Extract person observations with XPath provenance.
Uses LLM for extraction, then validates XPath.
"""
from lxml import html
# Parse HTML
tree = html.fromstring(source.html_content)
# Use LLM to extract person data with XPath
extraction_result = await self.extractor.extract_persons(
html_content=source.html_content,
source_url=source.url
)
observations = []
for person in extraction_result.persons:
validated_claims = []
for claim in person.claims:
# Verify XPath points to expected value
try:
elements = tree.xpath(claim.xpath)
if elements:
actual_value = elements[0].text_content().strip()
# Calculate match score
if actual_value == claim.claim_value:
match_score = 1.0
else:
from difflib import SequenceMatcher
match_score = SequenceMatcher(
None, actual_value, claim.claim_value
).ratio()
if match_score >= 0.8: # Accept if 80%+ match
validated_claims.append(ExtractedClaim(
claim_type=claim.claim_type,
claim_value=claim.claim_value,
xpath=claim.xpath,
xpath_match_score=match_score,
confidence=claim.confidence * match_score
))
except Exception as e:
# Skip claims with invalid XPath
continue
if validated_claims:
# Get literal name from claims
literal_name = next(
(c.claim_value for c in validated_claims if c.claim_type == 'full_name'),
None
)
observations.append((validated_claims, literal_name))
return observations
async def _transform_and_validate(
self,
observations: list[tuple[list[ExtractedClaim], str]],
source: SourceDocument
) -> list[PersonObservationData]:
"""Transform extracted data and generate POIDs."""
from ppid.identifiers import generate_poid
results = []
for claims, literal_name in observations:
# Generate deterministic POID
claims_hash = hashlib.sha256(
str(sorted([c.claim_value for c in claims])).encode()
).hexdigest()
poid = generate_poid(
source_url=source.url,
retrieval_timestamp=source.retrieved_at.isoformat(),
content_hash=f"{source.content_hash}:{claims_hash}"
)
results.append(PersonObservationData(
source=source,
claims=claims,
poid=poid
))
return results
async def _load_observations(
self,
observations: list[PersonObservationData]
) -> None:
"""Load observations to PostgreSQL and RDF store."""
for obs in observations:
# Check if already exists (idempotent)
existing = await self.db.get_observation(obs.poid)
if existing:
continue
# Insert to PostgreSQL
await self.db.create_observation(
poid=obs.poid,
source_url=obs.source.url,
source_type='institutional_website',
retrieved_at=obs.source.retrieved_at,
content_hash=obs.source.content_hash,
html_archive_path=obs.source.archive_path,
literal_name=next(
(c.claim_value for c in obs.claims if c.claim_type == 'full_name'),
None
),
claims=[{
'claim_type': c.claim_type,
'claim_value': c.claim_value,
'xpath': c.xpath,
'xpath_match_score': c.xpath_match_score,
'confidence': c.confidence
} for c in obs.claims]
)
# Insert to RDF triple store
await self._insert_rdf(obs)
# Publish to Kafka for downstream processing
self.kafka.send('ppid.observations.created', {
'poid': obs.poid,
'source_url': obs.source.url,
'timestamp': obs.source.retrieved_at.isoformat()
})
async def _insert_rdf(self, obs: PersonObservationData) -> None:
"""Insert observation as RDF triples."""
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, XSD
PPID = Namespace("https://ppid.org/")
PPIDV = Namespace("https://ppid.org/vocab#")
PROV = Namespace("http://www.w3.org/ns/prov#")
g = Graph()
obs_uri = PPID[obs.poid]
g.add((obs_uri, RDF.type, PPIDV.PersonObservation))
g.add((obs_uri, PPIDV.poid, Literal(obs.poid)))
g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url)))
g.add((obs_uri, PROV.generatedAtTime, Literal(
obs.source.retrieved_at.isoformat(),
datatype=XSD.dateTime
)))
# Add claims
for i, claim in enumerate(obs.claims):
claim_uri = PPID[f"{obs.poid}/claim/{i}"]
g.add((obs_uri, PPIDV.hasClaim, claim_uri))
g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type)))
g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value)))
g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath)))
g.add((claim_uri, PPIDV.xpathMatchScore, Literal(
claim.xpath_match_score, datatype=XSD.decimal
)))
# Insert to triple store
await self.rdf.insert(g)
```
---
## 6. GHCID Integration
### 6.1 Linking Persons to Institutions
```python
from dataclasses import dataclass
from typing import Optional
from datetime import date
@dataclass
class GhcidAffiliation:
"""
Link between a person (PRID) and a heritage institution (GHCID).
"""
ghcid: str # e.g., "NL-NH-HAA-A-NHA"
role_title: Optional[str] = None
department: Optional[str] = None
start_date: Optional[date] = None
end_date: Optional[date] = None
is_current: bool = True
source_poid: Optional[str] = None
confidence: float = 0.9
async def link_person_to_institution(
db,
prid: str,
ghcid: str,
role_title: str = None,
source_poid: str = None
) -> GhcidAffiliation:
"""
Create link between person and heritage institution.
Args:
prid: Person Reconstruction ID
ghcid: Global Heritage Custodian ID
role_title: Job title at institution
source_poid: Observation where affiliation was extracted
Returns:
Created affiliation record
"""
# Validate PRID exists
reconstruction = await db.get_reconstruction(prid)
if not reconstruction:
raise ValueError(f"Reconstruction not found: {prid}")
# Validate GHCID format
if not validate_ghcid_format(ghcid):
raise ValueError(f"Invalid GHCID format: {ghcid}")
# Create affiliation
affiliation = await db.create_ghcid_affiliation(
prid_id=reconstruction.id,
ghcid=ghcid,
role_title=role_title,
source_poid_id=source_poid,
is_current=True
)
return affiliation
def validate_ghcid_format(ghcid: str) -> bool:
"""Validate GHCID format."""
import re
# Pattern: CC-RR-SSS-T-ABBREV
pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$'
return bool(re.match(pattern, ghcid))
```
### 6.2 RDF Integration
```turtle
@prefix ppid: <https://ppid.org/> .
@prefix ppidv: <https://ppid.org/vocab#> .
@prefix ghcid: <https://w3id.org/heritage/custodian/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix schema: <http://schema.org/> .
# Person with GHCID affiliation
ppid:PRID-1234-5678-90ab-cde5
a ppidv:PersonReconstruction ;
schema:name "Jan van den Berg" ;
# Current employment
org:memberOf [
a org:Membership ;
org:organization ghcid:NL-NH-HAA-A-NHA ;
org:role [
a org:Role ;
schema:name "Senior Archivist"
] ;
ppidv:isCurrent true ;
schema:startDate "2015"^^xsd:gYear
] ;
# Direct link for simple queries
ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA .
# The heritage institution (from GHCID system)
ghcid:NL-NH-HAA-A-NHA
a ppidv:HeritageCustodian ;
schema:name "Noord-Hollands Archief" ;
ppidv:ghcid "NL-NH-HAA-A-NHA" ;
schema:address [
schema:addressLocality "Haarlem" ;
schema:addressCountry "NL"
] .
```
### 6.3 SPARQL Queries
```sparql
# Find all persons affiliated with a specific institution
PREFIX ppid: <https://ppid.org/>
PREFIX ppidv: <https://ppid.org/vocab#>
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
PREFIX schema: <http://schema.org/>
PREFIX org: <http://www.w3.org/ns/org#>
SELECT ?prid ?name ?role ?isCurrent
WHERE {
?prid a ppidv:PersonReconstruction ;
schema:name ?name ;
org:memberOf ?membership .
?membership org:organization ghcid:NL-NH-HAA-A-NHA ;
ppidv:isCurrent ?isCurrent .
OPTIONAL {
?membership org:role ?roleNode .
?roleNode schema:name ?role .
}
}
ORDER BY DESC(?isCurrent) ?name
# Find all institutions a person has worked at
SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent
WHERE {
ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership .
?membership org:organization ?institution ;
ppidv:isCurrent ?isCurrent .
?institution ppidv:ghcid ?ghcid ;
schema:name ?institutionName .
OPTIONAL { ?membership schema:startDate ?startDate }
OPTIONAL { ?membership schema:endDate ?endDate }
OPTIONAL {
?membership org:role ?roleNode .
?roleNode schema:name ?role .
}
}
ORDER BY DESC(?isCurrent) DESC(?startDate)
# Find archivists across all Dutch archives
SELECT ?prid ?name ?institution ?institutionName
WHERE {
?prid a ppidv:PersonReconstruction ;
schema:name ?name ;
org:memberOf ?membership .
?membership org:organization ?institution ;
ppidv:isCurrent true ;
org:role ?roleNode .
?roleNode schema:name ?role .
FILTER(CONTAINS(LCASE(?role), "archivist"))
?institution ppidv:ghcid ?ghcid ;
schema:name ?institutionName .
FILTER(STRSTARTS(?ghcid, "NL-"))
}
ORDER BY ?institutionName ?name
```
---
## 7. Security and Access Control
### 7.1 Authentication
```python
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, APIKeyHeader
from jose import JWTError, jwt
from passlib.context import CryptContext
from datetime import datetime, timedelta
from typing import Optional
# Configuration
SECRET_KEY = "your-secret-key" # Use env variable
ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 30
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False)
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
class User:
def __init__(self, id: str, email: str, roles: list[str]):
self.id = id
self.email = email
self.roles = roles
def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
"""Create JWT access token."""
to_encode = data.copy()
expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15))
to_encode.update({"exp": expire})
return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
async def get_current_user(
token: Optional[str] = Depends(oauth2_scheme),
api_key: Optional[str] = Depends(api_key_header),
db = Depends(get_db)
) -> User:
"""
Authenticate user via JWT token or API key.
"""
credentials_exception = HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Could not validate credentials",
headers={"WWW-Authenticate": "Bearer"},
)
# Try JWT token first
if token:
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
user_id: str = payload.get("sub")
if user_id is None:
raise credentials_exception
user = await db.get_user(user_id)
if user is None:
raise credentials_exception
return user
except JWTError:
pass
# Try API key
if api_key:
user = await db.get_user_by_api_key(api_key)
if user:
return user
raise credentials_exception
def require_role(required_roles: list[str]):
"""Dependency to require specific roles."""
async def role_checker(user: User = Depends(get_current_user)):
if not any(role in user.roles for role in required_roles):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Insufficient permissions"
)
return user
return role_checker
```
### 7.2 Authorization Roles
| Role | Permissions |
|------|-------------|
| `reader` | Read observations, reconstructions, claims |
| `contributor` | Create observations, add claims |
| `curator` | Create reconstructions, link observations, resolve conflicts |
| `admin` | Manage users, API keys, system configuration |
| `api_client` | Programmatic access via API key |
### 7.3 Rate Limiting
```python
from fastapi import Request
import redis
from datetime import datetime
class RateLimiter:
"""
Token bucket rate limiter using Redis.
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
async def is_allowed(
self,
key: str,
max_requests: int = 100,
window_seconds: int = 60
) -> tuple[bool, dict]:
"""
Check if request is allowed under rate limit.
Returns:
Tuple of (is_allowed, rate_limit_info)
"""
now = datetime.utcnow().timestamp()
window_start = now - window_seconds
pipe = self.redis.pipeline()
# Remove old requests
pipe.zremrangebyscore(key, 0, window_start)
# Count requests in window
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(now): now})
# Set expiry
pipe.expire(key, window_seconds)
results = pipe.execute()
request_count = results[1]
is_allowed = request_count < max_requests
return is_allowed, {
"limit": max_requests,
"remaining": max(0, max_requests - request_count - 1),
"reset": int(now + window_seconds)
}
# Rate limit tiers
RATE_LIMITS = {
"anonymous": {"requests": 60, "window": 60},
"reader": {"requests": 100, "window": 60},
"contributor": {"requests": 500, "window": 60},
"curator": {"requests": 1000, "window": 60},
"api_client": {"requests": 5000, "window": 60},
}
```
---
## 8. Performance Requirements
### 8.1 SLOs (Service Level Objectives)
| Metric | Target | Measurement |
|--------|--------|-------------|
| **Availability** | 99.9% | Monthly uptime |
| **API Latency (p50)** | < 50ms | Response time |
| **API Latency (p99)** | < 500ms | Response time |
| **Search Latency** | < 200ms | Full-text search |
| **SPARQL Query** | < 1s | Simple queries |
| **Throughput** | 1000 req/s | Sustained load |
### 8.2 Scaling Strategy
```yaml
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ppid-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ppid-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
```
### 8.3 Caching Strategy
```python
import redis
from functools import wraps
import json
import hashlib
class CacheManager:
"""
Multi-tier caching for PPID.
Tiers:
1. L1: In-memory (per-instance)
2. L2: Redis (shared)
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.local_cache = {}
def cache_observation(self, ttl: int = 3600):
"""Cache observation lookups."""
def decorator(func):
@wraps(func)
async def wrapper(poid: str, *args, **kwargs):
cache_key = f"observation:{poid}"
# Check L1 cache
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# Check L2 cache
cached = self.redis.get(cache_key)
if cached:
data = json.loads(cached)
self.local_cache[cache_key] = data
return data
# Fetch from database
result = await func(poid, *args, **kwargs)
if result:
# Store in both caches
self.redis.setex(cache_key, ttl, json.dumps(result))
self.local_cache[cache_key] = result
return result
return wrapper
return decorator
def cache_search(self, ttl: int = 300):
"""Cache search results (shorter TTL)."""
def decorator(func):
@wraps(func)
async def wrapper(query: str, *args, **kwargs):
# Create deterministic cache key from query params
key_data = {"query": query, "args": args, "kwargs": kwargs}
cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
result = await func(query, *args, **kwargs)
self.redis.setex(cache_key, ttl, json.dumps(result))
return result
return wrapper
return decorator
def invalidate_observation(self, poid: str):
"""Invalidate cache when observation is updated."""
cache_key = f"observation:{poid}"
self.redis.delete(cache_key)
self.local_cache.pop(cache_key, None)
```
---
## 9. Deployment Architecture
### 9.1 Kubernetes Deployment
```yaml
# API Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ppid-api
labels:
app: ppid
component: api
spec:
replicas: 3
selector:
matchLabels:
app: ppid
component: api
template:
metadata:
labels:
app: ppid
component: api
spec:
containers:
- name: api
image: ppid/api:latest
ports:
- containerPort: 8000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: ppid-secrets
key: database-url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: ppid-secrets
key: redis-url
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: ppid-secrets
key: jwt-secret
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
# Service
apiVersion: v1
kind: Service
metadata:
name: ppid-api
spec:
selector:
app: ppid
component: api
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
# Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ppid-ingress
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- api.ppid.org
secretName: ppid-tls
rules:
- host: api.ppid.org
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ppid-api
port:
number: 80
```
### 9.2 Docker Compose (Development)
```yaml
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid
- REDIS_URL=redis://redis:6379
- FUSEKI_URL=http://fuseki:3030
depends_on:
- postgres
- redis
- fuseki
volumes:
- ./src:/app/src
- ./archives:/app/archives
postgres:
image: postgres:16
environment:
POSTGRES_USER: ppid
POSTGRES_PASSWORD: ppid
POSTGRES_DB: ppid
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
fuseki:
image: stain/jena-fuseki
environment:
ADMIN_PASSWORD: admin
FUSEKI_DATASET_1: ppid
volumes:
- fuseki_data:/fuseki
ports:
- "3030:3030"
elasticsearch:
image: elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
depends_on:
- zookeeper
ports:
- "9092:9092"
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
volumes:
postgres_data:
fuseki_data:
es_data:
```
---
## 10. Monitoring and Observability
### 10.1 Prometheus Metrics
```python
from prometheus_client import Counter, Histogram, Gauge
import time
# Metrics
REQUEST_COUNT = Counter(
'ppid_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'ppid_request_latency_seconds',
'Request latency in seconds',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
OBSERVATIONS_CREATED = Counter(
'ppid_observations_created_total',
'Total observations created'
)
RECONSTRUCTIONS_CREATED = Counter(
'ppid_reconstructions_created_total',
'Total reconstructions created'
)
ENTITY_RESOLUTION_LATENCY = Histogram(
'ppid_entity_resolution_seconds',
'Entity resolution latency',
buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30]
)
CACHE_HITS = Counter(
'ppid_cache_hits_total',
'Cache hits',
['cache_type']
)
CACHE_MISSES = Counter(
'ppid_cache_misses_total',
'Cache misses',
['cache_type']
)
DB_CONNECTIONS = Gauge(
'ppid_db_connections',
'Active database connections'
)
# Middleware for request metrics
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
latency = time.time() - start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path
).observe(latency)
return response
```
### 10.2 Logging
```python
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
"""Structured JSON logging for observability."""
def format(self, record):
log_record = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
}
# Add extra fields
if hasattr(record, 'poid'):
log_record['poid'] = record.poid
if hasattr(record, 'prid'):
log_record['prid'] = record.prid
if hasattr(record, 'request_id'):
log_record['request_id'] = record.request_id
if hasattr(record, 'user_id'):
log_record['user_id'] = record.user_id
if hasattr(record, 'duration_ms'):
log_record['duration_ms'] = record.duration_ms
if record.exc_info:
log_record['exception'] = self.formatException(record.exc_info)
return json.dumps(log_record)
# Configure logging
def setup_logging():
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.setLevel(logging.INFO)
# Reduce noise from libraries
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
```
### 10.3 Grafana Dashboard (JSON)
```json
{
"dashboard": {
"title": "PPID System Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)"
}
]
},
{
"title": "Request Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))"
}
]
},
{
"title": "Observations Created",
"type": "stat",
"targets": [
{
"expr": "sum(ppid_observations_created_total)"
}
]
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))"
}
]
}
]
}
}
```
---
## 11. Implementation Checklist
### 11.1 Phase 1: Core Infrastructure
- [ ] Set up PostgreSQL database with schema
- [ ] Set up Apache Jena Fuseki for RDF
- [ ] Set up Redis for caching
- [ ] Implement POID/PRID generation
- [ ] Implement checksum validation
- [ ] Create basic REST API endpoints
### 11.2 Phase 2: Data Ingestion
- [ ] Build web scraping infrastructure
- [ ] Implement HTML archival
- [ ] Integrate LLM for extraction
- [ ] Implement XPath validation
- [ ] Set up Kafka for async processing
- [ ] Create ingestion pipeline
### 11.3 Phase 3: Entity Resolution
- [ ] Implement blocking strategies
- [ ] Implement similarity metrics
- [ ] Build clustering algorithm
- [ ] Create human-in-loop review UI
- [ ] Integrate with reconstruction creation
### 11.4 Phase 4: GHCID Integration
- [ ] Implement affiliation linking
- [ ] Add SPARQL queries for institution lookups
- [ ] Create bidirectional navigation
- [ ] Sync with GHCID registry updates
### 11.5 Phase 5: Production Readiness
- [ ] Implement authentication/authorization
- [ ] Set up rate limiting
- [ ] Configure monitoring and alerting
- [ ] Create backup and recovery procedures
- [ ] Performance testing and optimization
- [ ] Security audit
---
## 12. References
### Standards
- OAuth 2.0: https://oauth.net/2/
- OpenAPI 3.1: https://spec.openapis.org/oas/latest.html
- GraphQL: https://graphql.org/learn/
- SPARQL 1.1: https://www.w3.org/TR/sparql11-query/
### Related PPID Documents
- [Identifier Structure Design](./05_identifier_structure_design.md)
- [Entity Resolution Patterns](./06_entity_resolution_patterns.md)
- [Claims and Provenance](./07_claims_and_provenance.md)
### Technologies
- FastAPI: https://fastapi.tiangolo.com/
- Apache Jena: https://jena.apache.org/
- PostgreSQL: https://www.postgresql.org/
- Kubernetes: https://kubernetes.io/