Add final two chapters of the Person PID (PPID) design document: - 08_implementation_guidelines.md: Database architecture, API design, data ingestion pipeline, GHCID integration, security, performance, technology stack, deployment, and monitoring specifications - 09_governance_and_sustainability.md: Data governance policies, quality assurance, sustainability planning, community engagement, legal considerations, and long-term maintenance strategies
2447 lines
67 KiB
Markdown
2447 lines
67 KiB
Markdown
# Implementation Guidelines
|
|
|
|
**Version**: 0.1.0
|
|
**Last Updated**: 2025-01-09
|
|
**Related**: [Identifier Structure](./05_identifier_structure_design.md) | [Claims and Provenance](./07_claims_and_provenance.md)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
This document provides comprehensive technical specifications for implementing the PPID system:
|
|
|
|
- Database architecture (PostgreSQL + RDF triple store)
|
|
- API design (REST + GraphQL)
|
|
- Data ingestion pipeline
|
|
- GHCID integration patterns
|
|
- Security and access control
|
|
- Performance requirements
|
|
- Technology stack recommendations
|
|
- Deployment architecture
|
|
- Monitoring and observability
|
|
|
|
---
|
|
|
|
## 2. Architecture Overview
|
|
|
|
### 2.1 System Components
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ PPID System │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Ingestion │────▶│ Processing │────▶│ Storage │ │
|
|
│ │ Layer │ │ Layer │ │ Layer │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Web Scrapers │ │ Entity Res. │ │ PostgreSQL │ │
|
|
│ │ API Clients │ │ NER/NLP │ │ (Relational) │ │
|
|
│ │ File Import │ │ Validation │ ├──────────────┤ │
|
|
│ └──────────────┘ └──────────────┘ │ Apache Jena │ │
|
|
│ │ (RDF/SPARQL) │ │
|
|
│ ├──────────────┤ │
|
|
│ │ Redis │ │
|
|
│ │ (Cache) │ │
|
|
│ └──────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────────┴─────────────────────┐ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────┐│
|
|
│ │ REST API │ │ GraphQL ││
|
|
│ │ (FastAPI) │ │ (Ariadne││
|
|
│ └──────────────┘ └──────────┘│
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 Technology Stack
|
|
|
|
| Layer | Technology | Purpose |
|
|
|-------|------------|---------|
|
|
| **API Framework** | FastAPI (Python 3.11+) | REST API, async support |
|
|
| **GraphQL** | Ariadne | GraphQL endpoint |
|
|
| **Relational DB** | PostgreSQL 16 | Primary data store |
|
|
| **Triple Store** | Apache Jena Fuseki | RDF/SPARQL queries |
|
|
| **Cache** | Redis 7 | Session, rate limiting, caching |
|
|
| **Queue** | Apache Kafka | Async processing pipeline |
|
|
| **Search** | Elasticsearch 8 | Full-text search |
|
|
| **Object Storage** | MinIO / S3 | HTML archives, files |
|
|
| **Container** | Docker + Kubernetes | Deployment |
|
|
| **Monitoring** | Prometheus + Grafana | Metrics and alerting |
|
|
|
|
---
|
|
|
|
## 3. Database Schema
|
|
|
|
### 3.1 PostgreSQL Schema
|
|
|
|
```sql
|
|
-- Enable required extensions
|
|
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
|
|
CREATE EXTENSION IF NOT EXISTS "pg_trgm"; -- For fuzzy matching
|
|
|
|
-- Enum types
|
|
CREATE TYPE ppid_type AS ENUM ('POID', 'PRID');
|
|
CREATE TYPE claim_status AS ENUM ('active', 'superseded', 'retracted');
|
|
CREATE TYPE source_type AS ENUM (
|
|
'official_registry',
|
|
'institutional_website',
|
|
'professional_network',
|
|
'social_media',
|
|
'news_article',
|
|
'academic_publication',
|
|
'user_submitted',
|
|
'inferred'
|
|
);
|
|
|
|
-- Person Observations (POID)
|
|
CREATE TABLE person_observations (
|
|
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
|
poid VARCHAR(24) UNIQUE NOT NULL, -- POID-xxxx-xxxx-xxxx-xxxx
|
|
|
|
-- Source metadata
|
|
source_url TEXT NOT NULL,
|
|
source_type source_type NOT NULL,
|
|
retrieved_at TIMESTAMPTZ NOT NULL,
|
|
content_hash VARCHAR(64) NOT NULL, -- SHA-256
|
|
html_archive_path TEXT,
|
|
|
|
-- Extracted name components (PNV-compatible)
|
|
literal_name TEXT,
|
|
given_name TEXT,
|
|
surname TEXT,
|
|
surname_prefix TEXT, -- van, de, etc.
|
|
patronymic TEXT,
|
|
generation_suffix TEXT, -- Jr., III, etc.
|
|
|
|
-- Metadata
|
|
extraction_agent VARCHAR(100),
|
|
extraction_confidence DECIMAL(3,2),
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
|
|
|
-- Indexes
|
|
CONSTRAINT valid_poid CHECK (poid ~ '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
|
|
);
|
|
|
|
-- Person Reconstructions (PRID)
|
|
CREATE TABLE person_reconstructions (
|
|
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
|
prid VARCHAR(24) UNIQUE NOT NULL, -- PRID-xxxx-xxxx-xxxx-xxxx
|
|
|
|
-- Canonical name (resolved from observations)
|
|
canonical_name TEXT NOT NULL,
|
|
given_name TEXT,
|
|
surname TEXT,
|
|
surname_prefix TEXT,
|
|
|
|
-- Curation metadata
|
|
curator_id UUID REFERENCES users(id),
|
|
curation_method VARCHAR(50), -- 'manual', 'algorithmic', 'hybrid'
|
|
confidence_score DECIMAL(3,2),
|
|
|
|
-- Versioning
|
|
version INTEGER DEFAULT 1,
|
|
previous_version_id UUID REFERENCES person_reconstructions(id),
|
|
|
|
-- Timestamps
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
|
|
|
CONSTRAINT valid_prid CHECK (prid ~ '^PRID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$')
|
|
);
|
|
|
|
-- Link table: PRID derives from POIDs
|
|
CREATE TABLE reconstruction_observations (
|
|
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
|
|
poid_id UUID REFERENCES person_observations(id) ON DELETE CASCADE,
|
|
linked_at TIMESTAMPTZ DEFAULT NOW(),
|
|
linked_by UUID REFERENCES users(id),
|
|
link_confidence DECIMAL(3,2),
|
|
PRIMARY KEY (prid_id, poid_id)
|
|
);
|
|
|
|
-- Claims (assertions with provenance)
|
|
CREATE TABLE claims (
|
|
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
|
claim_id VARCHAR(50) UNIQUE NOT NULL,
|
|
|
|
-- Subject (what this claim is about)
|
|
poid_id UUID REFERENCES person_observations(id),
|
|
|
|
-- Claim content
|
|
claim_type VARCHAR(50) NOT NULL, -- 'job_title', 'employer', 'email', etc.
|
|
claim_value TEXT NOT NULL,
|
|
|
|
-- Provenance (MANDATORY per Rule 6)
|
|
source_url TEXT NOT NULL,
|
|
retrieved_on TIMESTAMPTZ NOT NULL,
|
|
xpath TEXT NOT NULL,
|
|
html_file TEXT NOT NULL,
|
|
xpath_match_score DECIMAL(3,2) NOT NULL,
|
|
content_hash VARCHAR(64),
|
|
|
|
-- Quality
|
|
confidence DECIMAL(3,2),
|
|
extraction_agent VARCHAR(100),
|
|
status claim_status DEFAULT 'active',
|
|
|
|
-- Relationships
|
|
supersedes_id UUID REFERENCES claims(id),
|
|
|
|
-- Timestamps
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
verified_at TIMESTAMPTZ,
|
|
|
|
-- Index for claim lookups
|
|
CONSTRAINT valid_xpath_score CHECK (xpath_match_score >= 0 AND xpath_match_score <= 1)
|
|
);
|
|
|
|
-- Claim relationships (supports, conflicts)
|
|
CREATE TABLE claim_relationships (
|
|
claim_a_id UUID REFERENCES claims(id) ON DELETE CASCADE,
|
|
claim_b_id UUID REFERENCES claims(id) ON DELETE CASCADE,
|
|
relationship_type VARCHAR(20) NOT NULL, -- 'supports', 'conflicts_with'
|
|
notes TEXT,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
PRIMARY KEY (claim_a_id, claim_b_id, relationship_type)
|
|
);
|
|
|
|
-- External identifiers (ORCID, ISNI, VIAF, Wikidata)
|
|
CREATE TABLE external_identifiers (
|
|
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
|
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
|
|
identifier_scheme VARCHAR(20) NOT NULL, -- 'orcid', 'isni', 'viaf', 'wikidata'
|
|
identifier_value VARCHAR(100) NOT NULL,
|
|
verified BOOLEAN DEFAULT FALSE,
|
|
verified_at TIMESTAMPTZ,
|
|
source_url TEXT,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
UNIQUE (prid_id, identifier_scheme, identifier_value)
|
|
);
|
|
|
|
-- GHCID links (person to heritage institution)
|
|
CREATE TABLE ghcid_affiliations (
|
|
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
|
|
prid_id UUID REFERENCES person_reconstructions(id) ON DELETE CASCADE,
|
|
ghcid VARCHAR(50) NOT NULL, -- e.g., NL-NH-HAA-A-NHA
|
|
|
|
-- Affiliation details
|
|
role_title TEXT,
|
|
department TEXT,
|
|
affiliation_start DATE,
|
|
affiliation_end DATE,
|
|
is_current BOOLEAN DEFAULT TRUE,
|
|
|
|
-- Provenance
|
|
source_poid_id UUID REFERENCES person_observations(id),
|
|
confidence DECIMAL(3,2),
|
|
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
-- Indexes for performance
|
|
CREATE INDEX idx_observations_source_url ON person_observations(source_url);
|
|
CREATE INDEX idx_observations_content_hash ON person_observations(content_hash);
|
|
CREATE INDEX idx_observations_name_trgm ON person_observations
|
|
USING gin (literal_name gin_trgm_ops);
|
|
|
|
CREATE INDEX idx_reconstructions_name_trgm ON person_reconstructions
|
|
USING gin (canonical_name gin_trgm_ops);
|
|
|
|
CREATE INDEX idx_claims_type ON claims(claim_type);
|
|
CREATE INDEX idx_claims_poid ON claims(poid_id);
|
|
CREATE INDEX idx_claims_status ON claims(status);
|
|
|
|
CREATE INDEX idx_ghcid_affiliations_ghcid ON ghcid_affiliations(ghcid);
|
|
CREATE INDEX idx_ghcid_affiliations_current ON ghcid_affiliations(prid_id)
|
|
WHERE is_current = TRUE;
|
|
|
|
-- Full-text search
|
|
CREATE INDEX idx_observations_fts ON person_observations
|
|
USING gin (to_tsvector('english', literal_name));
|
|
CREATE INDEX idx_reconstructions_fts ON person_reconstructions
|
|
USING gin (to_tsvector('english', canonical_name));
|
|
```
|
|
|
|
### 3.2 RDF Triple Store Schema
|
|
|
|
```turtle
|
|
@prefix ppid: <https://ppid.org/> .
|
|
@prefix ppidv: <https://ppid.org/vocab#> .
|
|
@prefix ppidt: <https://ppid.org/type#> .
|
|
@prefix picom: <https://personsincontext.org/model#> .
|
|
@prefix pnv: <https://w3id.org/pnv#> .
|
|
@prefix prov: <http://www.w3.org/ns/prov#> .
|
|
@prefix schema: <http://schema.org/> .
|
|
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
|
|
|
|
# Example Person Observation
|
|
ppid:POID-7a3b-c4d5-e6f7-890X a ppidt:PersonObservation, picom:PersonObservation ;
|
|
ppidv:poid "POID-7a3b-c4d5-e6f7-890X" ;
|
|
|
|
# Name (PNV structured)
|
|
pnv:hasName [
|
|
a pnv:PersonName ;
|
|
pnv:literalName "Jan van den Berg" ;
|
|
pnv:givenName "Jan" ;
|
|
pnv:surnamePrefix "van den" ;
|
|
pnv:baseSurname "Berg"
|
|
] ;
|
|
|
|
# Provenance
|
|
prov:wasDerivedFrom <https://linkedin.com/in/jan-van-den-berg> ;
|
|
prov:wasGeneratedBy ppid:extraction-activity-001 ;
|
|
prov:generatedAtTime "2025-01-09T14:30:00Z"^^xsd:dateTime ;
|
|
|
|
# Claims
|
|
ppidv:hasClaim ppid:claim-001, ppid:claim-002, ppid:claim-003 .
|
|
|
|
# Example Person Reconstruction
|
|
ppid:PRID-1234-5678-90ab-cde5 a ppidt:PersonReconstruction, picom:PersonReconstruction ;
|
|
ppidv:prid "PRID-1234-5678-90ab-cde5" ;
|
|
|
|
# Canonical name
|
|
schema:name "Jan van den Berg" ;
|
|
pnv:hasName [
|
|
a pnv:PersonName ;
|
|
pnv:literalName "Jan van den Berg" ;
|
|
pnv:givenName "Jan" ;
|
|
pnv:surnamePrefix "van den" ;
|
|
pnv:baseSurname "Berg"
|
|
] ;
|
|
|
|
# Derived from observations
|
|
prov:wasDerivedFrom ppid:POID-7a3b-c4d5-e6f7-890X,
|
|
ppid:POID-8c4d-e5f6-g7h8-901Y ;
|
|
|
|
# GHCID affiliation
|
|
ppidv:affiliatedWith <https://w3id.org/heritage/custodian/NL-NH-HAA-A-NHA> ;
|
|
|
|
# External identifiers
|
|
ppidv:orcid "0000-0002-1234-5678" ;
|
|
owl:sameAs <https://orcid.org/0000-0002-1234-5678> .
|
|
```
|
|
|
|
---
|
|
|
|
## 4. API Design
|
|
|
|
### 4.1 REST API Endpoints
|
|
|
|
```yaml
|
|
openapi: 3.1.0
|
|
info:
|
|
title: PPID API
|
|
version: 1.0.0
|
|
description: Person Persistent Identifier API
|
|
|
|
servers:
|
|
- url: https://api.ppid.org/v1
|
|
|
|
paths:
|
|
# Person Observations
|
|
/observations:
|
|
post:
|
|
summary: Create new person observation
|
|
operationId: createObservation
|
|
requestBody:
|
|
required: true
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/CreateObservationRequest'
|
|
responses:
|
|
'201':
|
|
description: Observation created
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/PersonObservation'
|
|
|
|
get:
|
|
summary: Search observations
|
|
operationId: searchObservations
|
|
parameters:
|
|
- name: name
|
|
in: query
|
|
schema:
|
|
type: string
|
|
- name: source_url
|
|
in: query
|
|
schema:
|
|
type: string
|
|
- name: limit
|
|
in: query
|
|
schema:
|
|
type: integer
|
|
default: 20
|
|
- name: offset
|
|
in: query
|
|
schema:
|
|
type: integer
|
|
default: 0
|
|
responses:
|
|
'200':
|
|
description: List of observations
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/ObservationList'
|
|
|
|
/observations/{poid}:
|
|
get:
|
|
summary: Get observation by POID
|
|
operationId: getObservation
|
|
parameters:
|
|
- name: poid
|
|
in: path
|
|
required: true
|
|
schema:
|
|
type: string
|
|
pattern: '^POID-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{3}[0-9a-fxX]$'
|
|
responses:
|
|
'200':
|
|
description: Person observation
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/PersonObservation'
|
|
text/turtle:
|
|
schema:
|
|
type: string
|
|
application/ld+json:
|
|
schema:
|
|
type: object
|
|
|
|
/observations/{poid}/claims:
|
|
get:
|
|
summary: Get claims for observation
|
|
operationId: getObservationClaims
|
|
parameters:
|
|
- name: poid
|
|
in: path
|
|
required: true
|
|
schema:
|
|
type: string
|
|
responses:
|
|
'200':
|
|
description: List of claims
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/ClaimList'
|
|
|
|
# Person Reconstructions
|
|
/reconstructions:
|
|
post:
|
|
summary: Create person reconstruction from observations
|
|
operationId: createReconstruction
|
|
requestBody:
|
|
required: true
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/CreateReconstructionRequest'
|
|
responses:
|
|
'201':
|
|
description: Reconstruction created
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/PersonReconstruction'
|
|
|
|
/reconstructions/{prid}:
|
|
get:
|
|
summary: Get reconstruction by PRID
|
|
operationId: getReconstruction
|
|
parameters:
|
|
- name: prid
|
|
in: path
|
|
required: true
|
|
schema:
|
|
type: string
|
|
responses:
|
|
'200':
|
|
description: Person reconstruction
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/PersonReconstruction'
|
|
|
|
/reconstructions/{prid}/observations:
|
|
get:
|
|
summary: Get observations linked to reconstruction
|
|
operationId: getReconstructionObservations
|
|
responses:
|
|
'200':
|
|
description: Linked observations
|
|
|
|
/reconstructions/{prid}/history:
|
|
get:
|
|
summary: Get version history
|
|
operationId: getReconstructionHistory
|
|
responses:
|
|
'200':
|
|
description: Version history
|
|
|
|
# Validation
|
|
/validate/{ppid}:
|
|
get:
|
|
summary: Validate PPID format and checksum
|
|
operationId: validatePpid
|
|
responses:
|
|
'200':
|
|
description: Validation result
|
|
content:
|
|
application/json:
|
|
schema:
|
|
type: object
|
|
properties:
|
|
valid:
|
|
type: boolean
|
|
type:
|
|
type: string
|
|
enum: [POID, PRID]
|
|
exists:
|
|
type: boolean
|
|
|
|
# Search
|
|
/search:
|
|
get:
|
|
summary: Full-text search across all records
|
|
operationId: search
|
|
parameters:
|
|
- name: q
|
|
in: query
|
|
required: true
|
|
schema:
|
|
type: string
|
|
- name: type
|
|
in: query
|
|
schema:
|
|
type: string
|
|
enum: [observation, reconstruction, all]
|
|
default: all
|
|
responses:
|
|
'200':
|
|
description: Search results
|
|
|
|
# Entity Resolution
|
|
/resolve:
|
|
post:
|
|
summary: Find matching records for input data
|
|
operationId: resolveEntity
|
|
requestBody:
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/EntityResolutionRequest'
|
|
responses:
|
|
'200':
|
|
description: Resolution candidates
|
|
content:
|
|
application/json:
|
|
schema:
|
|
$ref: '#/components/schemas/ResolutionCandidates'
|
|
|
|
components:
|
|
schemas:
|
|
CreateObservationRequest:
|
|
type: object
|
|
required:
|
|
- source_url
|
|
- retrieved_at
|
|
- claims
|
|
properties:
|
|
source_url:
|
|
type: string
|
|
format: uri
|
|
source_type:
|
|
type: string
|
|
enum: [official_registry, institutional_website, professional_network, social_media]
|
|
retrieved_at:
|
|
type: string
|
|
format: date-time
|
|
content_hash:
|
|
type: string
|
|
html_archive_path:
|
|
type: string
|
|
extraction_agent:
|
|
type: string
|
|
claims:
|
|
type: array
|
|
items:
|
|
$ref: '#/components/schemas/ClaimInput'
|
|
|
|
ClaimInput:
|
|
type: object
|
|
required:
|
|
- claim_type
|
|
- claim_value
|
|
- xpath
|
|
- xpath_match_score
|
|
properties:
|
|
claim_type:
|
|
type: string
|
|
claim_value:
|
|
type: string
|
|
xpath:
|
|
type: string
|
|
xpath_match_score:
|
|
type: number
|
|
minimum: 0
|
|
maximum: 1
|
|
confidence:
|
|
type: number
|
|
|
|
PersonObservation:
|
|
type: object
|
|
properties:
|
|
poid:
|
|
type: string
|
|
source_url:
|
|
type: string
|
|
literal_name:
|
|
type: string
|
|
claims:
|
|
type: array
|
|
items:
|
|
$ref: '#/components/schemas/Claim'
|
|
created_at:
|
|
type: string
|
|
format: date-time
|
|
|
|
CreateReconstructionRequest:
|
|
type: object
|
|
required:
|
|
- observation_ids
|
|
properties:
|
|
observation_ids:
|
|
type: array
|
|
items:
|
|
type: string
|
|
minItems: 1
|
|
canonical_name:
|
|
type: string
|
|
external_identifiers:
|
|
type: object
|
|
|
|
PersonReconstruction:
|
|
type: object
|
|
properties:
|
|
prid:
|
|
type: string
|
|
canonical_name:
|
|
type: string
|
|
observations:
|
|
type: array
|
|
items:
|
|
type: string
|
|
external_identifiers:
|
|
type: object
|
|
ghcid_affiliations:
|
|
type: array
|
|
items:
|
|
$ref: '#/components/schemas/GhcidAffiliation'
|
|
|
|
GhcidAffiliation:
|
|
type: object
|
|
properties:
|
|
ghcid:
|
|
type: string
|
|
role_title:
|
|
type: string
|
|
is_current:
|
|
type: boolean
|
|
|
|
securitySchemes:
|
|
bearerAuth:
|
|
type: http
|
|
scheme: bearer
|
|
bearerFormat: JWT
|
|
apiKey:
|
|
type: apiKey
|
|
in: header
|
|
name: X-API-Key
|
|
|
|
security:
|
|
- bearerAuth: []
|
|
- apiKey: []
|
|
```
|
|
|
|
### 4.2 GraphQL Schema
|
|
|
|
```graphql
|
|
type Query {
|
|
# Observations
|
|
observation(poid: ID!): PersonObservation
|
|
observations(
|
|
name: String
|
|
sourceUrl: String
|
|
limit: Int = 20
|
|
offset: Int = 0
|
|
): ObservationConnection!
|
|
|
|
# Reconstructions
|
|
reconstruction(prid: ID!): PersonReconstruction
|
|
reconstructions(
|
|
name: String
|
|
ghcid: String
|
|
limit: Int = 20
|
|
offset: Int = 0
|
|
): ReconstructionConnection!
|
|
|
|
# Search
|
|
search(query: String!, type: SearchType = ALL): SearchResults!
|
|
|
|
# Validation
|
|
validatePpid(ppid: String!): ValidationResult!
|
|
|
|
# Entity Resolution
|
|
resolveEntity(input: EntityResolutionInput!): [ResolutionCandidate!]!
|
|
}
|
|
|
|
type Mutation {
|
|
# Create observation
|
|
createObservation(input: CreateObservationInput!): PersonObservation!
|
|
|
|
# Create reconstruction
|
|
createReconstruction(input: CreateReconstructionInput!): PersonReconstruction!
|
|
|
|
# Link observation to reconstruction
|
|
linkObservation(prid: ID!, poid: ID!): PersonReconstruction!
|
|
|
|
# Update reconstruction
|
|
updateReconstruction(prid: ID!, input: UpdateReconstructionInput!): PersonReconstruction!
|
|
|
|
# Add external identifier
|
|
addExternalIdentifier(
|
|
prid: ID!
|
|
scheme: IdentifierScheme!
|
|
value: String!
|
|
): PersonReconstruction!
|
|
|
|
# Add GHCID affiliation
|
|
addGhcidAffiliation(
|
|
prid: ID!
|
|
ghcid: String!
|
|
roleTitle: String
|
|
isCurrent: Boolean = true
|
|
): PersonReconstruction!
|
|
}
|
|
|
|
type PersonObservation {
|
|
poid: ID!
|
|
sourceUrl: String!
|
|
sourceType: SourceType!
|
|
retrievedAt: DateTime!
|
|
contentHash: String
|
|
htmlArchivePath: String
|
|
|
|
# Name components
|
|
literalName: String
|
|
givenName: String
|
|
surname: String
|
|
surnamePrefix: String
|
|
|
|
# Related data
|
|
claims: [Claim!]!
|
|
linkedReconstructions: [PersonReconstruction!]!
|
|
|
|
# Metadata
|
|
extractionAgent: String
|
|
extractionConfidence: Float
|
|
createdAt: DateTime!
|
|
}
|
|
|
|
type PersonReconstruction {
|
|
prid: ID!
|
|
canonicalName: String!
|
|
givenName: String
|
|
surname: String
|
|
surnamePrefix: String
|
|
|
|
# Linked observations
|
|
observations: [PersonObservation!]!
|
|
|
|
# External identifiers
|
|
orcid: String
|
|
isni: String
|
|
viaf: String
|
|
wikidata: String
|
|
externalIdentifiers: [ExternalIdentifier!]!
|
|
|
|
# GHCID affiliations
|
|
ghcidAffiliations: [GhcidAffiliation!]!
|
|
currentAffiliations: [GhcidAffiliation!]!
|
|
|
|
# Versioning
|
|
version: Int!
|
|
previousVersion: PersonReconstruction
|
|
history: [PersonReconstruction!]!
|
|
|
|
# Curation
|
|
curator: User
|
|
curationMethod: CurationMethod
|
|
confidenceScore: Float
|
|
|
|
createdAt: DateTime!
|
|
updatedAt: DateTime!
|
|
}
|
|
|
|
type Claim {
|
|
id: ID!
|
|
claimType: ClaimType!
|
|
claimValue: String!
|
|
|
|
# Provenance (MANDATORY)
|
|
sourceUrl: String!
|
|
retrievedOn: DateTime!
|
|
xpath: String!
|
|
htmlFile: String!
|
|
xpathMatchScore: Float!
|
|
|
|
# Quality
|
|
confidence: Float
|
|
extractionAgent: String
|
|
status: ClaimStatus!
|
|
|
|
# Relationships
|
|
supports: [Claim!]!
|
|
conflictsWith: [Claim!]!
|
|
supersedes: Claim
|
|
|
|
createdAt: DateTime!
|
|
}
|
|
|
|
type GhcidAffiliation {
|
|
ghcid: String!
|
|
institution: HeritageCustodian # Resolved from GHCID
|
|
roleTitle: String
|
|
department: String
|
|
startDate: Date
|
|
endDate: Date
|
|
isCurrent: Boolean!
|
|
confidence: Float
|
|
}
|
|
|
|
type HeritageCustodian {
|
|
ghcid: String!
|
|
name: String!
|
|
institutionType: String!
|
|
city: String
|
|
country: String
|
|
}
|
|
|
|
type ExternalIdentifier {
|
|
scheme: IdentifierScheme!
|
|
value: String!
|
|
verified: Boolean!
|
|
verifiedAt: DateTime
|
|
}
|
|
|
|
enum SourceType {
|
|
OFFICIAL_REGISTRY
|
|
INSTITUTIONAL_WEBSITE
|
|
PROFESSIONAL_NETWORK
|
|
SOCIAL_MEDIA
|
|
NEWS_ARTICLE
|
|
ACADEMIC_PUBLICATION
|
|
USER_SUBMITTED
|
|
INFERRED
|
|
}
|
|
|
|
enum ClaimType {
|
|
FULL_NAME
|
|
GIVEN_NAME
|
|
FAMILY_NAME
|
|
JOB_TITLE
|
|
EMPLOYER
|
|
EMPLOYER_GHCID
|
|
EMAIL
|
|
LINKEDIN_URL
|
|
ORCID
|
|
BIRTH_DATE
|
|
EDUCATION
|
|
}
|
|
|
|
enum ClaimStatus {
|
|
ACTIVE
|
|
SUPERSEDED
|
|
RETRACTED
|
|
}
|
|
|
|
enum IdentifierScheme {
|
|
ORCID
|
|
ISNI
|
|
VIAF
|
|
WIKIDATA
|
|
LOC_NAF
|
|
}
|
|
|
|
enum CurationMethod {
|
|
MANUAL
|
|
ALGORITHMIC
|
|
HYBRID
|
|
}
|
|
|
|
enum SearchType {
|
|
OBSERVATION
|
|
RECONSTRUCTION
|
|
ALL
|
|
}
|
|
|
|
input CreateObservationInput {
|
|
sourceUrl: String!
|
|
sourceType: SourceType!
|
|
retrievedAt: DateTime!
|
|
contentHash: String
|
|
htmlArchivePath: String
|
|
extractionAgent: String
|
|
claims: [ClaimInput!]!
|
|
}
|
|
|
|
input ClaimInput {
|
|
claimType: ClaimType!
|
|
claimValue: String!
|
|
xpath: String!
|
|
xpathMatchScore: Float!
|
|
confidence: Float
|
|
}
|
|
|
|
input CreateReconstructionInput {
|
|
observationIds: [ID!]!
|
|
canonicalName: String
|
|
externalIdentifiers: ExternalIdentifiersInput
|
|
}
|
|
|
|
input EntityResolutionInput {
|
|
name: String!
|
|
employer: String
|
|
jobTitle: String
|
|
email: String
|
|
linkedinUrl: String
|
|
}
|
|
|
|
type ResolutionCandidate {
|
|
reconstruction: PersonReconstruction
|
|
observation: PersonObservation
|
|
matchScore: Float!
|
|
matchFactors: [MatchFactor!]!
|
|
}
|
|
|
|
type MatchFactor {
|
|
field: String!
|
|
score: Float!
|
|
method: String!
|
|
}
|
|
```
|
|
|
|
### 4.3 FastAPI Implementation
|
|
|
|
```python
|
|
from fastapi import FastAPI, HTTPException, Depends, Query
|
|
from fastapi.middleware.cors import CORSMiddleware
|
|
from pydantic import BaseModel, Field
|
|
from typing import Optional
|
|
from datetime import datetime
|
|
import uuid
|
|
|
|
app = FastAPI(
|
|
title="PPID API",
|
|
description="Person Persistent Identifier API",
|
|
version="1.0.0"
|
|
)
|
|
|
|
app.add_middleware(
|
|
CORSMiddleware,
|
|
allow_origins=["*"],
|
|
allow_credentials=True,
|
|
allow_methods=["*"],
|
|
allow_headers=["*"],
|
|
)
|
|
|
|
|
|
# --- Pydantic Models ---
|
|
|
|
class ClaimInput(BaseModel):
|
|
claim_type: str
|
|
claim_value: str
|
|
xpath: str
|
|
xpath_match_score: float = Field(ge=0, le=1)
|
|
confidence: Optional[float] = Field(None, ge=0, le=1)
|
|
|
|
|
|
class CreateObservationRequest(BaseModel):
|
|
source_url: str
|
|
source_type: str = "institutional_website"
|
|
retrieved_at: datetime
|
|
content_hash: Optional[str] = None
|
|
html_archive_path: Optional[str] = None
|
|
extraction_agent: Optional[str] = None
|
|
claims: list[ClaimInput]
|
|
|
|
|
|
class PersonObservationResponse(BaseModel):
|
|
poid: str
|
|
source_url: str
|
|
source_type: str
|
|
retrieved_at: datetime
|
|
literal_name: Optional[str] = None
|
|
claims: list[dict]
|
|
created_at: datetime
|
|
|
|
|
|
class CreateReconstructionRequest(BaseModel):
|
|
observation_ids: list[str]
|
|
canonical_name: Optional[str] = None
|
|
external_identifiers: Optional[dict] = None
|
|
|
|
|
|
class ValidationResult(BaseModel):
|
|
valid: bool
|
|
ppid_type: Optional[str] = None
|
|
exists: Optional[bool] = None
|
|
error: Optional[str] = None
|
|
|
|
|
|
# --- Dependencies ---
|
|
|
|
async def get_db():
|
|
"""Database connection dependency."""
|
|
# In production, use connection pool
|
|
pass
|
|
|
|
|
|
async def get_current_user(api_key: str = Depends(oauth2_scheme)):
|
|
"""Authenticate user from API key or JWT."""
|
|
pass
|
|
|
|
|
|
# --- Endpoints ---
|
|
|
|
@app.post("/api/v1/observations", response_model=PersonObservationResponse)
|
|
async def create_observation(
|
|
request: CreateObservationRequest,
|
|
db = Depends(get_db)
|
|
):
|
|
"""
|
|
Create a new Person Observation from extracted data.
|
|
|
|
The POID is generated deterministically from source metadata.
|
|
"""
|
|
from ppid.identifiers import generate_poid
|
|
|
|
# Generate deterministic POID
|
|
poid = generate_poid(
|
|
source_url=request.source_url,
|
|
retrieval_timestamp=request.retrieved_at.isoformat(),
|
|
content_hash=request.content_hash or ""
|
|
)
|
|
|
|
# Check for existing observation with same POID
|
|
existing = await db.get_observation(poid)
|
|
if existing:
|
|
return existing
|
|
|
|
# Extract name from claims
|
|
literal_name = None
|
|
for claim in request.claims:
|
|
if claim.claim_type == "full_name":
|
|
literal_name = claim.claim_value
|
|
break
|
|
|
|
# Create observation record
|
|
observation = await db.create_observation(
|
|
poid=poid,
|
|
source_url=request.source_url,
|
|
source_type=request.source_type,
|
|
retrieved_at=request.retrieved_at,
|
|
content_hash=request.content_hash,
|
|
html_archive_path=request.html_archive_path,
|
|
literal_name=literal_name,
|
|
extraction_agent=request.extraction_agent,
|
|
claims=[c.dict() for c in request.claims]
|
|
)
|
|
|
|
return observation
|
|
|
|
|
|
@app.get("/api/v1/observations/{poid}", response_model=PersonObservationResponse)
|
|
async def get_observation(poid: str, db = Depends(get_db)):
|
|
"""Get Person Observation by POID."""
|
|
from ppid.identifiers import validate_ppid
|
|
|
|
is_valid, error = validate_ppid(poid)
|
|
if not is_valid:
|
|
raise HTTPException(status_code=400, detail=f"Invalid POID: {error}")
|
|
|
|
observation = await db.get_observation(poid)
|
|
if not observation:
|
|
raise HTTPException(status_code=404, detail="Observation not found")
|
|
|
|
return observation
|
|
|
|
|
|
@app.get("/api/v1/validate/{ppid}", response_model=ValidationResult)
|
|
async def validate_ppid_endpoint(ppid: str, db = Depends(get_db)):
|
|
"""Validate PPID format, checksum, and existence."""
|
|
from ppid.identifiers import validate_ppid_full
|
|
|
|
is_valid, error = validate_ppid_full(ppid)
|
|
|
|
if not is_valid:
|
|
return ValidationResult(valid=False, error=error)
|
|
|
|
ppid_type = "POID" if ppid.startswith("POID") else "PRID"
|
|
|
|
# Check existence
|
|
if ppid_type == "POID":
|
|
exists = await db.observation_exists(ppid)
|
|
else:
|
|
exists = await db.reconstruction_exists(ppid)
|
|
|
|
return ValidationResult(
|
|
valid=True,
|
|
ppid_type=ppid_type,
|
|
exists=exists
|
|
)
|
|
|
|
|
|
@app.post("/api/v1/reconstructions")
|
|
async def create_reconstruction(
|
|
request: CreateReconstructionRequest,
|
|
db = Depends(get_db),
|
|
user = Depends(get_current_user)
|
|
):
|
|
"""Create Person Reconstruction from linked observations."""
|
|
from ppid.identifiers import generate_prid
|
|
|
|
# Validate all POIDs exist
|
|
for poid in request.observation_ids:
|
|
if not await db.observation_exists(poid):
|
|
raise HTTPException(
|
|
status_code=400,
|
|
detail=f"Observation not found: {poid}"
|
|
)
|
|
|
|
# Generate deterministic PRID
|
|
prid = generate_prid(
|
|
observation_ids=request.observation_ids,
|
|
curator_id=str(user.id),
|
|
timestamp=datetime.utcnow().isoformat()
|
|
)
|
|
|
|
# Determine canonical name
|
|
if request.canonical_name:
|
|
canonical_name = request.canonical_name
|
|
else:
|
|
# Use name from highest-confidence observation
|
|
observations = await db.get_observations(request.observation_ids)
|
|
canonical_name = max(
|
|
observations,
|
|
key=lambda o: o.extraction_confidence or 0
|
|
).literal_name
|
|
|
|
# Create reconstruction
|
|
reconstruction = await db.create_reconstruction(
|
|
prid=prid,
|
|
canonical_name=canonical_name,
|
|
observation_ids=request.observation_ids,
|
|
curator_id=user.id,
|
|
external_identifiers=request.external_identifiers
|
|
)
|
|
|
|
return reconstruction
|
|
|
|
|
|
@app.get("/api/v1/search")
|
|
async def search(
|
|
q: str = Query(..., min_length=2),
|
|
type: str = Query("all", regex="^(observation|reconstruction|all)$"),
|
|
limit: int = Query(20, ge=1, le=100),
|
|
offset: int = Query(0, ge=0),
|
|
db = Depends(get_db)
|
|
):
|
|
"""Full-text search across observations and reconstructions."""
|
|
results = await db.search(
|
|
query=q,
|
|
search_type=type,
|
|
limit=limit,
|
|
offset=offset
|
|
)
|
|
return results
|
|
|
|
|
|
@app.post("/api/v1/resolve")
|
|
async def resolve_entity(
|
|
name: str,
|
|
employer: Optional[str] = None,
|
|
job_title: Optional[str] = None,
|
|
email: Optional[str] = None,
|
|
db = Depends(get_db)
|
|
):
|
|
"""
|
|
Entity resolution: find matching records for input data.
|
|
|
|
Returns ranked candidates with match scores.
|
|
"""
|
|
from ppid.entity_resolution import find_candidates
|
|
|
|
candidates = await find_candidates(
|
|
db=db,
|
|
name=name,
|
|
employer=employer,
|
|
job_title=job_title,
|
|
email=email
|
|
)
|
|
|
|
return {
|
|
"candidates": candidates,
|
|
"query": {
|
|
"name": name,
|
|
"employer": employer,
|
|
"job_title": job_title,
|
|
"email": email
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Data Ingestion Pipeline
|
|
|
|
### 5.1 Pipeline Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ DATA INGESTION PIPELINE │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │ Source │───▶│ Extract │───▶│ Transform│───▶│ Load │ │
|
|
│ │ Fetch │ │ (NER) │ │ (Validate) │ (Store) │ │
|
|
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │ Archive │ │ Claims │ │ POID │ │ Postgres│ │
|
|
│ │ HTML │ │ XPath │ │ Generate│ │ + RDF │ │
|
|
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 5.2 Pipeline Implementation
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from datetime import datetime
|
|
from typing import AsyncIterator
|
|
import hashlib
|
|
import asyncio
|
|
from kafka import KafkaProducer, KafkaConsumer
|
|
|
|
@dataclass
|
|
class SourceDocument:
|
|
url: str
|
|
html_content: str
|
|
retrieved_at: datetime
|
|
content_hash: str
|
|
archive_path: str
|
|
|
|
|
|
@dataclass
|
|
class ExtractedClaim:
|
|
claim_type: str
|
|
claim_value: str
|
|
xpath: str
|
|
xpath_match_score: float
|
|
confidence: float
|
|
|
|
|
|
@dataclass
|
|
class PersonObservationData:
|
|
source: SourceDocument
|
|
claims: list[ExtractedClaim]
|
|
poid: str
|
|
|
|
|
|
class IngestionPipeline:
|
|
"""
|
|
Main data ingestion pipeline for PPID.
|
|
|
|
Stages:
|
|
1. Fetch: Retrieve web pages, archive HTML
|
|
2. Extract: NER/NLP to identify person data, generate claims with XPath
|
|
3. Transform: Validate, generate POID, structure data
|
|
4. Load: Store in PostgreSQL and RDF triple store
|
|
"""
|
|
|
|
def __init__(
|
|
self,
|
|
db_pool,
|
|
rdf_store,
|
|
kafka_producer: KafkaProducer,
|
|
archive_storage,
|
|
llm_extractor
|
|
):
|
|
self.db = db_pool
|
|
self.rdf = rdf_store
|
|
self.kafka = kafka_producer
|
|
self.archive = archive_storage
|
|
self.extractor = llm_extractor
|
|
|
|
async def process_url(self, url: str) -> list[PersonObservationData]:
|
|
"""
|
|
Full pipeline for a single URL.
|
|
|
|
Returns list of PersonObservations extracted from the page.
|
|
"""
|
|
# Stage 1: Fetch and archive
|
|
source = await self._fetch_and_archive(url)
|
|
|
|
# Stage 2: Extract claims with XPath
|
|
observations = await self._extract_observations(source)
|
|
|
|
# Stage 3: Transform and validate
|
|
validated = await self._transform_and_validate(observations)
|
|
|
|
# Stage 4: Load to databases
|
|
await self._load_observations(validated)
|
|
|
|
return validated
|
|
|
|
async def _fetch_and_archive(self, url: str) -> SourceDocument:
|
|
"""Fetch URL and archive HTML."""
|
|
from playwright.async_api import async_playwright
|
|
|
|
async with async_playwright() as p:
|
|
browser = await p.chromium.launch()
|
|
page = await browser.new_page()
|
|
|
|
await page.goto(url, wait_until='networkidle')
|
|
html_content = await page.content()
|
|
|
|
await browser.close()
|
|
|
|
# Calculate content hash
|
|
content_hash = hashlib.sha256(html_content.encode()).hexdigest()
|
|
|
|
# Archive HTML
|
|
retrieved_at = datetime.utcnow()
|
|
archive_path = await self.archive.store(
|
|
url=url,
|
|
content=html_content,
|
|
timestamp=retrieved_at
|
|
)
|
|
|
|
return SourceDocument(
|
|
url=url,
|
|
html_content=html_content,
|
|
retrieved_at=retrieved_at,
|
|
content_hash=content_hash,
|
|
archive_path=archive_path
|
|
)
|
|
|
|
async def _extract_observations(
|
|
self,
|
|
source: SourceDocument
|
|
) -> list[tuple[list[ExtractedClaim], str]]:
|
|
"""
|
|
Extract person observations with XPath provenance.
|
|
|
|
Uses LLM for extraction, then validates XPath.
|
|
"""
|
|
from lxml import html
|
|
|
|
# Parse HTML
|
|
tree = html.fromstring(source.html_content)
|
|
|
|
# Use LLM to extract person data with XPath
|
|
extraction_result = await self.extractor.extract_persons(
|
|
html_content=source.html_content,
|
|
source_url=source.url
|
|
)
|
|
|
|
observations = []
|
|
|
|
for person in extraction_result.persons:
|
|
validated_claims = []
|
|
|
|
for claim in person.claims:
|
|
# Verify XPath points to expected value
|
|
try:
|
|
elements = tree.xpath(claim.xpath)
|
|
if elements:
|
|
actual_value = elements[0].text_content().strip()
|
|
|
|
# Calculate match score
|
|
if actual_value == claim.claim_value:
|
|
match_score = 1.0
|
|
else:
|
|
from difflib import SequenceMatcher
|
|
match_score = SequenceMatcher(
|
|
None, actual_value, claim.claim_value
|
|
).ratio()
|
|
|
|
if match_score >= 0.8: # Accept if 80%+ match
|
|
validated_claims.append(ExtractedClaim(
|
|
claim_type=claim.claim_type,
|
|
claim_value=claim.claim_value,
|
|
xpath=claim.xpath,
|
|
xpath_match_score=match_score,
|
|
confidence=claim.confidence * match_score
|
|
))
|
|
except Exception as e:
|
|
# Skip claims with invalid XPath
|
|
continue
|
|
|
|
if validated_claims:
|
|
# Get literal name from claims
|
|
literal_name = next(
|
|
(c.claim_value for c in validated_claims if c.claim_type == 'full_name'),
|
|
None
|
|
)
|
|
observations.append((validated_claims, literal_name))
|
|
|
|
return observations
|
|
|
|
async def _transform_and_validate(
|
|
self,
|
|
observations: list[tuple[list[ExtractedClaim], str]],
|
|
source: SourceDocument
|
|
) -> list[PersonObservationData]:
|
|
"""Transform extracted data and generate POIDs."""
|
|
from ppid.identifiers import generate_poid
|
|
|
|
results = []
|
|
|
|
for claims, literal_name in observations:
|
|
# Generate deterministic POID
|
|
claims_hash = hashlib.sha256(
|
|
str(sorted([c.claim_value for c in claims])).encode()
|
|
).hexdigest()
|
|
|
|
poid = generate_poid(
|
|
source_url=source.url,
|
|
retrieval_timestamp=source.retrieved_at.isoformat(),
|
|
content_hash=f"{source.content_hash}:{claims_hash}"
|
|
)
|
|
|
|
results.append(PersonObservationData(
|
|
source=source,
|
|
claims=claims,
|
|
poid=poid
|
|
))
|
|
|
|
return results
|
|
|
|
async def _load_observations(
|
|
self,
|
|
observations: list[PersonObservationData]
|
|
) -> None:
|
|
"""Load observations to PostgreSQL and RDF store."""
|
|
for obs in observations:
|
|
# Check if already exists (idempotent)
|
|
existing = await self.db.get_observation(obs.poid)
|
|
if existing:
|
|
continue
|
|
|
|
# Insert to PostgreSQL
|
|
await self.db.create_observation(
|
|
poid=obs.poid,
|
|
source_url=obs.source.url,
|
|
source_type='institutional_website',
|
|
retrieved_at=obs.source.retrieved_at,
|
|
content_hash=obs.source.content_hash,
|
|
html_archive_path=obs.source.archive_path,
|
|
literal_name=next(
|
|
(c.claim_value for c in obs.claims if c.claim_type == 'full_name'),
|
|
None
|
|
),
|
|
claims=[{
|
|
'claim_type': c.claim_type,
|
|
'claim_value': c.claim_value,
|
|
'xpath': c.xpath,
|
|
'xpath_match_score': c.xpath_match_score,
|
|
'confidence': c.confidence
|
|
} for c in obs.claims]
|
|
)
|
|
|
|
# Insert to RDF triple store
|
|
await self._insert_rdf(obs)
|
|
|
|
# Publish to Kafka for downstream processing
|
|
self.kafka.send('ppid.observations.created', {
|
|
'poid': obs.poid,
|
|
'source_url': obs.source.url,
|
|
'timestamp': obs.source.retrieved_at.isoformat()
|
|
})
|
|
|
|
async def _insert_rdf(self, obs: PersonObservationData) -> None:
|
|
"""Insert observation as RDF triples."""
|
|
from rdflib import Graph, Namespace, Literal, URIRef
|
|
from rdflib.namespace import RDF, XSD
|
|
|
|
PPID = Namespace("https://ppid.org/")
|
|
PPIDV = Namespace("https://ppid.org/vocab#")
|
|
PROV = Namespace("http://www.w3.org/ns/prov#")
|
|
|
|
g = Graph()
|
|
|
|
obs_uri = PPID[obs.poid]
|
|
|
|
g.add((obs_uri, RDF.type, PPIDV.PersonObservation))
|
|
g.add((obs_uri, PPIDV.poid, Literal(obs.poid)))
|
|
g.add((obs_uri, PROV.wasDerivedFrom, URIRef(obs.source.url)))
|
|
g.add((obs_uri, PROV.generatedAtTime, Literal(
|
|
obs.source.retrieved_at.isoformat(),
|
|
datatype=XSD.dateTime
|
|
)))
|
|
|
|
# Add claims
|
|
for i, claim in enumerate(obs.claims):
|
|
claim_uri = PPID[f"{obs.poid}/claim/{i}"]
|
|
g.add((obs_uri, PPIDV.hasClaim, claim_uri))
|
|
g.add((claim_uri, PPIDV.claimType, Literal(claim.claim_type)))
|
|
g.add((claim_uri, PPIDV.claimValue, Literal(claim.claim_value)))
|
|
g.add((claim_uri, PPIDV.xpath, Literal(claim.xpath)))
|
|
g.add((claim_uri, PPIDV.xpathMatchScore, Literal(
|
|
claim.xpath_match_score, datatype=XSD.decimal
|
|
)))
|
|
|
|
# Insert to triple store
|
|
await self.rdf.insert(g)
|
|
```
|
|
|
|
---
|
|
|
|
## 6. GHCID Integration
|
|
|
|
### 6.1 Linking Persons to Institutions
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from typing import Optional
|
|
from datetime import date
|
|
|
|
@dataclass
|
|
class GhcidAffiliation:
|
|
"""
|
|
Link between a person (PRID) and a heritage institution (GHCID).
|
|
"""
|
|
ghcid: str # e.g., "NL-NH-HAA-A-NHA"
|
|
role_title: Optional[str] = None
|
|
department: Optional[str] = None
|
|
start_date: Optional[date] = None
|
|
end_date: Optional[date] = None
|
|
is_current: bool = True
|
|
source_poid: Optional[str] = None
|
|
confidence: float = 0.9
|
|
|
|
|
|
async def link_person_to_institution(
|
|
db,
|
|
prid: str,
|
|
ghcid: str,
|
|
role_title: str = None,
|
|
source_poid: str = None
|
|
) -> GhcidAffiliation:
|
|
"""
|
|
Create link between person and heritage institution.
|
|
|
|
Args:
|
|
prid: Person Reconstruction ID
|
|
ghcid: Global Heritage Custodian ID
|
|
role_title: Job title at institution
|
|
source_poid: Observation where affiliation was extracted
|
|
|
|
Returns:
|
|
Created affiliation record
|
|
"""
|
|
# Validate PRID exists
|
|
reconstruction = await db.get_reconstruction(prid)
|
|
if not reconstruction:
|
|
raise ValueError(f"Reconstruction not found: {prid}")
|
|
|
|
# Validate GHCID format
|
|
if not validate_ghcid_format(ghcid):
|
|
raise ValueError(f"Invalid GHCID format: {ghcid}")
|
|
|
|
# Create affiliation
|
|
affiliation = await db.create_ghcid_affiliation(
|
|
prid_id=reconstruction.id,
|
|
ghcid=ghcid,
|
|
role_title=role_title,
|
|
source_poid_id=source_poid,
|
|
is_current=True
|
|
)
|
|
|
|
return affiliation
|
|
|
|
|
|
def validate_ghcid_format(ghcid: str) -> bool:
|
|
"""Validate GHCID format."""
|
|
import re
|
|
# Pattern: CC-RR-SSS-T-ABBREV
|
|
pattern = r'^[A-Z]{2}-[A-Z]{2}-[A-Z]{3}-[A-Z]-[A-Z0-9]+$'
|
|
return bool(re.match(pattern, ghcid))
|
|
```
|
|
|
|
### 6.2 RDF Integration
|
|
|
|
```turtle
|
|
@prefix ppid: <https://ppid.org/> .
|
|
@prefix ppidv: <https://ppid.org/vocab#> .
|
|
@prefix ghcid: <https://w3id.org/heritage/custodian/> .
|
|
@prefix org: <http://www.w3.org/ns/org#> .
|
|
@prefix schema: <http://schema.org/> .
|
|
|
|
# Person with GHCID affiliation
|
|
ppid:PRID-1234-5678-90ab-cde5
|
|
a ppidv:PersonReconstruction ;
|
|
schema:name "Jan van den Berg" ;
|
|
|
|
# Current employment
|
|
org:memberOf [
|
|
a org:Membership ;
|
|
org:organization ghcid:NL-NH-HAA-A-NHA ;
|
|
org:role [
|
|
a org:Role ;
|
|
schema:name "Senior Archivist"
|
|
] ;
|
|
ppidv:isCurrent true ;
|
|
schema:startDate "2015"^^xsd:gYear
|
|
] ;
|
|
|
|
# Direct link for simple queries
|
|
ppidv:affiliatedWith ghcid:NL-NH-HAA-A-NHA .
|
|
|
|
# The heritage institution (from GHCID system)
|
|
ghcid:NL-NH-HAA-A-NHA
|
|
a ppidv:HeritageCustodian ;
|
|
schema:name "Noord-Hollands Archief" ;
|
|
ppidv:ghcid "NL-NH-HAA-A-NHA" ;
|
|
schema:address [
|
|
schema:addressLocality "Haarlem" ;
|
|
schema:addressCountry "NL"
|
|
] .
|
|
```
|
|
|
|
### 6.3 SPARQL Queries
|
|
|
|
```sparql
|
|
# Find all persons affiliated with a specific institution
|
|
PREFIX ppid: <https://ppid.org/>
|
|
PREFIX ppidv: <https://ppid.org/vocab#>
|
|
PREFIX ghcid: <https://w3id.org/heritage/custodian/>
|
|
PREFIX schema: <http://schema.org/>
|
|
PREFIX org: <http://www.w3.org/ns/org#>
|
|
|
|
SELECT ?prid ?name ?role ?isCurrent
|
|
WHERE {
|
|
?prid a ppidv:PersonReconstruction ;
|
|
schema:name ?name ;
|
|
org:memberOf ?membership .
|
|
|
|
?membership org:organization ghcid:NL-NH-HAA-A-NHA ;
|
|
ppidv:isCurrent ?isCurrent .
|
|
|
|
OPTIONAL {
|
|
?membership org:role ?roleNode .
|
|
?roleNode schema:name ?role .
|
|
}
|
|
}
|
|
ORDER BY DESC(?isCurrent) ?name
|
|
|
|
|
|
# Find all institutions a person has worked at
|
|
SELECT ?ghcid ?institutionName ?role ?startDate ?endDate ?isCurrent
|
|
WHERE {
|
|
ppid:PRID-1234-5678-90ab-cde5 org:memberOf ?membership .
|
|
|
|
?membership org:organization ?institution ;
|
|
ppidv:isCurrent ?isCurrent .
|
|
|
|
?institution ppidv:ghcid ?ghcid ;
|
|
schema:name ?institutionName .
|
|
|
|
OPTIONAL { ?membership schema:startDate ?startDate }
|
|
OPTIONAL { ?membership schema:endDate ?endDate }
|
|
OPTIONAL {
|
|
?membership org:role ?roleNode .
|
|
?roleNode schema:name ?role .
|
|
}
|
|
}
|
|
ORDER BY DESC(?isCurrent) DESC(?startDate)
|
|
|
|
|
|
# Find archivists across all Dutch archives
|
|
SELECT ?prid ?name ?institution ?institutionName
|
|
WHERE {
|
|
?prid a ppidv:PersonReconstruction ;
|
|
schema:name ?name ;
|
|
org:memberOf ?membership .
|
|
|
|
?membership org:organization ?institution ;
|
|
ppidv:isCurrent true ;
|
|
org:role ?roleNode .
|
|
|
|
?roleNode schema:name ?role .
|
|
FILTER(CONTAINS(LCASE(?role), "archivist"))
|
|
|
|
?institution ppidv:ghcid ?ghcid ;
|
|
schema:name ?institutionName .
|
|
FILTER(STRSTARTS(?ghcid, "NL-"))
|
|
}
|
|
ORDER BY ?institutionName ?name
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Security and Access Control
|
|
|
|
### 7.1 Authentication
|
|
|
|
```python
|
|
from fastapi import Depends, HTTPException, status
|
|
from fastapi.security import OAuth2PasswordBearer, APIKeyHeader
|
|
from jose import JWTError, jwt
|
|
from passlib.context import CryptContext
|
|
from datetime import datetime, timedelta
|
|
from typing import Optional
|
|
|
|
# Configuration
|
|
SECRET_KEY = "your-secret-key" # Use env variable
|
|
ALGORITHM = "HS256"
|
|
ACCESS_TOKEN_EXPIRE_MINUTES = 30
|
|
|
|
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
|
|
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token", auto_error=False)
|
|
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
|
|
|
|
|
|
class User:
|
|
def __init__(self, id: str, email: str, roles: list[str]):
|
|
self.id = id
|
|
self.email = email
|
|
self.roles = roles
|
|
|
|
|
|
def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
|
|
"""Create JWT access token."""
|
|
to_encode = data.copy()
|
|
expire = datetime.utcnow() + (expires_delta or timedelta(minutes=15))
|
|
to_encode.update({"exp": expire})
|
|
return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
|
|
|
|
|
|
async def get_current_user(
|
|
token: Optional[str] = Depends(oauth2_scheme),
|
|
api_key: Optional[str] = Depends(api_key_header),
|
|
db = Depends(get_db)
|
|
) -> User:
|
|
"""
|
|
Authenticate user via JWT token or API key.
|
|
"""
|
|
credentials_exception = HTTPException(
|
|
status_code=status.HTTP_401_UNAUTHORIZED,
|
|
detail="Could not validate credentials",
|
|
headers={"WWW-Authenticate": "Bearer"},
|
|
)
|
|
|
|
# Try JWT token first
|
|
if token:
|
|
try:
|
|
payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
|
|
user_id: str = payload.get("sub")
|
|
if user_id is None:
|
|
raise credentials_exception
|
|
|
|
user = await db.get_user(user_id)
|
|
if user is None:
|
|
raise credentials_exception
|
|
|
|
return user
|
|
except JWTError:
|
|
pass
|
|
|
|
# Try API key
|
|
if api_key:
|
|
user = await db.get_user_by_api_key(api_key)
|
|
if user:
|
|
return user
|
|
|
|
raise credentials_exception
|
|
|
|
|
|
def require_role(required_roles: list[str]):
|
|
"""Dependency to require specific roles."""
|
|
async def role_checker(user: User = Depends(get_current_user)):
|
|
if not any(role in user.roles for role in required_roles):
|
|
raise HTTPException(
|
|
status_code=status.HTTP_403_FORBIDDEN,
|
|
detail="Insufficient permissions"
|
|
)
|
|
return user
|
|
return role_checker
|
|
```
|
|
|
|
### 7.2 Authorization Roles
|
|
|
|
| Role | Permissions |
|
|
|------|-------------|
|
|
| `reader` | Read observations, reconstructions, claims |
|
|
| `contributor` | Create observations, add claims |
|
|
| `curator` | Create reconstructions, link observations, resolve conflicts |
|
|
| `admin` | Manage users, API keys, system configuration |
|
|
| `api_client` | Programmatic access via API key |
|
|
|
|
### 7.3 Rate Limiting
|
|
|
|
```python
|
|
from fastapi import Request
|
|
import redis
|
|
from datetime import datetime
|
|
|
|
class RateLimiter:
|
|
"""
|
|
Token bucket rate limiter using Redis.
|
|
"""
|
|
|
|
def __init__(self, redis_client: redis.Redis):
|
|
self.redis = redis_client
|
|
|
|
async def is_allowed(
|
|
self,
|
|
key: str,
|
|
max_requests: int = 100,
|
|
window_seconds: int = 60
|
|
) -> tuple[bool, dict]:
|
|
"""
|
|
Check if request is allowed under rate limit.
|
|
|
|
Returns:
|
|
Tuple of (is_allowed, rate_limit_info)
|
|
"""
|
|
now = datetime.utcnow().timestamp()
|
|
window_start = now - window_seconds
|
|
|
|
pipe = self.redis.pipeline()
|
|
|
|
# Remove old requests
|
|
pipe.zremrangebyscore(key, 0, window_start)
|
|
|
|
# Count requests in window
|
|
pipe.zcard(key)
|
|
|
|
# Add current request
|
|
pipe.zadd(key, {str(now): now})
|
|
|
|
# Set expiry
|
|
pipe.expire(key, window_seconds)
|
|
|
|
results = pipe.execute()
|
|
request_count = results[1]
|
|
|
|
is_allowed = request_count < max_requests
|
|
|
|
return is_allowed, {
|
|
"limit": max_requests,
|
|
"remaining": max(0, max_requests - request_count - 1),
|
|
"reset": int(now + window_seconds)
|
|
}
|
|
|
|
|
|
# Rate limit tiers
|
|
RATE_LIMITS = {
|
|
"anonymous": {"requests": 60, "window": 60},
|
|
"reader": {"requests": 100, "window": 60},
|
|
"contributor": {"requests": 500, "window": 60},
|
|
"curator": {"requests": 1000, "window": 60},
|
|
"api_client": {"requests": 5000, "window": 60},
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Performance Requirements
|
|
|
|
### 8.1 SLOs (Service Level Objectives)
|
|
|
|
| Metric | Target | Measurement |
|
|
|--------|--------|-------------|
|
|
| **Availability** | 99.9% | Monthly uptime |
|
|
| **API Latency (p50)** | < 50ms | Response time |
|
|
| **API Latency (p99)** | < 500ms | Response time |
|
|
| **Search Latency** | < 200ms | Full-text search |
|
|
| **SPARQL Query** | < 1s | Simple queries |
|
|
| **Throughput** | 1000 req/s | Sustained load |
|
|
|
|
### 8.2 Scaling Strategy
|
|
|
|
```yaml
|
|
# Kubernetes HPA (Horizontal Pod Autoscaler)
|
|
apiVersion: autoscaling/v2
|
|
kind: HorizontalPodAutoscaler
|
|
metadata:
|
|
name: ppid-api
|
|
spec:
|
|
scaleTargetRef:
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
name: ppid-api
|
|
minReplicas: 3
|
|
maxReplicas: 20
|
|
metrics:
|
|
- type: Resource
|
|
resource:
|
|
name: cpu
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 70
|
|
- type: Resource
|
|
resource:
|
|
name: memory
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 80
|
|
- type: Pods
|
|
pods:
|
|
metric:
|
|
name: http_requests_per_second
|
|
target:
|
|
type: AverageValue
|
|
averageValue: 100
|
|
```
|
|
|
|
### 8.3 Caching Strategy
|
|
|
|
```python
|
|
import redis
|
|
from functools import wraps
|
|
import json
|
|
import hashlib
|
|
|
|
class CacheManager:
|
|
"""
|
|
Multi-tier caching for PPID.
|
|
|
|
Tiers:
|
|
1. L1: In-memory (per-instance)
|
|
2. L2: Redis (shared)
|
|
"""
|
|
|
|
def __init__(self, redis_client: redis.Redis):
|
|
self.redis = redis_client
|
|
self.local_cache = {}
|
|
|
|
def cache_observation(self, ttl: int = 3600):
|
|
"""Cache observation lookups."""
|
|
def decorator(func):
|
|
@wraps(func)
|
|
async def wrapper(poid: str, *args, **kwargs):
|
|
cache_key = f"observation:{poid}"
|
|
|
|
# Check L1 cache
|
|
if cache_key in self.local_cache:
|
|
return self.local_cache[cache_key]
|
|
|
|
# Check L2 cache
|
|
cached = self.redis.get(cache_key)
|
|
if cached:
|
|
data = json.loads(cached)
|
|
self.local_cache[cache_key] = data
|
|
return data
|
|
|
|
# Fetch from database
|
|
result = await func(poid, *args, **kwargs)
|
|
|
|
if result:
|
|
# Store in both caches
|
|
self.redis.setex(cache_key, ttl, json.dumps(result))
|
|
self.local_cache[cache_key] = result
|
|
|
|
return result
|
|
return wrapper
|
|
return decorator
|
|
|
|
def cache_search(self, ttl: int = 300):
|
|
"""Cache search results (shorter TTL)."""
|
|
def decorator(func):
|
|
@wraps(func)
|
|
async def wrapper(query: str, *args, **kwargs):
|
|
# Create deterministic cache key from query params
|
|
key_data = {"query": query, "args": args, "kwargs": kwargs}
|
|
cache_key = f"search:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"
|
|
|
|
cached = self.redis.get(cache_key)
|
|
if cached:
|
|
return json.loads(cached)
|
|
|
|
result = await func(query, *args, **kwargs)
|
|
|
|
self.redis.setex(cache_key, ttl, json.dumps(result))
|
|
|
|
return result
|
|
return wrapper
|
|
return decorator
|
|
|
|
def invalidate_observation(self, poid: str):
|
|
"""Invalidate cache when observation is updated."""
|
|
cache_key = f"observation:{poid}"
|
|
self.redis.delete(cache_key)
|
|
self.local_cache.pop(cache_key, None)
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Deployment Architecture
|
|
|
|
### 9.1 Kubernetes Deployment
|
|
|
|
```yaml
|
|
# API Deployment
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: ppid-api
|
|
labels:
|
|
app: ppid
|
|
component: api
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: ppid
|
|
component: api
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: ppid
|
|
component: api
|
|
spec:
|
|
containers:
|
|
- name: api
|
|
image: ppid/api:latest
|
|
ports:
|
|
- containerPort: 8000
|
|
env:
|
|
- name: DATABASE_URL
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: ppid-secrets
|
|
key: database-url
|
|
- name: REDIS_URL
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: ppid-secrets
|
|
key: redis-url
|
|
- name: JWT_SECRET
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: ppid-secrets
|
|
key: jwt-secret
|
|
resources:
|
|
requests:
|
|
cpu: "250m"
|
|
memory: "512Mi"
|
|
limits:
|
|
cpu: "1000m"
|
|
memory: "2Gi"
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8000
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 5
|
|
|
|
---
|
|
# Service
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: ppid-api
|
|
spec:
|
|
selector:
|
|
app: ppid
|
|
component: api
|
|
ports:
|
|
- port: 80
|
|
targetPort: 8000
|
|
type: ClusterIP
|
|
|
|
---
|
|
# Ingress
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: ppid-ingress
|
|
annotations:
|
|
kubernetes.io/ingress.class: nginx
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
spec:
|
|
tls:
|
|
- hosts:
|
|
- api.ppid.org
|
|
secretName: ppid-tls
|
|
rules:
|
|
- host: api.ppid.org
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: ppid-api
|
|
port:
|
|
number: 80
|
|
```
|
|
|
|
### 9.2 Docker Compose (Development)
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
api:
|
|
build: .
|
|
ports:
|
|
- "8000:8000"
|
|
environment:
|
|
- DATABASE_URL=postgresql://ppid:ppid@postgres:5432/ppid
|
|
- REDIS_URL=redis://redis:6379
|
|
- FUSEKI_URL=http://fuseki:3030
|
|
depends_on:
|
|
- postgres
|
|
- redis
|
|
- fuseki
|
|
volumes:
|
|
- ./src:/app/src
|
|
- ./archives:/app/archives
|
|
|
|
postgres:
|
|
image: postgres:16
|
|
environment:
|
|
POSTGRES_USER: ppid
|
|
POSTGRES_PASSWORD: ppid
|
|
POSTGRES_DB: ppid
|
|
volumes:
|
|
- postgres_data:/var/lib/postgresql/data
|
|
ports:
|
|
- "5432:5432"
|
|
|
|
redis:
|
|
image: redis:7-alpine
|
|
ports:
|
|
- "6379:6379"
|
|
|
|
fuseki:
|
|
image: stain/jena-fuseki
|
|
environment:
|
|
ADMIN_PASSWORD: admin
|
|
FUSEKI_DATASET_1: ppid
|
|
volumes:
|
|
- fuseki_data:/fuseki
|
|
ports:
|
|
- "3030:3030"
|
|
|
|
elasticsearch:
|
|
image: elasticsearch:8.11.0
|
|
environment:
|
|
- discovery.type=single-node
|
|
- xpack.security.enabled=false
|
|
volumes:
|
|
- es_data:/usr/share/elasticsearch/data
|
|
ports:
|
|
- "9200:9200"
|
|
|
|
kafka:
|
|
image: confluentinc/cp-kafka:7.5.0
|
|
environment:
|
|
KAFKA_BROKER_ID: 1
|
|
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
|
|
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
|
|
depends_on:
|
|
- zookeeper
|
|
ports:
|
|
- "9092:9092"
|
|
|
|
zookeeper:
|
|
image: confluentinc/cp-zookeeper:7.5.0
|
|
environment:
|
|
ZOOKEEPER_CLIENT_PORT: 2181
|
|
|
|
volumes:
|
|
postgres_data:
|
|
fuseki_data:
|
|
es_data:
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Monitoring and Observability
|
|
|
|
### 10.1 Prometheus Metrics
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
import time
|
|
|
|
# Metrics
|
|
REQUEST_COUNT = Counter(
|
|
'ppid_requests_total',
|
|
'Total HTTP requests',
|
|
['method', 'endpoint', 'status']
|
|
)
|
|
|
|
REQUEST_LATENCY = Histogram(
|
|
'ppid_request_latency_seconds',
|
|
'Request latency in seconds',
|
|
['method', 'endpoint'],
|
|
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
|
|
)
|
|
|
|
OBSERVATIONS_CREATED = Counter(
|
|
'ppid_observations_created_total',
|
|
'Total observations created'
|
|
)
|
|
|
|
RECONSTRUCTIONS_CREATED = Counter(
|
|
'ppid_reconstructions_created_total',
|
|
'Total reconstructions created'
|
|
)
|
|
|
|
ENTITY_RESOLUTION_LATENCY = Histogram(
|
|
'ppid_entity_resolution_seconds',
|
|
'Entity resolution latency',
|
|
buckets=[.1, .25, .5, 1, 2.5, 5, 10, 30]
|
|
)
|
|
|
|
CACHE_HITS = Counter(
|
|
'ppid_cache_hits_total',
|
|
'Cache hits',
|
|
['cache_type']
|
|
)
|
|
|
|
CACHE_MISSES = Counter(
|
|
'ppid_cache_misses_total',
|
|
'Cache misses',
|
|
['cache_type']
|
|
)
|
|
|
|
DB_CONNECTIONS = Gauge(
|
|
'ppid_db_connections',
|
|
'Active database connections'
|
|
)
|
|
|
|
|
|
# Middleware for request metrics
|
|
@app.middleware("http")
|
|
async def metrics_middleware(request: Request, call_next):
|
|
start_time = time.time()
|
|
|
|
response = await call_next(request)
|
|
|
|
latency = time.time() - start_time
|
|
|
|
REQUEST_COUNT.labels(
|
|
method=request.method,
|
|
endpoint=request.url.path,
|
|
status=response.status_code
|
|
).inc()
|
|
|
|
REQUEST_LATENCY.labels(
|
|
method=request.method,
|
|
endpoint=request.url.path
|
|
).observe(latency)
|
|
|
|
return response
|
|
```
|
|
|
|
### 10.2 Logging
|
|
|
|
```python
|
|
import logging
|
|
import json
|
|
from datetime import datetime
|
|
|
|
class JSONFormatter(logging.Formatter):
|
|
"""Structured JSON logging for observability."""
|
|
|
|
def format(self, record):
|
|
log_record = {
|
|
"timestamp": datetime.utcnow().isoformat(),
|
|
"level": record.levelname,
|
|
"logger": record.name,
|
|
"message": record.getMessage(),
|
|
}
|
|
|
|
# Add extra fields
|
|
if hasattr(record, 'poid'):
|
|
log_record['poid'] = record.poid
|
|
if hasattr(record, 'prid'):
|
|
log_record['prid'] = record.prid
|
|
if hasattr(record, 'request_id'):
|
|
log_record['request_id'] = record.request_id
|
|
if hasattr(record, 'user_id'):
|
|
log_record['user_id'] = record.user_id
|
|
if hasattr(record, 'duration_ms'):
|
|
log_record['duration_ms'] = record.duration_ms
|
|
|
|
if record.exc_info:
|
|
log_record['exception'] = self.formatException(record.exc_info)
|
|
|
|
return json.dumps(log_record)
|
|
|
|
|
|
# Configure logging
|
|
def setup_logging():
|
|
handler = logging.StreamHandler()
|
|
handler.setFormatter(JSONFormatter())
|
|
|
|
root_logger = logging.getLogger()
|
|
root_logger.addHandler(handler)
|
|
root_logger.setLevel(logging.INFO)
|
|
|
|
# Reduce noise from libraries
|
|
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
|
|
logging.getLogger("httpx").setLevel(logging.WARNING)
|
|
```
|
|
|
|
### 10.3 Grafana Dashboard (JSON)
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "PPID System Overview",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(ppid_requests_total[5m])) by (endpoint)"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Request Latency (p99)",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.99, rate(ppid_request_latency_seconds_bucket[5m]))"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Observations Created",
|
|
"type": "stat",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(ppid_observations_created_total)"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Cache Hit Rate",
|
|
"type": "gauge",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(ppid_cache_hits_total[5m])) / (sum(rate(ppid_cache_hits_total[5m])) + sum(rate(ppid_cache_misses_total[5m])))"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Error Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(ppid_requests_total{status=~\"5..\"}[5m])) / sum(rate(ppid_requests_total[5m]))"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Implementation Checklist
|
|
|
|
### 11.1 Phase 1: Core Infrastructure
|
|
|
|
- [ ] Set up PostgreSQL database with schema
|
|
- [ ] Set up Apache Jena Fuseki for RDF
|
|
- [ ] Set up Redis for caching
|
|
- [ ] Implement POID/PRID generation
|
|
- [ ] Implement checksum validation
|
|
- [ ] Create basic REST API endpoints
|
|
|
|
### 11.2 Phase 2: Data Ingestion
|
|
|
|
- [ ] Build web scraping infrastructure
|
|
- [ ] Implement HTML archival
|
|
- [ ] Integrate LLM for extraction
|
|
- [ ] Implement XPath validation
|
|
- [ ] Set up Kafka for async processing
|
|
- [ ] Create ingestion pipeline
|
|
|
|
### 11.3 Phase 3: Entity Resolution
|
|
|
|
- [ ] Implement blocking strategies
|
|
- [ ] Implement similarity metrics
|
|
- [ ] Build clustering algorithm
|
|
- [ ] Create human-in-loop review UI
|
|
- [ ] Integrate with reconstruction creation
|
|
|
|
### 11.4 Phase 4: GHCID Integration
|
|
|
|
- [ ] Implement affiliation linking
|
|
- [ ] Add SPARQL queries for institution lookups
|
|
- [ ] Create bidirectional navigation
|
|
- [ ] Sync with GHCID registry updates
|
|
|
|
### 11.5 Phase 5: Production Readiness
|
|
|
|
- [ ] Implement authentication/authorization
|
|
- [ ] Set up rate limiting
|
|
- [ ] Configure monitoring and alerting
|
|
- [ ] Create backup and recovery procedures
|
|
- [ ] Performance testing and optimization
|
|
- [ ] Security audit
|
|
|
|
---
|
|
|
|
## 12. References
|
|
|
|
### Standards
|
|
- OAuth 2.0: https://oauth.net/2/
|
|
- OpenAPI 3.1: https://spec.openapis.org/oas/latest.html
|
|
- GraphQL: https://graphql.org/learn/
|
|
- SPARQL 1.1: https://www.w3.org/TR/sparql11-query/
|
|
|
|
### Related PPID Documents
|
|
- [Identifier Structure Design](./05_identifier_structure_design.md)
|
|
- [Entity Resolution Patterns](./06_entity_resolution_patterns.md)
|
|
- [Claims and Provenance](./07_claims_and_provenance.md)
|
|
|
|
### Technologies
|
|
- FastAPI: https://fastapi.tiangolo.com/
|
|
- Apache Jena: https://jena.apache.org/
|
|
- PostgreSQL: https://www.postgresql.org/
|
|
- Kubernetes: https://kubernetes.io/
|