kempersc b1f93b6f22 enrich person profiles

2025-12-12 12:51:10 +01:00

10 KiB

Raw Blame History

DSPy RAG Pipeline for Heritage Custodian Ontology

Version: 1.0.0
Status: Documentation Complete
Last Updated: 2025-12-12

Overview

This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.

Quick Start

# Install dependencies
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic

# Configure DSPy with Claude
export ANTHROPIC_API_KEY="your-key"

# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
export ZAI_API_TOKEN="your-token"

import dspy
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql

# Configure with Anthropic
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")

# Generate SPARQL from natural language
result = generate_sparql(
    question="Find all museums in Amsterdam with Wikidata IDs",
    language="en"
)
print(result["sparql"])

Documentation Suite

#	Document	Description	Lines
00	overview.md	Executive summary, architecture diagram, key design decisions	~200
01	architecture.md	System components, data flow, storage design (ChromaDB, TypeDB)	~400
02	dspy-signatures.md	7 DSPy module definitions with Pydantic output schemas	584
03	chunking-strategy.md	Schema-aware chunking for HTML, PDF, JSON-LD sources	~400
04	entity-extraction.md	NER following CH-Annotator v1.7.0, hybrid extraction	571
05	entity-linking.md	Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation	~650
06	retrieval-patterns.md	Hybrid retrieval with Reciprocal Rank Fusion	601
07	sparql-templates.md	40+ SPARQL query templates with Wikidata federation	563
08	evaluation.md	Metrics, gold standard datasets, benchmarks	~700

Total: ~4,700 lines of documentation

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        DSPy RAG Pipeline                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Chunker    │───▶│   Embedder   │───▶│  ChromaDB    │          │
│  │ (03-chunking)│    │              │    │  (Vectors)   │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│         │                                       │                   │
│         ▼                                       ▼                   │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Entity     │───▶│   Entity     │───▶│   TypeDB     │          │
│  │  Extractor   │    │   Linker     │    │    (KG)      │          │
│  │(04-extraction)│    │(05-linking)  │    │              │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│                             │                   │                   │
│                             ▼                   ▼                   │
│                      ┌──────────────┐    ┌──────────────┐          │
│                      │  Wikidata    │◀──▶│   SPARQL     │          │
│                      │   VIAF       │    │  Generator   │          │
│                      │   ISIL       │    │(07-templates)│          │
│                      └──────────────┘    └──────────────┘          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Existing Implementation

The following code already implements parts of this documentation:

`src/glam_extractor/api/dspy_sparql.py` (347 lines)

QuestionToSPARQL - DSPy signature for SPARQL generation
SPARQLGenerator - Basic DSPy module
RAGSPARQLGenerator - RAG-enhanced with Qdrant retrieval
configure_dspy() - LLM configuration
generate_sparql() - Main entry point
ONTOLOGY_CONTEXT - Prompt context with ontology info

`src/glam_extractor/annotators/dspy_optimizer.py` (1,030 lines)

EntitySpan - Entity representation for evaluation
NERMetrics - Comprehensive metrics (P/R/F1, per-type)
compute_ner_metrics() - CoNLL-style evaluation
GLAMNERSignature - DSPy signature for NER
GLAMRelationshipSignature - Relationship extraction
GLAMClaimSignature - Claim extraction
GLAMFullPipelineSignature - Full pipeline signature
PromptVersion - Version tracking
PromptArchive - Version management with rollback
OptimizationConfig - MIPROv2/BootstrapFewShot config
PromptOptimizer - DSPy optimization workflow

Key Design Decisions

1. CH-Annotator v1.7.0 Convention

All entity extraction follows the CH-Annotator convention with 9 hypernyms:

Code	Hypernym	Primary Ontology
AGT	AGENT	`crm:E39_Actor`
GRP	GROUP	`crm:E74_Group`
TOP	TOPONYM	`crm:E53_Place`
GEO	GEOMETRY	`geo:Geometry`
TMP	TEMPORAL	`crm:E52_Time-Span`
APP	APPELLATION	`crm:E41_Appellation`
ROL	ROLE	`org:Role`
WRK	WORK	`frbroo:F1_Work`
QTY	QUANTITY	`crm:E54_Dimension`

Heritage institutions use GRP.HER.* with GLAMORCUBESFIXPHDNT subtypes.

2. GLAMORCUBESFIXPHDNT Taxonomy

19-type classification for heritage custodians:

G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
N-NGO, T-Taste/Smell

3. Hybrid Extraction Strategy

Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See 04-entity-extraction.md.

4. Multi-Store Architecture

Store	Purpose	Technology
Vector Store	Semantic search	ChromaDB
Knowledge Graph	Entity relationships	TypeDB
External KBs	Authority linking	Wikidata, VIAF, ISIL

5. XPath Provenance (AGENTS.md Rule 6)

Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.

External Dependencies

Knowledge Bases

Wikidata - Primary entity linking target
VIAF - Authority file for agents
ISIL - Library/archive identifiers
GeoNames - Place disambiguation
ROR - Research organization registry

LinkML Schemas

schemas/20251121/linkml/
├── modules/classes/
│   ├── CustodianObservation.yaml
│   ├── CustodianName.yaml
│   ├── CustodianReconstruction.yaml
│   ├── CustodianLegalStatus.yaml
│   ├── CustodianPlace.yaml
│   ├── CustodianCollection.yaml
│   └── ...
└── modules/enums/
    ├── CustodianPrimaryTypeEnum.yaml  # GLAMORCUBESFIXPHDNT
    └── CanonicalClaimTypes.yaml       # 3-tier claim types

Convention Files

data/entity_annotation/ch_annotator-v1_7_0.yaml - CH-Annotator convention (2000+ lines)

Performance Targets

From 08-evaluation.md:

Task	Metric	Target
NER (exact)	F1	≥0.85
NER (relaxed)	F1	≥0.90
Type Classification	Macro-F1	≥0.80
Entity Linking	Hits@1	≥0.75
Entity Linking	MRR	≥0.80
Retrieval	NDCG@10	≥0.70
QA Faithfulness	Score	≥0.85

Next Steps for Implementation

Entity Extraction Pipeline - Implement hybrid extraction from 04-entity-extraction.md
Entity Linking Pipeline - Implement candidate generation/ranking from 05-entity-linking.md
Gold Standard Dataset - Create evaluation data following schema in 08-evaluation.md
ChromaDB Integration - Extend RAGSPARQLGenerator with ChromaDB
TypeDB Schema - Translate LinkML to TypeDB for KG storage

References

Project Documentation

AGENTS.md - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
docs/SCHEMA_MODULES.md - LinkML schema architecture
docs/PERSISTENT_IDENTIFIERS.md - GHCID identifier system

External Standards

Maintainer: GLAM Data Extraction Project
License: MIT

10 KiB Raw Blame History