10 KiB
DSPy RAG Pipeline for Heritage Custodian Ontology
Version: 1.0.0
Status: Documentation Complete
Last Updated: 2025-12-12
Overview
This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.
Quick Start
# Install dependencies
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic
# Configure DSPy with Claude
export ANTHROPIC_API_KEY="your-key"
# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
export ZAI_API_TOKEN="your-token"
import dspy
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql
# Configure with Anthropic
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")
# Generate SPARQL from natural language
result = generate_sparql(
question="Find all museums in Amsterdam with Wikidata IDs",
language="en"
)
print(result["sparql"])
Documentation Suite
| # | Document | Description | Lines |
|---|---|---|---|
| 00 | overview.md | Executive summary, architecture diagram, key design decisions | ~200 |
| 01 | architecture.md | System components, data flow, storage design (ChromaDB, TypeDB) | ~400 |
| 02 | dspy-signatures.md | 7 DSPy module definitions with Pydantic output schemas | 584 |
| 03 | chunking-strategy.md | Schema-aware chunking for HTML, PDF, JSON-LD sources | ~400 |
| 04 | entity-extraction.md | NER following CH-Annotator v1.7.0, hybrid extraction | 571 |
| 05 | entity-linking.md | Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation | ~650 |
| 06 | retrieval-patterns.md | Hybrid retrieval with Reciprocal Rank Fusion | 601 |
| 07 | sparql-templates.md | 40+ SPARQL query templates with Wikidata federation | 563 |
| 08 | evaluation.md | Metrics, gold standard datasets, benchmarks | ~700 |
Total: ~4,700 lines of documentation
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ DSPy RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chunker │───▶│ Embedder │───▶│ ChromaDB │ │
│ │ (03-chunking)│ │ │ │ (Vectors) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Entity │───▶│ Entity │───▶│ TypeDB │ │
│ │ Extractor │ │ Linker │ │ (KG) │ │
│ │(04-extraction)│ │(05-linking) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Wikidata │◀──▶│ SPARQL │ │
│ │ VIAF │ │ Generator │ │
│ │ ISIL │ │(07-templates)│ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Existing Implementation
The following code already implements parts of this documentation:
src/glam_extractor/api/dspy_sparql.py (347 lines)
QuestionToSPARQL- DSPy signature for SPARQL generationSPARQLGenerator- Basic DSPy moduleRAGSPARQLGenerator- RAG-enhanced with Qdrant retrievalconfigure_dspy()- LLM configurationgenerate_sparql()- Main entry pointONTOLOGY_CONTEXT- Prompt context with ontology info
src/glam_extractor/annotators/dspy_optimizer.py (1,030 lines)
EntitySpan- Entity representation for evaluationNERMetrics- Comprehensive metrics (P/R/F1, per-type)compute_ner_metrics()- CoNLL-style evaluationGLAMNERSignature- DSPy signature for NERGLAMRelationshipSignature- Relationship extractionGLAMClaimSignature- Claim extractionGLAMFullPipelineSignature- Full pipeline signaturePromptVersion- Version trackingPromptArchive- Version management with rollbackOptimizationConfig- MIPROv2/BootstrapFewShot configPromptOptimizer- DSPy optimization workflow
Key Design Decisions
1. CH-Annotator v1.7.0 Convention
All entity extraction follows the CH-Annotator convention with 9 hypernyms:
| Code | Hypernym | Primary Ontology |
|---|---|---|
| AGT | AGENT | crm:E39_Actor |
| GRP | GROUP | crm:E74_Group |
| TOP | TOPONYM | crm:E53_Place |
| GEO | GEOMETRY | geo:Geometry |
| TMP | TEMPORAL | crm:E52_Time-Span |
| APP | APPELLATION | crm:E41_Appellation |
| ROL | ROLE | org:Role |
| WRK | WORK | frbroo:F1_Work |
| QTY | QUANTITY | crm:E54_Dimension |
Heritage institutions use GRP.HER.* with GLAMORCUBESFIXPHDNT subtypes.
2. GLAMORCUBESFIXPHDNT Taxonomy
19-type classification for heritage custodians:
G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
N-NGO, T-Taste/Smell
3. Hybrid Extraction Strategy
Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See 04-entity-extraction.md.
4. Multi-Store Architecture
| Store | Purpose | Technology |
|---|---|---|
| Vector Store | Semantic search | ChromaDB |
| Knowledge Graph | Entity relationships | TypeDB |
| External KBs | Authority linking | Wikidata, VIAF, ISIL |
5. XPath Provenance (AGENTS.md Rule 6)
Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.
External Dependencies
Knowledge Bases
- Wikidata - Primary entity linking target
- VIAF - Authority file for agents
- ISIL - Library/archive identifiers
- GeoNames - Place disambiguation
- ROR - Research organization registry
LinkML Schemas
schemas/20251121/linkml/
├── modules/classes/
│ ├── CustodianObservation.yaml
│ ├── CustodianName.yaml
│ ├── CustodianReconstruction.yaml
│ ├── CustodianLegalStatus.yaml
│ ├── CustodianPlace.yaml
│ ├── CustodianCollection.yaml
│ └── ...
└── modules/enums/
├── CustodianPrimaryTypeEnum.yaml # GLAMORCUBESFIXPHDNT
└── CanonicalClaimTypes.yaml # 3-tier claim types
Convention Files
data/entity_annotation/ch_annotator-v1_7_0.yaml- CH-Annotator convention (2000+ lines)
Performance Targets
From 08-evaluation.md:
| Task | Metric | Target |
|---|---|---|
| NER (exact) | F1 | ≥0.85 |
| NER (relaxed) | F1 | ≥0.90 |
| Type Classification | Macro-F1 | ≥0.80 |
| Entity Linking | Hits@1 | ≥0.75 |
| Entity Linking | MRR | ≥0.80 |
| Retrieval | NDCG@10 | ≥0.70 |
| QA Faithfulness | Score | ≥0.85 |
Next Steps for Implementation
- Entity Extraction Pipeline - Implement hybrid extraction from 04-entity-extraction.md
- Entity Linking Pipeline - Implement candidate generation/ranking from 05-entity-linking.md
- Gold Standard Dataset - Create evaluation data following schema in 08-evaluation.md
- ChromaDB Integration - Extend
RAGSPARQLGeneratorwith ChromaDB - TypeDB Schema - Translate LinkML to TypeDB for KG storage
References
Project Documentation
- AGENTS.md - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
- docs/SCHEMA_MODULES.md - LinkML schema architecture
- docs/PERSISTENT_IDENTIFIERS.md - GHCID identifier system
External Standards
Maintainer: GLAM Data Extraction Project
License: MIT