# DSPy RAG Pipeline for Heritage Custodian Ontology **Version**: 1.0.0 **Status**: Documentation Complete **Last Updated**: 2025-12-12 ## Overview This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology. ## Quick Start ```bash # Install dependencies pip install dspy-ai chromadb qdrant-client httpx lxml pydantic # Configure DSPy with Claude export ANTHROPIC_API_KEY="your-key" # Or use Z.AI GLM (free, per AGENTS.md Rule 11) export ZAI_API_TOKEN="your-token" ``` ```python import dspy from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql # Configure with Anthropic configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514") # Generate SPARQL from natural language result = generate_sparql( question="Find all museums in Amsterdam with Wikidata IDs", language="en" ) print(result["sparql"]) ``` ## Documentation Suite | # | Document | Description | Lines | |---|----------|-------------|-------| | 00 | [overview.md](00-overview.md) | Executive summary, architecture diagram, key design decisions | ~200 | | 01 | [architecture.md](01-architecture.md) | System components, data flow, storage design (ChromaDB, TypeDB) | ~400 | | 02 | [dspy-signatures.md](02-dspy-signatures.md) | 7 DSPy module definitions with Pydantic output schemas | 584 | | 03 | [chunking-strategy.md](03-chunking-strategy.md) | Schema-aware chunking for HTML, PDF, JSON-LD sources | ~400 | | 04 | [entity-extraction.md](04-entity-extraction.md) | NER following CH-Annotator v1.7.0, hybrid extraction | 571 | | 05 | [entity-linking.md](05-entity-linking.md) | Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation | ~650 | | 06 | [retrieval-patterns.md](06-retrieval-patterns.md) | Hybrid retrieval with Reciprocal Rank Fusion | 601 | | 07 | [sparql-templates.md](07-sparql-templates.md) | 40+ SPARQL query templates with Wikidata federation | 563 | | 08 | [evaluation.md](08-evaluation.md) | Metrics, gold standard datasets, benchmarks | ~700 | **Total**: ~4,700 lines of documentation ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ DSPy RAG Pipeline │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Chunker │───▶│ Embedder │───▶│ ChromaDB │ │ │ │ (03-chunking)│ │ │ │ (Vectors) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Entity │───▶│ Entity │───▶│ TypeDB │ │ │ │ Extractor │ │ Linker │ │ (KG) │ │ │ │(04-extraction)│ │(05-linking) │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Wikidata │◀──▶│ SPARQL │ │ │ │ VIAF │ │ Generator │ │ │ │ ISIL │ │(07-templates)│ │ │ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Existing Implementation The following code already implements parts of this documentation: ### `src/glam_extractor/api/dspy_sparql.py` (347 lines) - `QuestionToSPARQL` - DSPy signature for SPARQL generation - `SPARQLGenerator` - Basic DSPy module - `RAGSPARQLGenerator` - RAG-enhanced with Qdrant retrieval - `configure_dspy()` - LLM configuration - `generate_sparql()` - Main entry point - `ONTOLOGY_CONTEXT` - Prompt context with ontology info ### `src/glam_extractor/annotators/dspy_optimizer.py` (1,030 lines) - `EntitySpan` - Entity representation for evaluation - `NERMetrics` - Comprehensive metrics (P/R/F1, per-type) - `compute_ner_metrics()` - CoNLL-style evaluation - `GLAMNERSignature` - DSPy signature for NER - `GLAMRelationshipSignature` - Relationship extraction - `GLAMClaimSignature` - Claim extraction - `GLAMFullPipelineSignature` - Full pipeline signature - `PromptVersion` - Version tracking - `PromptArchive` - Version management with rollback - `OptimizationConfig` - MIPROv2/BootstrapFewShot config - `PromptOptimizer` - DSPy optimization workflow ## Key Design Decisions ### 1. CH-Annotator v1.7.0 Convention All entity extraction follows the CH-Annotator convention with 9 hypernyms: | Code | Hypernym | Primary Ontology | |------|----------|------------------| | AGT | AGENT | `crm:E39_Actor` | | GRP | GROUP | `crm:E74_Group` | | TOP | TOPONYM | `crm:E53_Place` | | GEO | GEOMETRY | `geo:Geometry` | | TMP | TEMPORAL | `crm:E52_Time-Span` | | APP | APPELLATION | `crm:E41_Appellation` | | ROL | ROLE | `org:Role` | | WRK | WORK | `frbroo:F1_Work` | | QTY | QUANTITY | `crm:E54_Dimension` | Heritage institutions use `GRP.HER.*` with GLAMORCUBESFIXPHDNT subtypes. ### 2. GLAMORCUBESFIXPHDNT Taxonomy 19-type classification for heritage custodians: ``` G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research, C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society, F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital, N-NGO, T-Taste/Smell ``` ### 3. Hybrid Extraction Strategy Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See [04-entity-extraction.md](04-entity-extraction.md). ### 4. Multi-Store Architecture | Store | Purpose | Technology | |-------|---------|------------| | Vector Store | Semantic search | ChromaDB | | Knowledge Graph | Entity relationships | TypeDB | | External KBs | Authority linking | Wikidata, VIAF, ISIL | ### 5. XPath Provenance (AGENTS.md Rule 6) Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated. ## External Dependencies ### Knowledge Bases - **Wikidata** - Primary entity linking target - **VIAF** - Authority file for agents - **ISIL** - Library/archive identifiers - **GeoNames** - Place disambiguation - **ROR** - Research organization registry ### LinkML Schemas ``` schemas/20251121/linkml/ ├── modules/classes/ │ ├── CustodianObservation.yaml │ ├── CustodianName.yaml │ ├── CustodianReconstruction.yaml │ ├── CustodianLegalStatus.yaml │ ├── CustodianPlace.yaml │ ├── CustodianCollection.yaml │ └── ... └── modules/enums/ ├── CustodianPrimaryTypeEnum.yaml # GLAMORCUBESFIXPHDNT └── CanonicalClaimTypes.yaml # 3-tier claim types ``` ### Convention Files - `data/entity_annotation/ch_annotator-v1_7_0.yaml` - CH-Annotator convention (2000+ lines) ## Performance Targets From [08-evaluation.md](08-evaluation.md): | Task | Metric | Target | |------|--------|--------| | NER (exact) | F1 | ≥0.85 | | NER (relaxed) | F1 | ≥0.90 | | Type Classification | Macro-F1 | ≥0.80 | | Entity Linking | Hits@1 | ≥0.75 | | Entity Linking | MRR | ≥0.80 | | Retrieval | NDCG@10 | ≥0.70 | | QA Faithfulness | Score | ≥0.85 | ## Next Steps for Implementation 1. **Entity Extraction Pipeline** - Implement hybrid extraction from [04-entity-extraction.md](04-entity-extraction.md) 2. **Entity Linking Pipeline** - Implement candidate generation/ranking from [05-entity-linking.md](05-entity-linking.md) 3. **Gold Standard Dataset** - Create evaluation data following schema in [08-evaluation.md](08-evaluation.md) 4. **ChromaDB Integration** - Extend `RAGSPARQLGenerator` with ChromaDB 5. **TypeDB Schema** - Translate LinkML to TypeDB for KG storage ## References ### Project Documentation - **AGENTS.md** - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API) - **docs/SCHEMA_MODULES.md** - LinkML schema architecture - **docs/PERSISTENT_IDENTIFIERS.md** - GHCID identifier system ### External Standards - [DSPy Documentation](https://dspy-docs.vercel.app/) - [CH-Annotator Convention](../entity_annotation/ch_annotator-v1_7_0.yaml) - [CIDOC-CRM 7.1.3](https://cidoc-crm.org/) - [Records in Contexts (RiC-O)](https://www.ica.org/standards/RiC/RiC-O_v0-2.html) --- **Maintainer**: GLAM Data Extraction Project **License**: MIT