glam/docs/dspy_rag
2025-12-12 12:51:10 +01:00
..
00-overview.md enrich person profiles 2025-12-12 12:51:10 +01:00
01-architecture.md enrich person profiles 2025-12-12 12:51:10 +01:00
02-dspy-signatures.md enrich person profiles 2025-12-12 12:51:10 +01:00
03-chunking-strategy.md enrich person profiles 2025-12-12 12:51:10 +01:00
04-entity-extraction.md enrich person profiles 2025-12-12 12:51:10 +01:00
05-entity-linking.md enrich person profiles 2025-12-12 12:51:10 +01:00
06-retrieval-patterns.md enrich person profiles 2025-12-12 12:51:10 +01:00
07-sparql-templates.md enrich person profiles 2025-12-12 12:51:10 +01:00
08-evaluation.md enrich person profiles 2025-12-12 12:51:10 +01:00
README.md enrich person profiles 2025-12-12 12:51:10 +01:00

DSPy RAG Pipeline for Heritage Custodian Ontology

Version: 1.0.0
Status: Documentation Complete
Last Updated: 2025-12-12

Overview

This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.

Quick Start

# Install dependencies
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic

# Configure DSPy with Claude
export ANTHROPIC_API_KEY="your-key"

# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
export ZAI_API_TOKEN="your-token"
import dspy
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql

# Configure with Anthropic
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")

# Generate SPARQL from natural language
result = generate_sparql(
    question="Find all museums in Amsterdam with Wikidata IDs",
    language="en"
)
print(result["sparql"])

Documentation Suite

# Document Description Lines
00 overview.md Executive summary, architecture diagram, key design decisions ~200
01 architecture.md System components, data flow, storage design (ChromaDB, TypeDB) ~400
02 dspy-signatures.md 7 DSPy module definitions with Pydantic output schemas 584
03 chunking-strategy.md Schema-aware chunking for HTML, PDF, JSON-LD sources ~400
04 entity-extraction.md NER following CH-Annotator v1.7.0, hybrid extraction 571
05 entity-linking.md Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation ~650
06 retrieval-patterns.md Hybrid retrieval with Reciprocal Rank Fusion 601
07 sparql-templates.md 40+ SPARQL query templates with Wikidata federation 563
08 evaluation.md Metrics, gold standard datasets, benchmarks ~700

Total: ~4,700 lines of documentation

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        DSPy RAG Pipeline                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Chunker    │───▶│   Embedder   │───▶│  ChromaDB    │          │
│  │ (03-chunking)│    │              │    │  (Vectors)   │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│         │                                       │                   │
│         ▼                                       ▼                   │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Entity     │───▶│   Entity     │───▶│   TypeDB     │          │
│  │  Extractor   │    │   Linker     │    │    (KG)      │          │
│  │(04-extraction)│    │(05-linking)  │    │              │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│                             │                   │                   │
│                             ▼                   ▼                   │
│                      ┌──────────────┐    ┌──────────────┐          │
│                      │  Wikidata    │◀──▶│   SPARQL     │          │
│                      │   VIAF       │    │  Generator   │          │
│                      │   ISIL       │    │(07-templates)│          │
│                      └──────────────┘    └──────────────┘          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Existing Implementation

The following code already implements parts of this documentation:

src/glam_extractor/api/dspy_sparql.py (347 lines)

  • QuestionToSPARQL - DSPy signature for SPARQL generation
  • SPARQLGenerator - Basic DSPy module
  • RAGSPARQLGenerator - RAG-enhanced with Qdrant retrieval
  • configure_dspy() - LLM configuration
  • generate_sparql() - Main entry point
  • ONTOLOGY_CONTEXT - Prompt context with ontology info

src/glam_extractor/annotators/dspy_optimizer.py (1,030 lines)

  • EntitySpan - Entity representation for evaluation
  • NERMetrics - Comprehensive metrics (P/R/F1, per-type)
  • compute_ner_metrics() - CoNLL-style evaluation
  • GLAMNERSignature - DSPy signature for NER
  • GLAMRelationshipSignature - Relationship extraction
  • GLAMClaimSignature - Claim extraction
  • GLAMFullPipelineSignature - Full pipeline signature
  • PromptVersion - Version tracking
  • PromptArchive - Version management with rollback
  • OptimizationConfig - MIPROv2/BootstrapFewShot config
  • PromptOptimizer - DSPy optimization workflow

Key Design Decisions

1. CH-Annotator v1.7.0 Convention

All entity extraction follows the CH-Annotator convention with 9 hypernyms:

Code Hypernym Primary Ontology
AGT AGENT crm:E39_Actor
GRP GROUP crm:E74_Group
TOP TOPONYM crm:E53_Place
GEO GEOMETRY geo:Geometry
TMP TEMPORAL crm:E52_Time-Span
APP APPELLATION crm:E41_Appellation
ROL ROLE org:Role
WRK WORK frbroo:F1_Work
QTY QUANTITY crm:E54_Dimension

Heritage institutions use GRP.HER.* with GLAMORCUBESFIXPHDNT subtypes.

2. GLAMORCUBESFIXPHDNT Taxonomy

19-type classification for heritage custodians:

G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
N-NGO, T-Taste/Smell

3. Hybrid Extraction Strategy

Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See 04-entity-extraction.md.

4. Multi-Store Architecture

Store Purpose Technology
Vector Store Semantic search ChromaDB
Knowledge Graph Entity relationships TypeDB
External KBs Authority linking Wikidata, VIAF, ISIL

5. XPath Provenance (AGENTS.md Rule 6)

Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.

External Dependencies

Knowledge Bases

  • Wikidata - Primary entity linking target
  • VIAF - Authority file for agents
  • ISIL - Library/archive identifiers
  • GeoNames - Place disambiguation
  • ROR - Research organization registry

LinkML Schemas

schemas/20251121/linkml/
├── modules/classes/
│   ├── CustodianObservation.yaml
│   ├── CustodianName.yaml
│   ├── CustodianReconstruction.yaml
│   ├── CustodianLegalStatus.yaml
│   ├── CustodianPlace.yaml
│   ├── CustodianCollection.yaml
│   └── ...
└── modules/enums/
    ├── CustodianPrimaryTypeEnum.yaml  # GLAMORCUBESFIXPHDNT
    └── CanonicalClaimTypes.yaml       # 3-tier claim types

Convention Files

  • data/entity_annotation/ch_annotator-v1_7_0.yaml - CH-Annotator convention (2000+ lines)

Performance Targets

From 08-evaluation.md:

Task Metric Target
NER (exact) F1 ≥0.85
NER (relaxed) F1 ≥0.90
Type Classification Macro-F1 ≥0.80
Entity Linking Hits@1 ≥0.75
Entity Linking MRR ≥0.80
Retrieval NDCG@10 ≥0.70
QA Faithfulness Score ≥0.85

Next Steps for Implementation

  1. Entity Extraction Pipeline - Implement hybrid extraction from 04-entity-extraction.md
  2. Entity Linking Pipeline - Implement candidate generation/ranking from 05-entity-linking.md
  3. Gold Standard Dataset - Create evaluation data following schema in 08-evaluation.md
  4. ChromaDB Integration - Extend RAGSPARQLGenerator with ChromaDB
  5. TypeDB Schema - Translate LinkML to TypeDB for KG storage

References

Project Documentation

  • AGENTS.md - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
  • docs/SCHEMA_MODULES.md - LinkML schema architecture
  • docs/PERSISTENT_IDENTIFIERS.md - GHCID identifier system

External Standards


Maintainer: GLAM Data Extraction Project
License: MIT