glam/docs/dspy_rag/README.md

# DSPy RAG Pipeline for Heritage Custodian Ontology

**Version**: 1.0.0
**Status**: Documentation Complete
**Last Updated**: 2025-12-12

## Overview

This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.

## Quick Start

```bash
# Install dependencies
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic

# Configure DSPy with Claude
export ANTHROPIC_API_KEY="your-key"

# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
export ZAI_API_TOKEN="your-token"
```

```python
import dspy
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql

# Configure with Anthropic
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")

# Generate SPARQL from natural language
result = generate_sparql(
    question="Find all museums in Amsterdam with Wikidata IDs",
    language="en"
)
print(result["sparql"])
```

## Documentation Suite

| # | Document | Description | Lines |
|---|----------|-------------|-------|
| 00 | [overview.md](00-overview.md) | Executive summary, architecture diagram, key design decisions | ~200 |
| 01 | [architecture.md](01-architecture.md) | System components, data flow, storage design (ChromaDB, TypeDB) | ~400 |
| 02 | [dspy-signatures.md](02-dspy-signatures.md) | 7 DSPy module definitions with Pydantic output schemas | 584 |
| 03 | [chunking-strategy.md](03-chunking-strategy.md) | Schema-aware chunking for HTML, PDF, JSON-LD sources | ~400 |
| 04 | [entity-extraction.md](04-entity-extraction.md) | NER following CH-Annotator v1.7.0, hybrid extraction | 571 |
| 05 | [entity-linking.md](05-entity-linking.md) | Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation | ~650 |
| 06 | [retrieval-patterns.md](06-retrieval-patterns.md) | Hybrid retrieval with Reciprocal Rank Fusion | 601 |
| 07 | [sparql-templates.md](07-sparql-templates.md) | 40+ SPARQL query templates with Wikidata federation | 563 |
| 08 | [evaluation.md](08-evaluation.md) | Metrics, gold standard datasets, benchmarks | ~700 |

**Total**: ~4,700 lines of documentation

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                        DSPy RAG Pipeline                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Chunker    │───▶│   Embedder   │───▶│  ChromaDB    │          │
│  │ (03-chunking)│    │              │    │  (Vectors)   │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│         │                                       │                   │
│         ▼                                       ▼                   │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐          │
│  │   Entity     │───▶│   Entity     │───▶│   TypeDB     │          │
│  │  Extractor   │    │   Linker     │    │    (KG)      │          │
│  │(04-extraction)│    │(05-linking)  │    │              │          │
│  └──────────────┘    └──────────────┘    └──────────────┘          │
│                             │                   │                   │
│                             ▼                   ▼                   │
│                      ┌──────────────┐    ┌──────────────┐          │
│                      │  Wikidata    │◀──▶│   SPARQL     │          │
│                      │   VIAF       │    │  Generator   │          │
│                      │   ISIL       │    │(07-templates)│          │
│                      └──────────────┘    └──────────────┘          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

## Existing Implementation

The following code already implements parts of this documentation:

### `src/glam_extractor/api/dspy_sparql.py` (347 lines)

- `QuestionToSPARQL` - DSPy signature for SPARQL generation
- `SPARQLGenerator` - Basic DSPy module
- `RAGSPARQLGenerator` - RAG-enhanced with Qdrant retrieval
- `configure_dspy()` - LLM configuration
- `generate_sparql()` - Main entry point
- `ONTOLOGY_CONTEXT` - Prompt context with ontology info

### `src/glam_extractor/annotators/dspy_optimizer.py` (1,030 lines)

- `EntitySpan` - Entity representation for evaluation
- `NERMetrics` - Comprehensive metrics (P/R/F1, per-type)
- `compute_ner_metrics()` - CoNLL-style evaluation
- `GLAMNERSignature` - DSPy signature for NER
- `GLAMRelationshipSignature` - Relationship extraction
- `GLAMClaimSignature` - Claim extraction
- `GLAMFullPipelineSignature` - Full pipeline signature
- `PromptVersion` - Version tracking
- `PromptArchive` - Version management with rollback
- `OptimizationConfig` - MIPROv2/BootstrapFewShot config
- `PromptOptimizer` - DSPy optimization workflow

## Key Design Decisions

### 1. CH-Annotator v1.7.0 Convention

All entity extraction follows the CH-Annotator convention with 9 hypernyms:

| Code | Hypernym | Primary Ontology |
|------|----------|------------------|
| AGT | AGENT | `crm:E39_Actor` |
| GRP | GROUP | `crm:E74_Group` |
| TOP | TOPONYM | `crm:E53_Place` |
| GEO | GEOMETRY | `geo:Geometry` |
| TMP | TEMPORAL | `crm:E52_Time-Span` |
| APP | APPELLATION | `crm:E41_Appellation` |
| ROL | ROLE | `org:Role` |
| WRK | WORK | `frbroo:F1_Work` |
| QTY | QUANTITY | `crm:E54_Dimension` |

Heritage institutions use `GRP.HER.*` with GLAMORCUBESFIXPHDNT subtypes.

### 2. GLAMORCUBESFIXPHDNT Taxonomy

19-type classification for heritage custodians:

```
G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
N-NGO, T-Taste/Smell
```

### 3. Hybrid Extraction Strategy

Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See [04-entity-extraction.md](04-entity-extraction.md).

### 4. Multi-Store Architecture

| Store | Purpose | Technology |
|-------|---------|------------|
| Vector Store | Semantic search | ChromaDB |
| Knowledge Graph | Entity relationships | TypeDB |
| External KBs | Authority linking | Wikidata, VIAF, ISIL |

### 5. XPath Provenance (AGENTS.md Rule 6)

Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.

## External Dependencies

### Knowledge Bases

- **Wikidata** - Primary entity linking target
- **VIAF** - Authority file for agents
- **ISIL** - Library/archive identifiers
- **GeoNames** - Place disambiguation
- **ROR** - Research organization registry

### LinkML Schemas

```
schemas/20251121/linkml/
├── modules/classes/
│   ├── CustodianObservation.yaml
│   ├── CustodianName.yaml
│   ├── CustodianReconstruction.yaml
│   ├── CustodianLegalStatus.yaml
│   ├── CustodianPlace.yaml
│   ├── CustodianCollection.yaml
│   └── ...
└── modules/enums/
    ├── CustodianPrimaryTypeEnum.yaml  # GLAMORCUBESFIXPHDNT
    └── CanonicalClaimTypes.yaml       # 3-tier claim types
```

### Convention Files

- `data/entity_annotation/ch_annotator-v1_7_0.yaml` - CH-Annotator convention (2000+ lines)

## Performance Targets

From [08-evaluation.md](08-evaluation.md):

| Task | Metric | Target |
|------|--------|--------|
| NER (exact) | F1 | ≥0.85 |
| NER (relaxed) | F1 | ≥0.90 |
| Type Classification | Macro-F1 | ≥0.80 |
| Entity Linking | Hits@1 | ≥0.75 |
| Entity Linking | MRR | ≥0.80 |
| Retrieval | NDCG@10 | ≥0.70 |
| QA Faithfulness | Score | ≥0.85 |

## Next Steps for Implementation

1. **Entity Extraction Pipeline** - Implement hybrid extraction from [04-entity-extraction.md](04-entity-extraction.md)
2. **Entity Linking Pipeline** - Implement candidate generation/ranking from [05-entity-linking.md](05-entity-linking.md)
3. **Gold Standard Dataset** - Create evaluation data following schema in [08-evaluation.md](08-evaluation.md)
4. **ChromaDB Integration** - Extend `RAGSPARQLGenerator` with ChromaDB
5. **TypeDB Schema** - Translate LinkML to TypeDB for KG storage

## References

### Project Documentation

- **AGENTS.md** - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
- **docs/SCHEMA_MODULES.md** - LinkML schema architecture
- **docs/PERSISTENT_IDENTIFIERS.md** - GHCID identifier system

### External Standards

- [DSPy Documentation](https://dspy-docs.vercel.app/)
- [CH-Annotator Convention](../entity_annotation/ch_annotator-v1_7_0.yaml)
- [CIDOC-CRM 7.1.3](https://cidoc-crm.org/)
- [Records in Contexts (RiC-O)](https://www.ica.org/standards/RiC/RiC-O_v0-2.html)

---

**Maintainer**: GLAM Data Extraction Project
**License**: MIT