glam/docs/dspy_rag/README.md
2025-12-12 12:51:10 +01:00

229 lines
10 KiB
Markdown

# DSPy RAG Pipeline for Heritage Custodian Ontology
**Version**: 1.0.0
**Status**: Documentation Complete
**Last Updated**: 2025-12-12
## Overview
This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.
## Quick Start
```bash
# Install dependencies
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic
# Configure DSPy with Claude
export ANTHROPIC_API_KEY="your-key"
# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
export ZAI_API_TOKEN="your-token"
```
```python
import dspy
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql
# Configure with Anthropic
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")
# Generate SPARQL from natural language
result = generate_sparql(
question="Find all museums in Amsterdam with Wikidata IDs",
language="en"
)
print(result["sparql"])
```
## Documentation Suite
| # | Document | Description | Lines |
|---|----------|-------------|-------|
| 00 | [overview.md](00-overview.md) | Executive summary, architecture diagram, key design decisions | ~200 |
| 01 | [architecture.md](01-architecture.md) | System components, data flow, storage design (ChromaDB, TypeDB) | ~400 |
| 02 | [dspy-signatures.md](02-dspy-signatures.md) | 7 DSPy module definitions with Pydantic output schemas | 584 |
| 03 | [chunking-strategy.md](03-chunking-strategy.md) | Schema-aware chunking for HTML, PDF, JSON-LD sources | ~400 |
| 04 | [entity-extraction.md](04-entity-extraction.md) | NER following CH-Annotator v1.7.0, hybrid extraction | 571 |
| 05 | [entity-linking.md](05-entity-linking.md) | Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation | ~650 |
| 06 | [retrieval-patterns.md](06-retrieval-patterns.md) | Hybrid retrieval with Reciprocal Rank Fusion | 601 |
| 07 | [sparql-templates.md](07-sparql-templates.md) | 40+ SPARQL query templates with Wikidata federation | 563 |
| 08 | [evaluation.md](08-evaluation.md) | Metrics, gold standard datasets, benchmarks | ~700 |
**Total**: ~4,700 lines of documentation
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────┐
│ DSPy RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chunker │───▶│ Embedder │───▶│ ChromaDB │ │
│ │ (03-chunking)│ │ │ │ (Vectors) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Entity │───▶│ Entity │───▶│ TypeDB │ │
│ │ Extractor │ │ Linker │ │ (KG) │ │
│ │(04-extraction)│ │(05-linking) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Wikidata │◀──▶│ SPARQL │ │
│ │ VIAF │ │ Generator │ │
│ │ ISIL │ │(07-templates)│ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
## Existing Implementation
The following code already implements parts of this documentation:
### `src/glam_extractor/api/dspy_sparql.py` (347 lines)
- `QuestionToSPARQL` - DSPy signature for SPARQL generation
- `SPARQLGenerator` - Basic DSPy module
- `RAGSPARQLGenerator` - RAG-enhanced with Qdrant retrieval
- `configure_dspy()` - LLM configuration
- `generate_sparql()` - Main entry point
- `ONTOLOGY_CONTEXT` - Prompt context with ontology info
### `src/glam_extractor/annotators/dspy_optimizer.py` (1,030 lines)
- `EntitySpan` - Entity representation for evaluation
- `NERMetrics` - Comprehensive metrics (P/R/F1, per-type)
- `compute_ner_metrics()` - CoNLL-style evaluation
- `GLAMNERSignature` - DSPy signature for NER
- `GLAMRelationshipSignature` - Relationship extraction
- `GLAMClaimSignature` - Claim extraction
- `GLAMFullPipelineSignature` - Full pipeline signature
- `PromptVersion` - Version tracking
- `PromptArchive` - Version management with rollback
- `OptimizationConfig` - MIPROv2/BootstrapFewShot config
- `PromptOptimizer` - DSPy optimization workflow
## Key Design Decisions
### 1. CH-Annotator v1.7.0 Convention
All entity extraction follows the CH-Annotator convention with 9 hypernyms:
| Code | Hypernym | Primary Ontology |
|------|----------|------------------|
| AGT | AGENT | `crm:E39_Actor` |
| GRP | GROUP | `crm:E74_Group` |
| TOP | TOPONYM | `crm:E53_Place` |
| GEO | GEOMETRY | `geo:Geometry` |
| TMP | TEMPORAL | `crm:E52_Time-Span` |
| APP | APPELLATION | `crm:E41_Appellation` |
| ROL | ROLE | `org:Role` |
| WRK | WORK | `frbroo:F1_Work` |
| QTY | QUANTITY | `crm:E54_Dimension` |
Heritage institutions use `GRP.HER.*` with GLAMORCUBESFIXPHDNT subtypes.
### 2. GLAMORCUBESFIXPHDNT Taxonomy
19-type classification for heritage custodians:
```
G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
N-NGO, T-Taste/Smell
```
### 3. Hybrid Extraction Strategy
Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See [04-entity-extraction.md](04-entity-extraction.md).
### 4. Multi-Store Architecture
| Store | Purpose | Technology |
|-------|---------|------------|
| Vector Store | Semantic search | ChromaDB |
| Knowledge Graph | Entity relationships | TypeDB |
| External KBs | Authority linking | Wikidata, VIAF, ISIL |
### 5. XPath Provenance (AGENTS.md Rule 6)
Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.
## External Dependencies
### Knowledge Bases
- **Wikidata** - Primary entity linking target
- **VIAF** - Authority file for agents
- **ISIL** - Library/archive identifiers
- **GeoNames** - Place disambiguation
- **ROR** - Research organization registry
### LinkML Schemas
```
schemas/20251121/linkml/
├── modules/classes/
│ ├── CustodianObservation.yaml
│ ├── CustodianName.yaml
│ ├── CustodianReconstruction.yaml
│ ├── CustodianLegalStatus.yaml
│ ├── CustodianPlace.yaml
│ ├── CustodianCollection.yaml
│ └── ...
└── modules/enums/
├── CustodianPrimaryTypeEnum.yaml # GLAMORCUBESFIXPHDNT
└── CanonicalClaimTypes.yaml # 3-tier claim types
```
### Convention Files
- `data/entity_annotation/ch_annotator-v1_7_0.yaml` - CH-Annotator convention (2000+ lines)
## Performance Targets
From [08-evaluation.md](08-evaluation.md):
| Task | Metric | Target |
|------|--------|--------|
| NER (exact) | F1 | ≥0.85 |
| NER (relaxed) | F1 | ≥0.90 |
| Type Classification | Macro-F1 | ≥0.80 |
| Entity Linking | Hits@1 | ≥0.75 |
| Entity Linking | MRR | ≥0.80 |
| Retrieval | NDCG@10 | ≥0.70 |
| QA Faithfulness | Score | ≥0.85 |
## Next Steps for Implementation
1. **Entity Extraction Pipeline** - Implement hybrid extraction from [04-entity-extraction.md](04-entity-extraction.md)
2. **Entity Linking Pipeline** - Implement candidate generation/ranking from [05-entity-linking.md](05-entity-linking.md)
3. **Gold Standard Dataset** - Create evaluation data following schema in [08-evaluation.md](08-evaluation.md)
4. **ChromaDB Integration** - Extend `RAGSPARQLGenerator` with ChromaDB
5. **TypeDB Schema** - Translate LinkML to TypeDB for KG storage
## References
### Project Documentation
- **AGENTS.md** - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
- **docs/SCHEMA_MODULES.md** - LinkML schema architecture
- **docs/PERSISTENT_IDENTIFIERS.md** - GHCID identifier system
### External Standards
- [DSPy Documentation](https://dspy-docs.vercel.app/)
- [CH-Annotator Convention](../entity_annotation/ch_annotator-v1_7_0.yaml)
- [CIDOC-CRM 7.1.3](https://cidoc-crm.org/)
- [Records in Contexts (RiC-O)](https://www.ica.org/standards/RiC/RiC-O_v0-2.html)
---
**Maintainer**: GLAM Data Extraction Project
**License**: MIT