229 lines
10 KiB
Markdown
229 lines
10 KiB
Markdown
# DSPy RAG Pipeline for Heritage Custodian Ontology
|
|
|
|
**Version**: 1.0.0
|
|
**Status**: Documentation Complete
|
|
**Last Updated**: 2025-12-12
|
|
|
|
## Overview
|
|
|
|
This documentation suite describes the DSPy-based Retrieval-Augmented Generation (RAG) pipeline for extracting, linking, and querying heritage custodian data. The pipeline processes web pages, archival documents, and structured data to populate the Heritage Custodian Ontology.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install dspy-ai chromadb qdrant-client httpx lxml pydantic
|
|
|
|
# Configure DSPy with Claude
|
|
export ANTHROPIC_API_KEY="your-key"
|
|
|
|
# Or use Z.AI GLM (free, per AGENTS.md Rule 11)
|
|
export ZAI_API_TOKEN="your-token"
|
|
```
|
|
|
|
```python
|
|
import dspy
|
|
from glam_extractor.api.dspy_sparql import configure_dspy, generate_sparql
|
|
|
|
# Configure with Anthropic
|
|
configure_dspy(provider="anthropic", model="claude-sonnet-4-20250514")
|
|
|
|
# Generate SPARQL from natural language
|
|
result = generate_sparql(
|
|
question="Find all museums in Amsterdam with Wikidata IDs",
|
|
language="en"
|
|
)
|
|
print(result["sparql"])
|
|
```
|
|
|
|
## Documentation Suite
|
|
|
|
| # | Document | Description | Lines |
|
|
|---|----------|-------------|-------|
|
|
| 00 | [overview.md](00-overview.md) | Executive summary, architecture diagram, key design decisions | ~200 |
|
|
| 01 | [architecture.md](01-architecture.md) | System components, data flow, storage design (ChromaDB, TypeDB) | ~400 |
|
|
| 02 | [dspy-signatures.md](02-dspy-signatures.md) | 7 DSPy module definitions with Pydantic output schemas | 584 |
|
|
| 03 | [chunking-strategy.md](03-chunking-strategy.md) | Schema-aware chunking for HTML, PDF, JSON-LD sources | ~400 |
|
|
| 04 | [entity-extraction.md](04-entity-extraction.md) | NER following CH-Annotator v1.7.0, hybrid extraction | 571 |
|
|
| 05 | [entity-linking.md](05-entity-linking.md) | Multi-KB candidate generation (Wikidata, VIAF, ISIL), ranking, disambiguation | ~650 |
|
|
| 06 | [retrieval-patterns.md](06-retrieval-patterns.md) | Hybrid retrieval with Reciprocal Rank Fusion | 601 |
|
|
| 07 | [sparql-templates.md](07-sparql-templates.md) | 40+ SPARQL query templates with Wikidata federation | 563 |
|
|
| 08 | [evaluation.md](08-evaluation.md) | Metrics, gold standard datasets, benchmarks | ~700 |
|
|
|
|
**Total**: ~4,700 lines of documentation
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ DSPy RAG Pipeline │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Chunker │───▶│ Embedder │───▶│ ChromaDB │ │
|
|
│ │ (03-chunking)│ │ │ │ (Vectors) │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Entity │───▶│ Entity │───▶│ TypeDB │ │
|
|
│ │ Extractor │ │ Linker │ │ (KG) │ │
|
|
│ │(04-extraction)│ │(05-linking) │ │ │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Wikidata │◀──▶│ SPARQL │ │
|
|
│ │ VIAF │ │ Generator │ │
|
|
│ │ ISIL │ │(07-templates)│ │
|
|
│ └──────────────┘ └──────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Existing Implementation
|
|
|
|
The following code already implements parts of this documentation:
|
|
|
|
### `src/glam_extractor/api/dspy_sparql.py` (347 lines)
|
|
|
|
- `QuestionToSPARQL` - DSPy signature for SPARQL generation
|
|
- `SPARQLGenerator` - Basic DSPy module
|
|
- `RAGSPARQLGenerator` - RAG-enhanced with Qdrant retrieval
|
|
- `configure_dspy()` - LLM configuration
|
|
- `generate_sparql()` - Main entry point
|
|
- `ONTOLOGY_CONTEXT` - Prompt context with ontology info
|
|
|
|
### `src/glam_extractor/annotators/dspy_optimizer.py` (1,030 lines)
|
|
|
|
- `EntitySpan` - Entity representation for evaluation
|
|
- `NERMetrics` - Comprehensive metrics (P/R/F1, per-type)
|
|
- `compute_ner_metrics()` - CoNLL-style evaluation
|
|
- `GLAMNERSignature` - DSPy signature for NER
|
|
- `GLAMRelationshipSignature` - Relationship extraction
|
|
- `GLAMClaimSignature` - Claim extraction
|
|
- `GLAMFullPipelineSignature` - Full pipeline signature
|
|
- `PromptVersion` - Version tracking
|
|
- `PromptArchive` - Version management with rollback
|
|
- `OptimizationConfig` - MIPROv2/BootstrapFewShot config
|
|
- `PromptOptimizer` - DSPy optimization workflow
|
|
|
|
## Key Design Decisions
|
|
|
|
### 1. CH-Annotator v1.7.0 Convention
|
|
|
|
All entity extraction follows the CH-Annotator convention with 9 hypernyms:
|
|
|
|
| Code | Hypernym | Primary Ontology |
|
|
|------|----------|------------------|
|
|
| AGT | AGENT | `crm:E39_Actor` |
|
|
| GRP | GROUP | `crm:E74_Group` |
|
|
| TOP | TOPONYM | `crm:E53_Place` |
|
|
| GEO | GEOMETRY | `geo:Geometry` |
|
|
| TMP | TEMPORAL | `crm:E52_Time-Span` |
|
|
| APP | APPELLATION | `crm:E41_Appellation` |
|
|
| ROL | ROLE | `org:Role` |
|
|
| WRK | WORK | `frbroo:F1_Work` |
|
|
| QTY | QUANTITY | `crm:E54_Dimension` |
|
|
|
|
Heritage institutions use `GRP.HER.*` with GLAMORCUBESFIXPHDNT subtypes.
|
|
|
|
### 2. GLAMORCUBESFIXPHDNT Taxonomy
|
|
|
|
19-type classification for heritage custodians:
|
|
|
|
```
|
|
G-Gallery, L-Library, A-Archive, M-Museum, O-Official, R-Research,
|
|
C-Corporation, U-Unknown, B-Botanical/Zoo, E-Education, S-Society,
|
|
F-Feature, I-Intangible, X-Mixed, P-Personal, H-Holy site, D-Digital,
|
|
N-NGO, T-Taste/Smell
|
|
```
|
|
|
|
### 3. Hybrid Extraction Strategy
|
|
|
|
Pattern-based extraction (high precision) merged with LLM extraction (high recall) using span overlap algorithm. See [04-entity-extraction.md](04-entity-extraction.md).
|
|
|
|
### 4. Multi-Store Architecture
|
|
|
|
| Store | Purpose | Technology |
|
|
|-------|---------|------------|
|
|
| Vector Store | Semantic search | ChromaDB |
|
|
| Knowledge Graph | Entity relationships | TypeDB |
|
|
| External KBs | Authority linking | Wikidata, VIAF, ISIL |
|
|
|
|
### 5. XPath Provenance (AGENTS.md Rule 6)
|
|
|
|
Every claim requires XPath provenance pointing to the source HTML location. Claims without XPath are rejected as fabricated.
|
|
|
|
## External Dependencies
|
|
|
|
### Knowledge Bases
|
|
|
|
- **Wikidata** - Primary entity linking target
|
|
- **VIAF** - Authority file for agents
|
|
- **ISIL** - Library/archive identifiers
|
|
- **GeoNames** - Place disambiguation
|
|
- **ROR** - Research organization registry
|
|
|
|
### LinkML Schemas
|
|
|
|
```
|
|
schemas/20251121/linkml/
|
|
├── modules/classes/
|
|
│ ├── CustodianObservation.yaml
|
|
│ ├── CustodianName.yaml
|
|
│ ├── CustodianReconstruction.yaml
|
|
│ ├── CustodianLegalStatus.yaml
|
|
│ ├── CustodianPlace.yaml
|
|
│ ├── CustodianCollection.yaml
|
|
│ └── ...
|
|
└── modules/enums/
|
|
├── CustodianPrimaryTypeEnum.yaml # GLAMORCUBESFIXPHDNT
|
|
└── CanonicalClaimTypes.yaml # 3-tier claim types
|
|
```
|
|
|
|
### Convention Files
|
|
|
|
- `data/entity_annotation/ch_annotator-v1_7_0.yaml` - CH-Annotator convention (2000+ lines)
|
|
|
|
## Performance Targets
|
|
|
|
From [08-evaluation.md](08-evaluation.md):
|
|
|
|
| Task | Metric | Target |
|
|
|------|--------|--------|
|
|
| NER (exact) | F1 | ≥0.85 |
|
|
| NER (relaxed) | F1 | ≥0.90 |
|
|
| Type Classification | Macro-F1 | ≥0.80 |
|
|
| Entity Linking | Hits@1 | ≥0.75 |
|
|
| Entity Linking | MRR | ≥0.80 |
|
|
| Retrieval | NDCG@10 | ≥0.70 |
|
|
| QA Faithfulness | Score | ≥0.85 |
|
|
|
|
## Next Steps for Implementation
|
|
|
|
1. **Entity Extraction Pipeline** - Implement hybrid extraction from [04-entity-extraction.md](04-entity-extraction.md)
|
|
2. **Entity Linking Pipeline** - Implement candidate generation/ranking from [05-entity-linking.md](05-entity-linking.md)
|
|
3. **Gold Standard Dataset** - Create evaluation data following schema in [08-evaluation.md](08-evaluation.md)
|
|
4. **ChromaDB Integration** - Extend `RAGSPARQLGenerator` with ChromaDB
|
|
5. **TypeDB Schema** - Translate LinkML to TypeDB for KG storage
|
|
|
|
## References
|
|
|
|
### Project Documentation
|
|
|
|
- **AGENTS.md** - AI agent instructions (Rule 6: XPath provenance, Rule 10: CH-Annotator, Rule 11: Z.AI API)
|
|
- **docs/SCHEMA_MODULES.md** - LinkML schema architecture
|
|
- **docs/PERSISTENT_IDENTIFIERS.md** - GHCID identifier system
|
|
|
|
### External Standards
|
|
|
|
- [DSPy Documentation](https://dspy-docs.vercel.app/)
|
|
- [CH-Annotator Convention](../entity_annotation/ch_annotator-v1_7_0.yaml)
|
|
- [CIDOC-CRM 7.1.3](https://cidoc-crm.org/)
|
|
- [Records in Contexts (RiC-O)](https://www.ica.org/standards/RiC/RiC-O_v0-2.html)
|
|
|
|
---
|
|
|
|
**Maintainer**: GLAM Data Extraction Project
|
|
**License**: MIT
|