160 lines
8.9 KiB
Markdown
160 lines
8.9 KiB
Markdown
# DSPy RAG Pipeline Design for Heritage Custodian Ontology
|
|
|
|
## Executive Summary
|
|
|
|
This document outlines the design for a DSPy-based Retrieval-Augmented Generation (RAG) pipeline tailored to the Heritage Custodian Ontology. The pipeline leverages the rich semantic structure of the LinkML schema to enable intelligent retrieval, entity extraction, and knowledge graph construction for heritage institutions worldwide.
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ DSPy RAG Pipeline │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
|
|
│ │ Document │ │ Semantic │ │ Entity │ │ Entity │ │
|
|
│ │ Chunking │───▶│ Routing │───▶│ Extraction │───▶│ Linking │ │
|
|
│ │ (Schema- │ │ (19 types) │ │ (DSPy) │ │ (Wikidata)│ │
|
|
│ │ Aware) │ │ │ │ │ │ │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Vector Store (ChromaDB/Weaviate) │ │
|
|
│ │ • Schema-aware embeddings │ │
|
|
│ │ • Ontology-mapped metadata │ │
|
|
│ │ • SPARQL-queryable │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Knowledge Graph (TypeDB) │ │
|
|
│ │ • Custodian hub entities │ │
|
|
│ │ • Reconstructed aspects (Legal, Name, Place, Collection, Platform) │ │
|
|
│ │ • Provenance (PROV-O) │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Key Design Principles
|
|
|
|
### 1. Schema-Aware Chunking
|
|
|
|
Documents are chunked according to the ontology class structure:
|
|
|
|
| Class Type | Chunking Strategy | Metadata Added |
|
|
|------------|-------------------|----------------|
|
|
| `Custodian` | Hub entity boundary | `hc_id`, `custodian_type`, `ghcid` |
|
|
| `CustodianObservation` | Evidence unit | `source_url`, `retrieved_on`, `xpath` |
|
|
| `CustodianCollection` | Collection description | `collection_name`, `temporal_extent` |
|
|
| `EncompassingBody` | Organization context | `body_type`, `member_custodians` |
|
|
| `Project` | Initiative context | `project_status`, `date_range` |
|
|
|
|
### 2. Semantic Routing (GLAMORCUBESFIXPHDNT)
|
|
|
|
Queries are routed based on the 19-type taxonomy:
|
|
|
|
```python
|
|
CUSTODIAN_ROUTES = {
|
|
"G": ("GALLERY", ["art", "exhibition", "kunsthalle", "visual arts"]),
|
|
"L": ("LIBRARY", ["library", "bibliothek", "biblioteca", "books"]),
|
|
"A": ("ARCHIVE", ["archive", "archief", "archivo", "records", "documents"]),
|
|
"M": ("MUSEUM", ["museum", "museu", "museo", "collection"]),
|
|
"O": ("OFFICIAL_INSTITUTION", ["government", "agency", "platform"]),
|
|
"R": ("RESEARCH_CENTER", ["research", "institute", "documentation"]),
|
|
"C": ("COMMERCIAL", ["corporate", "company", "brand"]),
|
|
"U": ("UNSPECIFIED", []), # Data quality flag
|
|
"B": ("BIO_CUSTODIAN", ["botanical", "zoo", "aquarium", "herbarium"]),
|
|
"E": ("EDUCATION_PROVIDER", ["university", "school", "training"]),
|
|
"S": ("HERITAGE_SOCIETY", ["society", "vereniging", "club"]),
|
|
"F": ("FEATURE_CUSTODIAN", ["monument", "mansion", "palace"]),
|
|
"I": ("INTANGIBLE_HERITAGE_GROUP", ["performance", "folklore", "oral"]),
|
|
"X": ("MIXED", []), # Multiple types
|
|
"P": ("PERSONAL_COLLECTION", ["collector", "family", "private"]),
|
|
"H": ("HOLY_SACRED_SITE", ["church", "temple", "mosque", "monastery"]),
|
|
"D": ("DIGITAL_PLATFORM", ["online", "digital", "virtual"]),
|
|
"N": ("NON_PROFIT", ["ngo", "foundation", "charity"]),
|
|
"T": ("TASTE_SCENT_HERITAGE", ["culinary", "perfume", "distillery"]),
|
|
}
|
|
```
|
|
|
|
### 3. Ontology-Grounded Entity Extraction
|
|
|
|
All extracted entities are mapped to ontology classes:
|
|
|
|
| Entity Type | LinkML Class | Primary Ontology | Wikidata Mapping |
|
|
|-------------|--------------|------------------|------------------|
|
|
| Institution | `Custodian` | `crm:E39_Actor` | Instance Q-number |
|
|
| Legal Form | `CustodianLegalStatus` | `org:FormalOrganization` | ISO 20275 ELF |
|
|
| Place | `CustodianPlace` | `crm:E53_Place` | GeoNames, Wikidata |
|
|
| Collection | `CustodianCollection` | `crm:E78_Curated_Holding` | - |
|
|
| Platform | `DigitalPlatform` | `schema:WebSite` | - |
|
|
| Identifier | `Identifier` | `dct:identifier` | ISIL, VIAF, ISNI |
|
|
|
|
### 4. Provenance-First Design
|
|
|
|
Every extracted claim must have provenance:
|
|
|
|
```yaml
|
|
claim:
|
|
claim_type: full_name
|
|
claim_value: "Rijksmuseum Amsterdam"
|
|
provenance:
|
|
namespace: skos # Ontology prefix
|
|
path: /html/body/h1[1] # XPath to source
|
|
timestamp: "2025-12-06T10:00:00Z" # Extraction time
|
|
agent: claude-opus-4.5 # Extraction model
|
|
confidence_score: 0.95 # Confidence
|
|
tier: 3 # TIER_3 = NLP extraction
|
|
```
|
|
|
|
## Document Structure
|
|
|
|
1. [**Architecture**](./01-architecture.md) - System components and data flow
|
|
2. [**DSPy Signatures**](./02-dspy-signatures.md) - Module definitions
|
|
3. [**Chunking Strategy**](./03-chunking-strategy.md) - Schema-aware document processing
|
|
4. [**Entity Extraction**](./04-entity-extraction.md) - NER patterns for heritage domain
|
|
5. [**Entity Linking**](./05-entity-linking.md) - Wikidata/VIAF/ISIL resolution
|
|
6. [**Retrieval Patterns**](./06-retrieval-patterns.md) - Hybrid search strategies
|
|
7. [**SPARQL Templates**](./07-sparql-templates.md) - Query patterns
|
|
8. [**Evaluation**](./08-evaluation.md) - Metrics and benchmarks
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
from dspy_heritage import HeritageRAG
|
|
|
|
# Initialize pipeline
|
|
rag = HeritageRAG(
|
|
schema_path="schemas/20251121/linkml/",
|
|
vector_store="chromadb",
|
|
kg_backend="typedb"
|
|
)
|
|
|
|
# Extract entities from text
|
|
entities = rag.extract(
|
|
text="The Rijksmuseum in Amsterdam holds over 1 million objects...",
|
|
expected_types=["MUSEUM", "COLLECTION"]
|
|
)
|
|
|
|
# Query knowledge graph
|
|
results = rag.query(
|
|
"Which Dutch museums have digitized collections on Europeana?"
|
|
)
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
- **DSPy** >= 2.4.0 - LLM orchestration framework
|
|
- **ChromaDB** / **Weaviate** - Vector storage
|
|
- **TypeDB** 2.x - Knowledge graph backend
|
|
- **LinkML** >= 1.6.0 - Schema validation
|
|
- **sentence-transformers** - Embeddings
|
|
- **langchain** (optional) - Document loaders
|
|
|
|
## Related Documentation
|
|
|
|
- [Heritage Custodian Ontology](../../schemas/20251121/linkml/)
|
|
- [AGENTS.md](../../AGENTS.md) - Agent extraction rules
|
|
- [PERSISTENT_IDENTIFIERS.md](../PERSISTENT_IDENTIFIERS.md) - GHCID format
|
|
- [CH-Annotator Convention](../../data/entity_annotation/ch_annotator-v1_7_0.yaml)
|