DSPy RAG Pipeline Design for Heritage Custodian Ontology

Executive Summary

This document outlines the design for a DSPy-based Retrieval-Augmented Generation (RAG) pipeline tailored to the Heritage Custodian Ontology. The pipeline leverages the rich semantic structure of the LinkML schema to enable intelligent retrieval, entity extraction, and knowledge graph construction for heritage institutions worldwide.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          DSPy RAG Pipeline                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌───────────┐ │
│  │   Document   │    │   Semantic   │    │   Entity     │    │  Entity   │ │
│  │   Chunking   │───▶│   Routing    │───▶│  Extraction  │───▶│  Linking  │ │
│  │   (Schema-   │    │  (19 types)  │    │   (DSPy)     │    │ (Wikidata)│ │
│  │   Aware)     │    │              │    │              │    │           │ │
│  └──────────────┘    └──────────────┘    └──────────────┘    └───────────┘ │
│         │                   │                   │                   │       │
│         ▼                   ▼                   ▼                   ▼       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                     Vector Store (ChromaDB/Weaviate)                 │   │
│  │   • Schema-aware embeddings                                          │   │
│  │   • Ontology-mapped metadata                                         │   │
│  │   • SPARQL-queryable                                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                      Knowledge Graph (TypeDB)                        │   │
│  │   • Custodian hub entities                                           │   │
│  │   • Reconstructed aspects (Legal, Name, Place, Collection, Platform) │   │
│  │   • Provenance (PROV-O)                                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Principles

1. Schema-Aware Chunking

Documents are chunked according to the ontology class structure:

Class Type	Chunking Strategy	Metadata Added
`Custodian`	Hub entity boundary	`hc_id`, `custodian_type`, `ghcid`
`CustodianObservation`	Evidence unit	`source_url`, `retrieved_on`, `xpath`
`CustodianCollection`	Collection description	`collection_name`, `temporal_extent`
`EncompassingBody`	Organization context	`body_type`, `member_custodians`
`Project`	Initiative context	`project_status`, `date_range`

2. Semantic Routing (GLAMORCUBESFIXPHDNT)

Queries are routed based on the 19-type taxonomy:

CUSTODIAN_ROUTES = {
    "G": ("GALLERY", ["art", "exhibition", "kunsthalle", "visual arts"]),
    "L": ("LIBRARY", ["library", "bibliothek", "biblioteca", "books"]),
    "A": ("ARCHIVE", ["archive", "archief", "archivo", "records", "documents"]),
    "M": ("MUSEUM", ["museum", "museu", "museo", "collection"]),
    "O": ("OFFICIAL_INSTITUTION", ["government", "agency", "platform"]),
    "R": ("RESEARCH_CENTER", ["research", "institute", "documentation"]),
    "C": ("COMMERCIAL", ["corporate", "company", "brand"]),
    "U": ("UNSPECIFIED", []),  # Data quality flag
    "B": ("BIO_CUSTODIAN", ["botanical", "zoo", "aquarium", "herbarium"]),
    "E": ("EDUCATION_PROVIDER", ["university", "school", "training"]),
    "S": ("HERITAGE_SOCIETY", ["society", "vereniging", "club"]),
    "F": ("FEATURE_CUSTODIAN", ["monument", "mansion", "palace"]),
    "I": ("INTANGIBLE_HERITAGE_GROUP", ["performance", "folklore", "oral"]),
    "X": ("MIXED", []),  # Multiple types
    "P": ("PERSONAL_COLLECTION", ["collector", "family", "private"]),
    "H": ("HOLY_SACRED_SITE", ["church", "temple", "mosque", "monastery"]),
    "D": ("DIGITAL_PLATFORM", ["online", "digital", "virtual"]),
    "N": ("NON_PROFIT", ["ngo", "foundation", "charity"]),
    "T": ("TASTE_SCENT_HERITAGE", ["culinary", "perfume", "distillery"]),
}

3. Ontology-Grounded Entity Extraction

All extracted entities are mapped to ontology classes:

Entity Type	LinkML Class	Primary Ontology	Wikidata Mapping
Institution	`Custodian`	`crm:E39_Actor`	Instance Q-number
Legal Form	`CustodianLegalStatus`	`org:FormalOrganization`	ISO 20275 ELF
Place	`CustodianPlace`	`crm:E53_Place`	GeoNames, Wikidata
Collection	`CustodianCollection`	`crm:E78_Curated_Holding`	-
Platform	`DigitalPlatform`	`schema:WebSite`	-
Identifier	`Identifier`	`dct:identifier`	ISIL, VIAF, ISNI

4. Provenance-First Design

Every extracted claim must have provenance:

claim:
  claim_type: full_name
  claim_value: "Rijksmuseum Amsterdam"
  provenance:
    namespace: skos                        # Ontology prefix
    path: /html/body/h1[1]                 # XPath to source
    timestamp: "2025-12-06T10:00:00Z"      # Extraction time
    agent: claude-opus-4.5                 # Extraction model
    confidence_score: 0.95                 # Confidence
    tier: 3                                # TIER_3 = NLP extraction

Document Structure

Architecture - System components and data flow
DSPy Signatures - Module definitions
Chunking Strategy - Schema-aware document processing
Entity Extraction - NER patterns for heritage domain
Entity Linking - Wikidata/VIAF/ISIL resolution
Retrieval Patterns - Hybrid search strategies
SPARQL Templates - Query patterns
Evaluation - Metrics and benchmarks

Quick Start

from dspy_heritage import HeritageRAG

# Initialize pipeline
rag = HeritageRAG(
    schema_path="schemas/20251121/linkml/",
    vector_store="chromadb",
    kg_backend="typedb"
)

# Extract entities from text
entities = rag.extract(
    text="The Rijksmuseum in Amsterdam holds over 1 million objects...",
    expected_types=["MUSEUM", "COLLECTION"]
)

# Query knowledge graph
results = rag.query(
    "Which Dutch museums have digitized collections on Europeana?"
)

Dependencies

DSPy >= 2.4.0 - LLM orchestration framework
ChromaDB / Weaviate - Vector storage
TypeDB 2.x - Knowledge graph backend
LinkML >= 1.6.0 - Schema validation
sentence-transformers - Embeddings
langchain (optional) - Document loaders

Heritage Custodian Ontology
AGENTS.md - Agent extraction rules
PERSISTENT_IDENTIFIERS.md - GHCID format
CH-Annotator Convention

8.9 KiB Raw Blame History