26 KiB
26 KiB
DSPy RAG Architecture for Heritage Custodian Ontology
System Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ INPUT LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Conversation│ │ Website │ │ CSV │ │ Wikidata │ │
│ │ JSON │ │ Archive │ │ Registry │ │ SPARQL │ │
│ │ (139 files)│ │ (HTML) │ │ (ISIL, NDE) │ │ Queries │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Document Loader (Docling/LangChain) │ │
│ │ │ │
│ │ • HTML → Markdown (Docling) │ │
│ │ • JSON → Structured Messages │ │
│ │ • CSV → DataFrames with Schema Mapping │ │
│ │ • SPARQL Results → JSON-LD │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Schema-Aware Chunker │ │
│ │ │ │
│ │ Input: Raw documents │ │
│ │ Output: Chunks with ontology metadata │ │
│ │ │ │
│ │ Strategies: │ │
│ │ • Entity-boundary chunking (Custodian mentions) │ │
│ │ • Semantic section detection (Collection, Platform, Location) │ │
│ │ • Observation unit extraction (CustodianObservation) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DSPy Semantic Router │ │
│ │ │ │
│ │ Input: Chunk text + metadata │ │
│ │ Output: Custodian type classification (GLAMORCUBESFIXPHDNT) │ │
│ │ │ │
│ │ Module: CustodianTypeClassifier │ │
│ │ • 19 classes with Wikidata meanings │ │
│ │ • Multi-label support for MIXED type │ │
│ │ • Confidence scoring per type │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DSPy Entity Extractor │ │
│ │ │ │
│ │ Input: Classified chunks │ │
│ │ Output: Structured entities (LinkML-compliant) │ │
│ │ │ │
│ │ Modules: │ │
│ │ • CustodianExtractor (names, types, descriptions) │ │
│ │ • IdentifierExtractor (ISIL, Wikidata, VIAF, KvK) │ │
│ │ • LocationExtractor (addresses, GeoNames linking) │ │
│ │ • CollectionExtractor (holdings, temporal extent) │ │
│ │ • RelationshipExtractor (encompassing bodies, projects) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Entity Linker (Wikidata/VIAF/ISIL) │ │
│ │ │ │
│ │ Input: Extracted entities │ │
│ │ Output: Linked entities with external identifiers │ │
│ │ │ │
│ │ Services: │ │
│ │ • Wikidata MCP tool (wikidata-authenticated_*) │ │
│ │ • VIAF API │ │
│ │ • ISIL registry lookup │ │
│ │ • GeoNames for location normalization │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Vector Store (ChromaDB) │ │ Knowledge Graph (TypeDB) │ │
│ │ │ │ │ │
│ │ Collections: │ │ Entity Types: │ │
│ │ • custodian_chunks │ │ • custodian (hub) │ │
│ │ • observation_chunks │ │ • custodian-legal-status │ │
│ │ • collection_chunks │ │ • custodian-name │ │
│ │ • project_chunks │ │ • custodian-place │ │
│ │ │ │ • custodian-collection │ │
│ │ Metadata: │ │ • digital-platform │ │
│ │ • custodian_type │ │ • encompassing-body │ │
│ │ • country_code │ │ • project │ │
│ │ • ghcid │ │ │ │
│ │ • source_tier │ │ Relationships: │ │
│ │ • confidence_score │ │ • has-legal-status │ │
│ │ │ │ • has-place │ │
│ │ Embedding Model: │ │ • manages-collection │ │
│ │ • BGE-M3 (multilingual) │ │ • has-platform │ │
│ │ • domain-finetuned variant │ │ • member-of │ │
│ │ for heritage terminology │ │ • participated-in-project │ │
│ └─────────────────────────────────┘ └─────────────────────────────────────┘ │
│ │
└──────────────────────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ QUERY LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Hybrid Retriever │ │
│ │ │ │
│ │ Input: User query │ │
│ │ Output: Relevant chunks + KG entities │ │
│ │ │ │
│ │ Strategies: │ │
│ │ 1. Vector similarity (semantic match) │ │
│ │ 2. TypeDB traversal (relationship-based) │ │
│ │ 3. SPARQL federation (Wikidata enrichment) │ │
│ │ 4. Filtered retrieval (by type, country, tier) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DSPy Response Generator │ │
│ │ │ │
│ │ Input: Retrieved context + user query │ │
│ │ Output: Grounded response with citations │ │
│ │ │ │
│ │ Features: │ │
│ │ • Citation to source chunks │ │
│ │ • Provenance tracking (PROV-O compatible) │ │
│ │ • Confidence indication │ │
│ │ • Multi-language support (NL, EN, DE, FR, ES, PT) │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Data Flow
1. Ingestion Pipeline
Source Document
│
▼
┌──────────────────┐
│ Load Document │ • Docling for HTML
│ │ • JSON parser for conversations
│ │ • pandas for CSV registries
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Schema-Aware │ • Detect custodian boundaries
│ Chunking │ • Identify observation units
│ │ • Preserve ontology structure
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Type │ • GLAMORCUBESFIXPHDNT classification
│ Classification │ • Multi-label for MIXED
│ │ • Confidence scoring
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Entity │ • Extract Custodian attributes
│ Extraction │ • Extract identifiers
│ │ • Extract relationships
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Entity │ • Wikidata Q-number lookup
│ Linking │ • VIAF/ISIL resolution
│ │ • GeoNames normalization
└────────┬─────────┘
│
├────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Vector Store │ │ Knowledge Graph │
│ (embeddings) │ │ (entities) │
└──────────────────┘ └──────────────────┘
2. Query Pipeline
User Query
│
▼
┌──────────────────┐
│ Query │ • Intent classification
│ Understanding │ • Entity mention detection
│ │ • Type filtering extraction
└────────┬─────────┘
│
├────────────────────┬────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Vector │ │ Graph │ │ SPARQL │
│ Retrieval │ │ Traversal │ │ Federation │
│ (semantic) │ │ (relationships) │ │ (Wikidata) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└─────────────────────┴─────────────────────┘
│
▼
┌──────────────────┐
│ Context │ • Rank by relevance
│ Aggregation │ • Deduplicate
│ │ • Filter by tier
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Response │ • Generate answer
│ Generation │ • Add citations
│ │ • Track provenance
└────────┬─────────┘
│
▼
Final Response
Component Details
Schema-Aware Chunker
The chunker understands the LinkML class hierarchy and chunks documents accordingly:
class SchemaAwareChunker:
"""Chunks documents respecting ontology boundaries."""
def __init__(self, schema_path: str):
self.schema = load_linkml_schema(schema_path)
self.entity_patterns = self._build_entity_patterns()
def chunk(self, document: Document) -> List[Chunk]:
# 1. Identify entity boundaries
boundaries = self._detect_custodian_boundaries(document)
# 2. Split at boundaries
raw_chunks = self._split_at_boundaries(document, boundaries)
# 3. Enrich with ontology metadata
enriched_chunks = []
for chunk in raw_chunks:
chunk.metadata.update({
"primary_class": self._classify_chunk(chunk),
"mentioned_classes": self._extract_class_mentions(chunk),
"relationships": self._extract_relationships(chunk),
})
enriched_chunks.append(chunk)
return enriched_chunks
DSPy Modules
See 02-dspy-signatures.md for detailed module definitions.
Vector Store Schema
ChromaDB collections are structured to support ontology-aware retrieval:
# Collection: custodian_chunks
{
"id": "chunk-uuid",
"embedding": [0.1, 0.2, ...], # BGE-M3 embedding
"document": "The Rijksmuseum in Amsterdam...",
"metadata": {
# Ontology metadata
"primary_class": "Custodian",
"custodian_type": "MUSEUM",
"custodian_type_wikidata": "Q33506",
# Identifier metadata
"ghcid": "NL-NH-AMS-M-RM",
"wikidata_id": "Q190804",
"isil_code": "NL-AmRM",
# Geographic metadata
"country_code": "NL",
"region_code": "NH",
"settlement_code": "AMS",
# Provenance metadata
"source_tier": 2, # TIER_2_VERIFIED
"source_url": "https://www.rijksmuseum.nl/",
"extraction_date": "2025-12-12",
"confidence_score": 0.95,
}
}
TypeDB Schema
The TypeDB schema mirrors the LinkML structure:
define
# Hub entity
custodian sub entity,
owns hc-id @key,
owns ghcid,
owns custodian-type,
plays has-legal-status:custodian,
plays has-place:custodian,
plays manages-collection:custodian,
plays has-platform:custodian,
plays member-of:member;
# Reconstructed aspects
custodian-legal-status sub entity,
owns legal-form,
owns registration-number,
plays has-legal-status:status;
custodian-place sub entity,
owns preferred-label,
owns geonames-id,
plays has-place:place;
# Relationships
has-legal-status sub relation,
relates custodian,
relates status;
has-place sub relation,
relates custodian,
relates place;
member-of sub relation,
relates member,
relates body;
Integration Points
Wikidata MCP Tools
The pipeline integrates with Wikidata via MCP tools:
# Search for institution
entity_id = wikidata_search_entity("Rijksmuseum Amsterdam")
# Returns: Q190804
# Get properties
properties = wikidata_get_properties(entity_id)
# Returns: ["P31", "P17", "P131", "P791", ...]
# Get identifiers
identifiers = wikidata_get_identifiers(entity_id)
# Returns: {"ISIL": "NL-AmRM", "VIAF": "148691498", ...}
GeoNames Integration
Location normalization uses GeoNames SQLite database:
# Reverse geocode coordinates
settlement = geonames_reverse_geocode(
lat=52.3600,
lon=4.8852,
country_code="NL"
)
# Returns: {"name": "Amsterdam", "admin1": "NH", "geonames_id": 2759794}
CH-Annotator Convention
Entity extraction follows CH-Annotator v1.7.0:
entity:
type: GRP.HER.MUS # Heritage custodian - Museum
text: "Rijksmuseum"
start: 4
end: 14
provenance:
agent: claude-opus-4.5
convention: ch_annotator-v1_7_0
Deployment Options
Option 1: Local Development
# Start services
docker-compose up -d chromadb typedb
# Run pipeline
python -m dspy_heritage.pipeline --input data/custodian/ --schema schemas/20251121/linkml/
Option 2: Production (Hetzner)
# Deploy via rsync (no CI/CD per AGENTS.md Rule 7)
./infrastructure/deploy.sh --data
./infrastructure/deploy.sh --frontend
Performance Considerations
| Component | Target Latency | Strategy |
|---|---|---|
| Chunking | < 100ms/doc | Batch processing |
| Type Classification | < 50ms/chunk | Cached embeddings |
| Entity Extraction | < 200ms/chunk | DSPy caching |
| Entity Linking | < 500ms/entity | Wikidata cache |
| Vector Retrieval | < 50ms/query | HNSW index |
| Graph Traversal | < 100ms/query | TypeDB indexing |