# DSPy RAG Architecture for Heritage Custodian Ontology ## System Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ INPUT LAYER │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Conversation│ │ Website │ │ CSV │ │ Wikidata │ │ │ │ JSON │ │ Archive │ │ Registry │ │ SPARQL │ │ │ │ (139 files)│ │ (HTML) │ │ (ISIL, NDE) │ │ Queries │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Document Loader (Docling/LangChain) │ │ │ │ │ │ │ │ • HTML → Markdown (Docling) │ │ │ │ • JSON → Structured Messages │ │ │ │ • CSV → DataFrames with Schema Mapping │ │ │ │ • SPARQL Results → JSON-LD │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────┬──────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────┐ │ PROCESSING LAYER │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Schema-Aware Chunker │ │ │ │ │ │ │ │ Input: Raw documents │ │ │ │ Output: Chunks with ontology metadata │ │ │ │ │ │ │ │ Strategies: │ │ │ │ • Entity-boundary chunking (Custodian mentions) │ │ │ │ • Semantic section detection (Collection, Platform, Location) │ │ │ │ • Observation unit extraction (CustodianObservation) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ DSPy Semantic Router │ │ │ │ │ │ │ │ Input: Chunk text + metadata │ │ │ │ Output: Custodian type classification (GLAMORCUBESFIXPHDNT) │ │ │ │ │ │ │ │ Module: CustodianTypeClassifier │ │ │ │ • 19 classes with Wikidata meanings │ │ │ │ • Multi-label support for MIXED type │ │ │ │ • Confidence scoring per type │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ DSPy Entity Extractor │ │ │ │ │ │ │ │ Input: Classified chunks │ │ │ │ Output: Structured entities (LinkML-compliant) │ │ │ │ │ │ │ │ Modules: │ │ │ │ • CustodianExtractor (names, types, descriptions) │ │ │ │ • IdentifierExtractor (ISIL, Wikidata, VIAF, KvK) │ │ │ │ • LocationExtractor (addresses, GeoNames linking) │ │ │ │ • CollectionExtractor (holdings, temporal extent) │ │ │ │ • RelationshipExtractor (encompassing bodies, projects) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Entity Linker (Wikidata/VIAF/ISIL) │ │ │ │ │ │ │ │ Input: Extracted entities │ │ │ │ Output: Linked entities with external identifiers │ │ │ │ │ │ │ │ Services: │ │ │ │ • Wikidata MCP tool (wikidata-authenticated_*) │ │ │ │ • VIAF API │ │ │ │ • ISIL registry lookup │ │ │ │ • GeoNames for location normalization │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────┬──────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────┐ │ STORAGE LAYER │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────┐ ┌─────────────────────────────────────┐ │ │ │ Vector Store (ChromaDB) │ │ Knowledge Graph (TypeDB) │ │ │ │ │ │ │ │ │ │ Collections: │ │ Entity Types: │ │ │ │ • custodian_chunks │ │ • custodian (hub) │ │ │ │ • observation_chunks │ │ • custodian-legal-status │ │ │ │ • collection_chunks │ │ • custodian-name │ │ │ │ • project_chunks │ │ • custodian-place │ │ │ │ │ │ • custodian-collection │ │ │ │ Metadata: │ │ • digital-platform │ │ │ │ • custodian_type │ │ • encompassing-body │ │ │ │ • country_code │ │ • project │ │ │ │ • ghcid │ │ │ │ │ │ • source_tier │ │ Relationships: │ │ │ │ • confidence_score │ │ • has-legal-status │ │ │ │ │ │ • has-place │ │ │ │ Embedding Model: │ │ • manages-collection │ │ │ │ • BGE-M3 (multilingual) │ │ • has-platform │ │ │ │ • domain-finetuned variant │ │ • member-of │ │ │ │ for heritage terminology │ │ • participated-in-project │ │ │ └─────────────────────────────────┘ └─────────────────────────────────────┘ │ │ │ └──────────────────────────────────┬──────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────┐ │ QUERY LAYER │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Hybrid Retriever │ │ │ │ │ │ │ │ Input: User query │ │ │ │ Output: Relevant chunks + KG entities │ │ │ │ │ │ │ │ Strategies: │ │ │ │ 1. Vector similarity (semantic match) │ │ │ │ 2. TypeDB traversal (relationship-based) │ │ │ │ 3. SPARQL federation (Wikidata enrichment) │ │ │ │ 4. Filtered retrieval (by type, country, tier) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ DSPy Response Generator │ │ │ │ │ │ │ │ Input: Retrieved context + user query │ │ │ │ Output: Grounded response with citations │ │ │ │ │ │ │ │ Features: │ │ │ │ • Citation to source chunks │ │ │ │ • Provenance tracking (PROV-O compatible) │ │ │ │ • Confidence indication │ │ │ │ • Multi-language support (NL, EN, DE, FR, ES, PT) │ │ │ └──────────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────────┘ ``` ## Data Flow ### 1. Ingestion Pipeline ``` Source Document │ ▼ ┌──────────────────┐ │ Load Document │ • Docling for HTML │ │ • JSON parser for conversations │ │ • pandas for CSV registries └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Schema-Aware │ • Detect custodian boundaries │ Chunking │ • Identify observation units │ │ • Preserve ontology structure └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Type │ • GLAMORCUBESFIXPHDNT classification │ Classification │ • Multi-label for MIXED │ │ • Confidence scoring └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Entity │ • Extract Custodian attributes │ Extraction │ • Extract identifiers │ │ • Extract relationships └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Entity │ • Wikidata Q-number lookup │ Linking │ • VIAF/ISIL resolution │ │ • GeoNames normalization └────────┬─────────┘ │ ├────────────────────┐ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ Vector Store │ │ Knowledge Graph │ │ (embeddings) │ │ (entities) │ └──────────────────┘ └──────────────────┘ ``` ### 2. Query Pipeline ``` User Query │ ▼ ┌──────────────────┐ │ Query │ • Intent classification │ Understanding │ • Entity mention detection │ │ • Type filtering extraction └────────┬─────────┘ │ ├────────────────────┬────────────────────┐ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ Vector │ │ Graph │ │ SPARQL │ │ Retrieval │ │ Traversal │ │ Federation │ │ (semantic) │ │ (relationships) │ │ (Wikidata) │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │ │ │ └─────────────────────┴─────────────────────┘ │ ▼ ┌──────────────────┐ │ Context │ • Rank by relevance │ Aggregation │ • Deduplicate │ │ • Filter by tier └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Response │ • Generate answer │ Generation │ • Add citations │ │ • Track provenance └────────┬─────────┘ │ ▼ Final Response ``` ## Component Details ### Schema-Aware Chunker The chunker understands the LinkML class hierarchy and chunks documents accordingly: ```python class SchemaAwareChunker: """Chunks documents respecting ontology boundaries.""" def __init__(self, schema_path: str): self.schema = load_linkml_schema(schema_path) self.entity_patterns = self._build_entity_patterns() def chunk(self, document: Document) -> List[Chunk]: # 1. Identify entity boundaries boundaries = self._detect_custodian_boundaries(document) # 2. Split at boundaries raw_chunks = self._split_at_boundaries(document, boundaries) # 3. Enrich with ontology metadata enriched_chunks = [] for chunk in raw_chunks: chunk.metadata.update({ "primary_class": self._classify_chunk(chunk), "mentioned_classes": self._extract_class_mentions(chunk), "relationships": self._extract_relationships(chunk), }) enriched_chunks.append(chunk) return enriched_chunks ``` ### DSPy Modules See [02-dspy-signatures.md](./02-dspy-signatures.md) for detailed module definitions. ### Vector Store Schema ChromaDB collections are structured to support ontology-aware retrieval: ```python # Collection: custodian_chunks { "id": "chunk-uuid", "embedding": [0.1, 0.2, ...], # BGE-M3 embedding "document": "The Rijksmuseum in Amsterdam...", "metadata": { # Ontology metadata "primary_class": "Custodian", "custodian_type": "MUSEUM", "custodian_type_wikidata": "Q33506", # Identifier metadata "ghcid": "NL-NH-AMS-M-RM", "wikidata_id": "Q190804", "isil_code": "NL-AmRM", # Geographic metadata "country_code": "NL", "region_code": "NH", "settlement_code": "AMS", # Provenance metadata "source_tier": 2, # TIER_2_VERIFIED "source_url": "https://www.rijksmuseum.nl/", "extraction_date": "2025-12-12", "confidence_score": 0.95, } } ``` ### TypeDB Schema The TypeDB schema mirrors the LinkML structure: ```typeql define # Hub entity custodian sub entity, owns hc-id @key, owns ghcid, owns custodian-type, plays has-legal-status:custodian, plays has-place:custodian, plays manages-collection:custodian, plays has-platform:custodian, plays member-of:member; # Reconstructed aspects custodian-legal-status sub entity, owns legal-form, owns registration-number, plays has-legal-status:status; custodian-place sub entity, owns preferred-label, owns geonames-id, plays has-place:place; # Relationships has-legal-status sub relation, relates custodian, relates status; has-place sub relation, relates custodian, relates place; member-of sub relation, relates member, relates body; ``` ## Integration Points ### Wikidata MCP Tools The pipeline integrates with Wikidata via MCP tools: ```python # Search for institution entity_id = wikidata_search_entity("Rijksmuseum Amsterdam") # Returns: Q190804 # Get properties properties = wikidata_get_properties(entity_id) # Returns: ["P31", "P17", "P131", "P791", ...] # Get identifiers identifiers = wikidata_get_identifiers(entity_id) # Returns: {"ISIL": "NL-AmRM", "VIAF": "148691498", ...} ``` ### GeoNames Integration Location normalization uses GeoNames SQLite database: ```python # Reverse geocode coordinates settlement = geonames_reverse_geocode( lat=52.3600, lon=4.8852, country_code="NL" ) # Returns: {"name": "Amsterdam", "admin1": "NH", "geonames_id": 2759794} ``` ### CH-Annotator Convention Entity extraction follows CH-Annotator v1.7.0: ```yaml entity: type: GRP.HER.MUS # Heritage custodian - Museum text: "Rijksmuseum" start: 4 end: 14 provenance: agent: claude-opus-4.5 convention: ch_annotator-v1_7_0 ``` ## Deployment Options ### Option 1: Local Development ```bash # Start services docker-compose up -d chromadb typedb # Run pipeline python -m dspy_heritage.pipeline --input data/custodian/ --schema schemas/20251121/linkml/ ``` ### Option 2: Production (Hetzner) ```bash # Deploy via rsync (no CI/CD per AGENTS.md Rule 7) ./infrastructure/deploy.sh --data ./infrastructure/deploy.sh --frontend ``` ## Performance Considerations | Component | Target Latency | Strategy | |-----------|---------------|----------| | Chunking | < 100ms/doc | Batch processing | | Type Classification | < 50ms/chunk | Cached embeddings | | Entity Extraction | < 200ms/chunk | DSPy caching | | Entity Linking | < 500ms/entity | Wikidata cache | | Vector Retrieval | < 50ms/query | HNSW index | | Graph Traversal | < 100ms/query | TypeDB indexing |