437 lines
26 KiB
Markdown
437 lines
26 KiB
Markdown
# DSPy RAG Architecture for Heritage Custodian Ontology
|
|
|
|
## System Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ INPUT LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Conversation│ │ Website │ │ CSV │ │ Wikidata │ │
|
|
│ │ JSON │ │ Archive │ │ Registry │ │ SPARQL │ │
|
|
│ │ (139 files)│ │ (HTML) │ │ (ISIL, NDE) │ │ Queries │ │
|
|
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Document Loader (Docling/LangChain) │ │
|
|
│ │ │ │
|
|
│ │ • HTML → Markdown (Docling) │ │
|
|
│ │ • JSON → Structured Messages │ │
|
|
│ │ • CSV → DataFrames with Schema Mapping │ │
|
|
│ │ • SPARQL Results → JSON-LD │ │
|
|
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────┬──────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ PROCESSING LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Schema-Aware Chunker │ │
|
|
│ │ │ │
|
|
│ │ Input: Raw documents │ │
|
|
│ │ Output: Chunks with ontology metadata │ │
|
|
│ │ │ │
|
|
│ │ Strategies: │ │
|
|
│ │ • Entity-boundary chunking (Custodian mentions) │ │
|
|
│ │ • Semantic section detection (Collection, Platform, Location) │ │
|
|
│ │ • Observation unit extraction (CustodianObservation) │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ DSPy Semantic Router │ │
|
|
│ │ │ │
|
|
│ │ Input: Chunk text + metadata │ │
|
|
│ │ Output: Custodian type classification (GLAMORCUBESFIXPHDNT) │ │
|
|
│ │ │ │
|
|
│ │ Module: CustodianTypeClassifier │ │
|
|
│ │ • 19 classes with Wikidata meanings │ │
|
|
│ │ • Multi-label support for MIXED type │ │
|
|
│ │ • Confidence scoring per type │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ DSPy Entity Extractor │ │
|
|
│ │ │ │
|
|
│ │ Input: Classified chunks │ │
|
|
│ │ Output: Structured entities (LinkML-compliant) │ │
|
|
│ │ │ │
|
|
│ │ Modules: │ │
|
|
│ │ • CustodianExtractor (names, types, descriptions) │ │
|
|
│ │ • IdentifierExtractor (ISIL, Wikidata, VIAF, KvK) │ │
|
|
│ │ • LocationExtractor (addresses, GeoNames linking) │ │
|
|
│ │ • CollectionExtractor (holdings, temporal extent) │ │
|
|
│ │ • RelationshipExtractor (encompassing bodies, projects) │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Entity Linker (Wikidata/VIAF/ISIL) │ │
|
|
│ │ │ │
|
|
│ │ Input: Extracted entities │ │
|
|
│ │ Output: Linked entities with external identifiers │ │
|
|
│ │ │ │
|
|
│ │ Services: │ │
|
|
│ │ • Wikidata MCP tool (wikidata-authenticated_*) │ │
|
|
│ │ • VIAF API │ │
|
|
│ │ • ISIL registry lookup │ │
|
|
│ │ • GeoNames for location normalization │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────┬──────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ STORAGE LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────────────────────┐ ┌─────────────────────────────────────┐ │
|
|
│ │ Vector Store (ChromaDB) │ │ Knowledge Graph (TypeDB) │ │
|
|
│ │ │ │ │ │
|
|
│ │ Collections: │ │ Entity Types: │ │
|
|
│ │ • custodian_chunks │ │ • custodian (hub) │ │
|
|
│ │ • observation_chunks │ │ • custodian-legal-status │ │
|
|
│ │ • collection_chunks │ │ • custodian-name │ │
|
|
│ │ • project_chunks │ │ • custodian-place │ │
|
|
│ │ │ │ • custodian-collection │ │
|
|
│ │ Metadata: │ │ • digital-platform │ │
|
|
│ │ • custodian_type │ │ • encompassing-body │ │
|
|
│ │ • country_code │ │ • project │ │
|
|
│ │ • ghcid │ │ │ │
|
|
│ │ • source_tier │ │ Relationships: │ │
|
|
│ │ • confidence_score │ │ • has-legal-status │ │
|
|
│ │ │ │ • has-place │ │
|
|
│ │ Embedding Model: │ │ • manages-collection │ │
|
|
│ │ • BGE-M3 (multilingual) │ │ • has-platform │ │
|
|
│ │ • domain-finetuned variant │ │ • member-of │ │
|
|
│ │ for heritage terminology │ │ • participated-in-project │ │
|
|
│ └─────────────────────────────────┘ └─────────────────────────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────┬──────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ QUERY LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Hybrid Retriever │ │
|
|
│ │ │ │
|
|
│ │ Input: User query │ │
|
|
│ │ Output: Relevant chunks + KG entities │ │
|
|
│ │ │ │
|
|
│ │ Strategies: │ │
|
|
│ │ 1. Vector similarity (semantic match) │ │
|
|
│ │ 2. TypeDB traversal (relationship-based) │ │
|
|
│ │ 3. SPARQL federation (Wikidata enrichment) │ │
|
|
│ │ 4. Filtered retrieval (by type, country, tier) │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ DSPy Response Generator │ │
|
|
│ │ │ │
|
|
│ │ Input: Retrieved context + user query │ │
|
|
│ │ Output: Grounded response with citations │ │
|
|
│ │ │ │
|
|
│ │ Features: │ │
|
|
│ │ • Citation to source chunks │ │
|
|
│ │ • Provenance tracking (PROV-O compatible) │ │
|
|
│ │ • Confidence indication │ │
|
|
│ │ • Multi-language support (NL, EN, DE, FR, ES, PT) │ │
|
|
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
### 1. Ingestion Pipeline
|
|
|
|
```
|
|
Source Document
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Load Document │ • Docling for HTML
|
|
│ │ • JSON parser for conversations
|
|
│ │ • pandas for CSV registries
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Schema-Aware │ • Detect custodian boundaries
|
|
│ Chunking │ • Identify observation units
|
|
│ │ • Preserve ontology structure
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Type │ • GLAMORCUBESFIXPHDNT classification
|
|
│ Classification │ • Multi-label for MIXED
|
|
│ │ • Confidence scoring
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Entity │ • Extract Custodian attributes
|
|
│ Extraction │ • Extract identifiers
|
|
│ │ • Extract relationships
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Entity │ • Wikidata Q-number lookup
|
|
│ Linking │ • VIAF/ISIL resolution
|
|
│ │ • GeoNames normalization
|
|
└────────┬─────────┘
|
|
│
|
|
├────────────────────┐
|
|
▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐
|
|
│ Vector Store │ │ Knowledge Graph │
|
|
│ (embeddings) │ │ (entities) │
|
|
└──────────────────┘ └──────────────────┘
|
|
```
|
|
|
|
### 2. Query Pipeline
|
|
|
|
```
|
|
User Query
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Query │ • Intent classification
|
|
│ Understanding │ • Entity mention detection
|
|
│ │ • Type filtering extraction
|
|
└────────┬─────────┘
|
|
│
|
|
├────────────────────┬────────────────────┐
|
|
▼ ▼ ▼
|
|
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
|
|
│ Vector │ │ Graph │ │ SPARQL │
|
|
│ Retrieval │ │ Traversal │ │ Federation │
|
|
│ (semantic) │ │ (relationships) │ │ (Wikidata) │
|
|
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
|
|
│ │ │
|
|
└─────────────────────┴─────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Context │ • Rank by relevance
|
|
│ Aggregation │ • Deduplicate
|
|
│ │ • Filter by tier
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ Response │ • Generate answer
|
|
│ Generation │ • Add citations
|
|
│ │ • Track provenance
|
|
└────────┬─────────┘
|
|
│
|
|
▼
|
|
Final Response
|
|
```
|
|
|
|
## Component Details
|
|
|
|
### Schema-Aware Chunker
|
|
|
|
The chunker understands the LinkML class hierarchy and chunks documents accordingly:
|
|
|
|
```python
|
|
class SchemaAwareChunker:
|
|
"""Chunks documents respecting ontology boundaries."""
|
|
|
|
def __init__(self, schema_path: str):
|
|
self.schema = load_linkml_schema(schema_path)
|
|
self.entity_patterns = self._build_entity_patterns()
|
|
|
|
def chunk(self, document: Document) -> List[Chunk]:
|
|
# 1. Identify entity boundaries
|
|
boundaries = self._detect_custodian_boundaries(document)
|
|
|
|
# 2. Split at boundaries
|
|
raw_chunks = self._split_at_boundaries(document, boundaries)
|
|
|
|
# 3. Enrich with ontology metadata
|
|
enriched_chunks = []
|
|
for chunk in raw_chunks:
|
|
chunk.metadata.update({
|
|
"primary_class": self._classify_chunk(chunk),
|
|
"mentioned_classes": self._extract_class_mentions(chunk),
|
|
"relationships": self._extract_relationships(chunk),
|
|
})
|
|
enriched_chunks.append(chunk)
|
|
|
|
return enriched_chunks
|
|
```
|
|
|
|
### DSPy Modules
|
|
|
|
See [02-dspy-signatures.md](./02-dspy-signatures.md) for detailed module definitions.
|
|
|
|
### Vector Store Schema
|
|
|
|
ChromaDB collections are structured to support ontology-aware retrieval:
|
|
|
|
```python
|
|
# Collection: custodian_chunks
|
|
{
|
|
"id": "chunk-uuid",
|
|
"embedding": [0.1, 0.2, ...], # BGE-M3 embedding
|
|
"document": "The Rijksmuseum in Amsterdam...",
|
|
"metadata": {
|
|
# Ontology metadata
|
|
"primary_class": "Custodian",
|
|
"custodian_type": "MUSEUM",
|
|
"custodian_type_wikidata": "Q33506",
|
|
|
|
# Identifier metadata
|
|
"ghcid": "NL-NH-AMS-M-RM",
|
|
"wikidata_id": "Q190804",
|
|
"isil_code": "NL-AmRM",
|
|
|
|
# Geographic metadata
|
|
"country_code": "NL",
|
|
"region_code": "NH",
|
|
"settlement_code": "AMS",
|
|
|
|
# Provenance metadata
|
|
"source_tier": 2, # TIER_2_VERIFIED
|
|
"source_url": "https://www.rijksmuseum.nl/",
|
|
"extraction_date": "2025-12-12",
|
|
"confidence_score": 0.95,
|
|
}
|
|
}
|
|
```
|
|
|
|
### TypeDB Schema
|
|
|
|
The TypeDB schema mirrors the LinkML structure:
|
|
|
|
```typeql
|
|
define
|
|
|
|
# Hub entity
|
|
custodian sub entity,
|
|
owns hc-id @key,
|
|
owns ghcid,
|
|
owns custodian-type,
|
|
plays has-legal-status:custodian,
|
|
plays has-place:custodian,
|
|
plays manages-collection:custodian,
|
|
plays has-platform:custodian,
|
|
plays member-of:member;
|
|
|
|
# Reconstructed aspects
|
|
custodian-legal-status sub entity,
|
|
owns legal-form,
|
|
owns registration-number,
|
|
plays has-legal-status:status;
|
|
|
|
custodian-place sub entity,
|
|
owns preferred-label,
|
|
owns geonames-id,
|
|
plays has-place:place;
|
|
|
|
# Relationships
|
|
has-legal-status sub relation,
|
|
relates custodian,
|
|
relates status;
|
|
|
|
has-place sub relation,
|
|
relates custodian,
|
|
relates place;
|
|
|
|
member-of sub relation,
|
|
relates member,
|
|
relates body;
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### Wikidata MCP Tools
|
|
|
|
The pipeline integrates with Wikidata via MCP tools:
|
|
|
|
```python
|
|
# Search for institution
|
|
entity_id = wikidata_search_entity("Rijksmuseum Amsterdam")
|
|
# Returns: Q190804
|
|
|
|
# Get properties
|
|
properties = wikidata_get_properties(entity_id)
|
|
# Returns: ["P31", "P17", "P131", "P791", ...]
|
|
|
|
# Get identifiers
|
|
identifiers = wikidata_get_identifiers(entity_id)
|
|
# Returns: {"ISIL": "NL-AmRM", "VIAF": "148691498", ...}
|
|
```
|
|
|
|
### GeoNames Integration
|
|
|
|
Location normalization uses GeoNames SQLite database:
|
|
|
|
```python
|
|
# Reverse geocode coordinates
|
|
settlement = geonames_reverse_geocode(
|
|
lat=52.3600,
|
|
lon=4.8852,
|
|
country_code="NL"
|
|
)
|
|
# Returns: {"name": "Amsterdam", "admin1": "NH", "geonames_id": 2759794}
|
|
```
|
|
|
|
### CH-Annotator Convention
|
|
|
|
Entity extraction follows CH-Annotator v1.7.0:
|
|
|
|
```yaml
|
|
entity:
|
|
type: GRP.HER.MUS # Heritage custodian - Museum
|
|
text: "Rijksmuseum"
|
|
start: 4
|
|
end: 14
|
|
provenance:
|
|
agent: claude-opus-4.5
|
|
convention: ch_annotator-v1_7_0
|
|
```
|
|
|
|
## Deployment Options
|
|
|
|
### Option 1: Local Development
|
|
|
|
```bash
|
|
# Start services
|
|
docker-compose up -d chromadb typedb
|
|
|
|
# Run pipeline
|
|
python -m dspy_heritage.pipeline --input data/custodian/ --schema schemas/20251121/linkml/
|
|
```
|
|
|
|
### Option 2: Production (Hetzner)
|
|
|
|
```bash
|
|
# Deploy via rsync (no CI/CD per AGENTS.md Rule 7)
|
|
./infrastructure/deploy.sh --data
|
|
./infrastructure/deploy.sh --frontend
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
| Component | Target Latency | Strategy |
|
|
|-----------|---------------|----------|
|
|
| Chunking | < 100ms/doc | Batch processing |
|
|
| Type Classification | < 50ms/chunk | Cached embeddings |
|
|
| Entity Extraction | < 200ms/chunk | DSPy caching |
|
|
| Entity Linking | < 500ms/entity | Wikidata cache |
|
|
| Vector Retrieval | < 50ms/query | HNSW index |
|
|
| Graph Traversal | < 100ms/query | TypeDB indexing |
|