kempersc b1f93b6f22 enrich person profiles

2025-12-12 12:51:10 +01:00

26 KiB

Raw Blame History

DSPy RAG Architecture for Heritage Custodian Ontology

System Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              INPUT LAYER                                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│   │ Conversation│  │   Website   │  │    CSV      │  │  Wikidata   │            │
│   │    JSON     │  │   Archive   │  │  Registry   │  │   SPARQL    │            │
│   │  (139 files)│  │   (HTML)    │  │ (ISIL, NDE) │  │   Queries   │            │
│   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘            │
│          │                │                │                │                    │
│          ▼                ▼                ▼                ▼                    │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    Document Loader (Docling/LangChain)                   │   │
│   │                                                                          │   │
│   │  • HTML → Markdown (Docling)                                             │   │
│   │  • JSON → Structured Messages                                            │   │
│   │  • CSV → DataFrames with Schema Mapping                                  │   │
│   │  • SPARQL Results → JSON-LD                                              │   │
│   └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└──────────────────────────────────┬──────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                           PROCESSING LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    Schema-Aware Chunker                                  │   │
│   │                                                                          │   │
│   │  Input: Raw documents                                                    │   │
│   │  Output: Chunks with ontology metadata                                   │   │
│   │                                                                          │   │
│   │  Strategies:                                                             │   │
│   │  • Entity-boundary chunking (Custodian mentions)                         │   │
│   │  • Semantic section detection (Collection, Platform, Location)           │   │
│   │  • Observation unit extraction (CustodianObservation)                    │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                   │                                              │
│                                   ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    DSPy Semantic Router                                  │   │
│   │                                                                          │   │
│   │  Input: Chunk text + metadata                                            │   │
│   │  Output: Custodian type classification (GLAMORCUBESFIXPHDNT)             │   │
│   │                                                                          │   │
│   │  Module: CustodianTypeClassifier                                         │   │
│   │  • 19 classes with Wikidata meanings                                     │   │
│   │  • Multi-label support for MIXED type                                    │   │
│   │  • Confidence scoring per type                                           │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                   │                                              │
│                                   ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    DSPy Entity Extractor                                 │   │
│   │                                                                          │   │
│   │  Input: Classified chunks                                                │   │
│   │  Output: Structured entities (LinkML-compliant)                          │   │
│   │                                                                          │   │
│   │  Modules:                                                                │   │
│   │  • CustodianExtractor (names, types, descriptions)                       │   │
│   │  • IdentifierExtractor (ISIL, Wikidata, VIAF, KvK)                       │   │
│   │  • LocationExtractor (addresses, GeoNames linking)                       │   │
│   │  • CollectionExtractor (holdings, temporal extent)                       │   │
│   │  • RelationshipExtractor (encompassing bodies, projects)                 │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                   │                                              │
│                                   ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    Entity Linker (Wikidata/VIAF/ISIL)                    │   │
│   │                                                                          │   │
│   │  Input: Extracted entities                                               │   │
│   │  Output: Linked entities with external identifiers                       │   │
│   │                                                                          │   │
│   │  Services:                                                               │   │
│   │  • Wikidata MCP tool (wikidata-authenticated_*)                          │   │
│   │  • VIAF API                                                              │   │
│   │  • ISIL registry lookup                                                  │   │
│   │  • GeoNames for location normalization                                   │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└──────────────────────────────────┬──────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            STORAGE LAYER                                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌─────────────────────────────────┐  ┌─────────────────────────────────────┐  │
│   │      Vector Store (ChromaDB)    │  │    Knowledge Graph (TypeDB)         │  │
│   │                                 │  │                                     │  │
│   │  Collections:                   │  │  Entity Types:                      │  │
│   │  • custodian_chunks             │  │  • custodian (hub)                  │  │
│   │  • observation_chunks           │  │  • custodian-legal-status           │  │
│   │  • collection_chunks            │  │  • custodian-name                   │  │
│   │  • project_chunks               │  │  • custodian-place                  │  │
│   │                                 │  │  • custodian-collection             │  │
│   │  Metadata:                      │  │  • digital-platform                 │  │
│   │  • custodian_type               │  │  • encompassing-body                │  │
│   │  • country_code                 │  │  • project                          │  │
│   │  • ghcid                        │  │                                     │  │
│   │  • source_tier                  │  │  Relationships:                     │  │
│   │  • confidence_score             │  │  • has-legal-status                 │  │
│   │                                 │  │  • has-place                        │  │
│   │  Embedding Model:               │  │  • manages-collection               │  │
│   │  • BGE-M3 (multilingual)        │  │  • has-platform                     │  │
│   │  • domain-finetuned variant     │  │  • member-of                        │  │
│   │    for heritage terminology     │  │  • participated-in-project          │  │
│   └─────────────────────────────────┘  └─────────────────────────────────────┘  │
│                                                                                  │
└──────────────────────────────────┬──────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            QUERY LAYER                                           │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    Hybrid Retriever                                      │   │
│   │                                                                          │   │
│   │  Input: User query                                                       │   │
│   │  Output: Relevant chunks + KG entities                                   │   │
│   │                                                                          │   │
│   │  Strategies:                                                             │   │
│   │  1. Vector similarity (semantic match)                                   │   │
│   │  2. TypeDB traversal (relationship-based)                                │   │
│   │  3. SPARQL federation (Wikidata enrichment)                              │   │
│   │  4. Filtered retrieval (by type, country, tier)                          │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                   │                                              │
│                                   ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                    DSPy Response Generator                               │   │
│   │                                                                          │   │
│   │  Input: Retrieved context + user query                                   │   │
│   │  Output: Grounded response with citations                                │   │
│   │                                                                          │   │
│   │  Features:                                                               │   │
│   │  • Citation to source chunks                                             │   │
│   │  • Provenance tracking (PROV-O compatible)                               │   │
│   │  • Confidence indication                                                 │   │
│   │  • Multi-language support (NL, EN, DE, FR, ES, PT)                       │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Data Flow

1. Ingestion Pipeline

Source Document
       │
       ▼
┌──────────────────┐
│  Load Document   │  • Docling for HTML
│                  │  • JSON parser for conversations
│                  │  • pandas for CSV registries
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Schema-Aware     │  • Detect custodian boundaries
│ Chunking         │  • Identify observation units
│                  │  • Preserve ontology structure
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Type             │  • GLAMORCUBESFIXPHDNT classification
│ Classification   │  • Multi-label for MIXED
│                  │  • Confidence scoring
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Entity           │  • Extract Custodian attributes
│ Extraction       │  • Extract identifiers
│                  │  • Extract relationships
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Entity           │  • Wikidata Q-number lookup
│ Linking          │  • VIAF/ISIL resolution
│                  │  • GeoNames normalization
└────────┬─────────┘
         │
         ├────────────────────┐
         ▼                    ▼
┌──────────────────┐  ┌──────────────────┐
│ Vector Store     │  │ Knowledge Graph  │
│ (embeddings)     │  │ (entities)       │
└──────────────────┘  └──────────────────┘

2. Query Pipeline

User Query
       │
       ▼
┌──────────────────┐
│ Query            │  • Intent classification
│ Understanding    │  • Entity mention detection
│                  │  • Type filtering extraction
└────────┬─────────┘
         │
         ├────────────────────┬────────────────────┐
         ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│ Vector           │  │ Graph            │  │ SPARQL           │
│ Retrieval        │  │ Traversal        │  │ Federation       │
│ (semantic)       │  │ (relationships)  │  │ (Wikidata)       │
└────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
         │                     │                     │
         └─────────────────────┴─────────────────────┘
                               │
                               ▼
                    ┌──────────────────┐
                    │ Context          │  • Rank by relevance
                    │ Aggregation      │  • Deduplicate
                    │                  │  • Filter by tier
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │ Response         │  • Generate answer
                    │ Generation       │  • Add citations
                    │                  │  • Track provenance
                    └────────┬─────────┘
                             │
                             ▼
                       Final Response

Component Details

Schema-Aware Chunker

The chunker understands the LinkML class hierarchy and chunks documents accordingly:

class SchemaAwareChunker:
    """Chunks documents respecting ontology boundaries."""
    
    def __init__(self, schema_path: str):
        self.schema = load_linkml_schema(schema_path)
        self.entity_patterns = self._build_entity_patterns()
    
    def chunk(self, document: Document) -> List[Chunk]:
        # 1. Identify entity boundaries
        boundaries = self._detect_custodian_boundaries(document)
        
        # 2. Split at boundaries
        raw_chunks = self._split_at_boundaries(document, boundaries)
        
        # 3. Enrich with ontology metadata
        enriched_chunks = []
        for chunk in raw_chunks:
            chunk.metadata.update({
                "primary_class": self._classify_chunk(chunk),
                "mentioned_classes": self._extract_class_mentions(chunk),
                "relationships": self._extract_relationships(chunk),
            })
            enriched_chunks.append(chunk)
        
        return enriched_chunks

DSPy Modules

See 02-dspy-signatures.md for detailed module definitions.

Vector Store Schema

ChromaDB collections are structured to support ontology-aware retrieval:

# Collection: custodian_chunks
{
    "id": "chunk-uuid",
    "embedding": [0.1, 0.2, ...],  # BGE-M3 embedding
    "document": "The Rijksmuseum in Amsterdam...",
    "metadata": {
        # Ontology metadata
        "primary_class": "Custodian",
        "custodian_type": "MUSEUM",
        "custodian_type_wikidata": "Q33506",
        
        # Identifier metadata
        "ghcid": "NL-NH-AMS-M-RM",
        "wikidata_id": "Q190804",
        "isil_code": "NL-AmRM",
        
        # Geographic metadata
        "country_code": "NL",
        "region_code": "NH",
        "settlement_code": "AMS",
        
        # Provenance metadata
        "source_tier": 2,  # TIER_2_VERIFIED
        "source_url": "https://www.rijksmuseum.nl/",
        "extraction_date": "2025-12-12",
        "confidence_score": 0.95,
    }
}

TypeDB Schema

The TypeDB schema mirrors the LinkML structure:

define

# Hub entity
custodian sub entity,
    owns hc-id @key,
    owns ghcid,
    owns custodian-type,
    plays has-legal-status:custodian,
    plays has-place:custodian,
    plays manages-collection:custodian,
    plays has-platform:custodian,
    plays member-of:member;

# Reconstructed aspects
custodian-legal-status sub entity,
    owns legal-form,
    owns registration-number,
    plays has-legal-status:status;

custodian-place sub entity,
    owns preferred-label,
    owns geonames-id,
    plays has-place:place;

# Relationships
has-legal-status sub relation,
    relates custodian,
    relates status;

has-place sub relation,
    relates custodian,
    relates place;

member-of sub relation,
    relates member,
    relates body;

Integration Points

Wikidata MCP Tools

The pipeline integrates with Wikidata via MCP tools:

# Search for institution
entity_id = wikidata_search_entity("Rijksmuseum Amsterdam")
# Returns: Q190804

# Get properties
properties = wikidata_get_properties(entity_id)
# Returns: ["P31", "P17", "P131", "P791", ...]

# Get identifiers
identifiers = wikidata_get_identifiers(entity_id)
# Returns: {"ISIL": "NL-AmRM", "VIAF": "148691498", ...}

GeoNames Integration

Location normalization uses GeoNames SQLite database:

# Reverse geocode coordinates
settlement = geonames_reverse_geocode(
    lat=52.3600,
    lon=4.8852,
    country_code="NL"
)
# Returns: {"name": "Amsterdam", "admin1": "NH", "geonames_id": 2759794}

CH-Annotator Convention

Entity extraction follows CH-Annotator v1.7.0:

entity:
  type: GRP.HER.MUS  # Heritage custodian - Museum
  text: "Rijksmuseum"
  start: 4
  end: 14
  provenance:
    agent: claude-opus-4.5
    convention: ch_annotator-v1_7_0

Deployment Options

Option 1: Local Development

# Start services
docker-compose up -d chromadb typedb

# Run pipeline
python -m dspy_heritage.pipeline --input data/custodian/ --schema schemas/20251121/linkml/

Option 2: Production (Hetzner)

# Deploy via rsync (no CI/CD per AGENTS.md Rule 7)
./infrastructure/deploy.sh --data
./infrastructure/deploy.sh --frontend

Performance Considerations

Component	Target Latency	Strategy
Chunking	< 100ms/doc	Batch processing
Type Classification	< 50ms/chunk	Cached embeddings
Entity Extraction	< 200ms/chunk	DSPy caching
Entity Linking	< 500ms/entity	Wikidata cache
Vector Retrieval	< 50ms/query	HNSW index
Graph Traversal	< 100ms/query	TypeDB indexing

26 KiB Raw Blame History