kempersc 84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians

2025-12-28 14:56:35 +01:00

17 KiB

Raw Blame History

RAG Integration: SPARQL-First Retrieval Flow

Overview

This document describes how template-based SPARQL generation integrates with the existing Heritage RAG system in dspy_heritage_rag.py. The key principle is SPARQL-first retrieval: use precise SPARQL queries to filter structured data before falling back to semantic vector search.

Current RAG Architecture

The existing HeritageRAG class in backend/rag/dspy_heritage_rag.py follows this flow:

User Question
     |
     v
+----------------------+
| HeritageQueryRouter  |  <-- Classifies intent, entity_type, sources
+----------------------+
     |
     v
+----------------------+
| HeritageSPARQLGenerator |  <-- LLM generates SPARQL (PROBLEMATIC)
+----------------------+
     |
     v
+----------------------+
| execute_sparql()     |  <-- Execute against Oxigraph endpoint
+----------------------+
     |
     v
+----------------------+
| Qdrant Vector Search |  <-- Semantic search for context enrichment
+----------------------+
     |
     v
+----------------------+
| HeritageAnswerGenerator |  <-- Synthesize final answer
+----------------------+

Problem with Current Flow

The HeritageSPARQLGenerator (lines 344-478) uses an LLM to generate SPARQL queries. Despite extensive docstrings with examples, this produces:

Syntax errors - Orphaned punctuation, missing prefixes
Invalid predicates - Uses CIDOC-CRM properties not in our ontology
Inconsistent results - Same question generates different queries

Enhanced Flow with Template-Based SPARQL

User Question
     |
     v
+----------------------+
| HeritageQueryRouter  |  <-- Classifies intent, extracts entities
+----------------------+
     |
     v
+----------------------+
| TemplateClassifier   |  <-- NEW: DSPy module to select template
+----------------------+
     |
     +---> Template Found?
     |          |
     |    Yes   |   No
     |     |    |    |
     |     v    |    v
     |  +-------+  +----------------------+
     |  | Template |  | HeritageSPARQLGenerator |  <-- Fallback to LLM
     |  | Router   |  +----------------------+
     |  +-------+          |
     |     |               |
     |     v               |
     |  +-------+          |
     |  | Slot   |          |
     |  | Filler |          |
     |  +-------+          |
     |     |               |
     |     v               |
     |  +-----------+      |
     |  | Template   |      |
     |  | Instantiate|      |
     |  +-----------+      |
     |     |               |
     +-----+-------+-------+
           |
           v
+----------------------+
| SPARQL Validator     |  <-- Validate before execution
+----------------------+
           |
           v
+----------------------+
| execute_sparql()     |  <-- Execute validated query
+----------------------+
           |
           v
+----------------------+
| Qdrant Filter/Enrich |  <-- Use SPARQL results to filter Qdrant
+----------------------+
           |
           v
+----------------------+
| HeritageAnswerGenerator |
+----------------------+

Integration Points

1. HeritageQueryRouter Enhancement

The existing HeritageQueryRouter (lines 1442-1594) already extracts:

intent: geographic, statistical, relational, etc.
entities: Named entities in the question
entity_type: person, institution, or both
target_custodian_type: MUSEUM, LIBRARY, ARCHIVE, etc.

We add a new field for template routing:

class HeritageQueryRouter(dspy.Module):
    def forward(self, question: str, language: str = "nl", history: History | None = None) -> Prediction:
        # ... existing classification ...
        
        # NEW: Attempt template classification
        template_match = self.template_classifier.classify(
            question=question,
            intent=result.intent,
            entities=result.entities,
            custodian_type=target_custodian_type,
        )
        
        return Prediction(
            # ... existing fields ...
            template_id=template_match.template_id,  # NEW
            template_slots=template_match.slots,      # NEW
            template_confidence=template_match.confidence,  # NEW
        )

2. New TemplateClassifier Module

class TemplateClassifier(dspy.Module):
    """Classify question to a pre-defined SPARQL template.
    
    Uses DSPy Signature with Pydantic output for structured classification.
    This module is optimizable with GEPA alongside existing modules.
    """
    
    def __init__(self, templates_path: str = "data/sparql_templates.yaml"):
        super().__init__()
        self.templates = self._load_templates(templates_path)
        self.classifier = dspy.ChainOfThought(ClassifyTemplateSignature)
    
    def forward(
        self,
        question: str,
        intent: str,
        entities: list[str],
        custodian_type: str | None = None,
    ) -> Prediction:
        """Classify question to template with slot values."""
        
        # Build template descriptions for the classifier
        template_descriptions = self._format_template_descriptions()
        
        result = self.classifier(
            question=question,
            intent=intent,
            entities=entities,
            custodian_type=custodian_type or "UNKNOWN",
            available_templates=template_descriptions,
        )
        
        return Prediction(
            template_id=result.template_id,
            slots=result.slots,
            confidence=result.confidence,
            reasoning=result.reasoning,
        )

3. SPARQL-First Retrieval Pattern

The key integration pattern is using SPARQL results to filter Qdrant vector search:

async def retrieve_with_sparql_filter(
    self,
    question: str,
    routing: Prediction,
    language: str = "nl",
) -> list[dict]:
    """SPARQL-first retrieval with Qdrant enrichment.
    
    1. Execute SPARQL query (template or LLM-generated)
    2. Extract entity URIs/IDs from results
    3. Use URIs to filter Qdrant search
    4. Combine structured + semantic results
    """
    
    # Step 1: Generate and execute SPARQL
    if routing.template_id and routing.template_confidence > 0.8:
        # Use template-based SPARQL (high confidence)
        sparql = self.template_instantiator.instantiate(
            template_id=routing.template_id,
            slots=routing.template_slots,
        )
    else:
        # Fallback to LLM-generated SPARQL
        sparql_result = self.sparql_generator(
            question=question,
            intent=routing.intent,
            entities=routing.entities,
        )
        sparql = sparql_result.sparql
    
    # Step 2: Validate and execute SPARQL
    is_valid, errors = validate_sparql(sparql)
    if not is_valid:
        logger.warning(f"SPARQL validation failed: {errors}")
        # Fall back to pure vector search
        return await self.qdrant_retriever.search(question, k=10)
    
    sparql_results = await self.execute_sparql(sparql)
    
    # Step 3: Extract entity identifiers for Qdrant filtering
    entity_uris = [r.get("institution") or r.get("s") for r in sparql_results]
    ghcids = [self._uri_to_ghcid(uri) for uri in entity_uris if uri]
    
    # Step 4: Qdrant search with SPARQL-derived filters
    if ghcids:
        # Filter Qdrant to only these entities
        qdrant_results = await self.qdrant_retriever.search(
            query=question,
            k=20,
            filter={"ghcid": {"$in": ghcids}},
        )
    else:
        # SPARQL returned no results, use semantic search
        qdrant_results = await self.qdrant_retriever.search(question, k=10)
    
    # Step 5: Merge and deduplicate
    return self._merge_results(sparql_results, qdrant_results)

4. Integration with execute_sparql()

The existing execute_sparql() method (around line 1945 in tool definitions) needs enhancement:

async def execute_sparql(
    self,
    sparql_query: str,
    validate: bool = True,
    auto_correct: bool = True,
) -> list[dict]:
    """Execute SPARQL query with validation and auto-correction.
    
    Args:
        sparql_query: SPARQL query string
        validate: Whether to validate before execution
        auto_correct: Whether to attempt auto-correction on syntax errors
        
    Returns:
        List of result bindings as dictionaries
    """
    # Import auto-correction from existing module
    from glam_extractor.api.sparql_linter import auto_correct_sparql, validate_sparql
    
    # Step 1: Validate
    if validate:
        is_valid, errors = validate_sparql(sparql_query)
        if not is_valid:
            if auto_correct:
                sparql_query, was_modified = auto_correct_sparql(sparql_query)
                if was_modified:
                    logger.info("SPARQL auto-corrected")
                    # Re-validate after correction
                    is_valid, errors = validate_sparql(sparql_query)
                    if not is_valid:
                        raise ValueError(f"SPARQL still invalid after correction: {errors}")
            else:
                raise ValueError(f"Invalid SPARQL: {errors}")
    
    # Step 2: Execute
    try:
        response = await self.http_client.post(
            self.sparql_endpoint,
            data={"query": sparql_query},
            headers={"Accept": "application/sparql-results+json"},
            timeout=30.0,
        )
        response.raise_for_status()
        
        data = response.json()
        bindings = data.get("results", {}).get("bindings", [])
        
        # Convert to simple dicts
        return [
            {k: v.get("value") for k, v in binding.items()}
            for binding in bindings
        ]
    except httpx.HTTPStatusError as e:
        logger.error(f"SPARQL execution failed: {e.response.status_code}")
        if e.response.status_code == 400:
            # Query syntax error - log the query for debugging
            logger.error(f"Query: {sparql_query}")
        raise

Fallback Behavior

When template-based SPARQL fails or returns no results:

class HeritageRAG(dspy.Module):
    async def answer(self, question: str, language: str = "nl") -> Prediction:
        # Route the question
        routing = self.router(question=question, language=language)
        
        # Try template-based retrieval first
        results = []
        sparql_used = None
        
        if routing.template_id:
            try:
                sparql_used = self.instantiate_template(routing)
                results = await self.execute_sparql(sparql_used)
            except Exception as e:
                logger.warning(f"Template SPARQL failed: {e}")
                # Fall through to LLM generation
        
        # Fallback 1: LLM-generated SPARQL
        if not results:
            try:
                sparql_result = self.sparql_generator(
                    question=question,
                    intent=routing.intent,
                    entities=routing.entities,
                )
                sparql_used = sparql_result.sparql
                results = await self.execute_sparql(sparql_used)
            except Exception as e:
                logger.warning(f"LLM SPARQL failed: {e}")
        
        # Fallback 2: Pure vector search
        if not results:
            logger.info("Falling back to pure vector search")
            results = await self.qdrant_retriever.search(question, k=10)
            sparql_used = None
        
        # Generate answer
        context = self._format_results(results)
        answer = self.answer_generator(
            question=question,
            context=context,
            sources=["sparql" if sparql_used else "qdrant"],
            language=language,
        )
        
        return Prediction(
            answer=answer.answer,
            citations=answer.citations,
            sparql_query=sparql_used,
            result_count=len(results),
        )

Template Confidence Thresholds

Confidence	Behavior
> 0.9	Use template directly, skip LLM
0.7 - 0.9	Use template, validate with LLM
0.5 - 0.7	Use template but prepare LLM fallback
< 0.5	Skip template, use LLM directly

TEMPLATE_CONFIDENCE_THRESHOLD = 0.7

def should_use_template(confidence: float) -> bool:
    return confidence >= TEMPLATE_CONFIDENCE_THRESHOLD

Qdrant Collection Integration

For person queries (entity_type='person'), the flow integrates with the heritage_persons collection:

async def retrieve_persons_with_sparql(
    self,
    question: str,
    routing: Prediction,
) -> list[dict]:
    """Retrieve person records using SPARQL + Qdrant hybrid.
    
    SPARQL queries the RDF graph for schema:Person records.
    Qdrant filters by custodian_slug if extracted from routing.
    """
    
    # Generate person-specific SPARQL
    if routing.entity_type == "person":
        sparql = self.person_sparql_generator(
            question=question,
            intent=routing.intent,
            entities=routing.entities,
        )
        
        # Execute SPARQL to get person names/roles
        sparql_results = await self.execute_sparql(sparql.sparql)
        
        # Use results to filter Qdrant
        if routing.target_custodian_slug:
            # Filter by institution
            qdrant_results = await self.qdrant_retriever.search_persons(
                query=question,
                filter_custodian=routing.target_custodian_slug,
                k=20,
            )
        else:
            # Use SPARQL-derived names as filter
            names = [r.get("name") for r in sparql_results if r.get("name")]
            qdrant_results = await self.qdrant_retriever.search_persons(
                query=question,
                k=20,
            )
        
        return self._merge_person_results(sparql_results, qdrant_results)

Metrics and Monitoring

Track template vs LLM query performance:

@dataclass
class QueryMetrics:
    """Metrics for SPARQL query execution."""
    template_used: bool
    template_id: str | None
    confidence: float
    sparql_valid: bool
    execution_time_ms: float
    result_count: int
    fallback_used: bool
    error: str | None = None

async def execute_with_metrics(
    self,
    sparql: str,
    template_id: str | None = None,
    confidence: float = 0.0,
) -> tuple[list[dict], QueryMetrics]:
    """Execute SPARQL and collect metrics."""
    start = time.perf_counter()
    metrics = QueryMetrics(
        template_used=template_id is not None,
        template_id=template_id,
        confidence=confidence,
        sparql_valid=True,
        execution_time_ms=0,
        result_count=0,
        fallback_used=False,
    )
    
    try:
        results = await self.execute_sparql(sparql)
        metrics.result_count = len(results)
    except Exception as e:
        metrics.sparql_valid = False
        metrics.error = str(e)
        results = []
    
    metrics.execution_time_ms = (time.perf_counter() - start) * 1000
    
    # Log for monitoring
    logger.info(
        f"SPARQL {'template' if metrics.template_used else 'LLM'} "
        f"[{template_id or 'generated'}] "
        f"valid={metrics.sparql_valid} "
        f"results={metrics.result_count} "
        f"time={metrics.execution_time_ms:.1f}ms"
    )
    
    return results, metrics

File Changes Required

File	Changes
`backend/rag/dspy_heritage_rag.py`	Add TemplateClassifier, modify HeritageQueryRouter
`backend/rag/template_sparql.py`	NEW: Template loading, instantiation, validation
`backend/rag/sparql_templates.yaml`	NEW: Template definitions
`src/glam_extractor/api/sparql_linter.py`	Enhance validation for template output
`data/validation/sparql_validation_rules.json`	Already exists, source for slot values

Testing Integration

# tests/rag/test_template_integration.py

@pytest.mark.asyncio
async def test_template_sparql_execution():
    """Test that template SPARQL executes successfully."""
    rag = HeritageRAG()
    
    # Question that should match a template
    result = await rag.answer(
        question="Welke archieven zijn er in Drenthe?",
        language="nl",
    )
    
    assert result.sparql_query is not None
    assert "NL-DR" in result.sparql_query
    assert result.result_count > 0

@pytest.mark.asyncio
async def test_template_fallback_to_llm():
    """Test fallback to LLM for unmatched questions."""
    rag = HeritageRAG()
    
    # Question unlikely to match any template
    result = await rag.answer(
        question="Wat is de relatie tussen het Rijksmuseum en Vincent van Gogh?",
        language="nl",
    )
    
    # Should still produce an answer via LLM fallback
    assert result.answer is not None

Summary

The template-based SPARQL integration follows these principles:

SPARQL-first: Try precise structured queries before semantic search
Graceful degradation: Template → LLM SPARQL → Vector search
DSPy compatibility: Template classifier is a DSPy module, optimizable with GEPA
Metrics-driven: Track template vs LLM performance for continuous improvement
Hybrid results: Merge SPARQL + Qdrant for comprehensive answers

17 KiB Raw Blame History