17 KiB
RAG Integration: SPARQL-First Retrieval Flow
Overview
This document describes how template-based SPARQL generation integrates with the existing Heritage RAG system in dspy_heritage_rag.py. The key principle is SPARQL-first retrieval: use precise SPARQL queries to filter structured data before falling back to semantic vector search.
Current RAG Architecture
The existing HeritageRAG class in backend/rag/dspy_heritage_rag.py follows this flow:
User Question
|
v
+----------------------+
| HeritageQueryRouter | <-- Classifies intent, entity_type, sources
+----------------------+
|
v
+----------------------+
| HeritageSPARQLGenerator | <-- LLM generates SPARQL (PROBLEMATIC)
+----------------------+
|
v
+----------------------+
| execute_sparql() | <-- Execute against Oxigraph endpoint
+----------------------+
|
v
+----------------------+
| Qdrant Vector Search | <-- Semantic search for context enrichment
+----------------------+
|
v
+----------------------+
| HeritageAnswerGenerator | <-- Synthesize final answer
+----------------------+
Problem with Current Flow
The HeritageSPARQLGenerator (lines 344-478) uses an LLM to generate SPARQL queries. Despite extensive docstrings with examples, this produces:
- Syntax errors - Orphaned punctuation, missing prefixes
- Invalid predicates - Uses CIDOC-CRM properties not in our ontology
- Inconsistent results - Same question generates different queries
Enhanced Flow with Template-Based SPARQL
User Question
|
v
+----------------------+
| HeritageQueryRouter | <-- Classifies intent, extracts entities
+----------------------+
|
v
+----------------------+
| TemplateClassifier | <-- NEW: DSPy module to select template
+----------------------+
|
+---> Template Found?
| |
| Yes | No
| | | |
| v | v
| +-------+ +----------------------+
| | Template | | HeritageSPARQLGenerator | <-- Fallback to LLM
| | Router | +----------------------+
| +-------+ |
| | |
| v |
| +-------+ |
| | Slot | |
| | Filler | |
| +-------+ |
| | |
| v |
| +-----------+ |
| | Template | |
| | Instantiate| |
| +-----------+ |
| | |
+-----+-------+-------+
|
v
+----------------------+
| SPARQL Validator | <-- Validate before execution
+----------------------+
|
v
+----------------------+
| execute_sparql() | <-- Execute validated query
+----------------------+
|
v
+----------------------+
| Qdrant Filter/Enrich | <-- Use SPARQL results to filter Qdrant
+----------------------+
|
v
+----------------------+
| HeritageAnswerGenerator |
+----------------------+
Integration Points
1. HeritageQueryRouter Enhancement
The existing HeritageQueryRouter (lines 1442-1594) already extracts:
intent: geographic, statistical, relational, etc.entities: Named entities in the questionentity_type: person, institution, or bothtarget_custodian_type: MUSEUM, LIBRARY, ARCHIVE, etc.
We add a new field for template routing:
class HeritageQueryRouter(dspy.Module):
def forward(self, question: str, language: str = "nl", history: History | None = None) -> Prediction:
# ... existing classification ...
# NEW: Attempt template classification
template_match = self.template_classifier.classify(
question=question,
intent=result.intent,
entities=result.entities,
custodian_type=target_custodian_type,
)
return Prediction(
# ... existing fields ...
template_id=template_match.template_id, # NEW
template_slots=template_match.slots, # NEW
template_confidence=template_match.confidence, # NEW
)
2. New TemplateClassifier Module
class TemplateClassifier(dspy.Module):
"""Classify question to a pre-defined SPARQL template.
Uses DSPy Signature with Pydantic output for structured classification.
This module is optimizable with GEPA alongside existing modules.
"""
def __init__(self, templates_path: str = "data/sparql_templates.yaml"):
super().__init__()
self.templates = self._load_templates(templates_path)
self.classifier = dspy.ChainOfThought(ClassifyTemplateSignature)
def forward(
self,
question: str,
intent: str,
entities: list[str],
custodian_type: str | None = None,
) -> Prediction:
"""Classify question to template with slot values."""
# Build template descriptions for the classifier
template_descriptions = self._format_template_descriptions()
result = self.classifier(
question=question,
intent=intent,
entities=entities,
custodian_type=custodian_type or "UNKNOWN",
available_templates=template_descriptions,
)
return Prediction(
template_id=result.template_id,
slots=result.slots,
confidence=result.confidence,
reasoning=result.reasoning,
)
3. SPARQL-First Retrieval Pattern
The key integration pattern is using SPARQL results to filter Qdrant vector search:
async def retrieve_with_sparql_filter(
self,
question: str,
routing: Prediction,
language: str = "nl",
) -> list[dict]:
"""SPARQL-first retrieval with Qdrant enrichment.
1. Execute SPARQL query (template or LLM-generated)
2. Extract entity URIs/IDs from results
3. Use URIs to filter Qdrant search
4. Combine structured + semantic results
"""
# Step 1: Generate and execute SPARQL
if routing.template_id and routing.template_confidence > 0.8:
# Use template-based SPARQL (high confidence)
sparql = self.template_instantiator.instantiate(
template_id=routing.template_id,
slots=routing.template_slots,
)
else:
# Fallback to LLM-generated SPARQL
sparql_result = self.sparql_generator(
question=question,
intent=routing.intent,
entities=routing.entities,
)
sparql = sparql_result.sparql
# Step 2: Validate and execute SPARQL
is_valid, errors = validate_sparql(sparql)
if not is_valid:
logger.warning(f"SPARQL validation failed: {errors}")
# Fall back to pure vector search
return await self.qdrant_retriever.search(question, k=10)
sparql_results = await self.execute_sparql(sparql)
# Step 3: Extract entity identifiers for Qdrant filtering
entity_uris = [r.get("institution") or r.get("s") for r in sparql_results]
ghcids = [self._uri_to_ghcid(uri) for uri in entity_uris if uri]
# Step 4: Qdrant search with SPARQL-derived filters
if ghcids:
# Filter Qdrant to only these entities
qdrant_results = await self.qdrant_retriever.search(
query=question,
k=20,
filter={"ghcid": {"$in": ghcids}},
)
else:
# SPARQL returned no results, use semantic search
qdrant_results = await self.qdrant_retriever.search(question, k=10)
# Step 5: Merge and deduplicate
return self._merge_results(sparql_results, qdrant_results)
4. Integration with execute_sparql()
The existing execute_sparql() method (around line 1945 in tool definitions) needs enhancement:
async def execute_sparql(
self,
sparql_query: str,
validate: bool = True,
auto_correct: bool = True,
) -> list[dict]:
"""Execute SPARQL query with validation and auto-correction.
Args:
sparql_query: SPARQL query string
validate: Whether to validate before execution
auto_correct: Whether to attempt auto-correction on syntax errors
Returns:
List of result bindings as dictionaries
"""
# Import auto-correction from existing module
from glam_extractor.api.sparql_linter import auto_correct_sparql, validate_sparql
# Step 1: Validate
if validate:
is_valid, errors = validate_sparql(sparql_query)
if not is_valid:
if auto_correct:
sparql_query, was_modified = auto_correct_sparql(sparql_query)
if was_modified:
logger.info("SPARQL auto-corrected")
# Re-validate after correction
is_valid, errors = validate_sparql(sparql_query)
if not is_valid:
raise ValueError(f"SPARQL still invalid after correction: {errors}")
else:
raise ValueError(f"Invalid SPARQL: {errors}")
# Step 2: Execute
try:
response = await self.http_client.post(
self.sparql_endpoint,
data={"query": sparql_query},
headers={"Accept": "application/sparql-results+json"},
timeout=30.0,
)
response.raise_for_status()
data = response.json()
bindings = data.get("results", {}).get("bindings", [])
# Convert to simple dicts
return [
{k: v.get("value") for k, v in binding.items()}
for binding in bindings
]
except httpx.HTTPStatusError as e:
logger.error(f"SPARQL execution failed: {e.response.status_code}")
if e.response.status_code == 400:
# Query syntax error - log the query for debugging
logger.error(f"Query: {sparql_query}")
raise
Fallback Behavior
When template-based SPARQL fails or returns no results:
class HeritageRAG(dspy.Module):
async def answer(self, question: str, language: str = "nl") -> Prediction:
# Route the question
routing = self.router(question=question, language=language)
# Try template-based retrieval first
results = []
sparql_used = None
if routing.template_id:
try:
sparql_used = self.instantiate_template(routing)
results = await self.execute_sparql(sparql_used)
except Exception as e:
logger.warning(f"Template SPARQL failed: {e}")
# Fall through to LLM generation
# Fallback 1: LLM-generated SPARQL
if not results:
try:
sparql_result = self.sparql_generator(
question=question,
intent=routing.intent,
entities=routing.entities,
)
sparql_used = sparql_result.sparql
results = await self.execute_sparql(sparql_used)
except Exception as e:
logger.warning(f"LLM SPARQL failed: {e}")
# Fallback 2: Pure vector search
if not results:
logger.info("Falling back to pure vector search")
results = await self.qdrant_retriever.search(question, k=10)
sparql_used = None
# Generate answer
context = self._format_results(results)
answer = self.answer_generator(
question=question,
context=context,
sources=["sparql" if sparql_used else "qdrant"],
language=language,
)
return Prediction(
answer=answer.answer,
citations=answer.citations,
sparql_query=sparql_used,
result_count=len(results),
)
Template Confidence Thresholds
| Confidence | Behavior |
|---|---|
| > 0.9 | Use template directly, skip LLM |
| 0.7 - 0.9 | Use template, validate with LLM |
| 0.5 - 0.7 | Use template but prepare LLM fallback |
| < 0.5 | Skip template, use LLM directly |
TEMPLATE_CONFIDENCE_THRESHOLD = 0.7
def should_use_template(confidence: float) -> bool:
return confidence >= TEMPLATE_CONFIDENCE_THRESHOLD
Qdrant Collection Integration
For person queries (entity_type='person'), the flow integrates with the heritage_persons collection:
async def retrieve_persons_with_sparql(
self,
question: str,
routing: Prediction,
) -> list[dict]:
"""Retrieve person records using SPARQL + Qdrant hybrid.
SPARQL queries the RDF graph for schema:Person records.
Qdrant filters by custodian_slug if extracted from routing.
"""
# Generate person-specific SPARQL
if routing.entity_type == "person":
sparql = self.person_sparql_generator(
question=question,
intent=routing.intent,
entities=routing.entities,
)
# Execute SPARQL to get person names/roles
sparql_results = await self.execute_sparql(sparql.sparql)
# Use results to filter Qdrant
if routing.target_custodian_slug:
# Filter by institution
qdrant_results = await self.qdrant_retriever.search_persons(
query=question,
filter_custodian=routing.target_custodian_slug,
k=20,
)
else:
# Use SPARQL-derived names as filter
names = [r.get("name") for r in sparql_results if r.get("name")]
qdrant_results = await self.qdrant_retriever.search_persons(
query=question,
k=20,
)
return self._merge_person_results(sparql_results, qdrant_results)
Metrics and Monitoring
Track template vs LLM query performance:
@dataclass
class QueryMetrics:
"""Metrics for SPARQL query execution."""
template_used: bool
template_id: str | None
confidence: float
sparql_valid: bool
execution_time_ms: float
result_count: int
fallback_used: bool
error: str | None = None
async def execute_with_metrics(
self,
sparql: str,
template_id: str | None = None,
confidence: float = 0.0,
) -> tuple[list[dict], QueryMetrics]:
"""Execute SPARQL and collect metrics."""
start = time.perf_counter()
metrics = QueryMetrics(
template_used=template_id is not None,
template_id=template_id,
confidence=confidence,
sparql_valid=True,
execution_time_ms=0,
result_count=0,
fallback_used=False,
)
try:
results = await self.execute_sparql(sparql)
metrics.result_count = len(results)
except Exception as e:
metrics.sparql_valid = False
metrics.error = str(e)
results = []
metrics.execution_time_ms = (time.perf_counter() - start) * 1000
# Log for monitoring
logger.info(
f"SPARQL {'template' if metrics.template_used else 'LLM'} "
f"[{template_id or 'generated'}] "
f"valid={metrics.sparql_valid} "
f"results={metrics.result_count} "
f"time={metrics.execution_time_ms:.1f}ms"
)
return results, metrics
File Changes Required
| File | Changes |
|---|---|
backend/rag/dspy_heritage_rag.py |
Add TemplateClassifier, modify HeritageQueryRouter |
backend/rag/template_sparql.py |
NEW: Template loading, instantiation, validation |
backend/rag/sparql_templates.yaml |
NEW: Template definitions |
src/glam_extractor/api/sparql_linter.py |
Enhance validation for template output |
data/validation/sparql_validation_rules.json |
Already exists, source for slot values |
Testing Integration
# tests/rag/test_template_integration.py
@pytest.mark.asyncio
async def test_template_sparql_execution():
"""Test that template SPARQL executes successfully."""
rag = HeritageRAG()
# Question that should match a template
result = await rag.answer(
question="Welke archieven zijn er in Drenthe?",
language="nl",
)
assert result.sparql_query is not None
assert "NL-DR" in result.sparql_query
assert result.result_count > 0
@pytest.mark.asyncio
async def test_template_fallback_to_llm():
"""Test fallback to LLM for unmatched questions."""
rag = HeritageRAG()
# Question unlikely to match any template
result = await rag.answer(
question="Wat is de relatie tussen het Rijksmuseum en Vincent van Gogh?",
language="nl",
)
# Should still produce an answer via LLM fallback
assert result.answer is not None
Summary
The template-based SPARQL integration follows these principles:
- SPARQL-first: Try precise structured queries before semantic search
- Graceful degradation: Template → LLM SPARQL → Vector search
- DSPy compatibility: Template classifier is a DSPy module, optimizable with GEPA
- Metrics-driven: Track template vs LLM performance for continuous improvement
- Hybrid results: Merge SPARQL + Qdrant for comprehensive answers