- Add prompt-query_template_mapping/SOTA_analysis.md with Formica et al. research - Update GraphRAG design patterns documentation - Update temporal semantic hypergraph documentation
18 KiB
SOTA Analysis: Template-Based SPARQL Generation
Date: 2025-01-07 Status: Active Research Author: OpenCode
Executive Summary
Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements.
Key Finding: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are:
- Add RAG-enhanced tier between embedding and LLM
- Implement SPARQL validation feedback loop
- Schema-aware slot filling
- GEPA optimization for DSPy modules
1. Research Survey
1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)
Key Innovation: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners.
Architecture:
User Question
↓
┌─────────────────────────────────────────┐
│ Metadata Indexer │
│ - Schema classes/properties indexed │
│ - Example Q&A pairs vectorized │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Prompt Builder (RAG) │
│ - Retrieve similar examples │
│ - Retrieve relevant schema fragments │
│ - Compose context-rich prompt │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ SPARQL Generator │
│ - LLM generates SPARQL │
│ - Validation against schema │
│ - Iterative correction loop │
└─────────────────────────────────────────┘
↓
Validated SPARQL
Relevance to GLAM:
- ✅ We have schema (LinkML) but don't use it in prompts
- ✅ We have example Q&A in templates but don't retrieve semantically
- ❌ Missing: Schema-aware validation loop
1.2 COT-SPARQL (SEMANTICS 2024)
Key Innovation: Chain-of-Thought prompting with context injection.
Two Context Types:
- Context A: Entity and relation extraction from question
- Context B: Most semantically similar example from training set
Performance: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9
Relevance to GLAM:
- ✅ Our embedding matcher finds similar patterns (partial Context B)
- ❌ Missing: Entity/relation extraction step (Context A)
- ❌ Missing: CoT prompting in LLM tier
1.3 KGQuest (arXiv 2511.11258, Nov 2024)
Key Innovation: Deterministic template generation + LLM refinement.
Architecture:
KG Triplets
↓
Cluster by relation type
↓
Generate rule-based templates (deterministic)
↓
LLM refinement for fluency (lightweight, controlled)
Relevance to GLAM:
- ✅ Validates our template-first approach
- ✅ We use deterministic templates with LLM fallback
- 💡 Insight: Use LLM only for refinement, not generation
1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)
Key Innovation: Explicit tiered architecture with fallback.
Recommended Pattern:
def process_query(question):
# Tier 1: Template matching (deterministic, high accuracy)
match = template_matcher.match(question)
if match and match.confidence >= 0.85:
return render_template(match)
# Tier 2: LLM generation (fallback)
return llm_generate_sparql(question, schema_context)
Relevance to GLAM:
- ✅ We already implement this pattern
- 💡 Our threshold is 0.75, could consider raising for higher precision
1.5 GEPA Optimization (DSPy, 2024-2025)
Key Innovation: Genetic-Pareto optimization for prompt evolution.
Approach:
- Dual-model: Cheap student LM + Smart reflection LM
- Iterate: Run → Analyze failures → Generate improved prompts
- Results: 10-20% accuracy improvements typical
Relevance to GLAM:
- ❌ We use static DSPy signatures without optimization
- 💡 Could apply GEPA to TemplateClassifier and SlotExtractor
1.6 Intent-Driven Hybrid Architecture (2024)
Key Pattern: Intent classification → Template selection → Slot filling → LLM fallback
User Query
↓
┌─────────────────────────────────────────┐
│ Intent Classifier │
│ - Embedding-based classification │
│ - Hierarchical intent taxonomy │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Template Selector │
│ - Map intent → available templates │
│ - FAISS/vector retrieval for similar │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Slot Filler │
│ - Schema-aware extraction │
│ - Validation against ontology │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ LLM Fallback │
│ - Only when template fails │
│ - Constrained generation │
└─────────────────────────────────────────┘
Relevance to GLAM:
- ✅ We have semantic router for intent
- ✅ We have template classification
- ❌ Missing: Hierarchical intent taxonomy
- ❌ Missing: Schema-aware slot validation
2. Current GLAM Architecture
2.1 Current 3-Tier System
User Question
↓
┌─────────────────────────────────────────┐
│ TIER 1: Pattern Matching │
│ - Regex-based template matching │
│ - Slot type validation │
│ - Confidence ≥ 0.75 required │
│ - ~1ms latency │
└─────────────────────────────────────────┘
↓ (if no match)
┌─────────────────────────────────────────┐
│ TIER 2: Embedding Matching │
│ - Sentence-transformer embeddings │
│ - Cosine similarity ≥ 0.70 │
│ - ~50ms latency (cached) │
└─────────────────────────────────────────┘
↓ (if no match)
┌─────────────────────────────────────────┐
│ TIER 3: LLM Classification │
│ - DSPy ChainOfThought │
│ - Template ID classification │
│ - ~500-2000ms latency │
└─────────────────────────────────────────┘
↓
Slot Extraction (DSPy)
↓
Template Instantiation (Jinja2)
↓
SPARQL Query
2.2 Strengths
| Aspect | Current Implementation | Rating |
|---|---|---|
| Deterministic first | Regex before embeddings before LLM | ⭐⭐⭐⭐⭐ |
| Semantic similarity | Sentence-transformer embeddings | ⭐⭐⭐⭐ |
| Multilingual | Dutch/English/German patterns | ⭐⭐⭐⭐ |
| Conversation context | Context resolver for follow-ups | ⭐⭐⭐⭐ |
| Relevance filtering | Fyke filter for out-of-scope | ⭐⭐⭐⭐ |
| Slot resolution | Synonym resolver with fuzzy match | ⭐⭐⭐⭐ |
| Template variants | Region/country/ISIL variants | ⭐⭐⭐⭐ |
2.3 Gaps vs SOTA
| Gap | SOTA Reference | Impact | Priority |
|---|---|---|---|
| No RAG-enhanced tier | SPARQL-LLM, FIRESPARQL | Medium | High |
| No SPARQL validation loop | SPARQL-LLM | High | High |
| No schema-aware slot filling | Auto-KGQA, LLM-based NL2SPARQL | Medium | Medium |
| No GEPA optimization | DSPy GEPA tutorials | Medium | Medium |
| No hierarchical intents | Intent classification patterns | Low | Low |
| Limited metrics | SPARQL-LLM | Low | Low |
3. Proposed Improvements
3.1 Add Tier 2.5: RAG-Enhanced Matching
Insert a RAG tier between embedding matching and LLM fallback:
# After embedding match fails:
@dataclass
class RAGEnhancedMatch:
"""Context-enriched matching using similar examples."""
def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]:
# Retrieve top-3 most similar Q&A examples from YAML
similar_examples = self._retrieve_similar_examples(question, k=3)
# Check if examples strongly suggest a template
template_votes = Counter(ex.template_id for ex in similar_examples)
top_template, count = template_votes.most_common(1)[0]
if count >= 2: # 2 of 3 examples agree
return TemplateMatchResult(
matched=True,
template_id=top_template,
confidence=0.75 + (count / 3) * 0.15, # 0.80-0.90
reasoning=f"RAG: {count}/3 similar examples use {top_template}"
)
return None
Benefits:
- Handles paraphrases that embeddings miss
- Uses existing example data in templates
- Cheaper than LLM fallback
3.2 Add SPARQL Validation Feedback Loop
After template instantiation, validate SPARQL against schema:
class SPARQLValidator:
"""Validates generated SPARQL against ontology schema."""
def __init__(self, schema_path: Path):
self.valid_predicates = self._load_predicates(schema_path)
self.valid_classes = self._load_classes(schema_path)
def validate(self, sparql: str) -> ValidationResult:
errors = []
# Extract predicates used in query
predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql)
for pred in predicates:
if pred not in self.valid_predicates:
errors.append(f"Unknown predicate: {pred}")
# Extract classes
classes = re.findall(r'a\s+(hcc:\w+)', sparql)
for cls in classes:
if cls not in self.valid_classes:
errors.append(f"Unknown class: {cls}")
return ValidationResult(
valid=len(errors) == 0,
errors=errors,
suggestions=self._suggest_fixes(errors)
)
def correct_with_llm(self, sparql: str, errors: list[str]) -> str:
"""Use LLM to correct validation errors."""
prompt = f"""
The following SPARQL query has errors:
```sparql
{sparql}
```
Errors found:
{chr(10).join(f'- {e}' for e in errors)}
Correct the query. Return only the corrected SPARQL.
"""
# Call LLM for correction
return self._call_llm(prompt)
Benefits:
- Catches schema mismatches before execution
- Enables iterative correction (SPARQL-LLM pattern)
- Reduces runtime errors
3.3 Schema-Aware Slot Filling
Use ontology to validate extracted slot values:
class SchemaAwareSlotExtractor(dspy.Module):
"""Slot extraction with ontology validation."""
def __init__(self, ontology_path: Path):
super().__init__()
self.extract = dspy.ChainOfThought(SlotExtractorSignature)
self.ontology = self._load_ontology(ontology_path)
def forward(self, question: str, template_id: str, ...) -> dict[str, str]:
# Standard DSPy extraction
raw_slots = self.extract(question=question, ...)
# Validate against ontology
validated_slots = {}
for slot_name, value in raw_slots.items():
if slot_name == "institution_type":
# Check if value maps to valid hc:institutionType
if value in self.ontology.institution_types:
validated_slots[slot_name] = value
else:
# Try fuzzy match against ontology
match = self._fuzzy_match_ontology(value, "institution_types")
if match:
validated_slots[slot_name] = match
logger.info(f"Corrected slot: {value} → {match}")
return validated_slots
Benefits:
- Ensures slot values are ontology-compliant
- Auto-corrects minor extraction errors
- Reduces downstream SPARQL errors
3.4 GEPA Optimization for DSPy Modules
Add GEPA optimization training for key modules:
# backend/rag/optimization/gepa_training.py
import dspy
from dspy import GEPA
def optimize_template_classifier():
"""Optimize TemplateClassifier using GEPA."""
# Load training data from template examples
training_data = load_training_examples()
# Define metric
def classification_metric(example, prediction):
return 1.0 if prediction.template_id == example.expected_template else 0.0
# Initialize GEPA optimizer
optimizer = GEPA(
metric=classification_metric,
num_candidates=10,
num_threads=4,
)
# Optimize
classifier = TemplateClassifier()
optimized = optimizer.compile(
classifier,
trainset=training_data,
max_rounds=5,
)
# Save optimized module
optimized.save("optimized_template_classifier.json")
return optimized
Benefits:
- 10-20% accuracy improvement typical
- Automated prompt refinement
- Domain-specific optimization
3.5 Hierarchical Intent Classification
Structure intents hierarchically for scalability:
# Intent taxonomy for 50+ intents
intent_hierarchy:
geographic:
- list_by_location
- count_by_location
- compare_locations
temporal:
- point_in_time
- timeline
- events_in_period
- founding_date
entity:
- find_by_name
- find_by_identifier
statistical:
- count_by_type
- distribution
- aggregation
financial:
- budget_threshold
- expense_comparison
class HierarchicalIntentClassifier:
"""Two-stage intent classification for scalability."""
def classify(self, question: str) -> IntentResult:
# Stage 1: Classify into top-level category (5 options)
top_level = self._classify_top_level(question) # geographic, temporal, etc.
# Stage 2: Classify into specific intent within category
specific = self._classify_specific(question, top_level)
return IntentResult(
top_level=top_level,
specific=specific,
confidence=min(top_level.confidence, specific.confidence)
)
Benefits:
- Scales to 50+ templates without accuracy loss
- Faster classification (fewer options per stage)
- Better organized codebase
4. Implementation Priority
Phase 1: High Impact (1-2 days)
-
SPARQL Validation Loop (3.2)
- Load schema from LinkML
- Validate predicates/classes
- Add LLM correction step
-
Metrics Enhancement (3.6)
- Track tier usage distribution
- Track latency per tier
- Track validation error rates
Phase 2: Medium Impact (2-3 days)
-
RAG-Enhanced Tier (3.1)
- Index template examples
- Implement retrieval
- Add as Tier 2.5
-
Schema-Aware Slot Filling (3.3)
- Load ontology
- Validate extracted values
- Auto-correct mismatches
Phase 3: Optimization (3-5 days)
-
GEPA Training (3.4)
- Create training dataset
- Define metrics
- Run optimization
- Deploy optimized modules
-
Hierarchical Intents (3.5)
- Design taxonomy
- Implement two-stage classifier
- Migrate existing templates
5. Expected Outcomes
| Improvement | Expected Impact | Measurement |
|---|---|---|
| SPARQL Validation | -50% runtime errors | Error rate tracking |
| RAG-Enhanced Tier | +5-10% template match rate | Tier 2.5 success rate |
| Schema-Aware Slots | -30% slot errors | Validation error logs |
| GEPA Optimization | +10-20% LLM tier accuracy | Template classification F1 |
| Hierarchical Intents | Ready for 50+ templates | Intent classification latency |
6. References
- SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation
- COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting
- KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement
- FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning
- Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection
- DSPy GEPA - Reflective prompt evolution
- Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback