glam/docs/plan/prompt-query_template_mapping/SOTA_analysis.md
kempersc 30b9cb9d14 Add SOTA analysis and update design pattern documentation
- Add prompt-query_template_mapping/SOTA_analysis.md with Formica et al. research
- Update GraphRAG design patterns documentation
- Update temporal semantic hypergraph documentation
2026-01-07 22:05:01 +01:00

516 lines
18 KiB
Markdown

# SOTA Analysis: Template-Based SPARQL Generation
**Date**: 2025-01-07
**Status**: Active Research
**Author**: OpenCode
## Executive Summary
Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements.
**Key Finding**: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are:
1. Add RAG-enhanced tier between embedding and LLM
2. Implement SPARQL validation feedback loop
3. Schema-aware slot filling
4. GEPA optimization for DSPy modules
---
## 1. Research Survey
### 1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)
**Key Innovation**: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners.
**Architecture**:
```
User Question
┌─────────────────────────────────────────┐
│ Metadata Indexer │
│ - Schema classes/properties indexed │
│ - Example Q&A pairs vectorized │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Prompt Builder (RAG) │
│ - Retrieve similar examples │
│ - Retrieve relevant schema fragments │
│ - Compose context-rich prompt │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ SPARQL Generator │
│ - LLM generates SPARQL │
│ - Validation against schema │
│ - Iterative correction loop │
└─────────────────────────────────────────┘
Validated SPARQL
```
**Relevance to GLAM**:
- ✅ We have schema (LinkML) but don't use it in prompts
- ✅ We have example Q&A in templates but don't retrieve semantically
- ❌ Missing: Schema-aware validation loop
### 1.2 COT-SPARQL (SEMANTICS 2024)
**Key Innovation**: Chain-of-Thought prompting with context injection.
**Two Context Types**:
- **Context A**: Entity and relation extraction from question
- **Context B**: Most semantically similar example from training set
**Performance**: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9
**Relevance to GLAM**:
- ✅ Our embedding matcher finds similar patterns (partial Context B)
- ❌ Missing: Entity/relation extraction step (Context A)
- ❌ Missing: CoT prompting in LLM tier
### 1.3 KGQuest (arXiv 2511.11258, Nov 2024)
**Key Innovation**: Deterministic template generation + LLM refinement.
**Architecture**:
```
KG Triplets
Cluster by relation type
Generate rule-based templates (deterministic)
LLM refinement for fluency (lightweight, controlled)
```
**Relevance to GLAM**:
- ✅ Validates our template-first approach
- ✅ We use deterministic templates with LLM fallback
- 💡 Insight: Use LLM only for refinement, not generation
### 1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)
**Key Innovation**: Explicit tiered architecture with fallback.
**Recommended Pattern**:
```python
def process_query(question):
# Tier 1: Template matching (deterministic, high accuracy)
match = template_matcher.match(question)
if match and match.confidence >= 0.85:
return render_template(match)
# Tier 2: LLM generation (fallback)
return llm_generate_sparql(question, schema_context)
```
**Relevance to GLAM**:
- ✅ We already implement this pattern
- 💡 Our threshold is 0.75, could consider raising for higher precision
### 1.5 GEPA Optimization (DSPy, 2024-2025)
**Key Innovation**: Genetic-Pareto optimization for prompt evolution.
**Approach**:
- Dual-model: Cheap student LM + Smart reflection LM
- Iterate: Run → Analyze failures → Generate improved prompts
- Results: 10-20% accuracy improvements typical
**Relevance to GLAM**:
- ❌ We use static DSPy signatures without optimization
- 💡 Could apply GEPA to TemplateClassifier and SlotExtractor
### 1.6 Intent-Driven Hybrid Architecture (2024)
**Key Pattern**: Intent classification → Template selection → Slot filling → LLM fallback
```
User Query
┌─────────────────────────────────────────┐
│ Intent Classifier │
│ - Embedding-based classification │
│ - Hierarchical intent taxonomy │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Template Selector │
│ - Map intent → available templates │
│ - FAISS/vector retrieval for similar │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Slot Filler │
│ - Schema-aware extraction │
│ - Validation against ontology │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ LLM Fallback │
│ - Only when template fails │
│ - Constrained generation │
└─────────────────────────────────────────┘
```
**Relevance to GLAM**:
- ✅ We have semantic router for intent
- ✅ We have template classification
- ❌ Missing: Hierarchical intent taxonomy
- ❌ Missing: Schema-aware slot validation
---
## 2. Current GLAM Architecture
### 2.1 Current 3-Tier System
```
User Question
┌─────────────────────────────────────────┐
│ TIER 1: Pattern Matching │
│ - Regex-based template matching │
│ - Slot type validation │
│ - Confidence ≥ 0.75 required │
│ - ~1ms latency │
└─────────────────────────────────────────┘
↓ (if no match)
┌─────────────────────────────────────────┐
│ TIER 2: Embedding Matching │
│ - Sentence-transformer embeddings │
│ - Cosine similarity ≥ 0.70 │
│ - ~50ms latency (cached) │
└─────────────────────────────────────────┘
↓ (if no match)
┌─────────────────────────────────────────┐
│ TIER 3: LLM Classification │
│ - DSPy ChainOfThought │
│ - Template ID classification │
│ - ~500-2000ms latency │
└─────────────────────────────────────────┘
Slot Extraction (DSPy)
Template Instantiation (Jinja2)
SPARQL Query
```
### 2.2 Strengths
| Aspect | Current Implementation | Rating |
|--------|----------------------|--------|
| Deterministic first | Regex before embeddings before LLM | ⭐⭐⭐⭐⭐ |
| Semantic similarity | Sentence-transformer embeddings | ⭐⭐⭐⭐ |
| Multilingual | Dutch/English/German patterns | ⭐⭐⭐⭐ |
| Conversation context | Context resolver for follow-ups | ⭐⭐⭐⭐ |
| Relevance filtering | Fyke filter for out-of-scope | ⭐⭐⭐⭐ |
| Slot resolution | Synonym resolver with fuzzy match | ⭐⭐⭐⭐ |
| Template variants | Region/country/ISIL variants | ⭐⭐⭐⭐ |
### 2.3 Gaps vs SOTA
| Gap | SOTA Reference | Impact | Priority |
|-----|---------------|--------|----------|
| No RAG-enhanced tier | SPARQL-LLM, FIRESPARQL | Medium | High |
| No SPARQL validation loop | SPARQL-LLM | High | High |
| No schema-aware slot filling | Auto-KGQA, LLM-based NL2SPARQL | Medium | Medium |
| No GEPA optimization | DSPy GEPA tutorials | Medium | Medium |
| No hierarchical intents | Intent classification patterns | Low | Low |
| Limited metrics | SPARQL-LLM | Low | Low |
---
## 3. Proposed Improvements
### 3.1 Add Tier 2.5: RAG-Enhanced Matching
Insert a RAG tier between embedding matching and LLM fallback:
```python
# After embedding match fails:
@dataclass
class RAGEnhancedMatch:
"""Context-enriched matching using similar examples."""
def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]:
# Retrieve top-3 most similar Q&A examples from YAML
similar_examples = self._retrieve_similar_examples(question, k=3)
# Check if examples strongly suggest a template
template_votes = Counter(ex.template_id for ex in similar_examples)
top_template, count = template_votes.most_common(1)[0]
if count >= 2: # 2 of 3 examples agree
return TemplateMatchResult(
matched=True,
template_id=top_template,
confidence=0.75 + (count / 3) * 0.15, # 0.80-0.90
reasoning=f"RAG: {count}/3 similar examples use {top_template}"
)
return None
```
**Benefits**:
- Handles paraphrases that embeddings miss
- Uses existing example data in templates
- Cheaper than LLM fallback
### 3.2 Add SPARQL Validation Feedback Loop
After template instantiation, validate SPARQL against schema:
```python
class SPARQLValidator:
"""Validates generated SPARQL against ontology schema."""
def __init__(self, schema_path: Path):
self.valid_predicates = self._load_predicates(schema_path)
self.valid_classes = self._load_classes(schema_path)
def validate(self, sparql: str) -> ValidationResult:
errors = []
# Extract predicates used in query
predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql)
for pred in predicates:
if pred not in self.valid_predicates:
errors.append(f"Unknown predicate: {pred}")
# Extract classes
classes = re.findall(r'a\s+(hcc:\w+)', sparql)
for cls in classes:
if cls not in self.valid_classes:
errors.append(f"Unknown class: {cls}")
return ValidationResult(
valid=len(errors) == 0,
errors=errors,
suggestions=self._suggest_fixes(errors)
)
def correct_with_llm(self, sparql: str, errors: list[str]) -> str:
"""Use LLM to correct validation errors."""
prompt = f"""
The following SPARQL query has errors:
```sparql
{sparql}
```
Errors found:
{chr(10).join(f'- {e}' for e in errors)}
Correct the query. Return only the corrected SPARQL.
"""
# Call LLM for correction
return self._call_llm(prompt)
```
**Benefits**:
- Catches schema mismatches before execution
- Enables iterative correction (SPARQL-LLM pattern)
- Reduces runtime errors
### 3.3 Schema-Aware Slot Filling
Use ontology to validate extracted slot values:
```python
class SchemaAwareSlotExtractor(dspy.Module):
"""Slot extraction with ontology validation."""
def __init__(self, ontology_path: Path):
super().__init__()
self.extract = dspy.ChainOfThought(SlotExtractorSignature)
self.ontology = self._load_ontology(ontology_path)
def forward(self, question: str, template_id: str, ...) -> dict[str, str]:
# Standard DSPy extraction
raw_slots = self.extract(question=question, ...)
# Validate against ontology
validated_slots = {}
for slot_name, value in raw_slots.items():
if slot_name == "institution_type":
# Check if value maps to valid hc:institutionType
if value in self.ontology.institution_types:
validated_slots[slot_name] = value
else:
# Try fuzzy match against ontology
match = self._fuzzy_match_ontology(value, "institution_types")
if match:
validated_slots[slot_name] = match
logger.info(f"Corrected slot: {value}{match}")
return validated_slots
```
**Benefits**:
- Ensures slot values are ontology-compliant
- Auto-corrects minor extraction errors
- Reduces downstream SPARQL errors
### 3.4 GEPA Optimization for DSPy Modules
Add GEPA optimization training for key modules:
```python
# backend/rag/optimization/gepa_training.py
import dspy
from dspy import GEPA
def optimize_template_classifier():
"""Optimize TemplateClassifier using GEPA."""
# Load training data from template examples
training_data = load_training_examples()
# Define metric
def classification_metric(example, prediction):
return 1.0 if prediction.template_id == example.expected_template else 0.0
# Initialize GEPA optimizer
optimizer = GEPA(
metric=classification_metric,
num_candidates=10,
num_threads=4,
)
# Optimize
classifier = TemplateClassifier()
optimized = optimizer.compile(
classifier,
trainset=training_data,
max_rounds=5,
)
# Save optimized module
optimized.save("optimized_template_classifier.json")
return optimized
```
**Benefits**:
- 10-20% accuracy improvement typical
- Automated prompt refinement
- Domain-specific optimization
### 3.5 Hierarchical Intent Classification
Structure intents hierarchically for scalability:
```yaml
# Intent taxonomy for 50+ intents
intent_hierarchy:
geographic:
- list_by_location
- count_by_location
- compare_locations
temporal:
- point_in_time
- timeline
- events_in_period
- founding_date
entity:
- find_by_name
- find_by_identifier
statistical:
- count_by_type
- distribution
- aggregation
financial:
- budget_threshold
- expense_comparison
```
```python
class HierarchicalIntentClassifier:
"""Two-stage intent classification for scalability."""
def classify(self, question: str) -> IntentResult:
# Stage 1: Classify into top-level category (5 options)
top_level = self._classify_top_level(question) # geographic, temporal, etc.
# Stage 2: Classify into specific intent within category
specific = self._classify_specific(question, top_level)
return IntentResult(
top_level=top_level,
specific=specific,
confidence=min(top_level.confidence, specific.confidence)
)
```
**Benefits**:
- Scales to 50+ templates without accuracy loss
- Faster classification (fewer options per stage)
- Better organized codebase
---
## 4. Implementation Priority
### Phase 1: High Impact (1-2 days)
1. **SPARQL Validation Loop** (3.2)
- Load schema from LinkML
- Validate predicates/classes
- Add LLM correction step
2. **Metrics Enhancement** (3.6)
- Track tier usage distribution
- Track latency per tier
- Track validation error rates
### Phase 2: Medium Impact (2-3 days)
3. **RAG-Enhanced Tier** (3.1)
- Index template examples
- Implement retrieval
- Add as Tier 2.5
4. **Schema-Aware Slot Filling** (3.3)
- Load ontology
- Validate extracted values
- Auto-correct mismatches
### Phase 3: Optimization (3-5 days)
5. **GEPA Training** (3.4)
- Create training dataset
- Define metrics
- Run optimization
- Deploy optimized modules
6. **Hierarchical Intents** (3.5)
- Design taxonomy
- Implement two-stage classifier
- Migrate existing templates
---
## 5. Expected Outcomes
| Improvement | Expected Impact | Measurement |
|-------------|-----------------|-------------|
| SPARQL Validation | -50% runtime errors | Error rate tracking |
| RAG-Enhanced Tier | +5-10% template match rate | Tier 2.5 success rate |
| Schema-Aware Slots | -30% slot errors | Validation error logs |
| GEPA Optimization | +10-20% LLM tier accuracy | Template classification F1 |
| Hierarchical Intents | Ready for 50+ templates | Intent classification latency |
---
## 6. References
1. SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation
2. COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting
3. KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement
4. FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning
5. Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection
6. DSPy GEPA - Reflective prompt evolution
7. Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback