- Add prompt-query_template_mapping/SOTA_analysis.md with Formica et al. research - Update GraphRAG design patterns documentation - Update temporal semantic hypergraph documentation
516 lines
18 KiB
Markdown
516 lines
18 KiB
Markdown
# SOTA Analysis: Template-Based SPARQL Generation
|
|
|
|
**Date**: 2025-01-07
|
|
**Status**: Active Research
|
|
**Author**: OpenCode
|
|
|
|
## Executive Summary
|
|
|
|
Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements.
|
|
|
|
**Key Finding**: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are:
|
|
1. Add RAG-enhanced tier between embedding and LLM
|
|
2. Implement SPARQL validation feedback loop
|
|
3. Schema-aware slot filling
|
|
4. GEPA optimization for DSPy modules
|
|
|
|
---
|
|
|
|
## 1. Research Survey
|
|
|
|
### 1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)
|
|
|
|
**Key Innovation**: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners.
|
|
|
|
**Architecture**:
|
|
```
|
|
User Question
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Metadata Indexer │
|
|
│ - Schema classes/properties indexed │
|
|
│ - Example Q&A pairs vectorized │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Prompt Builder (RAG) │
|
|
│ - Retrieve similar examples │
|
|
│ - Retrieve relevant schema fragments │
|
|
│ - Compose context-rich prompt │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ SPARQL Generator │
|
|
│ - LLM generates SPARQL │
|
|
│ - Validation against schema │
|
|
│ - Iterative correction loop │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
Validated SPARQL
|
|
```
|
|
|
|
**Relevance to GLAM**:
|
|
- ✅ We have schema (LinkML) but don't use it in prompts
|
|
- ✅ We have example Q&A in templates but don't retrieve semantically
|
|
- ❌ Missing: Schema-aware validation loop
|
|
|
|
### 1.2 COT-SPARQL (SEMANTICS 2024)
|
|
|
|
**Key Innovation**: Chain-of-Thought prompting with context injection.
|
|
|
|
**Two Context Types**:
|
|
- **Context A**: Entity and relation extraction from question
|
|
- **Context B**: Most semantically similar example from training set
|
|
|
|
**Performance**: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9
|
|
|
|
**Relevance to GLAM**:
|
|
- ✅ Our embedding matcher finds similar patterns (partial Context B)
|
|
- ❌ Missing: Entity/relation extraction step (Context A)
|
|
- ❌ Missing: CoT prompting in LLM tier
|
|
|
|
### 1.3 KGQuest (arXiv 2511.11258, Nov 2024)
|
|
|
|
**Key Innovation**: Deterministic template generation + LLM refinement.
|
|
|
|
**Architecture**:
|
|
```
|
|
KG Triplets
|
|
↓
|
|
Cluster by relation type
|
|
↓
|
|
Generate rule-based templates (deterministic)
|
|
↓
|
|
LLM refinement for fluency (lightweight, controlled)
|
|
```
|
|
|
|
**Relevance to GLAM**:
|
|
- ✅ Validates our template-first approach
|
|
- ✅ We use deterministic templates with LLM fallback
|
|
- 💡 Insight: Use LLM only for refinement, not generation
|
|
|
|
### 1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)
|
|
|
|
**Key Innovation**: Explicit tiered architecture with fallback.
|
|
|
|
**Recommended Pattern**:
|
|
```python
|
|
def process_query(question):
|
|
# Tier 1: Template matching (deterministic, high accuracy)
|
|
match = template_matcher.match(question)
|
|
if match and match.confidence >= 0.85:
|
|
return render_template(match)
|
|
|
|
# Tier 2: LLM generation (fallback)
|
|
return llm_generate_sparql(question, schema_context)
|
|
```
|
|
|
|
**Relevance to GLAM**:
|
|
- ✅ We already implement this pattern
|
|
- 💡 Our threshold is 0.75, could consider raising for higher precision
|
|
|
|
### 1.5 GEPA Optimization (DSPy, 2024-2025)
|
|
|
|
**Key Innovation**: Genetic-Pareto optimization for prompt evolution.
|
|
|
|
**Approach**:
|
|
- Dual-model: Cheap student LM + Smart reflection LM
|
|
- Iterate: Run → Analyze failures → Generate improved prompts
|
|
- Results: 10-20% accuracy improvements typical
|
|
|
|
**Relevance to GLAM**:
|
|
- ❌ We use static DSPy signatures without optimization
|
|
- 💡 Could apply GEPA to TemplateClassifier and SlotExtractor
|
|
|
|
### 1.6 Intent-Driven Hybrid Architecture (2024)
|
|
|
|
**Key Pattern**: Intent classification → Template selection → Slot filling → LLM fallback
|
|
|
|
```
|
|
User Query
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Intent Classifier │
|
|
│ - Embedding-based classification │
|
|
│ - Hierarchical intent taxonomy │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Template Selector │
|
|
│ - Map intent → available templates │
|
|
│ - FAISS/vector retrieval for similar │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ Slot Filler │
|
|
│ - Schema-aware extraction │
|
|
│ - Validation against ontology │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ LLM Fallback │
|
|
│ - Only when template fails │
|
|
│ - Constrained generation │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
**Relevance to GLAM**:
|
|
- ✅ We have semantic router for intent
|
|
- ✅ We have template classification
|
|
- ❌ Missing: Hierarchical intent taxonomy
|
|
- ❌ Missing: Schema-aware slot validation
|
|
|
|
---
|
|
|
|
## 2. Current GLAM Architecture
|
|
|
|
### 2.1 Current 3-Tier System
|
|
|
|
```
|
|
User Question
|
|
↓
|
|
┌─────────────────────────────────────────┐
|
|
│ TIER 1: Pattern Matching │
|
|
│ - Regex-based template matching │
|
|
│ - Slot type validation │
|
|
│ - Confidence ≥ 0.75 required │
|
|
│ - ~1ms latency │
|
|
└─────────────────────────────────────────┘
|
|
↓ (if no match)
|
|
┌─────────────────────────────────────────┐
|
|
│ TIER 2: Embedding Matching │
|
|
│ - Sentence-transformer embeddings │
|
|
│ - Cosine similarity ≥ 0.70 │
|
|
│ - ~50ms latency (cached) │
|
|
└─────────────────────────────────────────┘
|
|
↓ (if no match)
|
|
┌─────────────────────────────────────────┐
|
|
│ TIER 3: LLM Classification │
|
|
│ - DSPy ChainOfThought │
|
|
│ - Template ID classification │
|
|
│ - ~500-2000ms latency │
|
|
└─────────────────────────────────────────┘
|
|
↓
|
|
Slot Extraction (DSPy)
|
|
↓
|
|
Template Instantiation (Jinja2)
|
|
↓
|
|
SPARQL Query
|
|
```
|
|
|
|
### 2.2 Strengths
|
|
|
|
| Aspect | Current Implementation | Rating |
|
|
|--------|----------------------|--------|
|
|
| Deterministic first | Regex before embeddings before LLM | ⭐⭐⭐⭐⭐ |
|
|
| Semantic similarity | Sentence-transformer embeddings | ⭐⭐⭐⭐ |
|
|
| Multilingual | Dutch/English/German patterns | ⭐⭐⭐⭐ |
|
|
| Conversation context | Context resolver for follow-ups | ⭐⭐⭐⭐ |
|
|
| Relevance filtering | Fyke filter for out-of-scope | ⭐⭐⭐⭐ |
|
|
| Slot resolution | Synonym resolver with fuzzy match | ⭐⭐⭐⭐ |
|
|
| Template variants | Region/country/ISIL variants | ⭐⭐⭐⭐ |
|
|
|
|
### 2.3 Gaps vs SOTA
|
|
|
|
| Gap | SOTA Reference | Impact | Priority |
|
|
|-----|---------------|--------|----------|
|
|
| No RAG-enhanced tier | SPARQL-LLM, FIRESPARQL | Medium | High |
|
|
| No SPARQL validation loop | SPARQL-LLM | High | High |
|
|
| No schema-aware slot filling | Auto-KGQA, LLM-based NL2SPARQL | Medium | Medium |
|
|
| No GEPA optimization | DSPy GEPA tutorials | Medium | Medium |
|
|
| No hierarchical intents | Intent classification patterns | Low | Low |
|
|
| Limited metrics | SPARQL-LLM | Low | Low |
|
|
|
|
---
|
|
|
|
## 3. Proposed Improvements
|
|
|
|
### 3.1 Add Tier 2.5: RAG-Enhanced Matching
|
|
|
|
Insert a RAG tier between embedding matching and LLM fallback:
|
|
|
|
```python
|
|
# After embedding match fails:
|
|
@dataclass
|
|
class RAGEnhancedMatch:
|
|
"""Context-enriched matching using similar examples."""
|
|
|
|
def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]:
|
|
# Retrieve top-3 most similar Q&A examples from YAML
|
|
similar_examples = self._retrieve_similar_examples(question, k=3)
|
|
|
|
# Check if examples strongly suggest a template
|
|
template_votes = Counter(ex.template_id for ex in similar_examples)
|
|
top_template, count = template_votes.most_common(1)[0]
|
|
|
|
if count >= 2: # 2 of 3 examples agree
|
|
return TemplateMatchResult(
|
|
matched=True,
|
|
template_id=top_template,
|
|
confidence=0.75 + (count / 3) * 0.15, # 0.80-0.90
|
|
reasoning=f"RAG: {count}/3 similar examples use {top_template}"
|
|
)
|
|
return None
|
|
```
|
|
|
|
**Benefits**:
|
|
- Handles paraphrases that embeddings miss
|
|
- Uses existing example data in templates
|
|
- Cheaper than LLM fallback
|
|
|
|
### 3.2 Add SPARQL Validation Feedback Loop
|
|
|
|
After template instantiation, validate SPARQL against schema:
|
|
|
|
```python
|
|
class SPARQLValidator:
|
|
"""Validates generated SPARQL against ontology schema."""
|
|
|
|
def __init__(self, schema_path: Path):
|
|
self.valid_predicates = self._load_predicates(schema_path)
|
|
self.valid_classes = self._load_classes(schema_path)
|
|
|
|
def validate(self, sparql: str) -> ValidationResult:
|
|
errors = []
|
|
|
|
# Extract predicates used in query
|
|
predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql)
|
|
for pred in predicates:
|
|
if pred not in self.valid_predicates:
|
|
errors.append(f"Unknown predicate: {pred}")
|
|
|
|
# Extract classes
|
|
classes = re.findall(r'a\s+(hcc:\w+)', sparql)
|
|
for cls in classes:
|
|
if cls not in self.valid_classes:
|
|
errors.append(f"Unknown class: {cls}")
|
|
|
|
return ValidationResult(
|
|
valid=len(errors) == 0,
|
|
errors=errors,
|
|
suggestions=self._suggest_fixes(errors)
|
|
)
|
|
|
|
def correct_with_llm(self, sparql: str, errors: list[str]) -> str:
|
|
"""Use LLM to correct validation errors."""
|
|
prompt = f"""
|
|
The following SPARQL query has errors:
|
|
|
|
```sparql
|
|
{sparql}
|
|
```
|
|
|
|
Errors found:
|
|
{chr(10).join(f'- {e}' for e in errors)}
|
|
|
|
Correct the query. Return only the corrected SPARQL.
|
|
"""
|
|
# Call LLM for correction
|
|
return self._call_llm(prompt)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Catches schema mismatches before execution
|
|
- Enables iterative correction (SPARQL-LLM pattern)
|
|
- Reduces runtime errors
|
|
|
|
### 3.3 Schema-Aware Slot Filling
|
|
|
|
Use ontology to validate extracted slot values:
|
|
|
|
```python
|
|
class SchemaAwareSlotExtractor(dspy.Module):
|
|
"""Slot extraction with ontology validation."""
|
|
|
|
def __init__(self, ontology_path: Path):
|
|
super().__init__()
|
|
self.extract = dspy.ChainOfThought(SlotExtractorSignature)
|
|
self.ontology = self._load_ontology(ontology_path)
|
|
|
|
def forward(self, question: str, template_id: str, ...) -> dict[str, str]:
|
|
# Standard DSPy extraction
|
|
raw_slots = self.extract(question=question, ...)
|
|
|
|
# Validate against ontology
|
|
validated_slots = {}
|
|
for slot_name, value in raw_slots.items():
|
|
if slot_name == "institution_type":
|
|
# Check if value maps to valid hc:institutionType
|
|
if value in self.ontology.institution_types:
|
|
validated_slots[slot_name] = value
|
|
else:
|
|
# Try fuzzy match against ontology
|
|
match = self._fuzzy_match_ontology(value, "institution_types")
|
|
if match:
|
|
validated_slots[slot_name] = match
|
|
logger.info(f"Corrected slot: {value} → {match}")
|
|
|
|
return validated_slots
|
|
```
|
|
|
|
**Benefits**:
|
|
- Ensures slot values are ontology-compliant
|
|
- Auto-corrects minor extraction errors
|
|
- Reduces downstream SPARQL errors
|
|
|
|
### 3.4 GEPA Optimization for DSPy Modules
|
|
|
|
Add GEPA optimization training for key modules:
|
|
|
|
```python
|
|
# backend/rag/optimization/gepa_training.py
|
|
|
|
import dspy
|
|
from dspy import GEPA
|
|
|
|
def optimize_template_classifier():
|
|
"""Optimize TemplateClassifier using GEPA."""
|
|
|
|
# Load training data from template examples
|
|
training_data = load_training_examples()
|
|
|
|
# Define metric
|
|
def classification_metric(example, prediction):
|
|
return 1.0 if prediction.template_id == example.expected_template else 0.0
|
|
|
|
# Initialize GEPA optimizer
|
|
optimizer = GEPA(
|
|
metric=classification_metric,
|
|
num_candidates=10,
|
|
num_threads=4,
|
|
)
|
|
|
|
# Optimize
|
|
classifier = TemplateClassifier()
|
|
optimized = optimizer.compile(
|
|
classifier,
|
|
trainset=training_data,
|
|
max_rounds=5,
|
|
)
|
|
|
|
# Save optimized module
|
|
optimized.save("optimized_template_classifier.json")
|
|
|
|
return optimized
|
|
```
|
|
|
|
**Benefits**:
|
|
- 10-20% accuracy improvement typical
|
|
- Automated prompt refinement
|
|
- Domain-specific optimization
|
|
|
|
### 3.5 Hierarchical Intent Classification
|
|
|
|
Structure intents hierarchically for scalability:
|
|
|
|
```yaml
|
|
# Intent taxonomy for 50+ intents
|
|
intent_hierarchy:
|
|
geographic:
|
|
- list_by_location
|
|
- count_by_location
|
|
- compare_locations
|
|
temporal:
|
|
- point_in_time
|
|
- timeline
|
|
- events_in_period
|
|
- founding_date
|
|
entity:
|
|
- find_by_name
|
|
- find_by_identifier
|
|
statistical:
|
|
- count_by_type
|
|
- distribution
|
|
- aggregation
|
|
financial:
|
|
- budget_threshold
|
|
- expense_comparison
|
|
```
|
|
|
|
```python
|
|
class HierarchicalIntentClassifier:
|
|
"""Two-stage intent classification for scalability."""
|
|
|
|
def classify(self, question: str) -> IntentResult:
|
|
# Stage 1: Classify into top-level category (5 options)
|
|
top_level = self._classify_top_level(question) # geographic, temporal, etc.
|
|
|
|
# Stage 2: Classify into specific intent within category
|
|
specific = self._classify_specific(question, top_level)
|
|
|
|
return IntentResult(
|
|
top_level=top_level,
|
|
specific=specific,
|
|
confidence=min(top_level.confidence, specific.confidence)
|
|
)
|
|
```
|
|
|
|
**Benefits**:
|
|
- Scales to 50+ templates without accuracy loss
|
|
- Faster classification (fewer options per stage)
|
|
- Better organized codebase
|
|
|
|
---
|
|
|
|
## 4. Implementation Priority
|
|
|
|
### Phase 1: High Impact (1-2 days)
|
|
|
|
1. **SPARQL Validation Loop** (3.2)
|
|
- Load schema from LinkML
|
|
- Validate predicates/classes
|
|
- Add LLM correction step
|
|
|
|
2. **Metrics Enhancement** (3.6)
|
|
- Track tier usage distribution
|
|
- Track latency per tier
|
|
- Track validation error rates
|
|
|
|
### Phase 2: Medium Impact (2-3 days)
|
|
|
|
3. **RAG-Enhanced Tier** (3.1)
|
|
- Index template examples
|
|
- Implement retrieval
|
|
- Add as Tier 2.5
|
|
|
|
4. **Schema-Aware Slot Filling** (3.3)
|
|
- Load ontology
|
|
- Validate extracted values
|
|
- Auto-correct mismatches
|
|
|
|
### Phase 3: Optimization (3-5 days)
|
|
|
|
5. **GEPA Training** (3.4)
|
|
- Create training dataset
|
|
- Define metrics
|
|
- Run optimization
|
|
- Deploy optimized modules
|
|
|
|
6. **Hierarchical Intents** (3.5)
|
|
- Design taxonomy
|
|
- Implement two-stage classifier
|
|
- Migrate existing templates
|
|
|
|
---
|
|
|
|
## 5. Expected Outcomes
|
|
|
|
| Improvement | Expected Impact | Measurement |
|
|
|-------------|-----------------|-------------|
|
|
| SPARQL Validation | -50% runtime errors | Error rate tracking |
|
|
| RAG-Enhanced Tier | +5-10% template match rate | Tier 2.5 success rate |
|
|
| Schema-Aware Slots | -30% slot errors | Validation error logs |
|
|
| GEPA Optimization | +10-20% LLM tier accuracy | Template classification F1 |
|
|
| Hierarchical Intents | Ready for 50+ templates | Intent classification latency |
|
|
|
|
---
|
|
|
|
## 6. References
|
|
|
|
1. SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation
|
|
2. COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting
|
|
3. KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement
|
|
4. FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning
|
|
5. Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection
|
|
6. DSPy GEPA - Reflective prompt evolution
|
|
7. Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback
|