# SOTA Analysis: Template-Based SPARQL Generation **Date**: 2025-01-07 **Status**: Active Research **Author**: OpenCode ## Executive Summary Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements. **Key Finding**: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are: 1. Add RAG-enhanced tier between embedding and LLM 2. Implement SPARQL validation feedback loop 3. Schema-aware slot filling 4. GEPA optimization for DSPy modules --- ## 1. Research Survey ### 1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024) **Key Innovation**: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners. **Architecture**: ``` User Question ↓ ┌─────────────────────────────────────────┐ │ Metadata Indexer │ │ - Schema classes/properties indexed │ │ - Example Q&A pairs vectorized │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Prompt Builder (RAG) │ │ - Retrieve similar examples │ │ - Retrieve relevant schema fragments │ │ - Compose context-rich prompt │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ SPARQL Generator │ │ - LLM generates SPARQL │ │ - Validation against schema │ │ - Iterative correction loop │ └─────────────────────────────────────────┘ ↓ Validated SPARQL ``` **Relevance to GLAM**: - ✅ We have schema (LinkML) but don't use it in prompts - ✅ We have example Q&A in templates but don't retrieve semantically - ❌ Missing: Schema-aware validation loop ### 1.2 COT-SPARQL (SEMANTICS 2024) **Key Innovation**: Chain-of-Thought prompting with context injection. **Two Context Types**: - **Context A**: Entity and relation extraction from question - **Context B**: Most semantically similar example from training set **Performance**: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9 **Relevance to GLAM**: - ✅ Our embedding matcher finds similar patterns (partial Context B) - ❌ Missing: Entity/relation extraction step (Context A) - ❌ Missing: CoT prompting in LLM tier ### 1.3 KGQuest (arXiv 2511.11258, Nov 2024) **Key Innovation**: Deterministic template generation + LLM refinement. **Architecture**: ``` KG Triplets ↓ Cluster by relation type ↓ Generate rule-based templates (deterministic) ↓ LLM refinement for fluency (lightweight, controlled) ``` **Relevance to GLAM**: - ✅ Validates our template-first approach - ✅ We use deterministic templates with LLM fallback - 💡 Insight: Use LLM only for refinement, not generation ### 1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024) **Key Innovation**: Explicit tiered architecture with fallback. **Recommended Pattern**: ```python def process_query(question): # Tier 1: Template matching (deterministic, high accuracy) match = template_matcher.match(question) if match and match.confidence >= 0.85: return render_template(match) # Tier 2: LLM generation (fallback) return llm_generate_sparql(question, schema_context) ``` **Relevance to GLAM**: - ✅ We already implement this pattern - 💡 Our threshold is 0.75, could consider raising for higher precision ### 1.5 GEPA Optimization (DSPy, 2024-2025) **Key Innovation**: Genetic-Pareto optimization for prompt evolution. **Approach**: - Dual-model: Cheap student LM + Smart reflection LM - Iterate: Run → Analyze failures → Generate improved prompts - Results: 10-20% accuracy improvements typical **Relevance to GLAM**: - ❌ We use static DSPy signatures without optimization - 💡 Could apply GEPA to TemplateClassifier and SlotExtractor ### 1.6 Intent-Driven Hybrid Architecture (2024) **Key Pattern**: Intent classification → Template selection → Slot filling → LLM fallback ``` User Query ↓ ┌─────────────────────────────────────────┐ │ Intent Classifier │ │ - Embedding-based classification │ │ - Hierarchical intent taxonomy │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Template Selector │ │ - Map intent → available templates │ │ - FAISS/vector retrieval for similar │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ Slot Filler │ │ - Schema-aware extraction │ │ - Validation against ontology │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM Fallback │ │ - Only when template fails │ │ - Constrained generation │ └─────────────────────────────────────────┘ ``` **Relevance to GLAM**: - ✅ We have semantic router for intent - ✅ We have template classification - ❌ Missing: Hierarchical intent taxonomy - ❌ Missing: Schema-aware slot validation --- ## 2. Current GLAM Architecture ### 2.1 Current 3-Tier System ``` User Question ↓ ┌─────────────────────────────────────────┐ │ TIER 1: Pattern Matching │ │ - Regex-based template matching │ │ - Slot type validation │ │ - Confidence ≥ 0.75 required │ │ - ~1ms latency │ └─────────────────────────────────────────┘ ↓ (if no match) ┌─────────────────────────────────────────┐ │ TIER 2: Embedding Matching │ │ - Sentence-transformer embeddings │ │ - Cosine similarity ≥ 0.70 │ │ - ~50ms latency (cached) │ └─────────────────────────────────────────┘ ↓ (if no match) ┌─────────────────────────────────────────┐ │ TIER 3: LLM Classification │ │ - DSPy ChainOfThought │ │ - Template ID classification │ │ - ~500-2000ms latency │ └─────────────────────────────────────────┘ ↓ Slot Extraction (DSPy) ↓ Template Instantiation (Jinja2) ↓ SPARQL Query ``` ### 2.2 Strengths | Aspect | Current Implementation | Rating | |--------|----------------------|--------| | Deterministic first | Regex before embeddings before LLM | ⭐⭐⭐⭐⭐ | | Semantic similarity | Sentence-transformer embeddings | ⭐⭐⭐⭐ | | Multilingual | Dutch/English/German patterns | ⭐⭐⭐⭐ | | Conversation context | Context resolver for follow-ups | ⭐⭐⭐⭐ | | Relevance filtering | Fyke filter for out-of-scope | ⭐⭐⭐⭐ | | Slot resolution | Synonym resolver with fuzzy match | ⭐⭐⭐⭐ | | Template variants | Region/country/ISIL variants | ⭐⭐⭐⭐ | ### 2.3 Gaps vs SOTA | Gap | SOTA Reference | Impact | Priority | |-----|---------------|--------|----------| | No RAG-enhanced tier | SPARQL-LLM, FIRESPARQL | Medium | High | | No SPARQL validation loop | SPARQL-LLM | High | High | | No schema-aware slot filling | Auto-KGQA, LLM-based NL2SPARQL | Medium | Medium | | No GEPA optimization | DSPy GEPA tutorials | Medium | Medium | | No hierarchical intents | Intent classification patterns | Low | Low | | Limited metrics | SPARQL-LLM | Low | Low | --- ## 3. Proposed Improvements ### 3.1 Add Tier 2.5: RAG-Enhanced Matching Insert a RAG tier between embedding matching and LLM fallback: ```python # After embedding match fails: @dataclass class RAGEnhancedMatch: """Context-enriched matching using similar examples.""" def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]: # Retrieve top-3 most similar Q&A examples from YAML similar_examples = self._retrieve_similar_examples(question, k=3) # Check if examples strongly suggest a template template_votes = Counter(ex.template_id for ex in similar_examples) top_template, count = template_votes.most_common(1)[0] if count >= 2: # 2 of 3 examples agree return TemplateMatchResult( matched=True, template_id=top_template, confidence=0.75 + (count / 3) * 0.15, # 0.80-0.90 reasoning=f"RAG: {count}/3 similar examples use {top_template}" ) return None ``` **Benefits**: - Handles paraphrases that embeddings miss - Uses existing example data in templates - Cheaper than LLM fallback ### 3.2 Add SPARQL Validation Feedback Loop After template instantiation, validate SPARQL against schema: ```python class SPARQLValidator: """Validates generated SPARQL against ontology schema.""" def __init__(self, schema_path: Path): self.valid_predicates = self._load_predicates(schema_path) self.valid_classes = self._load_classes(schema_path) def validate(self, sparql: str) -> ValidationResult: errors = [] # Extract predicates used in query predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql) for pred in predicates: if pred not in self.valid_predicates: errors.append(f"Unknown predicate: {pred}") # Extract classes classes = re.findall(r'a\s+(hcc:\w+)', sparql) for cls in classes: if cls not in self.valid_classes: errors.append(f"Unknown class: {cls}") return ValidationResult( valid=len(errors) == 0, errors=errors, suggestions=self._suggest_fixes(errors) ) def correct_with_llm(self, sparql: str, errors: list[str]) -> str: """Use LLM to correct validation errors.""" prompt = f""" The following SPARQL query has errors: ```sparql {sparql} ``` Errors found: {chr(10).join(f'- {e}' for e in errors)} Correct the query. Return only the corrected SPARQL. """ # Call LLM for correction return self._call_llm(prompt) ``` **Benefits**: - Catches schema mismatches before execution - Enables iterative correction (SPARQL-LLM pattern) - Reduces runtime errors ### 3.3 Schema-Aware Slot Filling Use ontology to validate extracted slot values: ```python class SchemaAwareSlotExtractor(dspy.Module): """Slot extraction with ontology validation.""" def __init__(self, ontology_path: Path): super().__init__() self.extract = dspy.ChainOfThought(SlotExtractorSignature) self.ontology = self._load_ontology(ontology_path) def forward(self, question: str, template_id: str, ...) -> dict[str, str]: # Standard DSPy extraction raw_slots = self.extract(question=question, ...) # Validate against ontology validated_slots = {} for slot_name, value in raw_slots.items(): if slot_name == "institution_type": # Check if value maps to valid hc:institutionType if value in self.ontology.institution_types: validated_slots[slot_name] = value else: # Try fuzzy match against ontology match = self._fuzzy_match_ontology(value, "institution_types") if match: validated_slots[slot_name] = match logger.info(f"Corrected slot: {value} → {match}") return validated_slots ``` **Benefits**: - Ensures slot values are ontology-compliant - Auto-corrects minor extraction errors - Reduces downstream SPARQL errors ### 3.4 GEPA Optimization for DSPy Modules Add GEPA optimization training for key modules: ```python # backend/rag/optimization/gepa_training.py import dspy from dspy import GEPA def optimize_template_classifier(): """Optimize TemplateClassifier using GEPA.""" # Load training data from template examples training_data = load_training_examples() # Define metric def classification_metric(example, prediction): return 1.0 if prediction.template_id == example.expected_template else 0.0 # Initialize GEPA optimizer optimizer = GEPA( metric=classification_metric, num_candidates=10, num_threads=4, ) # Optimize classifier = TemplateClassifier() optimized = optimizer.compile( classifier, trainset=training_data, max_rounds=5, ) # Save optimized module optimized.save("optimized_template_classifier.json") return optimized ``` **Benefits**: - 10-20% accuracy improvement typical - Automated prompt refinement - Domain-specific optimization ### 3.5 Hierarchical Intent Classification Structure intents hierarchically for scalability: ```yaml # Intent taxonomy for 50+ intents intent_hierarchy: geographic: - list_by_location - count_by_location - compare_locations temporal: - point_in_time - timeline - events_in_period - founding_date entity: - find_by_name - find_by_identifier statistical: - count_by_type - distribution - aggregation financial: - budget_threshold - expense_comparison ``` ```python class HierarchicalIntentClassifier: """Two-stage intent classification for scalability.""" def classify(self, question: str) -> IntentResult: # Stage 1: Classify into top-level category (5 options) top_level = self._classify_top_level(question) # geographic, temporal, etc. # Stage 2: Classify into specific intent within category specific = self._classify_specific(question, top_level) return IntentResult( top_level=top_level, specific=specific, confidence=min(top_level.confidence, specific.confidence) ) ``` **Benefits**: - Scales to 50+ templates without accuracy loss - Faster classification (fewer options per stage) - Better organized codebase --- ## 4. Implementation Priority ### Phase 1: High Impact (1-2 days) 1. **SPARQL Validation Loop** (3.2) - Load schema from LinkML - Validate predicates/classes - Add LLM correction step 2. **Metrics Enhancement** (3.6) - Track tier usage distribution - Track latency per tier - Track validation error rates ### Phase 2: Medium Impact (2-3 days) 3. **RAG-Enhanced Tier** (3.1) - Index template examples - Implement retrieval - Add as Tier 2.5 4. **Schema-Aware Slot Filling** (3.3) - Load ontology - Validate extracted values - Auto-correct mismatches ### Phase 3: Optimization (3-5 days) 5. **GEPA Training** (3.4) - Create training dataset - Define metrics - Run optimization - Deploy optimized modules 6. **Hierarchical Intents** (3.5) - Design taxonomy - Implement two-stage classifier - Migrate existing templates --- ## 5. Expected Outcomes | Improvement | Expected Impact | Measurement | |-------------|-----------------|-------------| | SPARQL Validation | -50% runtime errors | Error rate tracking | | RAG-Enhanced Tier | +5-10% template match rate | Tier 2.5 success rate | | Schema-Aware Slots | -30% slot errors | Validation error logs | | GEPA Optimization | +10-20% LLM tier accuracy | Template classification F1 | | Hierarchical Intents | Ready for 50+ templates | Intent classification latency | --- ## 6. References 1. SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation 2. COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting 3. KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement 4. FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning 5. Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection 6. DSPy GEPA - Reflective prompt evolution 7. Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback