glam/docs/plan/prompt-query_template_mapping/SOTA_analysis.md
kempersc 30b9cb9d14 Add SOTA analysis and update design pattern documentation
- Add prompt-query_template_mapping/SOTA_analysis.md with Formica et al. research
- Update GraphRAG design patterns documentation
- Update temporal semantic hypergraph documentation
2026-01-07 22:05:01 +01:00

18 KiB

SOTA Analysis: Template-Based SPARQL Generation

Date: 2025-01-07 Status: Active Research Author: OpenCode

Executive Summary

Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements.

Key Finding: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are:

  1. Add RAG-enhanced tier between embedding and LLM
  2. Implement SPARQL validation feedback loop
  3. Schema-aware slot filling
  4. GEPA optimization for DSPy modules

1. Research Survey

1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)

Key Innovation: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners.

Architecture:

User Question
    ↓
┌─────────────────────────────────────────┐
│  Metadata Indexer                       │
│  - Schema classes/properties indexed    │
│  - Example Q&A pairs vectorized         │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Prompt Builder (RAG)                   │
│  - Retrieve similar examples            │
│  - Retrieve relevant schema fragments   │
│  - Compose context-rich prompt          │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  SPARQL Generator                       │
│  - LLM generates SPARQL                 │
│  - Validation against schema            │
│  - Iterative correction loop            │
└─────────────────────────────────────────┘
    ↓
Validated SPARQL

Relevance to GLAM:

  • We have schema (LinkML) but don't use it in prompts
  • We have example Q&A in templates but don't retrieve semantically
  • Missing: Schema-aware validation loop

1.2 COT-SPARQL (SEMANTICS 2024)

Key Innovation: Chain-of-Thought prompting with context injection.

Two Context Types:

  • Context A: Entity and relation extraction from question
  • Context B: Most semantically similar example from training set

Performance: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9

Relevance to GLAM:

  • Our embedding matcher finds similar patterns (partial Context B)
  • Missing: Entity/relation extraction step (Context A)
  • Missing: CoT prompting in LLM tier

1.3 KGQuest (arXiv 2511.11258, Nov 2024)

Key Innovation: Deterministic template generation + LLM refinement.

Architecture:

KG Triplets
    ↓
Cluster by relation type
    ↓
Generate rule-based templates (deterministic)
    ↓
LLM refinement for fluency (lightweight, controlled)

Relevance to GLAM:

  • Validates our template-first approach
  • We use deterministic templates with LLM fallback
  • 💡 Insight: Use LLM only for refinement, not generation

1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)

Key Innovation: Explicit tiered architecture with fallback.

Recommended Pattern:

def process_query(question):
    # Tier 1: Template matching (deterministic, high accuracy)
    match = template_matcher.match(question)
    if match and match.confidence >= 0.85:
        return render_template(match)
    
    # Tier 2: LLM generation (fallback)
    return llm_generate_sparql(question, schema_context)

Relevance to GLAM:

  • We already implement this pattern
  • 💡 Our threshold is 0.75, could consider raising for higher precision

1.5 GEPA Optimization (DSPy, 2024-2025)

Key Innovation: Genetic-Pareto optimization for prompt evolution.

Approach:

  • Dual-model: Cheap student LM + Smart reflection LM
  • Iterate: Run → Analyze failures → Generate improved prompts
  • Results: 10-20% accuracy improvements typical

Relevance to GLAM:

  • We use static DSPy signatures without optimization
  • 💡 Could apply GEPA to TemplateClassifier and SlotExtractor

1.6 Intent-Driven Hybrid Architecture (2024)

Key Pattern: Intent classification → Template selection → Slot filling → LLM fallback

User Query
    ↓
┌─────────────────────────────────────────┐
│  Intent Classifier                      │
│  - Embedding-based classification       │
│  - Hierarchical intent taxonomy         │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Template Selector                      │
│  - Map intent → available templates     │
│  - FAISS/vector retrieval for similar   │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Slot Filler                            │
│  - Schema-aware extraction              │
│  - Validation against ontology          │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  LLM Fallback                           │
│  - Only when template fails             │
│  - Constrained generation               │
└─────────────────────────────────────────┘

Relevance to GLAM:

  • We have semantic router for intent
  • We have template classification
  • Missing: Hierarchical intent taxonomy
  • Missing: Schema-aware slot validation

2. Current GLAM Architecture

2.1 Current 3-Tier System

User Question
    ↓
┌─────────────────────────────────────────┐
│  TIER 1: Pattern Matching               │
│  - Regex-based template matching        │
│  - Slot type validation                 │
│  - Confidence ≥ 0.75 required           │
│  - ~1ms latency                         │
└─────────────────────────────────────────┘
    ↓ (if no match)
┌─────────────────────────────────────────┐
│  TIER 2: Embedding Matching             │
│  - Sentence-transformer embeddings      │
│  - Cosine similarity ≥ 0.70             │
│  - ~50ms latency (cached)               │
└─────────────────────────────────────────┘
    ↓ (if no match)
┌─────────────────────────────────────────┐
│  TIER 3: LLM Classification             │
│  - DSPy ChainOfThought                  │
│  - Template ID classification           │
│  - ~500-2000ms latency                  │
└─────────────────────────────────────────┘
    ↓
Slot Extraction (DSPy)
    ↓
Template Instantiation (Jinja2)
    ↓
SPARQL Query

2.2 Strengths

Aspect Current Implementation Rating
Deterministic first Regex before embeddings before LLM
Semantic similarity Sentence-transformer embeddings
Multilingual Dutch/English/German patterns
Conversation context Context resolver for follow-ups
Relevance filtering Fyke filter for out-of-scope
Slot resolution Synonym resolver with fuzzy match
Template variants Region/country/ISIL variants

2.3 Gaps vs SOTA

Gap SOTA Reference Impact Priority
No RAG-enhanced tier SPARQL-LLM, FIRESPARQL Medium High
No SPARQL validation loop SPARQL-LLM High High
No schema-aware slot filling Auto-KGQA, LLM-based NL2SPARQL Medium Medium
No GEPA optimization DSPy GEPA tutorials Medium Medium
No hierarchical intents Intent classification patterns Low Low
Limited metrics SPARQL-LLM Low Low

3. Proposed Improvements

3.1 Add Tier 2.5: RAG-Enhanced Matching

Insert a RAG tier between embedding matching and LLM fallback:

# After embedding match fails:
@dataclass
class RAGEnhancedMatch:
    """Context-enriched matching using similar examples."""
    
    def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]:
        # Retrieve top-3 most similar Q&A examples from YAML
        similar_examples = self._retrieve_similar_examples(question, k=3)
        
        # Check if examples strongly suggest a template
        template_votes = Counter(ex.template_id for ex in similar_examples)
        top_template, count = template_votes.most_common(1)[0]
        
        if count >= 2:  # 2 of 3 examples agree
            return TemplateMatchResult(
                matched=True,
                template_id=top_template,
                confidence=0.75 + (count / 3) * 0.15,  # 0.80-0.90
                reasoning=f"RAG: {count}/3 similar examples use {top_template}"
            )
        return None

Benefits:

  • Handles paraphrases that embeddings miss
  • Uses existing example data in templates
  • Cheaper than LLM fallback

3.2 Add SPARQL Validation Feedback Loop

After template instantiation, validate SPARQL against schema:

class SPARQLValidator:
    """Validates generated SPARQL against ontology schema."""
    
    def __init__(self, schema_path: Path):
        self.valid_predicates = self._load_predicates(schema_path)
        self.valid_classes = self._load_classes(schema_path)
    
    def validate(self, sparql: str) -> ValidationResult:
        errors = []
        
        # Extract predicates used in query
        predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql)
        for pred in predicates:
            if pred not in self.valid_predicates:
                errors.append(f"Unknown predicate: {pred}")
        
        # Extract classes
        classes = re.findall(r'a\s+(hcc:\w+)', sparql)
        for cls in classes:
            if cls not in self.valid_classes:
                errors.append(f"Unknown class: {cls}")
        
        return ValidationResult(
            valid=len(errors) == 0,
            errors=errors,
            suggestions=self._suggest_fixes(errors)
        )
    
    def correct_with_llm(self, sparql: str, errors: list[str]) -> str:
        """Use LLM to correct validation errors."""
        prompt = f"""
        The following SPARQL query has errors:
        
        ```sparql
        {sparql}
        ```
        
        Errors found:
        {chr(10).join(f'- {e}' for e in errors)}
        
        Correct the query. Return only the corrected SPARQL.
        """
        # Call LLM for correction
        return self._call_llm(prompt)

Benefits:

  • Catches schema mismatches before execution
  • Enables iterative correction (SPARQL-LLM pattern)
  • Reduces runtime errors

3.3 Schema-Aware Slot Filling

Use ontology to validate extracted slot values:

class SchemaAwareSlotExtractor(dspy.Module):
    """Slot extraction with ontology validation."""
    
    def __init__(self, ontology_path: Path):
        super().__init__()
        self.extract = dspy.ChainOfThought(SlotExtractorSignature)
        self.ontology = self._load_ontology(ontology_path)
    
    def forward(self, question: str, template_id: str, ...) -> dict[str, str]:
        # Standard DSPy extraction
        raw_slots = self.extract(question=question, ...)
        
        # Validate against ontology
        validated_slots = {}
        for slot_name, value in raw_slots.items():
            if slot_name == "institution_type":
                # Check if value maps to valid hc:institutionType
                if value in self.ontology.institution_types:
                    validated_slots[slot_name] = value
                else:
                    # Try fuzzy match against ontology
                    match = self._fuzzy_match_ontology(value, "institution_types")
                    if match:
                        validated_slots[slot_name] = match
                        logger.info(f"Corrected slot: {value}{match}")
        
        return validated_slots

Benefits:

  • Ensures slot values are ontology-compliant
  • Auto-corrects minor extraction errors
  • Reduces downstream SPARQL errors

3.4 GEPA Optimization for DSPy Modules

Add GEPA optimization training for key modules:

# backend/rag/optimization/gepa_training.py

import dspy
from dspy import GEPA

def optimize_template_classifier():
    """Optimize TemplateClassifier using GEPA."""
    
    # Load training data from template examples
    training_data = load_training_examples()
    
    # Define metric
    def classification_metric(example, prediction):
        return 1.0 if prediction.template_id == example.expected_template else 0.0
    
    # Initialize GEPA optimizer
    optimizer = GEPA(
        metric=classification_metric,
        num_candidates=10,
        num_threads=4,
    )
    
    # Optimize
    classifier = TemplateClassifier()
    optimized = optimizer.compile(
        classifier,
        trainset=training_data,
        max_rounds=5,
    )
    
    # Save optimized module
    optimized.save("optimized_template_classifier.json")
    
    return optimized

Benefits:

  • 10-20% accuracy improvement typical
  • Automated prompt refinement
  • Domain-specific optimization

3.5 Hierarchical Intent Classification

Structure intents hierarchically for scalability:

# Intent taxonomy for 50+ intents
intent_hierarchy:
  geographic:
    - list_by_location
    - count_by_location
    - compare_locations
  temporal:
    - point_in_time
    - timeline
    - events_in_period
    - founding_date
  entity:
    - find_by_name
    - find_by_identifier
  statistical:
    - count_by_type
    - distribution
    - aggregation
  financial:
    - budget_threshold
    - expense_comparison
class HierarchicalIntentClassifier:
    """Two-stage intent classification for scalability."""
    
    def classify(self, question: str) -> IntentResult:
        # Stage 1: Classify into top-level category (5 options)
        top_level = self._classify_top_level(question)  # geographic, temporal, etc.
        
        # Stage 2: Classify into specific intent within category
        specific = self._classify_specific(question, top_level)
        
        return IntentResult(
            top_level=top_level,
            specific=specific,
            confidence=min(top_level.confidence, specific.confidence)
        )

Benefits:

  • Scales to 50+ templates without accuracy loss
  • Faster classification (fewer options per stage)
  • Better organized codebase

4. Implementation Priority

Phase 1: High Impact (1-2 days)

  1. SPARQL Validation Loop (3.2)

    • Load schema from LinkML
    • Validate predicates/classes
    • Add LLM correction step
  2. Metrics Enhancement (3.6)

    • Track tier usage distribution
    • Track latency per tier
    • Track validation error rates

Phase 2: Medium Impact (2-3 days)

  1. RAG-Enhanced Tier (3.1)

    • Index template examples
    • Implement retrieval
    • Add as Tier 2.5
  2. Schema-Aware Slot Filling (3.3)

    • Load ontology
    • Validate extracted values
    • Auto-correct mismatches

Phase 3: Optimization (3-5 days)

  1. GEPA Training (3.4)

    • Create training dataset
    • Define metrics
    • Run optimization
    • Deploy optimized modules
  2. Hierarchical Intents (3.5)

    • Design taxonomy
    • Implement two-stage classifier
    • Migrate existing templates

5. Expected Outcomes

Improvement Expected Impact Measurement
SPARQL Validation -50% runtime errors Error rate tracking
RAG-Enhanced Tier +5-10% template match rate Tier 2.5 success rate
Schema-Aware Slots -30% slot errors Validation error logs
GEPA Optimization +10-20% LLM tier accuracy Template classification F1
Hierarchical Intents Ready for 50+ templates Intent classification latency

6. References

  1. SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation
  2. COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting
  3. KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement
  4. FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning
  5. Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection
  6. DSPy GEPA - Reflective prompt evolution
  7. Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback