kempersc 30b9cb9d14 Add SOTA analysis and update design pattern documentation

- Add prompt-query_template_mapping/SOTA_analysis.md with Formica et al. research
- Update GraphRAG design patterns documentation
- Update temporal semantic hypergraph documentation

2026-01-07 22:05:01 +01:00

18 KiB

Raw Blame History

SOTA Analysis: Template-Based SPARQL Generation

Date: 2025-01-07 Status: Active Research Author: OpenCode

Executive Summary

Based on comprehensive research of 2024-2025 academic papers and industry practices, this document compares our current implementation against state-of-the-art (SOTA) approaches and recommends improvements.

Key Finding: Our 3-tier architecture (regex → embedding → LLM) is well-aligned with SOTA hybrid approaches. The primary improvement opportunities are:

Add RAG-enhanced tier between embedding and LLM
Implement SPARQL validation feedback loop
Schema-aware slot filling
GEPA optimization for DSPy modules

1. Research Survey

1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)

Key Innovation: Real-time SPARQL generation with 24% F1 improvement over TEXT2SPARQL winners.

Architecture:

User Question
    ↓
┌─────────────────────────────────────────┐
│  Metadata Indexer                       │
│  - Schema classes/properties indexed    │
│  - Example Q&A pairs vectorized         │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Prompt Builder (RAG)                   │
│  - Retrieve similar examples            │
│  - Retrieve relevant schema fragments   │
│  - Compose context-rich prompt          │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  SPARQL Generator                       │
│  - LLM generates SPARQL                 │
│  - Validation against schema            │
│  - Iterative correction loop            │
└─────────────────────────────────────────┘
    ↓
Validated SPARQL

Relevance to GLAM:

✅ We have schema (LinkML) but don't use it in prompts
✅ We have example Q&A in templates but don't retrieve semantically
❌ Missing: Schema-aware validation loop

1.2 COT-SPARQL (SEMANTICS 2024)

Key Innovation: Chain-of-Thought prompting with context injection.

Two Context Types:

Context A: Entity and relation extraction from question
Context B: Most semantically similar example from training set

Performance: 4.4% F1 improvement on QALD-10, 3.0% on QALD-9

Relevance to GLAM:

✅ Our embedding matcher finds similar patterns (partial Context B)
❌ Missing: Entity/relation extraction step (Context A)
❌ Missing: CoT prompting in LLM tier

1.3 KGQuest (arXiv 2511.11258, Nov 2024)

Key Innovation: Deterministic template generation + LLM refinement.

Architecture:

KG Triplets
    ↓
Cluster by relation type
    ↓
Generate rule-based templates (deterministic)
    ↓
LLM refinement for fluency (lightweight, controlled)

Relevance to GLAM:

✅ Validates our template-first approach
✅ We use deterministic templates with LLM fallback
💡 Insight: Use LLM only for refinement, not generation

1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)

Key Innovation: Explicit tiered architecture with fallback.

Recommended Pattern:

def process_query(question):
    # Tier 1: Template matching (deterministic, high accuracy)
    match = template_matcher.match(question)
    if match and match.confidence >= 0.85:
        return render_template(match)
    
    # Tier 2: LLM generation (fallback)
    return llm_generate_sparql(question, schema_context)

Relevance to GLAM:

✅ We already implement this pattern
💡 Our threshold is 0.75, could consider raising for higher precision

1.5 GEPA Optimization (DSPy, 2024-2025)

Key Innovation: Genetic-Pareto optimization for prompt evolution.

Approach:

Dual-model: Cheap student LM + Smart reflection LM
Iterate: Run → Analyze failures → Generate improved prompts
Results: 10-20% accuracy improvements typical

Relevance to GLAM:

❌ We use static DSPy signatures without optimization
💡 Could apply GEPA to TemplateClassifier and SlotExtractor

1.6 Intent-Driven Hybrid Architecture (2024)

Key Pattern: Intent classification → Template selection → Slot filling → LLM fallback

User Query
    ↓
┌─────────────────────────────────────────┐
│  Intent Classifier                      │
│  - Embedding-based classification       │
│  - Hierarchical intent taxonomy         │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Template Selector                      │
│  - Map intent → available templates     │
│  - FAISS/vector retrieval for similar   │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Slot Filler                            │
│  - Schema-aware extraction              │
│  - Validation against ontology          │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  LLM Fallback                           │
│  - Only when template fails             │
│  - Constrained generation               │
└─────────────────────────────────────────┘

Relevance to GLAM:

✅ We have semantic router for intent
✅ We have template classification
❌ Missing: Hierarchical intent taxonomy
❌ Missing: Schema-aware slot validation

2. Current GLAM Architecture

2.1 Current 3-Tier System

User Question
    ↓
┌─────────────────────────────────────────┐
│  TIER 1: Pattern Matching               │
│  - Regex-based template matching        │
│  - Slot type validation                 │
│  - Confidence ≥ 0.75 required           │
│  - ~1ms latency                         │
└─────────────────────────────────────────┘
    ↓ (if no match)
┌─────────────────────────────────────────┐
│  TIER 2: Embedding Matching             │
│  - Sentence-transformer embeddings      │
│  - Cosine similarity ≥ 0.70             │
│  - ~50ms latency (cached)               │
└─────────────────────────────────────────┘
    ↓ (if no match)
┌─────────────────────────────────────────┐
│  TIER 3: LLM Classification             │
│  - DSPy ChainOfThought                  │
│  - Template ID classification           │
│  - ~500-2000ms latency                  │
└─────────────────────────────────────────┘
    ↓
Slot Extraction (DSPy)
    ↓
Template Instantiation (Jinja2)
    ↓
SPARQL Query

2.2 Strengths

Aspect	Current Implementation	Rating
Deterministic first	Regex before embeddings before LLM	⭐⭐⭐⭐⭐
Semantic similarity	Sentence-transformer embeddings	⭐⭐⭐⭐
Multilingual	Dutch/English/German patterns	⭐⭐⭐⭐
Conversation context	Context resolver for follow-ups	⭐⭐⭐⭐
Relevance filtering	Fyke filter for out-of-scope	⭐⭐⭐⭐
Slot resolution	Synonym resolver with fuzzy match	⭐⭐⭐⭐
Template variants	Region/country/ISIL variants	⭐⭐⭐⭐

2.3 Gaps vs SOTA

Gap	SOTA Reference	Impact	Priority
No RAG-enhanced tier	SPARQL-LLM, FIRESPARQL	Medium	High
No SPARQL validation loop	SPARQL-LLM	High	High
No schema-aware slot filling	Auto-KGQA, LLM-based NL2SPARQL	Medium	Medium
No GEPA optimization	DSPy GEPA tutorials	Medium	Medium
No hierarchical intents	Intent classification patterns	Low	Low
Limited metrics	SPARQL-LLM	Low	Low

3. Proposed Improvements

3.1 Add Tier 2.5: RAG-Enhanced Matching

Insert a RAG tier between embedding matching and LLM fallback:

# After embedding match fails:
@dataclass
class RAGEnhancedMatch:
    """Context-enriched matching using similar examples."""
    
    def match(self, question: str, templates: dict) -> Optional[TemplateMatchResult]:
        # Retrieve top-3 most similar Q&A examples from YAML
        similar_examples = self._retrieve_similar_examples(question, k=3)
        
        # Check if examples strongly suggest a template
        template_votes = Counter(ex.template_id for ex in similar_examples)
        top_template, count = template_votes.most_common(1)[0]
        
        if count >= 2:  # 2 of 3 examples agree
            return TemplateMatchResult(
                matched=True,
                template_id=top_template,
                confidence=0.75 + (count / 3) * 0.15,  # 0.80-0.90
                reasoning=f"RAG: {count}/3 similar examples use {top_template}"
            )
        return None

Benefits:

Handles paraphrases that embeddings miss
Uses existing example data in templates
Cheaper than LLM fallback

3.2 Add SPARQL Validation Feedback Loop

After template instantiation, validate SPARQL against schema:

class SPARQLValidator:
    """Validates generated SPARQL against ontology schema."""
    
    def __init__(self, schema_path: Path):
        self.valid_predicates = self._load_predicates(schema_path)
        self.valid_classes = self._load_classes(schema_path)
    
    def validate(self, sparql: str) -> ValidationResult:
        errors = []
        
        # Extract predicates used in query
        predicates = re.findall(r'(hc:\w+|schema:\w+)', sparql)
        for pred in predicates:
            if pred not in self.valid_predicates:
                errors.append(f"Unknown predicate: {pred}")
        
        # Extract classes
        classes = re.findall(r'a\s+(hcc:\w+)', sparql)
        for cls in classes:
            if cls not in self.valid_classes:
                errors.append(f"Unknown class: {cls}")
        
        return ValidationResult(
            valid=len(errors) == 0,
            errors=errors,
            suggestions=self._suggest_fixes(errors)
        )
    
    def correct_with_llm(self, sparql: str, errors: list[str]) -> str:
        """Use LLM to correct validation errors."""
        prompt = f"""
        The following SPARQL query has errors:
        
        ```sparql
        {sparql}
        ```
        
        Errors found:
        {chr(10).join(f'- {e}' for e in errors)}
        
        Correct the query. Return only the corrected SPARQL.
        """
        # Call LLM for correction
        return self._call_llm(prompt)

Benefits:

Catches schema mismatches before execution
Enables iterative correction (SPARQL-LLM pattern)
Reduces runtime errors

3.3 Schema-Aware Slot Filling

Use ontology to validate extracted slot values:

class SchemaAwareSlotExtractor(dspy.Module):
    """Slot extraction with ontology validation."""
    
    def __init__(self, ontology_path: Path):
        super().__init__()
        self.extract = dspy.ChainOfThought(SlotExtractorSignature)
        self.ontology = self._load_ontology(ontology_path)
    
    def forward(self, question: str, template_id: str, ...) -> dict[str, str]:
        # Standard DSPy extraction
        raw_slots = self.extract(question=question, ...)
        
        # Validate against ontology
        validated_slots = {}
        for slot_name, value in raw_slots.items():
            if slot_name == "institution_type":
                # Check if value maps to valid hc:institutionType
                if value in self.ontology.institution_types:
                    validated_slots[slot_name] = value
                else:
                    # Try fuzzy match against ontology
                    match = self._fuzzy_match_ontology(value, "institution_types")
                    if match:
                        validated_slots[slot_name] = match
                        logger.info(f"Corrected slot: {value} → {match}")
        
        return validated_slots

Benefits:

Ensures slot values are ontology-compliant
Auto-corrects minor extraction errors
Reduces downstream SPARQL errors

3.4 GEPA Optimization for DSPy Modules

Add GEPA optimization training for key modules:

# backend/rag/optimization/gepa_training.py

import dspy
from dspy import GEPA

def optimize_template_classifier():
    """Optimize TemplateClassifier using GEPA."""
    
    # Load training data from template examples
    training_data = load_training_examples()
    
    # Define metric
    def classification_metric(example, prediction):
        return 1.0 if prediction.template_id == example.expected_template else 0.0
    
    # Initialize GEPA optimizer
    optimizer = GEPA(
        metric=classification_metric,
        num_candidates=10,
        num_threads=4,
    )
    
    # Optimize
    classifier = TemplateClassifier()
    optimized = optimizer.compile(
        classifier,
        trainset=training_data,
        max_rounds=5,
    )
    
    # Save optimized module
    optimized.save("optimized_template_classifier.json")
    
    return optimized

Benefits:

10-20% accuracy improvement typical
Automated prompt refinement
Domain-specific optimization

3.5 Hierarchical Intent Classification

Structure intents hierarchically for scalability:

# Intent taxonomy for 50+ intents
intent_hierarchy:
  geographic:
    - list_by_location
    - count_by_location
    - compare_locations
  temporal:
    - point_in_time
    - timeline
    - events_in_period
    - founding_date
  entity:
    - find_by_name
    - find_by_identifier
  statistical:
    - count_by_type
    - distribution
    - aggregation
  financial:
    - budget_threshold
    - expense_comparison

class HierarchicalIntentClassifier:
    """Two-stage intent classification for scalability."""
    
    def classify(self, question: str) -> IntentResult:
        # Stage 1: Classify into top-level category (5 options)
        top_level = self._classify_top_level(question)  # geographic, temporal, etc.
        
        # Stage 2: Classify into specific intent within category
        specific = self._classify_specific(question, top_level)
        
        return IntentResult(
            top_level=top_level,
            specific=specific,
            confidence=min(top_level.confidence, specific.confidence)
        )

Benefits:

Scales to 50+ templates without accuracy loss
Faster classification (fewer options per stage)
Better organized codebase

4. Implementation Priority

Phase 1: High Impact (1-2 days)

SPARQL Validation Loop (3.2)
- Load schema from LinkML
- Validate predicates/classes
- Add LLM correction step
Metrics Enhancement (3.6)
- Track tier usage distribution
- Track latency per tier
- Track validation error rates

Phase 2: Medium Impact (2-3 days)

RAG-Enhanced Tier (3.1)
- Index template examples
- Implement retrieval
- Add as Tier 2.5
Schema-Aware Slot Filling (3.3)
- Load ontology
- Validate extracted values
- Auto-correct mismatches

Phase 3: Optimization (3-5 days)

GEPA Training (3.4)
- Create training dataset
- Define metrics
- Run optimization
- Deploy optimized modules
Hierarchical Intents (3.5)
- Design taxonomy
- Implement two-stage classifier
- Migrate existing templates

5. Expected Outcomes

Improvement	Expected Impact	Measurement
SPARQL Validation	-50% runtime errors	Error rate tracking
RAG-Enhanced Tier	+5-10% template match rate	Tier 2.5 success rate
Schema-Aware Slots	-30% slot errors	Validation error logs
GEPA Optimization	+10-20% LLM tier accuracy	Template classification F1
Hierarchical Intents	Ready for 50+ templates	Intent classification latency

6. References

SPARQL-LLM (arXiv:2512.14277) - Real-time SPARQL generation
COT-SPARQL (SEMANTICS 2024) - Chain-of-Thought prompting
KGQuest (arXiv:2511.11258) - Deterministic template + LLM refinement
FIRESPARQL (arXiv:2508.10467) - Modular framework with fine-tuning
Auto-KGQA (ESWC 2024) - Autonomous KG subgraph selection
DSPy GEPA - Reflective prompt evolution
Hybrid NLQ→SPARQL (LinkedIn 2024) - Template-first with LLM fallback

18 KiB Raw Blame History

SOTA Analysis: Template-Based SPARQL Generation

Executive Summary

1. Research Survey

1.1 SPARQL-LLM (arXiv 2512.14277, Dec 2024)

1.2 COT-SPARQL (SEMANTICS 2024)

1.3 KGQuest (arXiv 2511.11258, Nov 2024)

1.4 Hybrid Template + LLM Fallback (LinkedIn, May 2024)

1.5 GEPA Optimization (DSPy, 2024-2025)

1.6 Intent-Driven Hybrid Architecture (2024)

2. Current GLAM Architecture

2.1 Current 3-Tier System

2.2 Strengths

2.3 Gaps vs SOTA

3. Proposed Improvements

3.1 Add Tier 2.5: RAG-Enhanced Matching

3.2 Add SPARQL Validation Feedback Loop

3.3 Schema-Aware Slot Filling

3.4 GEPA Optimization for DSPy Modules

3.5 Hierarchical Intent Classification

4. Implementation Priority

Phase 1: High Impact (1-2 days)

Phase 2: Medium Impact (2-3 days)

Phase 3: Optimization (3-5 days)

5. Expected Outcomes

6. References

18 KiB

Raw Blame History