# Template-Based SPARQL Query Generation

## Overview

This plan documents the implementation of a **template-based SPARQL query generation system** for the de Aa ArchiefAssistent (archief.support). The system replaces unreliable LLM-generated SPARQL with validated, pre-defined query templates that are selected and instantiated based on user intent.

## Problem Statement

The current RAG system generates SPARQL queries via LLM, which produces:

1. **Syntax errors** - Orphaned punctuation (`.` on empty lines) after auto-correction
2. **Invalid predicates** - `crm:P53_has_former_or_current_location` (CIDOC-CRM not in ontology)
3. **Undefined prefixes** - `wd:` used without declaration
4. **Inconsistent results** - Same question generates different (sometimes broken) SPARQL

### Example Failure

Question: "Welke archieven zijn er in Drenthe?"

```sparql
# LLM-generated query with orphaned "." causing 400 error:
?archive skos:prefLabel ?name .
  .                              <-- SYNTAX ERROR
  FILTER(CONTAINS(STR(?archive), "NL-DR")) .
```

## Solution: Template-Based KBQA

Research confirms **template-based Knowledge Base Question Answering (KBQA)** achieves **65.44% precision** vs 10.52% for LLM-only approaches (Formica et al., 2023).

### Architecture

```
User Question
     |
     v
+--------------------+
| Intent Classifier  |  <-- DSPy Signature for question classification
+--------------------+
     |
     v
+--------------------+
| Template Router    |  <-- Selects appropriate SPARQL template
+--------------------+
     |
     v
+--------------------+
| Entity Extractor   |  <-- Extracts slots (province, type, etc.)
+--------------------+
     |
     v
+--------------------+
| Template Filler    |  <-- Instantiates template with slot values
+--------------------+
     |
     v
+--------------------+
| SPARQL Validator   |  <-- Validates syntax before execution
+--------------------+
     |
     v
Valid SPARQL Query
```

## Documentation Index

| Document | Description |
|----------|-------------|
| [methodology.md](methodology.md) | Academic methodology and research citations |
| [design-patterns.md](design-patterns.md) | Software patterns (Strategy, Template Method, CoR) |
| [tdd.md](tdd.md) | Test-driven development approach with test cases |
| [dspy-compatibility.md](dspy-compatibility.md) | DSPy integration for template classification |
| [rag-integration.md](rag-integration.md) | SPARQL-first retrieval flow integration |
| [external-dependencies.md](external-dependencies.md) | Required libraries and services |
| [templates-schema.md](templates-schema.md) | YAML/JSON schema for template definitions |
| [competency-questions.md](competency-questions.md) | Ontology coverage testing, fyke filter for irrelevant questions |
| [conversation-context.md](conversation-context.md) | Follow-up prompt handling with DSPy History |

## Quick Start

### 1. Template Definition

```yaml
templates:
  region_institution_search:
    id: "region_institution_search"
    question_patterns:
      - "Welke {institution_type_nl} zijn er in {province}?"
      - "{institution_type_nl} in {province}"
    sparql_template: |
      PREFIX hc: <https://nde.nl/ontology/hc/class/>
      PREFIX hcp: <https://nde.nl/ontology/hc/>
      PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
      SELECT ?institution ?name WHERE {
        ?institution a hc:Custodian ;
                     hcp:institutionType "{{institution_type_code}}" ;
                     skos:prefLabel ?name .
        FILTER(CONTAINS(STR(?institution), "{{province_code}}"))
      }
    slots:
      institution_type_code:
        source: sparql_validation_rules.json#institution_types
      province_code:
        source: sparql_validation_rules.json#subregions
```

### 2. Usage Flow

```python
# 1. Classify question intent
intent = classifier.classify("Welke archieven zijn er in Drenthe?")
# -> Intent(template_id="region_institution_search", slots={"province": "Drenthe", "type": "archieven"})

# 2. Extract slot values
slots = extractor.extract(intent)
# -> {"institution_type_code": "A", "province_code": "NL-DR"}

# 3. Instantiate template
query = filler.fill("region_institution_search", slots)
# -> Valid SPARQL query

# 4. Execute against SPARQL endpoint
results = execute_sparql(query)
```

## Integration Points

| Component | File | Integration |
|-----------|------|-------------|
| Query Router | `dspy_heritage_rag.py` | Add template classification before LLM |
| Slot Filler | `ontology_mapping.py` | Use for entity extraction |
| Validation | `sparql_linter.py` | Validate instantiated templates |
| Rules | `sparql_validation_rules.json` | Source for slot enum values |

## Key Metrics

| Metric | Current (LLM-only) | Target (Template-based) |
|--------|-------------------|-------------------------|
| Query success rate | ~40% | >95% |
| Syntax errors | ~30% | 0% |
| Response time | ~3s | <1s |
| Consistency | Low | High |

## Next Steps After Planning

1. **Create template definitions** for top 10 question types
2. **Implement TemplateRouter** (DSPy Signature for classification)
3. **Implement SlotFiller** (entity extraction using ontology_mapping.py)
4. **Implement TemplateInstantiator** (fill slots -> valid SPARQL)
5. **Add fallback** to LLM generation for unmatched questions
6. **Write tests** for each template (TDD approach)

## References

- Formica et al. (2023) - Template-based approach for QA over knowledge bases
- Steinmetz et al. (2019) - Pattern-based NL to SPARQL
- Zheng et al. (2018) - Template decomposition for complex questions
- SPARQL-LLM (2025) - Hybrid template + LLM approach