glam/docs/plan/prompt-query_template_mapping/README.md

158 lines
5.5 KiB
Markdown

# Template-Based SPARQL Query Generation
## Overview
This plan documents the implementation of a **template-based SPARQL query generation system** for the de Aa ArchiefAssistent (archief.support). The system replaces unreliable LLM-generated SPARQL with validated, pre-defined query templates that are selected and instantiated based on user intent.
## Problem Statement
The current RAG system generates SPARQL queries via LLM, which produces:
1. **Syntax errors** - Orphaned punctuation (`.` on empty lines) after auto-correction
2. **Invalid predicates** - `crm:P53_has_former_or_current_location` (CIDOC-CRM not in ontology)
3. **Undefined prefixes** - `wd:` used without declaration
4. **Inconsistent results** - Same question generates different (sometimes broken) SPARQL
### Example Failure
Question: "Welke archieven zijn er in Drenthe?"
```sparql
# LLM-generated query with orphaned "." causing 400 error:
?archive skos:prefLabel ?name .
. <-- SYNTAX ERROR
FILTER(CONTAINS(STR(?archive), "NL-DR")) .
```
## Solution: Template-Based KBQA
Research confirms **template-based Knowledge Base Question Answering (KBQA)** achieves **65.44% precision** vs 10.52% for LLM-only approaches (Formica et al., 2023).
### Architecture
```
User Question
|
v
+--------------------+
| Intent Classifier | <-- DSPy Signature for question classification
+--------------------+
|
v
+--------------------+
| Template Router | <-- Selects appropriate SPARQL template
+--------------------+
|
v
+--------------------+
| Entity Extractor | <-- Extracts slots (province, type, etc.)
+--------------------+
|
v
+--------------------+
| Template Filler | <-- Instantiates template with slot values
+--------------------+
|
v
+--------------------+
| SPARQL Validator | <-- Validates syntax before execution
+--------------------+
|
v
Valid SPARQL Query
```
## Documentation Index
| Document | Description |
|----------|-------------|
| [methodology.md](methodology.md) | Academic methodology and research citations |
| [design-patterns.md](design-patterns.md) | Software patterns (Strategy, Template Method, CoR) |
| [tdd.md](tdd.md) | Test-driven development approach with test cases |
| [dspy-compatibility.md](dspy-compatibility.md) | DSPy integration for template classification |
| [rag-integration.md](rag-integration.md) | SPARQL-first retrieval flow integration |
| [external-dependencies.md](external-dependencies.md) | Required libraries and services |
| [templates-schema.md](templates-schema.md) | YAML/JSON schema for template definitions |
| [competency-questions.md](competency-questions.md) | Ontology coverage testing, fyke filter for irrelevant questions |
| [conversation-context.md](conversation-context.md) | Follow-up prompt handling with DSPy History |
## Quick Start
### 1. Template Definition
```yaml
templates:
region_institution_search:
id: "region_institution_search"
question_patterns:
- "Welke {institution_type_nl} zijn er in {province}?"
- "{institution_type_nl} in {province}"
sparql_template: |
PREFIX hc: <https://nde.nl/ontology/hc/class/>
PREFIX hcp: <https://nde.nl/ontology/hc/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?institution ?name WHERE {
?institution a hc:Custodian ;
hcp:institutionType "{{institution_type_code}}" ;
skos:prefLabel ?name .
FILTER(CONTAINS(STR(?institution), "{{province_code}}"))
}
slots:
institution_type_code:
source: sparql_validation_rules.json#institution_types
province_code:
source: sparql_validation_rules.json#subregions
```
### 2. Usage Flow
```python
# 1. Classify question intent
intent = classifier.classify("Welke archieven zijn er in Drenthe?")
# -> Intent(template_id="region_institution_search", slots={"province": "Drenthe", "type": "archieven"})
# 2. Extract slot values
slots = extractor.extract(intent)
# -> {"institution_type_code": "A", "province_code": "NL-DR"}
# 3. Instantiate template
query = filler.fill("region_institution_search", slots)
# -> Valid SPARQL query
# 4. Execute against SPARQL endpoint
results = execute_sparql(query)
```
## Integration Points
| Component | File | Integration |
|-----------|------|-------------|
| Query Router | `dspy_heritage_rag.py` | Add template classification before LLM |
| Slot Filler | `ontology_mapping.py` | Use for entity extraction |
| Validation | `sparql_linter.py` | Validate instantiated templates |
| Rules | `sparql_validation_rules.json` | Source for slot enum values |
## Key Metrics
| Metric | Current (LLM-only) | Target (Template-based) |
|--------|-------------------|-------------------------|
| Query success rate | ~40% | >95% |
| Syntax errors | ~30% | 0% |
| Response time | ~3s | <1s |
| Consistency | Low | High |
## Next Steps After Planning
1. **Create template definitions** for top 10 question types
2. **Implement TemplateRouter** (DSPy Signature for classification)
3. **Implement SlotFiller** (entity extraction using ontology_mapping.py)
4. **Implement TemplateInstantiator** (fill slots -> valid SPARQL)
5. **Add fallback** to LLM generation for unmatched questions
6. **Write tests** for each template (TDD approach)
## References
- Formica et al. (2023) - Template-based approach for QA over knowledge bases
- Steinmetz et al. (2019) - Pattern-based NL to SPARQL
- Zheng et al. (2018) - Template decomposition for complex questions
- SPARQL-LLM (2025) - Hybrid template + LLM approach