158 lines
5.5 KiB
Markdown
158 lines
5.5 KiB
Markdown
# Template-Based SPARQL Query Generation
|
|
|
|
## Overview
|
|
|
|
This plan documents the implementation of a **template-based SPARQL query generation system** for the de Aa ArchiefAssistent (archief.support). The system replaces unreliable LLM-generated SPARQL with validated, pre-defined query templates that are selected and instantiated based on user intent.
|
|
|
|
## Problem Statement
|
|
|
|
The current RAG system generates SPARQL queries via LLM, which produces:
|
|
|
|
1. **Syntax errors** - Orphaned punctuation (`.` on empty lines) after auto-correction
|
|
2. **Invalid predicates** - `crm:P53_has_former_or_current_location` (CIDOC-CRM not in ontology)
|
|
3. **Undefined prefixes** - `wd:` used without declaration
|
|
4. **Inconsistent results** - Same question generates different (sometimes broken) SPARQL
|
|
|
|
### Example Failure
|
|
|
|
Question: "Welke archieven zijn er in Drenthe?"
|
|
|
|
```sparql
|
|
# LLM-generated query with orphaned "." causing 400 error:
|
|
?archive skos:prefLabel ?name .
|
|
. <-- SYNTAX ERROR
|
|
FILTER(CONTAINS(STR(?archive), "NL-DR")) .
|
|
```
|
|
|
|
## Solution: Template-Based KBQA
|
|
|
|
Research confirms **template-based Knowledge Base Question Answering (KBQA)** achieves **65.44% precision** vs 10.52% for LLM-only approaches (Formica et al., 2023).
|
|
|
|
### Architecture
|
|
|
|
```
|
|
User Question
|
|
|
|
|
v
|
|
+--------------------+
|
|
| Intent Classifier | <-- DSPy Signature for question classification
|
|
+--------------------+
|
|
|
|
|
v
|
|
+--------------------+
|
|
| Template Router | <-- Selects appropriate SPARQL template
|
|
+--------------------+
|
|
|
|
|
v
|
|
+--------------------+
|
|
| Entity Extractor | <-- Extracts slots (province, type, etc.)
|
|
+--------------------+
|
|
|
|
|
v
|
|
+--------------------+
|
|
| Template Filler | <-- Instantiates template with slot values
|
|
+--------------------+
|
|
|
|
|
v
|
|
+--------------------+
|
|
| SPARQL Validator | <-- Validates syntax before execution
|
|
+--------------------+
|
|
|
|
|
v
|
|
Valid SPARQL Query
|
|
```
|
|
|
|
## Documentation Index
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [methodology.md](methodology.md) | Academic methodology and research citations |
|
|
| [design-patterns.md](design-patterns.md) | Software patterns (Strategy, Template Method, CoR) |
|
|
| [tdd.md](tdd.md) | Test-driven development approach with test cases |
|
|
| [dspy-compatibility.md](dspy-compatibility.md) | DSPy integration for template classification |
|
|
| [rag-integration.md](rag-integration.md) | SPARQL-first retrieval flow integration |
|
|
| [external-dependencies.md](external-dependencies.md) | Required libraries and services |
|
|
| [templates-schema.md](templates-schema.md) | YAML/JSON schema for template definitions |
|
|
| [competency-questions.md](competency-questions.md) | Ontology coverage testing, fyke filter for irrelevant questions |
|
|
| [conversation-context.md](conversation-context.md) | Follow-up prompt handling with DSPy History |
|
|
|
|
## Quick Start
|
|
|
|
### 1. Template Definition
|
|
|
|
```yaml
|
|
templates:
|
|
region_institution_search:
|
|
id: "region_institution_search"
|
|
question_patterns:
|
|
- "Welke {institution_type_nl} zijn er in {province}?"
|
|
- "{institution_type_nl} in {province}"
|
|
sparql_template: |
|
|
PREFIX hc: <https://nde.nl/ontology/hc/class/>
|
|
PREFIX hcp: <https://nde.nl/ontology/hc/>
|
|
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
|
|
SELECT ?institution ?name WHERE {
|
|
?institution a hc:Custodian ;
|
|
hcp:institutionType "{{institution_type_code}}" ;
|
|
skos:prefLabel ?name .
|
|
FILTER(CONTAINS(STR(?institution), "{{province_code}}"))
|
|
}
|
|
slots:
|
|
institution_type_code:
|
|
source: sparql_validation_rules.json#institution_types
|
|
province_code:
|
|
source: sparql_validation_rules.json#subregions
|
|
```
|
|
|
|
### 2. Usage Flow
|
|
|
|
```python
|
|
# 1. Classify question intent
|
|
intent = classifier.classify("Welke archieven zijn er in Drenthe?")
|
|
# -> Intent(template_id="region_institution_search", slots={"province": "Drenthe", "type": "archieven"})
|
|
|
|
# 2. Extract slot values
|
|
slots = extractor.extract(intent)
|
|
# -> {"institution_type_code": "A", "province_code": "NL-DR"}
|
|
|
|
# 3. Instantiate template
|
|
query = filler.fill("region_institution_search", slots)
|
|
# -> Valid SPARQL query
|
|
|
|
# 4. Execute against SPARQL endpoint
|
|
results = execute_sparql(query)
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
| Component | File | Integration |
|
|
|-----------|------|-------------|
|
|
| Query Router | `dspy_heritage_rag.py` | Add template classification before LLM |
|
|
| Slot Filler | `ontology_mapping.py` | Use for entity extraction |
|
|
| Validation | `sparql_linter.py` | Validate instantiated templates |
|
|
| Rules | `sparql_validation_rules.json` | Source for slot enum values |
|
|
|
|
## Key Metrics
|
|
|
|
| Metric | Current (LLM-only) | Target (Template-based) |
|
|
|--------|-------------------|-------------------------|
|
|
| Query success rate | ~40% | >95% |
|
|
| Syntax errors | ~30% | 0% |
|
|
| Response time | ~3s | <1s |
|
|
| Consistency | Low | High |
|
|
|
|
## Next Steps After Planning
|
|
|
|
1. **Create template definitions** for top 10 question types
|
|
2. **Implement TemplateRouter** (DSPy Signature for classification)
|
|
3. **Implement SlotFiller** (entity extraction using ontology_mapping.py)
|
|
4. **Implement TemplateInstantiator** (fill slots -> valid SPARQL)
|
|
5. **Add fallback** to LLM generation for unmatched questions
|
|
6. **Write tests** for each template (TDD approach)
|
|
|
|
## References
|
|
|
|
- Formica et al. (2023) - Template-based approach for QA over knowledge bases
|
|
- Steinmetz et al. (2019) - Pattern-based NL to SPARQL
|
|
- Zheng et al. (2018) - Template decomposition for complex questions
|
|
- SPARQL-LLM (2025) - Hybrid template + LLM approach
|