# Template-Based SPARQL Query Generation ## Overview This plan documents the implementation of a **template-based SPARQL query generation system** for the de Aa ArchiefAssistent (archief.support). The system replaces unreliable LLM-generated SPARQL with validated, pre-defined query templates that are selected and instantiated based on user intent. ## Problem Statement The current RAG system generates SPARQL queries via LLM, which produces: 1. **Syntax errors** - Orphaned punctuation (`.` on empty lines) after auto-correction 2. **Invalid predicates** - `crm:P53_has_former_or_current_location` (CIDOC-CRM not in ontology) 3. **Undefined prefixes** - `wd:` used without declaration 4. **Inconsistent results** - Same question generates different (sometimes broken) SPARQL ### Example Failure Question: "Welke archieven zijn er in Drenthe?" ```sparql # LLM-generated query with orphaned "." causing 400 error: ?archive skos:prefLabel ?name . . <-- SYNTAX ERROR FILTER(CONTAINS(STR(?archive), "NL-DR")) . ``` ## Solution: Template-Based KBQA Research confirms **template-based Knowledge Base Question Answering (KBQA)** achieves **65.44% precision** vs 10.52% for LLM-only approaches (Formica et al., 2023). ### Architecture ``` User Question | v +--------------------+ | Intent Classifier | <-- DSPy Signature for question classification +--------------------+ | v +--------------------+ | Template Router | <-- Selects appropriate SPARQL template +--------------------+ | v +--------------------+ | Entity Extractor | <-- Extracts slots (province, type, etc.) +--------------------+ | v +--------------------+ | Template Filler | <-- Instantiates template with slot values +--------------------+ | v +--------------------+ | SPARQL Validator | <-- Validates syntax before execution +--------------------+ | v Valid SPARQL Query ``` ## Documentation Index | Document | Description | |----------|-------------| | [methodology.md](methodology.md) | Academic methodology and research citations | | [design-patterns.md](design-patterns.md) | Software patterns (Strategy, Template Method, CoR) | | [tdd.md](tdd.md) | Test-driven development approach with test cases | | [dspy-compatibility.md](dspy-compatibility.md) | DSPy integration for template classification | | [rag-integration.md](rag-integration.md) | SPARQL-first retrieval flow integration | | [external-dependencies.md](external-dependencies.md) | Required libraries and services | | [templates-schema.md](templates-schema.md) | YAML/JSON schema for template definitions | | [competency-questions.md](competency-questions.md) | Ontology coverage testing, fyke filter for irrelevant questions | | [conversation-context.md](conversation-context.md) | Follow-up prompt handling with DSPy History | ## Quick Start ### 1. Template Definition ```yaml templates: region_institution_search: id: "region_institution_search" question_patterns: - "Welke {institution_type_nl} zijn er in {province}?" - "{institution_type_nl} in {province}" sparql_template: | PREFIX hc: PREFIX hcp: PREFIX skos: SELECT ?institution ?name WHERE { ?institution a hc:Custodian ; hcp:institutionType "{{institution_type_code}}" ; skos:prefLabel ?name . FILTER(CONTAINS(STR(?institution), "{{province_code}}")) } slots: institution_type_code: source: sparql_validation_rules.json#institution_types province_code: source: sparql_validation_rules.json#subregions ``` ### 2. Usage Flow ```python # 1. Classify question intent intent = classifier.classify("Welke archieven zijn er in Drenthe?") # -> Intent(template_id="region_institution_search", slots={"province": "Drenthe", "type": "archieven"}) # 2. Extract slot values slots = extractor.extract(intent) # -> {"institution_type_code": "A", "province_code": "NL-DR"} # 3. Instantiate template query = filler.fill("region_institution_search", slots) # -> Valid SPARQL query # 4. Execute against SPARQL endpoint results = execute_sparql(query) ``` ## Integration Points | Component | File | Integration | |-----------|------|-------------| | Query Router | `dspy_heritage_rag.py` | Add template classification before LLM | | Slot Filler | `ontology_mapping.py` | Use for entity extraction | | Validation | `sparql_linter.py` | Validate instantiated templates | | Rules | `sparql_validation_rules.json` | Source for slot enum values | ## Key Metrics | Metric | Current (LLM-only) | Target (Template-based) | |--------|-------------------|-------------------------| | Query success rate | ~40% | >95% | | Syntax errors | ~30% | 0% | | Response time | ~3s | <1s | | Consistency | Low | High | ## Next Steps After Planning 1. **Create template definitions** for top 10 question types 2. **Implement TemplateRouter** (DSPy Signature for classification) 3. **Implement SlotFiller** (entity extraction using ontology_mapping.py) 4. **Implement TemplateInstantiator** (fill slots -> valid SPARQL) 5. **Add fallback** to LLM generation for unmatched questions 6. **Write tests** for each template (TDD approach) ## References - Formica et al. (2023) - Template-based approach for QA over knowledge bases - Steinmetz et al. (2019) - Pattern-based NL to SPARQL - Zheng et al. (2018) - Template decomposition for complex questions - SPARQL-LLM (2025) - Hybrid template + LLM approach