glam/docs/plan/prompt-query_template_mapping/methodology.md

8.1 KiB

Template-Based KBQA Methodology

Academic Foundation

Template-based Knowledge Base Question Answering (KBQA) is a well-established methodology with significant academic backing. This document summarizes the key research papers and their contributions to our implementation.

Key Research Papers

1. Formica et al. (2023) - Template-Based Approach for QA over Knowledge Bases

Citation:

Formica, A., Mazzei, M., Pourabbas, E., & Rafanelli, M. (2023). A template-based approach for question answering over knowledge bases. Knowledge and Information Systems. https://link.springer.com/article/10.1007/s10115-023-01966-8

Key Findings:

  • 65.44% precision with template-based approach vs 10.52% for LLM-only
  • 21 question classes identified for template categorization
  • SVM and GBDT classifiers for question classification
  • Template instantiation with entity linking

Architecture:

Question -> Feature Extraction -> Question Classification -> 
Template Selection -> Entity Linking -> SPARQL Generation

Relevance to Our System:

  • Validates the template-based approach for structured KBs
  • Provides classification methodology we can adapt for Dutch heritage questions
  • Demonstrates dramatic improvement over direct LLM generation

2. Steinmetz et al. (2019) - From Natural Language Questions to SPARQL Queries

Citation:

Steinmetz, N., Arning, A. K., & Sattler, K. U. (2019). From Natural Language Questions to SPARQL Queries: A Pattern-based Approach. BTW 2019. https://dl.gi.de/items/83f0bcd6-5ab5-4111-9181-4ad3abae0e89

Key Findings:

  • KB-independent template patterns
  • Question pattern matching with regular expressions
  • Slot-based template instantiation
  • Works across different knowledge bases

Pattern Structure:

Question Pattern: "What is the {property} of {entity}?"
SPARQL Template:  SELECT ?value WHERE { ?entity wdt:{property_id} ?value }
Slot Mappings:    {entity} -> entity_id, {property} -> property_id

Relevance to Our System:

  • KB-independent approach applicable to Heritage Custodian ontology
  • Pattern matching methodology for Dutch question forms
  • Slot extraction techniques

3. Zheng et al. (2018) - Question Understanding Via Template Decomposition

Citation:

Zheng, W., Yu, J. X., Zou, L., & Cheng, H. (2018). Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition. PVLDB, 11(11). https://www.vldb.org/pvldb/vol11/p1373-zheng.pdf

Key Findings:

  • Binary template decomposition for complex questions
  • Handles multi-clause questions (joins, filters, aggregations)
  • Template composition for complex queries

Decomposition Example:

Complex Question: "Which archives in Drenthe have more than 1000 items?"

Decomposition:
  Template 1: type_filter(archive)
  Template 2: location_filter(Drenthe)
  Template 3: count_filter(items > 1000)
  
Composition: JOIN(Template 1, Template 2, Template 3)

Relevance to Our System:

  • Handle complex heritage questions with multiple constraints
  • Compose simple templates into complex queries
  • Aggregation support (counting, grouping)

4. GRASP (2025) - Generic Reasoning And SPARQL Generation

Citation:

GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs (2025). https://arxiv.org/abs/2507.08107

Key Findings:

  • LLM-based SPARQL generation without training data
  • Zero-shot SPARQL generation
  • Works across heterogeneous KGs

Relevance to Our System:

  • Fallback mechanism for unmatched questions
  • LLM-assisted slot filling for ambiguous entities

5. SPARQL-LLM (2025) - Real-Time SPARQL Query Generation

Citation:

SPARQL-LLM: Real-Time SPARQL Query Generation from Natural Language (2025). https://arxiv.org/html/2512.14277v1

Key Findings:

  • Hybrid approach combining templates with LLM
  • Real-time query refinement
  • Error correction with LLM feedback

Hybrid Architecture:

Question -> Template Matcher -> [Match Found?]
                                    |
                    Yes: Template Instantiation
                    No:  LLM Generation with Schema Context

Relevance to Our System:

  • Hybrid fallback approach when templates don't match
  • Schema-aware LLM generation for edge cases

Methodology Selection

Based on the research, we adopt a hybrid template-first approach:

Primary Path: Template-Based (95% of queries)

  1. Question Classification (Formica et al.)

    • DSPy-based intent classifier
    • Map to predefined template categories
    • Feature extraction for slot identification
  2. Pattern Matching (Steinmetz et al.)

    • Regular expression patterns for Dutch questions
    • Fuzzy matching for variations
    • Multi-pattern support per template
  3. Template Composition (Zheng et al.)

    • Simple templates for basic queries
    • Composition rules for complex queries
    • Aggregation templates (COUNT, GROUP BY)

Fallback Path: LLM-Assisted (5% of queries)

  1. Schema-Aware Generation (GRASP)

    • Full ontology context provided to LLM
    • Valid prefix declarations
    • Property constraint hints
  2. Error Correction Loop (SPARQL-LLM)

    • Auto-correction with sparql_linter.py
    • LLM retry with error feedback
    • Maximum 2 retry attempts

Question Classification Taxonomy

Based on Formica et al., we define question classes for heritage domain:

Class 1: Location-Based Queries

  • "Welke {type} zijn er in {province}?"
  • "Archives in {city}"
  • "Museums near {location}"

Class 2: Type-Based Queries

  • "Hoeveel {type} zijn er in Nederland?"
  • "List all {type}"
  • "What types of archives exist?"

Class 3: Entity Lookup Queries

  • "Wat is {institution_name}?"
  • "Information about {institution}"
  • "Details of {ghcid}"

Class 4: Relationship Queries

  • "Who works at {institution}?"
  • "What is the parent organization of {institution}?"
  • "Which institutions have {collection_type}?"

Class 5: Aggregation Queries

  • "How many archives are in each province?"
  • "Count museums by type"
  • "Total institutions in {region}"

Class 6: Comparison Queries

  • "Compare archives in {province1} and {province2}"
  • "Difference between {type1} and {type2}"

Class 7: Temporal Queries

  • "Institutions founded after {year}"
  • "Archives closed in {decade}"
  • "Recent additions to {collection}"

Feature Extraction

Following Formica et al., we extract features for classification:

Linguistic Features

  • Question word (Welke, Hoeveel, Wat, Waar)
  • Named entities (institution names, locations)
  • Temporal expressions (years, dates)
  • Numeric patterns (counts, comparisons)

Semantic Features

  • Institution type mentions (archief, museum, bibliotheek)
  • Geographic mentions (provinces, cities)
  • Relationship terms (werkt bij, onderdeel van)

Structural Features

  • Question length
  • Clause count
  • Presence of conjunctions (en, of)

Slot Value Resolution

From LinkML Schema

All slot values are resolved from sparql_validation_rules.json:

{
  "institution_types": {
    "archief": "A", "archieven": "A",
    "museum": "M", "musea": "M",
    "bibliotheek": "L", "bibliotheken": "L"
  },
  "subregions": {
    "drenthe": "NL-DR",
    "utrecht": "NL-UT",
    "noord-holland": "NL-NH"
  }
}

Entity Linking

For institution names, we use:

  1. Exact match against GHCID registry
  2. Fuzzy match with rapidfuzz
  3. LLM-assisted disambiguation for ambiguous cases

Evaluation Metrics

Following academic methodology, we measure:

Metric Definition Target
Precision Correct queries / Total queries >90%
Recall Answered questions / Total questions >85%
Template Coverage Questions matching templates / Total >95%
Slot Accuracy Correct slot values / Total slots >98%
Latency Query generation time <100ms

References

  1. Formica, A., et al. (2023). A template-based approach for question answering over knowledge bases. KAIS.
  2. Steinmetz, N., et al. (2019). From Natural Language Questions to SPARQL Queries. BTW.
  3. Zheng, W., et al. (2018). Question Answering Over Knowledge Graphs. PVLDB.
  4. GRASP (2025). Generic Reasoning And SPARQL Generation. arXiv.
  5. SPARQL-LLM (2025). Real-Time SPARQL Query Generation. arXiv.