kempersc 84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians

2025-12-28 14:56:35 +01:00

8.1 KiB

Raw Blame History

Template-Based KBQA Methodology

Academic Foundation

Template-based Knowledge Base Question Answering (KBQA) is a well-established methodology with significant academic backing. This document summarizes the key research papers and their contributions to our implementation.

Key Research Papers

1. Formica et al. (2023) - Template-Based Approach for QA over Knowledge Bases

Citation:

Formica, A., Mazzei, M., Pourabbas, E., & Rafanelli, M. (2023). A template-based approach for question answering over knowledge bases. Knowledge and Information Systems. https://link.springer.com/article/10.1007/s10115-023-01966-8

Key Findings:

65.44% precision with template-based approach vs 10.52% for LLM-only
21 question classes identified for template categorization
SVM and GBDT classifiers for question classification
Template instantiation with entity linking

Architecture:

Question -> Feature Extraction -> Question Classification -> 
Template Selection -> Entity Linking -> SPARQL Generation

Relevance to Our System:

Validates the template-based approach for structured KBs
Provides classification methodology we can adapt for Dutch heritage questions
Demonstrates dramatic improvement over direct LLM generation

2. Steinmetz et al. (2019) - From Natural Language Questions to SPARQL Queries

Citation:

Steinmetz, N., Arning, A. K., & Sattler, K. U. (2019). From Natural Language Questions to SPARQL Queries: A Pattern-based Approach. BTW 2019. https://dl.gi.de/items/83f0bcd6-5ab5-4111-9181-4ad3abae0e89

Key Findings:

KB-independent template patterns
Question pattern matching with regular expressions
Slot-based template instantiation
Works across different knowledge bases

Pattern Structure:

Question Pattern: "What is the {property} of {entity}?"
SPARQL Template:  SELECT ?value WHERE { ?entity wdt:{property_id} ?value }
Slot Mappings:    {entity} -> entity_id, {property} -> property_id

Relevance to Our System:

KB-independent approach applicable to Heritage Custodian ontology
Pattern matching methodology for Dutch question forms
Slot extraction techniques

3. Zheng et al. (2018) - Question Understanding Via Template Decomposition

Citation:

Zheng, W., Yu, J. X., Zou, L., & Cheng, H. (2018). Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition. PVLDB, 11(11). https://www.vldb.org/pvldb/vol11/p1373-zheng.pdf

Key Findings:

Binary template decomposition for complex questions
Handles multi-clause questions (joins, filters, aggregations)
Template composition for complex queries

Decomposition Example:

Complex Question: "Which archives in Drenthe have more than 1000 items?"

Decomposition:
  Template 1: type_filter(archive)
  Template 2: location_filter(Drenthe)
  Template 3: count_filter(items > 1000)
  
Composition: JOIN(Template 1, Template 2, Template 3)

Relevance to Our System:

Handle complex heritage questions with multiple constraints
Compose simple templates into complex queries
Aggregation support (counting, grouping)

4. GRASP (2025) - Generic Reasoning And SPARQL Generation

Citation:

GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs (2025). https://arxiv.org/abs/2507.08107

Key Findings:

LLM-based SPARQL generation without training data
Zero-shot SPARQL generation
Works across heterogeneous KGs

Relevance to Our System:

Fallback mechanism for unmatched questions
LLM-assisted slot filling for ambiguous entities

5. SPARQL-LLM (2025) - Real-Time SPARQL Query Generation

Citation:

SPARQL-LLM: Real-Time SPARQL Query Generation from Natural Language (2025). https://arxiv.org/html/2512.14277v1

Key Findings:

Hybrid approach combining templates with LLM
Real-time query refinement
Error correction with LLM feedback

Hybrid Architecture:

Question -> Template Matcher -> [Match Found?]
                                    |
                    Yes: Template Instantiation
                    No:  LLM Generation with Schema Context

Relevance to Our System:

Hybrid fallback approach when templates don't match
Schema-aware LLM generation for edge cases

Methodology Selection

Based on the research, we adopt a hybrid template-first approach:

Primary Path: Template-Based (95% of queries)

Question Classification (Formica et al.)
- DSPy-based intent classifier
- Map to predefined template categories
- Feature extraction for slot identification
Pattern Matching (Steinmetz et al.)
- Regular expression patterns for Dutch questions
- Fuzzy matching for variations
- Multi-pattern support per template
Template Composition (Zheng et al.)
- Simple templates for basic queries
- Composition rules for complex queries
- Aggregation templates (COUNT, GROUP BY)

Fallback Path: LLM-Assisted (5% of queries)

Schema-Aware Generation (GRASP)
- Full ontology context provided to LLM
- Valid prefix declarations
- Property constraint hints
Error Correction Loop (SPARQL-LLM)
- Auto-correction with sparql_linter.py
- LLM retry with error feedback
- Maximum 2 retry attempts

Question Classification Taxonomy

Based on Formica et al., we define question classes for heritage domain:

Class 1: Location-Based Queries

"Welke {type} zijn er in {province}?"
"Archives in {city}"
"Museums near {location}"

Class 2: Type-Based Queries

"Hoeveel {type} zijn er in Nederland?"
"List all {type}"
"What types of archives exist?"

Class 3: Entity Lookup Queries

"Wat is {institution_name}?"
"Information about {institution}"
"Details of {ghcid}"

Class 4: Relationship Queries

"Who works at {institution}?"
"What is the parent organization of {institution}?"
"Which institutions have {collection_type}?"

Class 5: Aggregation Queries

"How many archives are in each province?"
"Count museums by type"
"Total institutions in {region}"

Class 6: Comparison Queries

"Compare archives in {province1} and {province2}"
"Difference between {type1} and {type2}"

Class 7: Temporal Queries

"Institutions founded after {year}"
"Archives closed in {decade}"
"Recent additions to {collection}"

Feature Extraction

Following Formica et al., we extract features for classification:

Linguistic Features

Question word (Welke, Hoeveel, Wat, Waar)
Named entities (institution names, locations)
Temporal expressions (years, dates)
Numeric patterns (counts, comparisons)

Semantic Features

Institution type mentions (archief, museum, bibliotheek)
Geographic mentions (provinces, cities)
Relationship terms (werkt bij, onderdeel van)

Structural Features

Question length
Clause count
Presence of conjunctions (en, of)

Slot Value Resolution

From LinkML Schema

All slot values are resolved from sparql_validation_rules.json:

{
  "institution_types": {
    "archief": "A", "archieven": "A",
    "museum": "M", "musea": "M",
    "bibliotheek": "L", "bibliotheken": "L"
  },
  "subregions": {
    "drenthe": "NL-DR",
    "utrecht": "NL-UT",
    "noord-holland": "NL-NH"
  }
}

Entity Linking

For institution names, we use:

Exact match against GHCID registry
Fuzzy match with rapidfuzz
LLM-assisted disambiguation for ambiguous cases

Evaluation Metrics

Following academic methodology, we measure:

Metric	Definition	Target
Precision	Correct queries / Total queries	>90%
Recall	Answered questions / Total questions	>85%
Template Coverage	Questions matching templates / Total	>95%
Slot Accuracy	Correct slot values / Total slots	>98%
Latency	Query generation time	<100ms

References

Formica, A., et al. (2023). A template-based approach for question answering over knowledge bases. KAIS.
Steinmetz, N., et al. (2019). From Natural Language Questions to SPARQL Queries. BTW.
Zheng, W., et al. (2018). Question Answering Over Knowledge Graphs. PVLDB.
GRASP (2025). Generic Reasoning And SPARQL Generation. arXiv.
SPARQL-LLM (2025). Real-Time SPARQL Query Generation. arXiv.

8.1 KiB Raw Blame History

Template-Based KBQA Methodology

Academic Foundation

Key Research Papers

1. Formica et al. (2023) - Template-Based Approach for QA over Knowledge Bases

2. Steinmetz et al. (2019) - From Natural Language Questions to SPARQL Queries

3. Zheng et al. (2018) - Question Understanding Via Template Decomposition

4. GRASP (2025) - Generic Reasoning And SPARQL Generation

5. SPARQL-LLM (2025) - Real-Time SPARQL Query Generation

Methodology Selection

Primary Path: Template-Based (95% of queries)

Fallback Path: LLM-Assisted (5% of queries)

Question Classification Taxonomy

Class 1: Location-Based Queries

Class 2: Type-Based Queries

Class 3: Entity Lookup Queries

Class 4: Relationship Queries

Class 5: Aggregation Queries

Class 6: Comparison Queries

Class 7: Temporal Queries

Feature Extraction

Linguistic Features

Semantic Features

Structural Features

Slot Value Resolution

From LinkML Schema

Entity Linking

Evaluation Metrics

References

8.1 KiB

Raw Blame History