glam/docs/plan/prompt-query_template_mapping/external-dependencies.md

9.1 KiB

External Dependencies

Overview

This document lists the external dependencies required for the template-based SPARQL query generation system. Dependencies are categorized by purpose and include both required and optional packages.

Required Dependencies

Core Python Packages

These packages are essential for the template system to function:

Package Version Purpose PyPI
pydantic >=2.0 Structured output validation, slot schemas pydantic
pyyaml >=6.0 Template definition loading PyYAML
dspy-ai >=2.6 DSPy framework for template classification dspy-ai
httpx >=0.25 SPARQL endpoint HTTP client httpx
jinja2 >=3.0 Template instantiation engine Jinja2

Already in Project

These packages are already in pyproject.toml and will be available:

# From pyproject.toml
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "dspy-ai>=2.6",
    "httpx>=0.25",
]

Optional Dependencies

For improved slot value resolution when user input doesn't exactly match enum values:

Package Version Purpose PyPI
rapidfuzz >=3.0 Fast fuzzy string matching for slot values rapidfuzz
python-Levenshtein >=0.21 Speed up rapidfuzz calculations python-Levenshtein

Usage Example:

from rapidfuzz import fuzz, process

# Match user input to valid province codes
PROVINCES = ["Noord-Holland", "Zuid-Holland", "Utrecht", "Drenthe", "Gelderland"]

def match_province(user_input: str, threshold: float = 70.0) -> str | None:
    """Fuzzy match user input to valid province."""
    result = process.extractOne(
        user_input,
        PROVINCES,
        scorer=fuzz.WRatio,
        score_cutoff=threshold,
    )
    return result[0] if result else None

# Examples
match_province("drente")  # -> "Drenthe"
match_province("N-Holland")  # -> "Noord-Holland"
match_province("zuudholland")  # -> "Zuid-Holland"

Installation:

pip install rapidfuzz python-Levenshtein

Semantic Similarity (Optional)

For intent classification when questions don't match patterns exactly:

Package Version Purpose PyPI
sentence-transformers >=2.2 Semantic similarity for template matching sentence-transformers

Usage Example:

from sentence_transformers import SentenceTransformer, util

# Load multilingual model for Dutch/English
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Template question patterns
PATTERNS = [
    "Welke archieven zijn er in {province}?",
    "Hoeveel musea zijn er in Nederland?",
    "Wat is het oudste archief?",
]

def find_best_template(question: str, threshold: float = 0.7) -> int | None:
    """Find best matching template by semantic similarity."""
    question_embedding = model.encode(question)
    pattern_embeddings = model.encode(PATTERNS)
    
    similarities = util.cos_sim(question_embedding, pattern_embeddings)[0]
    best_idx = similarities.argmax().item()
    best_score = similarities[best_idx].item()
    
    return best_idx if best_score >= threshold else None

# Example
find_best_template("Welke archieven heeft Drenthe?")  # -> 0

Installation:

pip install sentence-transformers

Note: This adds ~500MB of model weights. Only use if DSPy classification is insufficient.

SPARQL Validation (Optional)

For deeper SPARQL syntax validation beyond regex:

Package Version Purpose PyPI
rdflib >=6.0 RDF/SPARQL parsing and validation rdflib

Usage Example:

from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import ParseException

def validate_sparql_syntax(query: str) -> tuple[bool, str | None]:
    """Validate SPARQL syntax using rdflib parser."""
    try:
        prepareQuery(query)
        return True, None
    except ParseException as e:
        return False, str(e)

# Example
valid, error = validate_sparql_syntax("""
    PREFIX hc: <https://nde.nl/ontology/hc/>
    SELECT ?s WHERE { ?s a hc:Custodian }
""")
# -> (True, None)

Installation:

pip install rdflib

External Services

Required Services

Service Endpoint Purpose
Oxigraph SPARQL http://localhost:7878/query SPARQL query execution
Qdrant Vector DB http://localhost:6333 Semantic search fallback

Service Availability Checks

import httpx

async def check_sparql_endpoint(
    endpoint: str = "http://localhost:7878/query",
    timeout: float = 5.0,
) -> bool:
    """Check if SPARQL endpoint is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                endpoint.replace("/query", "/"),
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False

async def check_qdrant(
    host: str = "localhost",
    port: int = 6333,
    timeout: float = 5.0,
) -> bool:
    """Check if Qdrant is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://{host}:{port}/",
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False

Project Files Required

Existing Files

These files must exist for the template system to function:

File Purpose Status
data/validation/sparql_validation_rules.json Slot enum values (provinces, types) Exists
backend/rag/ontology_mapping.py Entity extraction, fuzzy matching Exists
src/glam_extractor/api/sparql_linter.py SPARQL validation/correction Exists
backend/rag/dspy_heritage_rag.py Integration point Exists

New Files to Create

File Purpose Status
backend/rag/template_sparql.py Template loading, classification, instantiation To create
data/sparql_templates.yaml Template definitions To create
tests/rag/test_template_sparql.py Unit tests To create

pyproject.toml Updates

Add optional dependencies for template system:

[project.optional-dependencies]
# Template-based SPARQL generation
sparql-templates = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
]

# Full template system with semantic matching
sparql-templates-full = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
    "sentence-transformers>=2.2",
    "rdflib>=6.0",
]

Installation:

# Minimal template support
pip install -e ".[sparql-templates]"

# Full template support with semantic matching
pip install -e ".[sparql-templates-full]"

Environment Variables

Variable Default Purpose
SPARQL_ENDPOINT http://localhost:7878/query SPARQL endpoint URL
QDRANT_HOST localhost Qdrant host
QDRANT_PORT 6333 Qdrant port
TEMPLATE_CONFIDENCE_THRESHOLD 0.7 Minimum confidence for template use
ENABLE_FUZZY_MATCHING true Enable rapidfuzz for slot matching

Version Compatibility Matrix

Python DSPy Pydantic Status
3.11+ 2.6+ 2.0+ Supported
3.10 2.6+ 2.0+ Supported
3.9 2.5+ 2.0+ ⚠️ Limited (no match statements)
<3.9 - - Not supported

Docker Considerations

If deploying in Docker, ensure these are in the Dockerfile:

# Python dependencies
RUN pip install --no-cache-dir \
    pydantic>=2.0 \
    pyyaml>=6.0 \
    dspy-ai>=2.6 \
    httpx>=0.25 \
    jinja2>=3.0 \
    rapidfuzz>=3.0

# Optional: sentence-transformers (adds ~500MB)
# RUN pip install sentence-transformers>=2.2

Dependency Security

All recommended packages are actively maintained and have no known critical CVEs as of 2025-06.

Package Last Updated Security Status
pydantic 2025-05 No known CVEs
rapidfuzz 2025-06 No known CVEs
dspy-ai 2025-06 No known CVEs
jinja2 2025-04 No known CVEs

Run security audit:

pip-audit --requirement requirements.txt

Summary

Minimum viable installation:

pip install pydantic pyyaml dspy-ai httpx jinja2

Recommended installation:

pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein

Full installation (with semantic matching):

pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein sentence-transformers rdflib