kempersc 84904e344b Make AGENTS more succint by referring to opencode rules & enrich custodians

2025-12-28 14:56:35 +01:00

9.1 KiB

Raw Permalink Blame History

External Dependencies

Overview

This document lists the external dependencies required for the template-based SPARQL query generation system. Dependencies are categorized by purpose and include both required and optional packages.

Required Dependencies

Core Python Packages

These packages are essential for the template system to function:

Package	Version	Purpose	PyPI
`pydantic`	>=2.0	Structured output validation, slot schemas	pydantic
`pyyaml`	>=6.0	Template definition loading	PyYAML
`dspy-ai`	>=2.6	DSPy framework for template classification	dspy-ai
`httpx`	>=0.25	SPARQL endpoint HTTP client	httpx
`jinja2`	>=3.0	Template instantiation engine	Jinja2

Already in Project

These packages are already in pyproject.toml and will be available:

# From pyproject.toml
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "dspy-ai>=2.6",
    "httpx>=0.25",
]

Optional Dependencies

Fuzzy Matching (Recommended)

For improved slot value resolution when user input doesn't exactly match enum values:

Package	Version	Purpose	PyPI
`rapidfuzz`	>=3.0	Fast fuzzy string matching for slot values	rapidfuzz
`python-Levenshtein`	>=0.21	Speed up rapidfuzz calculations	python-Levenshtein

Usage Example:

from rapidfuzz import fuzz, process

# Match user input to valid province codes
PROVINCES = ["Noord-Holland", "Zuid-Holland", "Utrecht", "Drenthe", "Gelderland"]

def match_province(user_input: str, threshold: float = 70.0) -> str | None:
    """Fuzzy match user input to valid province."""
    result = process.extractOne(
        user_input,
        PROVINCES,
        scorer=fuzz.WRatio,
        score_cutoff=threshold,
    )
    return result[0] if result else None

# Examples
match_province("drente")  # -> "Drenthe"
match_province("N-Holland")  # -> "Noord-Holland"
match_province("zuudholland")  # -> "Zuid-Holland"

Installation:

pip install rapidfuzz python-Levenshtein

Semantic Similarity (Optional)

For intent classification when questions don't match patterns exactly:

Package	Version	Purpose	PyPI
`sentence-transformers`	>=2.2	Semantic similarity for template matching	sentence-transformers

Usage Example:

from sentence_transformers import SentenceTransformer, util

# Load multilingual model for Dutch/English
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Template question patterns
PATTERNS = [
    "Welke archieven zijn er in {province}?",
    "Hoeveel musea zijn er in Nederland?",
    "Wat is het oudste archief?",
]

def find_best_template(question: str, threshold: float = 0.7) -> int | None:
    """Find best matching template by semantic similarity."""
    question_embedding = model.encode(question)
    pattern_embeddings = model.encode(PATTERNS)
    
    similarities = util.cos_sim(question_embedding, pattern_embeddings)[0]
    best_idx = similarities.argmax().item()
    best_score = similarities[best_idx].item()
    
    return best_idx if best_score >= threshold else None

# Example
find_best_template("Welke archieven heeft Drenthe?")  # -> 0

Installation:

pip install sentence-transformers

Note: This adds ~500MB of model weights. Only use if DSPy classification is insufficient.

SPARQL Validation (Optional)

For deeper SPARQL syntax validation beyond regex:

Package	Version	Purpose	PyPI
`rdflib`	>=6.0	RDF/SPARQL parsing and validation	rdflib

Usage Example:

from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import ParseException

def validate_sparql_syntax(query: str) -> tuple[bool, str | None]:
    """Validate SPARQL syntax using rdflib parser."""
    try:
        prepareQuery(query)
        return True, None
    except ParseException as e:
        return False, str(e)

# Example
valid, error = validate_sparql_syntax("""
    PREFIX hc: <https://nde.nl/ontology/hc/>
    SELECT ?s WHERE { ?s a hc:Custodian }
""")
# -> (True, None)

Installation:

pip install rdflib

External Services

Required Services

Service	Endpoint	Purpose
Oxigraph SPARQL	`http://localhost:7878/query`	SPARQL query execution
Qdrant Vector DB	`http://localhost:6333`	Semantic search fallback

Service Availability Checks

import httpx

async def check_sparql_endpoint(
    endpoint: str = "http://localhost:7878/query",
    timeout: float = 5.0,
) -> bool:
    """Check if SPARQL endpoint is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                endpoint.replace("/query", "/"),
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False

async def check_qdrant(
    host: str = "localhost",
    port: int = 6333,
    timeout: float = 5.0,
) -> bool:
    """Check if Qdrant is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://{host}:{port}/",
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False

Project Files Required

Existing Files

These files must exist for the template system to function:

File	Purpose	Status
`data/validation/sparql_validation_rules.json`	Slot enum values (provinces, types)	✅ Exists
`backend/rag/ontology_mapping.py`	Entity extraction, fuzzy matching	✅ Exists
`src/glam_extractor/api/sparql_linter.py`	SPARQL validation/correction	✅ Exists
`backend/rag/dspy_heritage_rag.py`	Integration point	✅ Exists

New Files to Create

File	Purpose	Status
`backend/rag/template_sparql.py`	Template loading, classification, instantiation	❌ To create
`data/sparql_templates.yaml`	Template definitions	❌ To create
`tests/rag/test_template_sparql.py`	Unit tests	❌ To create

pyproject.toml Updates

Add optional dependencies for template system:

[project.optional-dependencies]
# Template-based SPARQL generation
sparql-templates = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
]

# Full template system with semantic matching
sparql-templates-full = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
    "sentence-transformers>=2.2",
    "rdflib>=6.0",
]

Installation:

# Minimal template support
pip install -e ".[sparql-templates]"

# Full template support with semantic matching
pip install -e ".[sparql-templates-full]"

Environment Variables

Variable	Default	Purpose
`SPARQL_ENDPOINT`	`http://localhost:7878/query`	SPARQL endpoint URL
`QDRANT_HOST`	`localhost`	Qdrant host
`QDRANT_PORT`	`6333`	Qdrant port
`TEMPLATE_CONFIDENCE_THRESHOLD`	`0.7`	Minimum confidence for template use
`ENABLE_FUZZY_MATCHING`	`true`	Enable rapidfuzz for slot matching

Version Compatibility Matrix

Python	DSPy	Pydantic	Status
3.11+	2.6+	2.0+	✅ Supported
3.10	2.6+	2.0+	✅ Supported
3.9	2.5+	2.0+	⚠️ Limited (no `match` statements)
<3.9	-	-	❌ Not supported

Docker Considerations

If deploying in Docker, ensure these are in the Dockerfile:

# Python dependencies
RUN pip install --no-cache-dir \
    pydantic>=2.0 \
    pyyaml>=6.0 \
    dspy-ai>=2.6 \
    httpx>=0.25 \
    jinja2>=3.0 \
    rapidfuzz>=3.0

# Optional: sentence-transformers (adds ~500MB)
# RUN pip install sentence-transformers>=2.2

Dependency Security

All recommended packages are actively maintained and have no known critical CVEs as of 2025-06.

Package	Last Updated	Security Status
pydantic	2025-05	✅ No known CVEs
rapidfuzz	2025-06	✅ No known CVEs
dspy-ai	2025-06	✅ No known CVEs
jinja2	2025-04	✅ No known CVEs

Run security audit:

pip-audit --requirement requirements.txt

Summary

Minimum viable installation:

pip install pydantic pyyaml dspy-ai httpx jinja2

Recommended installation:

pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein

Full installation (with semantic matching):

pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein sentence-transformers rdflib

9.1 KiB Raw Permalink Blame History

External Dependencies

Overview

Required Dependencies

Core Python Packages

Already in Project

Optional Dependencies

Fuzzy Matching (Recommended)

Semantic Similarity (Optional)

SPARQL Validation (Optional)

External Services

Required Services

Service Availability Checks

Project Files Required

Existing Files

New Files to Create

pyproject.toml Updates

Environment Variables

Version Compatibility Matrix

Docker Considerations

Dependency Security

Summary

9.1 KiB

Raw Permalink Blame History