glam/docs/plan/prompt-query_template_mapping/external-dependencies.md

330 lines
9.1 KiB
Markdown

# External Dependencies
## Overview
This document lists the external dependencies required for the template-based SPARQL query generation system. Dependencies are categorized by purpose and include both required and optional packages.
## Required Dependencies
### Core Python Packages
These packages are essential for the template system to function:
| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `pydantic` | >=2.0 | Structured output validation, slot schemas | [pydantic](https://pypi.org/project/pydantic/) |
| `pyyaml` | >=6.0 | Template definition loading | [PyYAML](https://pypi.org/project/PyYAML/) |
| `dspy-ai` | >=2.6 | DSPy framework for template classification | [dspy-ai](https://pypi.org/project/dspy-ai/) |
| `httpx` | >=0.25 | SPARQL endpoint HTTP client | [httpx](https://pypi.org/project/httpx/) |
| `jinja2` | >=3.0 | Template instantiation engine | [Jinja2](https://pypi.org/project/Jinja2/) |
### Already in Project
These packages are already in `pyproject.toml` and will be available:
```toml
# From pyproject.toml
dependencies = [
"pydantic>=2.0",
"pyyaml>=6.0",
"dspy-ai>=2.6",
"httpx>=0.25",
]
```
## Optional Dependencies
### Fuzzy Matching (Recommended)
For improved slot value resolution when user input doesn't exactly match enum values:
| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `rapidfuzz` | >=3.0 | Fast fuzzy string matching for slot values | [rapidfuzz](https://pypi.org/project/rapidfuzz/) |
| `python-Levenshtein` | >=0.21 | Speed up rapidfuzz calculations | [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) |
**Usage Example:**
```python
from rapidfuzz import fuzz, process
# Match user input to valid province codes
PROVINCES = ["Noord-Holland", "Zuid-Holland", "Utrecht", "Drenthe", "Gelderland"]
def match_province(user_input: str, threshold: float = 70.0) -> str | None:
"""Fuzzy match user input to valid province."""
result = process.extractOne(
user_input,
PROVINCES,
scorer=fuzz.WRatio,
score_cutoff=threshold,
)
return result[0] if result else None
# Examples
match_province("drente") # -> "Drenthe"
match_province("N-Holland") # -> "Noord-Holland"
match_province("zuudholland") # -> "Zuid-Holland"
```
**Installation:**
```bash
pip install rapidfuzz python-Levenshtein
```
### Semantic Similarity (Optional)
For intent classification when questions don't match patterns exactly:
| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `sentence-transformers` | >=2.2 | Semantic similarity for template matching | [sentence-transformers](https://pypi.org/project/sentence-transformers/) |
**Usage Example:**
```python
from sentence_transformers import SentenceTransformer, util
# Load multilingual model for Dutch/English
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Template question patterns
PATTERNS = [
"Welke archieven zijn er in {province}?",
"Hoeveel musea zijn er in Nederland?",
"Wat is het oudste archief?",
]
def find_best_template(question: str, threshold: float = 0.7) -> int | None:
"""Find best matching template by semantic similarity."""
question_embedding = model.encode(question)
pattern_embeddings = model.encode(PATTERNS)
similarities = util.cos_sim(question_embedding, pattern_embeddings)[0]
best_idx = similarities.argmax().item()
best_score = similarities[best_idx].item()
return best_idx if best_score >= threshold else None
# Example
find_best_template("Welke archieven heeft Drenthe?") # -> 0
```
**Installation:**
```bash
pip install sentence-transformers
```
**Note:** This adds ~500MB of model weights. Only use if DSPy classification is insufficient.
### SPARQL Validation (Optional)
For deeper SPARQL syntax validation beyond regex:
| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `rdflib` | >=6.0 | RDF/SPARQL parsing and validation | [rdflib](https://pypi.org/project/rdflib/) |
**Usage Example:**
```python
from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import ParseException
def validate_sparql_syntax(query: str) -> tuple[bool, str | None]:
"""Validate SPARQL syntax using rdflib parser."""
try:
prepareQuery(query)
return True, None
except ParseException as e:
return False, str(e)
# Example
valid, error = validate_sparql_syntax("""
PREFIX hc: <https://nde.nl/ontology/hc/>
SELECT ?s WHERE { ?s a hc:Custodian }
""")
# -> (True, None)
```
**Installation:**
```bash
pip install rdflib
```
## External Services
### Required Services
| Service | Endpoint | Purpose |
|---------|----------|---------|
| Oxigraph SPARQL | `http://localhost:7878/query` | SPARQL query execution |
| Qdrant Vector DB | `http://localhost:6333` | Semantic search fallback |
### Service Availability Checks
```python
import httpx
async def check_sparql_endpoint(
endpoint: str = "http://localhost:7878/query",
timeout: float = 5.0,
) -> bool:
"""Check if SPARQL endpoint is available."""
try:
async with httpx.AsyncClient() as client:
response = await client.get(
endpoint.replace("/query", "/"),
timeout=timeout,
)
return response.status_code == 200
except Exception:
return False
async def check_qdrant(
host: str = "localhost",
port: int = 6333,
timeout: float = 5.0,
) -> bool:
"""Check if Qdrant is available."""
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"http://{host}:{port}/",
timeout=timeout,
)
return response.status_code == 200
except Exception:
return False
```
## Project Files Required
### Existing Files
These files must exist for the template system to function:
| File | Purpose | Status |
|------|---------|--------|
| `data/validation/sparql_validation_rules.json` | Slot enum values (provinces, types) | ✅ Exists |
| `backend/rag/ontology_mapping.py` | Entity extraction, fuzzy matching | ✅ Exists |
| `src/glam_extractor/api/sparql_linter.py` | SPARQL validation/correction | ✅ Exists |
| `backend/rag/dspy_heritage_rag.py` | Integration point | ✅ Exists |
### New Files to Create
| File | Purpose | Status |
|------|---------|--------|
| `backend/rag/template_sparql.py` | Template loading, classification, instantiation | ❌ To create |
| `data/sparql_templates.yaml` | Template definitions | ❌ To create |
| `tests/rag/test_template_sparql.py` | Unit tests | ❌ To create |
## pyproject.toml Updates
Add optional dependencies for template system:
```toml
[project.optional-dependencies]
# Template-based SPARQL generation
sparql-templates = [
"rapidfuzz>=3.0",
"python-Levenshtein>=0.21",
"jinja2>=3.0",
]
# Full template system with semantic matching
sparql-templates-full = [
"rapidfuzz>=3.0",
"python-Levenshtein>=0.21",
"jinja2>=3.0",
"sentence-transformers>=2.2",
"rdflib>=6.0",
]
```
**Installation:**
```bash
# Minimal template support
pip install -e ".[sparql-templates]"
# Full template support with semantic matching
pip install -e ".[sparql-templates-full]"
```
## Environment Variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `SPARQL_ENDPOINT` | `http://localhost:7878/query` | SPARQL endpoint URL |
| `QDRANT_HOST` | `localhost` | Qdrant host |
| `QDRANT_PORT` | `6333` | Qdrant port |
| `TEMPLATE_CONFIDENCE_THRESHOLD` | `0.7` | Minimum confidence for template use |
| `ENABLE_FUZZY_MATCHING` | `true` | Enable rapidfuzz for slot matching |
## Version Compatibility Matrix
| Python | DSPy | Pydantic | Status |
|--------|------|----------|--------|
| 3.11+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.10 | 2.6+ | 2.0+ | ✅ Supported |
| 3.9 | 2.5+ | 2.0+ | ⚠️ Limited (no `match` statements) |
| <3.9 | - | - | Not supported |
## Docker Considerations
If deploying in Docker, ensure these are in the Dockerfile:
```dockerfile
# Python dependencies
RUN pip install --no-cache-dir \
pydantic>=2.0 \
pyyaml>=6.0 \
dspy-ai>=2.6 \
httpx>=0.25 \
jinja2>=3.0 \
rapidfuzz>=3.0
# Optional: sentence-transformers (adds ~500MB)
# RUN pip install sentence-transformers>=2.2
```
## Dependency Security
All recommended packages are actively maintained and have no known critical CVEs as of 2025-06.
| Package | Last Updated | Security Status |
|---------|--------------|-----------------|
| pydantic | 2025-05 | No known CVEs |
| rapidfuzz | 2025-06 | No known CVEs |
| dspy-ai | 2025-06 | No known CVEs |
| jinja2 | 2025-04 | No known CVEs |
Run security audit:
```bash
pip-audit --requirement requirements.txt
```
## Summary
**Minimum viable installation:**
```bash
pip install pydantic pyyaml dspy-ai httpx jinja2
```
**Recommended installation:**
```bash
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein
```
**Full installation (with semantic matching):**
```bash
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein sentence-transformers rdflib
```