330 lines
9.1 KiB
Markdown
330 lines
9.1 KiB
Markdown
# External Dependencies
|
|
|
|
## Overview
|
|
|
|
This document lists the external dependencies required for the template-based SPARQL query generation system. Dependencies are categorized by purpose and include both required and optional packages.
|
|
|
|
## Required Dependencies
|
|
|
|
### Core Python Packages
|
|
|
|
These packages are essential for the template system to function:
|
|
|
|
| Package | Version | Purpose | PyPI |
|
|
|---------|---------|---------|------|
|
|
| `pydantic` | >=2.0 | Structured output validation, slot schemas | [pydantic](https://pypi.org/project/pydantic/) |
|
|
| `pyyaml` | >=6.0 | Template definition loading | [PyYAML](https://pypi.org/project/PyYAML/) |
|
|
| `dspy-ai` | >=2.6 | DSPy framework for template classification | [dspy-ai](https://pypi.org/project/dspy-ai/) |
|
|
| `httpx` | >=0.25 | SPARQL endpoint HTTP client | [httpx](https://pypi.org/project/httpx/) |
|
|
| `jinja2` | >=3.0 | Template instantiation engine | [Jinja2](https://pypi.org/project/Jinja2/) |
|
|
|
|
### Already in Project
|
|
|
|
These packages are already in `pyproject.toml` and will be available:
|
|
|
|
```toml
|
|
# From pyproject.toml
|
|
dependencies = [
|
|
"pydantic>=2.0",
|
|
"pyyaml>=6.0",
|
|
"dspy-ai>=2.6",
|
|
"httpx>=0.25",
|
|
]
|
|
```
|
|
|
|
## Optional Dependencies
|
|
|
|
### Fuzzy Matching (Recommended)
|
|
|
|
For improved slot value resolution when user input doesn't exactly match enum values:
|
|
|
|
| Package | Version | Purpose | PyPI |
|
|
|---------|---------|---------|------|
|
|
| `rapidfuzz` | >=3.0 | Fast fuzzy string matching for slot values | [rapidfuzz](https://pypi.org/project/rapidfuzz/) |
|
|
| `python-Levenshtein` | >=0.21 | Speed up rapidfuzz calculations | [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) |
|
|
|
|
**Usage Example:**
|
|
|
|
```python
|
|
from rapidfuzz import fuzz, process
|
|
|
|
# Match user input to valid province codes
|
|
PROVINCES = ["Noord-Holland", "Zuid-Holland", "Utrecht", "Drenthe", "Gelderland"]
|
|
|
|
def match_province(user_input: str, threshold: float = 70.0) -> str | None:
|
|
"""Fuzzy match user input to valid province."""
|
|
result = process.extractOne(
|
|
user_input,
|
|
PROVINCES,
|
|
scorer=fuzz.WRatio,
|
|
score_cutoff=threshold,
|
|
)
|
|
return result[0] if result else None
|
|
|
|
# Examples
|
|
match_province("drente") # -> "Drenthe"
|
|
match_province("N-Holland") # -> "Noord-Holland"
|
|
match_province("zuudholland") # -> "Zuid-Holland"
|
|
```
|
|
|
|
**Installation:**
|
|
|
|
```bash
|
|
pip install rapidfuzz python-Levenshtein
|
|
```
|
|
|
|
### Semantic Similarity (Optional)
|
|
|
|
For intent classification when questions don't match patterns exactly:
|
|
|
|
| Package | Version | Purpose | PyPI |
|
|
|---------|---------|---------|------|
|
|
| `sentence-transformers` | >=2.2 | Semantic similarity for template matching | [sentence-transformers](https://pypi.org/project/sentence-transformers/) |
|
|
|
|
**Usage Example:**
|
|
|
|
```python
|
|
from sentence_transformers import SentenceTransformer, util
|
|
|
|
# Load multilingual model for Dutch/English
|
|
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
|
|
|
|
# Template question patterns
|
|
PATTERNS = [
|
|
"Welke archieven zijn er in {province}?",
|
|
"Hoeveel musea zijn er in Nederland?",
|
|
"Wat is het oudste archief?",
|
|
]
|
|
|
|
def find_best_template(question: str, threshold: float = 0.7) -> int | None:
|
|
"""Find best matching template by semantic similarity."""
|
|
question_embedding = model.encode(question)
|
|
pattern_embeddings = model.encode(PATTERNS)
|
|
|
|
similarities = util.cos_sim(question_embedding, pattern_embeddings)[0]
|
|
best_idx = similarities.argmax().item()
|
|
best_score = similarities[best_idx].item()
|
|
|
|
return best_idx if best_score >= threshold else None
|
|
|
|
# Example
|
|
find_best_template("Welke archieven heeft Drenthe?") # -> 0
|
|
```
|
|
|
|
**Installation:**
|
|
|
|
```bash
|
|
pip install sentence-transformers
|
|
```
|
|
|
|
**Note:** This adds ~500MB of model weights. Only use if DSPy classification is insufficient.
|
|
|
|
### SPARQL Validation (Optional)
|
|
|
|
For deeper SPARQL syntax validation beyond regex:
|
|
|
|
| Package | Version | Purpose | PyPI |
|
|
|---------|---------|---------|------|
|
|
| `rdflib` | >=6.0 | RDF/SPARQL parsing and validation | [rdflib](https://pypi.org/project/rdflib/) |
|
|
|
|
**Usage Example:**
|
|
|
|
```python
|
|
from rdflib.plugins.sparql import prepareQuery
|
|
from rdflib.plugins.sparql.parser import ParseException
|
|
|
|
def validate_sparql_syntax(query: str) -> tuple[bool, str | None]:
|
|
"""Validate SPARQL syntax using rdflib parser."""
|
|
try:
|
|
prepareQuery(query)
|
|
return True, None
|
|
except ParseException as e:
|
|
return False, str(e)
|
|
|
|
# Example
|
|
valid, error = validate_sparql_syntax("""
|
|
PREFIX hc: <https://nde.nl/ontology/hc/>
|
|
SELECT ?s WHERE { ?s a hc:Custodian }
|
|
""")
|
|
# -> (True, None)
|
|
```
|
|
|
|
**Installation:**
|
|
|
|
```bash
|
|
pip install rdflib
|
|
```
|
|
|
|
## External Services
|
|
|
|
### Required Services
|
|
|
|
| Service | Endpoint | Purpose |
|
|
|---------|----------|---------|
|
|
| Oxigraph SPARQL | `http://localhost:7878/query` | SPARQL query execution |
|
|
| Qdrant Vector DB | `http://localhost:6333` | Semantic search fallback |
|
|
|
|
### Service Availability Checks
|
|
|
|
```python
|
|
import httpx
|
|
|
|
async def check_sparql_endpoint(
|
|
endpoint: str = "http://localhost:7878/query",
|
|
timeout: float = 5.0,
|
|
) -> bool:
|
|
"""Check if SPARQL endpoint is available."""
|
|
try:
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.get(
|
|
endpoint.replace("/query", "/"),
|
|
timeout=timeout,
|
|
)
|
|
return response.status_code == 200
|
|
except Exception:
|
|
return False
|
|
|
|
async def check_qdrant(
|
|
host: str = "localhost",
|
|
port: int = 6333,
|
|
timeout: float = 5.0,
|
|
) -> bool:
|
|
"""Check if Qdrant is available."""
|
|
try:
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.get(
|
|
f"http://{host}:{port}/",
|
|
timeout=timeout,
|
|
)
|
|
return response.status_code == 200
|
|
except Exception:
|
|
return False
|
|
```
|
|
|
|
## Project Files Required
|
|
|
|
### Existing Files
|
|
|
|
These files must exist for the template system to function:
|
|
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `data/validation/sparql_validation_rules.json` | Slot enum values (provinces, types) | ✅ Exists |
|
|
| `backend/rag/ontology_mapping.py` | Entity extraction, fuzzy matching | ✅ Exists |
|
|
| `src/glam_extractor/api/sparql_linter.py` | SPARQL validation/correction | ✅ Exists |
|
|
| `backend/rag/dspy_heritage_rag.py` | Integration point | ✅ Exists |
|
|
|
|
### New Files to Create
|
|
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `backend/rag/template_sparql.py` | Template loading, classification, instantiation | ❌ To create |
|
|
| `data/sparql_templates.yaml` | Template definitions | ❌ To create |
|
|
| `tests/rag/test_template_sparql.py` | Unit tests | ❌ To create |
|
|
|
|
## pyproject.toml Updates
|
|
|
|
Add optional dependencies for template system:
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
# Template-based SPARQL generation
|
|
sparql-templates = [
|
|
"rapidfuzz>=3.0",
|
|
"python-Levenshtein>=0.21",
|
|
"jinja2>=3.0",
|
|
]
|
|
|
|
# Full template system with semantic matching
|
|
sparql-templates-full = [
|
|
"rapidfuzz>=3.0",
|
|
"python-Levenshtein>=0.21",
|
|
"jinja2>=3.0",
|
|
"sentence-transformers>=2.2",
|
|
"rdflib>=6.0",
|
|
]
|
|
```
|
|
|
|
**Installation:**
|
|
|
|
```bash
|
|
# Minimal template support
|
|
pip install -e ".[sparql-templates]"
|
|
|
|
# Full template support with semantic matching
|
|
pip install -e ".[sparql-templates-full]"
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Purpose |
|
|
|----------|---------|---------|
|
|
| `SPARQL_ENDPOINT` | `http://localhost:7878/query` | SPARQL endpoint URL |
|
|
| `QDRANT_HOST` | `localhost` | Qdrant host |
|
|
| `QDRANT_PORT` | `6333` | Qdrant port |
|
|
| `TEMPLATE_CONFIDENCE_THRESHOLD` | `0.7` | Minimum confidence for template use |
|
|
| `ENABLE_FUZZY_MATCHING` | `true` | Enable rapidfuzz for slot matching |
|
|
|
|
## Version Compatibility Matrix
|
|
|
|
| Python | DSPy | Pydantic | Status |
|
|
|--------|------|----------|--------|
|
|
| 3.11+ | 2.6+ | 2.0+ | ✅ Supported |
|
|
| 3.10 | 2.6+ | 2.0+ | ✅ Supported |
|
|
| 3.9 | 2.5+ | 2.0+ | ⚠️ Limited (no `match` statements) |
|
|
| <3.9 | - | - | ❌ Not supported |
|
|
|
|
## Docker Considerations
|
|
|
|
If deploying in Docker, ensure these are in the Dockerfile:
|
|
|
|
```dockerfile
|
|
# Python dependencies
|
|
RUN pip install --no-cache-dir \
|
|
pydantic>=2.0 \
|
|
pyyaml>=6.0 \
|
|
dspy-ai>=2.6 \
|
|
httpx>=0.25 \
|
|
jinja2>=3.0 \
|
|
rapidfuzz>=3.0
|
|
|
|
# Optional: sentence-transformers (adds ~500MB)
|
|
# RUN pip install sentence-transformers>=2.2
|
|
```
|
|
|
|
## Dependency Security
|
|
|
|
All recommended packages are actively maintained and have no known critical CVEs as of 2025-06.
|
|
|
|
| Package | Last Updated | Security Status |
|
|
|---------|--------------|-----------------|
|
|
| pydantic | 2025-05 | ✅ No known CVEs |
|
|
| rapidfuzz | 2025-06 | ✅ No known CVEs |
|
|
| dspy-ai | 2025-06 | ✅ No known CVEs |
|
|
| jinja2 | 2025-04 | ✅ No known CVEs |
|
|
|
|
Run security audit:
|
|
|
|
```bash
|
|
pip-audit --requirement requirements.txt
|
|
```
|
|
|
|
## Summary
|
|
|
|
**Minimum viable installation:**
|
|
|
|
```bash
|
|
pip install pydantic pyyaml dspy-ai httpx jinja2
|
|
```
|
|
|
|
**Recommended installation:**
|
|
|
|
```bash
|
|
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein
|
|
```
|
|
|
|
**Full installation (with semantic matching):**
|
|
|
|
```bash
|
|
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein sentence-transformers rdflib
|
|
```
|