# External Dependencies

## Overview

This document lists the external dependencies required for the template-based SPARQL query generation system. Dependencies are categorized by purpose and include both required and optional packages.

## Required Dependencies

### Core Python Packages

These packages are essential for the template system to function:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `pydantic` | >=2.0 | Structured output validation, slot schemas | [pydantic](https://pypi.org/project/pydantic/) |
| `pyyaml` | >=6.0 | Template definition loading | [PyYAML](https://pypi.org/project/PyYAML/) |
| `dspy-ai` | >=2.6 | DSPy framework for template classification | [dspy-ai](https://pypi.org/project/dspy-ai/) |
| `httpx` | >=0.25 | SPARQL endpoint HTTP client | [httpx](https://pypi.org/project/httpx/) |
| `jinja2` | >=3.0 | Template instantiation engine | [Jinja2](https://pypi.org/project/Jinja2/) |

### Already in Project

These packages are already in `pyproject.toml` and will be available:

```toml
# From pyproject.toml
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "dspy-ai>=2.6",
    "httpx>=0.25",
]
```

## Optional Dependencies

### Fuzzy Matching (Recommended)

For improved slot value resolution when user input doesn't exactly match enum values:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `rapidfuzz` | >=3.0 | Fast fuzzy string matching for slot values | [rapidfuzz](https://pypi.org/project/rapidfuzz/) |
| `python-Levenshtein` | >=0.21 | Speed up rapidfuzz calculations | [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) |

**Usage Example:**

```python
from rapidfuzz import fuzz, process

# Match user input to valid province codes
PROVINCES = ["Noord-Holland", "Zuid-Holland", "Utrecht", "Drenthe", "Gelderland"]

def match_province(user_input: str, threshold: float = 70.0) -> str | None:
    """Fuzzy match user input to valid province."""
    result = process.extractOne(
        user_input,
        PROVINCES,
        scorer=fuzz.WRatio,
        score_cutoff=threshold,
    )
    return result[0] if result else None

# Examples
match_province("drente")  # -> "Drenthe"
match_province("N-Holland")  # -> "Noord-Holland"
match_province("zuudholland")  # -> "Zuid-Holland"
```

**Installation:**

```bash
pip install rapidfuzz python-Levenshtein
```

### Semantic Similarity (Optional)

For intent classification when questions don't match patterns exactly:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `sentence-transformers` | >=2.2 | Semantic similarity for template matching | [sentence-transformers](https://pypi.org/project/sentence-transformers/) |

**Usage Example:**

```python
from sentence_transformers import SentenceTransformer, util

# Load multilingual model for Dutch/English
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Template question patterns
PATTERNS = [
    "Welke archieven zijn er in {province}?",
    "Hoeveel musea zijn er in Nederland?",
    "Wat is het oudste archief?",
]

def find_best_template(question: str, threshold: float = 0.7) -> int | None:
    """Find best matching template by semantic similarity."""
    question_embedding = model.encode(question)
    pattern_embeddings = model.encode(PATTERNS)
    
    similarities = util.cos_sim(question_embedding, pattern_embeddings)[0]
    best_idx = similarities.argmax().item()
    best_score = similarities[best_idx].item()
    
    return best_idx if best_score >= threshold else None

# Example
find_best_template("Welke archieven heeft Drenthe?")  # -> 0
```

**Installation:**

```bash
pip install sentence-transformers
```

**Note:** This adds ~500MB of model weights. Only use if DSPy classification is insufficient.

### SPARQL Validation (Optional)

For deeper SPARQL syntax validation beyond regex:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `rdflib` | >=6.0 | RDF/SPARQL parsing and validation | [rdflib](https://pypi.org/project/rdflib/) |

**Usage Example:**

```python
from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import ParseException

def validate_sparql_syntax(query: str) -> tuple[bool, str | None]:
    """Validate SPARQL syntax using rdflib parser."""
    try:
        prepareQuery(query)
        return True, None
    except ParseException as e:
        return False, str(e)

# Example
valid, error = validate_sparql_syntax("""
    PREFIX hc: <https://nde.nl/ontology/hc/>
    SELECT ?s WHERE { ?s a hc:Custodian }
""")
# -> (True, None)
```

**Installation:**

```bash
pip install rdflib
```

## External Services

### Required Services

| Service | Endpoint | Purpose |
|---------|----------|---------|
| Oxigraph SPARQL | `http://localhost:7878/query` | SPARQL query execution |
| Qdrant Vector DB | `http://localhost:6333` | Semantic search fallback |

### Service Availability Checks

```python
import httpx

async def check_sparql_endpoint(
    endpoint: str = "http://localhost:7878/query",
    timeout: float = 5.0,
) -> bool:
    """Check if SPARQL endpoint is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                endpoint.replace("/query", "/"),
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False

async def check_qdrant(
    host: str = "localhost",
    port: int = 6333,
    timeout: float = 5.0,
) -> bool:
    """Check if Qdrant is available."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://{host}:{port}/",
                timeout=timeout,
            )
            return response.status_code == 200
    except Exception:
        return False
```

## Project Files Required

### Existing Files

These files must exist for the template system to function:

| File | Purpose | Status |
|------|---------|--------|
| `data/validation/sparql_validation_rules.json` | Slot enum values (provinces, types) | ✅ Exists |
| `backend/rag/ontology_mapping.py` | Entity extraction, fuzzy matching | ✅ Exists |
| `src/glam_extractor/api/sparql_linter.py` | SPARQL validation/correction | ✅ Exists |
| `backend/rag/dspy_heritage_rag.py` | Integration point | ✅ Exists |

### New Files to Create

| File | Purpose | Status |
|------|---------|--------|
| `backend/rag/template_sparql.py` | Template loading, classification, instantiation | ❌ To create |
| `data/sparql_templates.yaml` | Template definitions | ❌ To create |
| `tests/rag/test_template_sparql.py` | Unit tests | ❌ To create |

## pyproject.toml Updates

Add optional dependencies for template system:

```toml
[project.optional-dependencies]
# Template-based SPARQL generation
sparql-templates = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
]

# Full template system with semantic matching
sparql-templates-full = [
    "rapidfuzz>=3.0",
    "python-Levenshtein>=0.21",
    "jinja2>=3.0",
    "sentence-transformers>=2.2",
    "rdflib>=6.0",
]
```

**Installation:**

```bash
# Minimal template support
pip install -e ".[sparql-templates]"

# Full template support with semantic matching
pip install -e ".[sparql-templates-full]"
```

## Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `SPARQL_ENDPOINT` | `http://localhost:7878/query` | SPARQL endpoint URL |
| `QDRANT_HOST` | `localhost` | Qdrant host |
| `QDRANT_PORT` | `6333` | Qdrant port |
| `TEMPLATE_CONFIDENCE_THRESHOLD` | `0.7` | Minimum confidence for template use |
| `ENABLE_FUZZY_MATCHING` | `true` | Enable rapidfuzz for slot matching |

## Version Compatibility Matrix

| Python | DSPy | Pydantic | Status |
|--------|------|----------|--------|
| 3.11+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.10 | 2.6+ | 2.0+ | ✅ Supported |
| 3.9 | 2.5+ | 2.0+ | ⚠️ Limited (no `match` statements) |
| <3.9 | - | - | ❌ Not supported |

## Docker Considerations

If deploying in Docker, ensure these are in the Dockerfile:

```dockerfile
# Python dependencies
RUN pip install --no-cache-dir \
    pydantic>=2.0 \
    pyyaml>=6.0 \
    dspy-ai>=2.6 \
    httpx>=0.25 \
    jinja2>=3.0 \
    rapidfuzz>=3.0

# Optional: sentence-transformers (adds ~500MB)
# RUN pip install sentence-transformers>=2.2
```

## Dependency Security

All recommended packages are actively maintained and have no known critical CVEs as of 2025-06.

| Package | Last Updated | Security Status |
|---------|--------------|-----------------|
| pydantic | 2025-05 | ✅ No known CVEs |
| rapidfuzz | 2025-06 | ✅ No known CVEs |
| dspy-ai | 2025-06 | ✅ No known CVEs |
| jinja2 | 2025-04 | ✅ No known CVEs |

Run security audit:

```bash
pip-audit --requirement requirements.txt
```

## Summary

**Minimum viable installation:**

```bash
pip install pydantic pyyaml dspy-ai httpx jinja2
```

**Recommended installation:**

```bash
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein
```

**Full installation (with semantic matching):**

```bash
pip install pydantic pyyaml dspy-ai httpx jinja2 rapidfuzz python-Levenshtein sentence-transformers rdflib
```