# Specificity Score System - External Dependencies

## Overview

This document lists the external dependencies required for the specificity score system. Dependencies are categorized by purpose and include both required and optional packages.

> **INTEGRATION NOTE**: This document has been updated to reflect the **existing infrastructure** in the codebase. Several components listed as "to create" already exist and should be **extended** rather than recreated.

---

## Required Dependencies

### Core Python Packages

These packages are essential for the specificity score system to function:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `pydantic` | >=2.0 | Score model validation and structured output | [pydantic](https://pypi.org/project/pydantic/) |
| `pyyaml` | >=6.0 | LinkML schema parsing, template definitions | [PyYAML](https://pypi.org/project/PyYAML/) |
| `dspy-ai` | >=2.6 | Template classification, RAG integration | [dspy-ai](https://pypi.org/project/dspy-ai/) |
| `linkml` | >=1.6 | Schema validation, annotations access | [linkml](https://pypi.org/project/linkml/) |

### Already in Project

These packages are already in `pyproject.toml` and will be available:

```toml
# From pyproject.toml
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "dspy-ai>=2.6",
    "linkml>=1.6",
]
```

---

## Optional Dependencies

### Schema Processing (Recommended)

For batch processing of LinkML schema annotations:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `linkml-runtime` | >=1.6 | Runtime schema loading and traversal | [linkml-runtime](https://pypi.org/project/linkml-runtime/) |
| `linkml-validator` | >=0.5 | Validate annotated schemas | [linkml-validator](https://pypi.org/project/linkml-validator/) |

**Usage Example:**

```python
from linkml_runtime import SchemaView

# Load schema and access annotations
schema = SchemaView("schemas/20251121/linkml/01_custodian_name.yaml")

# Get specificity score for a class
archive_class = schema.get_class("Archive")
specificity = archive_class.annotations.get("specificity_score")
rationale = archive_class.annotations.get("specificity_rationale")

print(f"Archive specificity: {specificity.value}")
# Output: Archive specificity: 0.75
```

**Installation:**

```bash
pip install linkml-runtime linkml-validator
```

---

### Caching (Recommended)

For caching computed scores during RAG retrieval:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `cachetools` | >=5.0 | In-memory LRU cache for scores | [cachetools](https://pypi.org/project/cachetools/) |
| `diskcache` | >=5.6 | Persistent disk cache for large deployments | [diskcache](https://pypi.org/project/diskcache/) |

**Usage Example:**

```python
from cachetools import TTLCache
from functools import wraps

# Cache with 1-hour TTL, max 1000 entries
_score_cache = TTLCache(maxsize=1000, ttl=3600)

def cached_template_score(class_name: str, template_id: str) -> float:
    """Get template-specific score with caching."""
    cache_key = f"{template_id}:{class_name}"
    
    if cache_key in _score_cache:
        return _score_cache[cache_key]
    
    score = compute_template_score(class_name, template_id)
    _score_cache[cache_key] = score
    return score
```

**Installation:**

```bash
pip install cachetools diskcache
```

---

### UML Visualization (Optional)

For generating filtered UML diagrams based on specificity scores:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `graphviz` | >=0.20 | DOT graph generation for UML | [graphviz](https://pypi.org/project/graphviz/) |
| `pydot` | >=1.4 | DOT file parsing and manipulation | [pydot](https://pypi.org/project/pydot/) |
| `plantuml` | >=0.3 | PlantUML diagram generation | [plantuml](https://pypi.org/project/plantuml/) |

**Usage Example:**

```python
from graphviz import Digraph

def create_filtered_uml(
    schema: SchemaView,
    template_id: str,
    threshold: float = 0.5
) -> Digraph:
    """Generate UML with classes filtered by specificity threshold."""
    dot = Digraph(comment=f"Heritage Ontology - {template_id}")
    dot.attr(rankdir="TB", splines="ortho")
    
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= threshold:
            # Add node with opacity based on score
            opacity = int(score * 255)
            color = f"#4A90D9{opacity:02X}"
            dot.node(class_name, fillcolor=color, style="filled")
    
    return dot
```

**System Dependency:**

```bash
# macOS
brew install graphviz

# Ubuntu/Debian
sudo apt-get install graphviz

# Windows
choco install graphviz
```

**Installation:**

```bash
pip install graphviz pydot plantuml
```

---

### Monitoring & Observability (Optional)

For production monitoring of score calculations:

| Package | Version | Purpose | PyPI |
|---------|---------|---------|------|
| `prometheus-client` | >=0.17 | Metrics collection for score usage | [prometheus-client](https://pypi.org/project/prometheus-client/) |
| `structlog` | >=23.0 | Structured logging for score decisions | [structlog](https://pypi.org/project/structlog/) |

**Usage Example:**

```python
from prometheus_client import Counter, Histogram

# Track template classification distribution
TEMPLATE_COUNTER = Counter(
    "specificity_template_classifications_total",
    "Number of questions classified per template",
    ["template_id"]
)

# Track score computation latency
SCORE_LATENCY = Histogram(
    "specificity_score_computation_seconds",
    "Time to compute specificity scores",
    ["score_type"]  # "general" or "template"
)

def classify_with_metrics(question: str) -> str:
    """Classify question and record metrics."""
    with SCORE_LATENCY.labels(score_type="template").time():
        template_id = classify_template(question)
    
    TEMPLATE_COUNTER.labels(template_id=template_id).inc()
    return template_id
```

**Installation:**

```bash
pip install prometheus-client structlog
```

---

## External Services

### Required Services

| Service | Endpoint | Purpose |
|---------|----------|---------|
| None | - | Specificity scoring is self-contained |

The specificity score system is **fully self-contained** and does not require external services. All scores are computed from:
1. Static annotations in LinkML schema files
2. In-memory template definitions
3. DSPy classification (optional LLM backend)

### Optional Services

| Service | Endpoint | Purpose |
|---------|----------|---------|
| Qdrant Vector DB | `http://localhost:6333` | RAG integration for score-weighted retrieval |
| Oxigraph SPARQL | `http://localhost:7878/query` | Schema metadata queries |
| LLM API (OpenAI, Z.AI) | Varies | DSPy template classification |

---

## Project Files Required

### Existing Files (DO NOT RECREATE)

These files **already exist** and provide the foundation for specificity scoring:

| File | Purpose | Status |
|------|---------|--------|
| `backend/rag/template_sparql.py` | **TemplateClassifier** (line 1104), **SlotExtractor**, **ConversationContextResolver** | ✅ Exists - EXTEND |
| `backend/rag/template_sparql.py:634` | **TemplateClassifierSignature** (DSPy Signature) | ✅ Exists - EXTEND |
| `data/sparql_templates.yaml` | SPARQL template definitions (11+ templates) | ✅ Exists - EXTEND |
| `schemas/20251121/linkml/01_custodian_name.yaml` | Main schema with annotations | ✅ Exists |
| `schemas/20251121/linkml/modules/classes/*.yaml` | 304 class YAML files to annotate | ✅ Exists |
| `backend/rag/dspy_heritage_rag.py` | RAG integration point | ✅ Exists |
| `docs/plan/specificity_score/04-prompt-conversation-templates.md` | Template definitions | ✅ Exists |

### New Files to Create

| File | Purpose | Status |
|------|---------|--------|
| `backend/rag/specificity_scorer.py` | Score calculation engine | ❌ To create |
| `backend/rag/sparql_to_context_mapper.py` | Maps SPARQL templates → Context templates | ❌ To create |
| `backend/rag/specificity_lookup.py` | Reads scores from LinkML annotations | ❌ To create |
| `backend/rag/specificity_aware_retriever.py` | Score-weighted retrieval | ❌ To create |
| `data/validation/specificity_scores.json` | Cached general scores | ❌ To create |
| `tests/rag/test_specificity_scorer.py` | Unit tests | ❌ To create |
| `scripts/annotate_specificity_scores.py` | Batch annotation script | ❌ To create |

### Key Integration Points

The existing `TemplateClassifier` in `backend/rag/template_sparql.py:1104` already:
- Classifies questions to SPARQL template IDs
- Extracts slots (institution_type, location, etc.)
- Uses DSPy for classification

**New code should WRAP this classifier**, not replace it:

```python
# backend/rag/specificity_aware_classifier.py
from backend.rag.template_sparql import TemplateClassifier

class SpecificityAwareClassifier:
    """Wraps existing TemplateClassifier with specificity score lookup."""
    
    def __init__(self, base_classifier: TemplateClassifier, specificity_lookup):
        self.base_classifier = base_classifier
        self.specificity_lookup = specificity_lookup
    
    def classify_with_scores(self, question: str) -> ClassificationWithScores:
        # Use existing classifier
        result = self.base_classifier.classify(question)
        
        # Map SPARQL template → context template
        context_template = self._map_to_context_template(
            result.template_id, 
            result.slots
        )
        
        # Look up specificity scores for context template
        scores = self.specificity_lookup.get_scores(context_template)
        
        return ClassificationWithScores(
            sparql_template=result.template_id,
            context_template=context_template,
            slots=result.slots,
            class_scores=scores
        )
```

---

## pyproject.toml Updates

Add optional dependencies for specificity scoring:

```toml
[project.optional-dependencies]
# Core specificity scoring
specificity = [
    "linkml-runtime>=1.6",
    "cachetools>=5.0",
]

# Full specificity system with visualization
specificity-full = [
    "linkml-runtime>=1.6",
    "linkml-validator>=0.5",
    "cachetools>=5.0",
    "diskcache>=5.6",
    "graphviz>=0.20",
    "pydot>=1.4",
]

# Specificity with monitoring
specificity-monitored = [
    "linkml-runtime>=1.6",
    "cachetools>=5.0",
    "prometheus-client>=0.17",
    "structlog>=23.0",
]
```

**Installation:**

```bash
# Minimal specificity support
pip install -e ".[specificity]"

# Full specificity support with visualization
pip install -e ".[specificity-full]"

# Specificity with production monitoring
pip install -e ".[specificity-monitored]"
```

---

## Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `SPECIFICITY_CACHE_TTL` | `3600` | Cache TTL in seconds |
| `SPECIFICITY_DEFAULT_THRESHOLD` | `0.5` | Default filtering threshold |
| `SPECIFICITY_TEMPLATE_FALLBACK` | `general_heritage` | Fallback template ID |
| `SPECIFICITY_ENABLE_METRICS` | `false` | Enable Prometheus metrics |
| `ZAI_API_TOKEN` | (required for DSPy) | Z.AI API token for classification |

---

## Version Compatibility Matrix

| Python | LinkML | DSPy | Pydantic | Status |
|--------|--------|------|----------|--------|
| 3.11+ | 1.6+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.10 | 1.6+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.9 | 1.5+ | 2.5+ | 2.0+ | ⚠️ Limited |
| <3.9 | - | - | - | ❌ Not supported |

---

## Docker Considerations

If deploying in Docker, ensure these are in the Dockerfile:

```dockerfile
# System dependencies for graphviz (if using UML visualization)
RUN apt-get update && apt-get install -y graphviz && rm -rf /var/lib/apt/lists/*

# Python dependencies
RUN pip install --no-cache-dir \
    pydantic>=2.0 \
    pyyaml>=6.0 \
    dspy-ai>=2.6 \
    linkml>=1.6 \
    linkml-runtime>=1.6 \
    cachetools>=5.0

# Optional: graphviz Python bindings
# RUN pip install graphviz>=0.20 pydot>=1.4
```

---

## Dependency Security

All recommended packages are actively maintained and have no known critical CVEs as of 2025-01.

| Package | Last Updated | Security Status |
|---------|--------------|-----------------|
| pydantic | 2024-12 | ✅ No known CVEs |
| linkml | 2024-12 | ✅ No known CVEs |
| linkml-runtime | 2024-12 | ✅ No known CVEs |
| dspy-ai | 2025-01 | ✅ No known CVEs |
| cachetools | 2024-11 | ✅ No known CVEs |

Run security audit:

```bash
pip-audit --requirement requirements.txt
```

---

## Dependency Graph

```
specificity_scorer.py
├── linkml-runtime (schema loading)
│   └── pyyaml
├── pydantic (data models)
├── cachetools (performance)
└── dspy-ai (classification)
    └── httpx (LLM API calls)

specificity_aware_retriever.py
├── specificity_scorer.py
├── qdrant-client (vector store)
└── numpy (score calculations)

uml_visualizer.py (optional)
├── graphviz
├── pydot
└── specificity_scorer.py
```

---

## Summary

**Minimum viable installation:**

```bash
pip install pydantic pyyaml linkml linkml-runtime
```

**Recommended installation:**

```bash
pip install pydantic pyyaml linkml linkml-runtime cachetools dspy-ai
```

**Full installation (with visualization and monitoring):**

```bash
pip install pydantic pyyaml linkml linkml-runtime linkml-validator cachetools diskcache dspy-ai graphviz pydot prometheus-client structlog
```

---

## References

- `docs/plan/prompt-query_template_mapping/external-dependencies.md` - Related dependencies
- `docs/plan/specificity_score/03-rag-dspy-integration.md` - DSPy integration details
- `pyproject.toml` - Current project dependencies