- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework. - Added detailed mapping of SPARQL templates to context templates for improved specificity filtering. - Implemented wrapper patterns around existing classifiers to extend functionality without duplication. - Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality. - Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
14 KiB
Specificity Score System - External Dependencies
Overview
This document lists the external dependencies required for the specificity score system. Dependencies are categorized by purpose and include both required and optional packages.
INTEGRATION NOTE: This document has been updated to reflect the existing infrastructure in the codebase. Several components listed as "to create" already exist and should be extended rather than recreated.
Required Dependencies
Core Python Packages
These packages are essential for the specificity score system to function:
| Package | Version | Purpose | PyPI |
|---|---|---|---|
pydantic |
>=2.0 | Score model validation and structured output | pydantic |
pyyaml |
>=6.0 | LinkML schema parsing, template definitions | PyYAML |
dspy-ai |
>=2.6 | Template classification, RAG integration | dspy-ai |
linkml |
>=1.6 | Schema validation, annotations access | linkml |
Already in Project
These packages are already in pyproject.toml and will be available:
# From pyproject.toml
dependencies = [
"pydantic>=2.0",
"pyyaml>=6.0",
"dspy-ai>=2.6",
"linkml>=1.6",
]
Optional Dependencies
Schema Processing (Recommended)
For batch processing of LinkML schema annotations:
| Package | Version | Purpose | PyPI |
|---|---|---|---|
linkml-runtime |
>=1.6 | Runtime schema loading and traversal | linkml-runtime |
linkml-validator |
>=0.5 | Validate annotated schemas | linkml-validator |
Usage Example:
from linkml_runtime import SchemaView
# Load schema and access annotations
schema = SchemaView("schemas/20251121/linkml/01_custodian_name.yaml")
# Get specificity score for a class
archive_class = schema.get_class("Archive")
specificity = archive_class.annotations.get("specificity_score")
rationale = archive_class.annotations.get("specificity_rationale")
print(f"Archive specificity: {specificity.value}")
# Output: Archive specificity: 0.75
Installation:
pip install linkml-runtime linkml-validator
Caching (Recommended)
For caching computed scores during RAG retrieval:
| Package | Version | Purpose | PyPI |
|---|---|---|---|
cachetools |
>=5.0 | In-memory LRU cache for scores | cachetools |
diskcache |
>=5.6 | Persistent disk cache for large deployments | diskcache |
Usage Example:
from cachetools import TTLCache
from functools import wraps
# Cache with 1-hour TTL, max 1000 entries
_score_cache = TTLCache(maxsize=1000, ttl=3600)
def cached_template_score(class_name: str, template_id: str) -> float:
"""Get template-specific score with caching."""
cache_key = f"{template_id}:{class_name}"
if cache_key in _score_cache:
return _score_cache[cache_key]
score = compute_template_score(class_name, template_id)
_score_cache[cache_key] = score
return score
Installation:
pip install cachetools diskcache
UML Visualization (Optional)
For generating filtered UML diagrams based on specificity scores:
| Package | Version | Purpose | PyPI |
|---|---|---|---|
graphviz |
>=0.20 | DOT graph generation for UML | graphviz |
pydot |
>=1.4 | DOT file parsing and manipulation | pydot |
plantuml |
>=0.3 | PlantUML diagram generation | plantuml |
Usage Example:
from graphviz import Digraph
def create_filtered_uml(
schema: SchemaView,
template_id: str,
threshold: float = 0.5
) -> Digraph:
"""Generate UML with classes filtered by specificity threshold."""
dot = Digraph(comment=f"Heritage Ontology - {template_id}")
dot.attr(rankdir="TB", splines="ortho")
for class_name in schema.all_classes():
cls = schema.get_class(class_name)
score = get_template_score(cls, template_id)
if score >= threshold:
# Add node with opacity based on score
opacity = int(score * 255)
color = f"#4A90D9{opacity:02X}"
dot.node(class_name, fillcolor=color, style="filled")
return dot
System Dependency:
# macOS
brew install graphviz
# Ubuntu/Debian
sudo apt-get install graphviz
# Windows
choco install graphviz
Installation:
pip install graphviz pydot plantuml
Monitoring & Observability (Optional)
For production monitoring of score calculations:
| Package | Version | Purpose | PyPI |
|---|---|---|---|
prometheus-client |
>=0.17 | Metrics collection for score usage | prometheus-client |
structlog |
>=23.0 | Structured logging for score decisions | structlog |
Usage Example:
from prometheus_client import Counter, Histogram
# Track template classification distribution
TEMPLATE_COUNTER = Counter(
"specificity_template_classifications_total",
"Number of questions classified per template",
["template_id"]
)
# Track score computation latency
SCORE_LATENCY = Histogram(
"specificity_score_computation_seconds",
"Time to compute specificity scores",
["score_type"] # "general" or "template"
)
def classify_with_metrics(question: str) -> str:
"""Classify question and record metrics."""
with SCORE_LATENCY.labels(score_type="template").time():
template_id = classify_template(question)
TEMPLATE_COUNTER.labels(template_id=template_id).inc()
return template_id
Installation:
pip install prometheus-client structlog
External Services
Required Services
| Service | Endpoint | Purpose |
|---|---|---|
| None | - | Specificity scoring is self-contained |
The specificity score system is fully self-contained and does not require external services. All scores are computed from:
- Static annotations in LinkML schema files
- In-memory template definitions
- DSPy classification (optional LLM backend)
Optional Services
| Service | Endpoint | Purpose |
|---|---|---|
| Qdrant Vector DB | http://localhost:6333 |
RAG integration for score-weighted retrieval |
| Oxigraph SPARQL | http://localhost:7878/query |
Schema metadata queries |
| LLM API (OpenAI, Z.AI) | Varies | DSPy template classification |
Project Files Required
Existing Files (DO NOT RECREATE)
These files already exist and provide the foundation for specificity scoring:
| File | Purpose | Status |
|---|---|---|
backend/rag/template_sparql.py |
TemplateClassifier (line 1104), SlotExtractor, ConversationContextResolver | ✅ Exists - EXTEND |
backend/rag/template_sparql.py:634 |
TemplateClassifierSignature (DSPy Signature) | ✅ Exists - EXTEND |
data/sparql_templates.yaml |
SPARQL template definitions (11+ templates) | ✅ Exists - EXTEND |
schemas/20251121/linkml/01_custodian_name.yaml |
Main schema with annotations | ✅ Exists |
schemas/20251121/linkml/modules/classes/*.yaml |
304 class YAML files to annotate | ✅ Exists |
backend/rag/dspy_heritage_rag.py |
RAG integration point | ✅ Exists |
docs/plan/specificity_score/04-prompt-conversation-templates.md |
Template definitions | ✅ Exists |
New Files to Create
| File | Purpose | Status |
|---|---|---|
backend/rag/specificity_scorer.py |
Score calculation engine | ❌ To create |
backend/rag/sparql_to_context_mapper.py |
Maps SPARQL templates → Context templates | ❌ To create |
backend/rag/specificity_lookup.py |
Reads scores from LinkML annotations | ❌ To create |
backend/rag/specificity_aware_retriever.py |
Score-weighted retrieval | ❌ To create |
data/validation/specificity_scores.json |
Cached general scores | ❌ To create |
tests/rag/test_specificity_scorer.py |
Unit tests | ❌ To create |
scripts/annotate_specificity_scores.py |
Batch annotation script | ❌ To create |
Key Integration Points
The existing TemplateClassifier in backend/rag/template_sparql.py:1104 already:
- Classifies questions to SPARQL template IDs
- Extracts slots (institution_type, location, etc.)
- Uses DSPy for classification
New code should WRAP this classifier, not replace it:
# backend/rag/specificity_aware_classifier.py
from backend.rag.template_sparql import TemplateClassifier
class SpecificityAwareClassifier:
"""Wraps existing TemplateClassifier with specificity score lookup."""
def __init__(self, base_classifier: TemplateClassifier, specificity_lookup):
self.base_classifier = base_classifier
self.specificity_lookup = specificity_lookup
def classify_with_scores(self, question: str) -> ClassificationWithScores:
# Use existing classifier
result = self.base_classifier.classify(question)
# Map SPARQL template → context template
context_template = self._map_to_context_template(
result.template_id,
result.slots
)
# Look up specificity scores for context template
scores = self.specificity_lookup.get_scores(context_template)
return ClassificationWithScores(
sparql_template=result.template_id,
context_template=context_template,
slots=result.slots,
class_scores=scores
)
pyproject.toml Updates
Add optional dependencies for specificity scoring:
[project.optional-dependencies]
# Core specificity scoring
specificity = [
"linkml-runtime>=1.6",
"cachetools>=5.0",
]
# Full specificity system with visualization
specificity-full = [
"linkml-runtime>=1.6",
"linkml-validator>=0.5",
"cachetools>=5.0",
"diskcache>=5.6",
"graphviz>=0.20",
"pydot>=1.4",
]
# Specificity with monitoring
specificity-monitored = [
"linkml-runtime>=1.6",
"cachetools>=5.0",
"prometheus-client>=0.17",
"structlog>=23.0",
]
Installation:
# Minimal specificity support
pip install -e ".[specificity]"
# Full specificity support with visualization
pip install -e ".[specificity-full]"
# Specificity with production monitoring
pip install -e ".[specificity-monitored]"
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
SPECIFICITY_CACHE_TTL |
3600 |
Cache TTL in seconds |
SPECIFICITY_DEFAULT_THRESHOLD |
0.5 |
Default filtering threshold |
SPECIFICITY_TEMPLATE_FALLBACK |
general_heritage |
Fallback template ID |
SPECIFICITY_ENABLE_METRICS |
false |
Enable Prometheus metrics |
ZAI_API_TOKEN |
(required for DSPy) | Z.AI API token for classification |
Version Compatibility Matrix
| Python | LinkML | DSPy | Pydantic | Status |
|---|---|---|---|---|
| 3.11+ | 1.6+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.10 | 1.6+ | 2.6+ | 2.0+ | ✅ Supported |
| 3.9 | 1.5+ | 2.5+ | 2.0+ | ⚠️ Limited |
| <3.9 | - | - | - | ❌ Not supported |
Docker Considerations
If deploying in Docker, ensure these are in the Dockerfile:
# System dependencies for graphviz (if using UML visualization)
RUN apt-get update && apt-get install -y graphviz && rm -rf /var/lib/apt/lists/*
# Python dependencies
RUN pip install --no-cache-dir \
pydantic>=2.0 \
pyyaml>=6.0 \
dspy-ai>=2.6 \
linkml>=1.6 \
linkml-runtime>=1.6 \
cachetools>=5.0
# Optional: graphviz Python bindings
# RUN pip install graphviz>=0.20 pydot>=1.4
Dependency Security
All recommended packages are actively maintained and have no known critical CVEs as of 2025-01.
| Package | Last Updated | Security Status |
|---|---|---|
| pydantic | 2024-12 | ✅ No known CVEs |
| linkml | 2024-12 | ✅ No known CVEs |
| linkml-runtime | 2024-12 | ✅ No known CVEs |
| dspy-ai | 2025-01 | ✅ No known CVEs |
| cachetools | 2024-11 | ✅ No known CVEs |
Run security audit:
pip-audit --requirement requirements.txt
Dependency Graph
specificity_scorer.py
├── linkml-runtime (schema loading)
│ └── pyyaml
├── pydantic (data models)
├── cachetools (performance)
└── dspy-ai (classification)
└── httpx (LLM API calls)
specificity_aware_retriever.py
├── specificity_scorer.py
├── qdrant-client (vector store)
└── numpy (score calculations)
uml_visualizer.py (optional)
├── graphviz
├── pydot
└── specificity_scorer.py
Summary
Minimum viable installation:
pip install pydantic pyyaml linkml linkml-runtime
Recommended installation:
pip install pydantic pyyaml linkml linkml-runtime cachetools dspy-ai
Full installation (with visualization and monitoring):
pip install pydantic pyyaml linkml linkml-runtime linkml-validator cachetools diskcache dspy-ai graphviz pydot prometheus-client structlog
References
docs/plan/prompt-query_template_mapping/external-dependencies.md- Related dependenciesdocs/plan/specificity_score/03-rag-dspy-integration.md- DSPy integration detailspyproject.toml- Current project dependencies