glam/docs/plan/specificity_score/05-dependencies.md
kempersc 11983014bb Enhance specificity scoring system integration with existing infrastructure
- Updated documentation to clarify integration points with existing components in the RAG pipeline and DSPy framework.
- Added detailed mapping of SPARQL templates to context templates for improved specificity filtering.
- Implemented wrapper patterns around existing classifiers to extend functionality without duplication.
- Introduced new tests for the SpecificityAwareClassifier and SPARQLToContextMapper to ensure proper integration and functionality.
- Enhanced the CustodianRDFConverter to include ISO country and subregion codes from GHCID for better geospatial data handling.
2026-01-05 17:37:49 +01:00

14 KiB

Specificity Score System - External Dependencies

Overview

This document lists the external dependencies required for the specificity score system. Dependencies are categorized by purpose and include both required and optional packages.

INTEGRATION NOTE: This document has been updated to reflect the existing infrastructure in the codebase. Several components listed as "to create" already exist and should be extended rather than recreated.


Required Dependencies

Core Python Packages

These packages are essential for the specificity score system to function:

Package Version Purpose PyPI
pydantic >=2.0 Score model validation and structured output pydantic
pyyaml >=6.0 LinkML schema parsing, template definitions PyYAML
dspy-ai >=2.6 Template classification, RAG integration dspy-ai
linkml >=1.6 Schema validation, annotations access linkml

Already in Project

These packages are already in pyproject.toml and will be available:

# From pyproject.toml
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "dspy-ai>=2.6",
    "linkml>=1.6",
]

Optional Dependencies

For batch processing of LinkML schema annotations:

Package Version Purpose PyPI
linkml-runtime >=1.6 Runtime schema loading and traversal linkml-runtime
linkml-validator >=0.5 Validate annotated schemas linkml-validator

Usage Example:

from linkml_runtime import SchemaView

# Load schema and access annotations
schema = SchemaView("schemas/20251121/linkml/01_custodian_name.yaml")

# Get specificity score for a class
archive_class = schema.get_class("Archive")
specificity = archive_class.annotations.get("specificity_score")
rationale = archive_class.annotations.get("specificity_rationale")

print(f"Archive specificity: {specificity.value}")
# Output: Archive specificity: 0.75

Installation:

pip install linkml-runtime linkml-validator

For caching computed scores during RAG retrieval:

Package Version Purpose PyPI
cachetools >=5.0 In-memory LRU cache for scores cachetools
diskcache >=5.6 Persistent disk cache for large deployments diskcache

Usage Example:

from cachetools import TTLCache
from functools import wraps

# Cache with 1-hour TTL, max 1000 entries
_score_cache = TTLCache(maxsize=1000, ttl=3600)

def cached_template_score(class_name: str, template_id: str) -> float:
    """Get template-specific score with caching."""
    cache_key = f"{template_id}:{class_name}"
    
    if cache_key in _score_cache:
        return _score_cache[cache_key]
    
    score = compute_template_score(class_name, template_id)
    _score_cache[cache_key] = score
    return score

Installation:

pip install cachetools diskcache

UML Visualization (Optional)

For generating filtered UML diagrams based on specificity scores:

Package Version Purpose PyPI
graphviz >=0.20 DOT graph generation for UML graphviz
pydot >=1.4 DOT file parsing and manipulation pydot
plantuml >=0.3 PlantUML diagram generation plantuml

Usage Example:

from graphviz import Digraph

def create_filtered_uml(
    schema: SchemaView,
    template_id: str,
    threshold: float = 0.5
) -> Digraph:
    """Generate UML with classes filtered by specificity threshold."""
    dot = Digraph(comment=f"Heritage Ontology - {template_id}")
    dot.attr(rankdir="TB", splines="ortho")
    
    for class_name in schema.all_classes():
        cls = schema.get_class(class_name)
        score = get_template_score(cls, template_id)
        
        if score >= threshold:
            # Add node with opacity based on score
            opacity = int(score * 255)
            color = f"#4A90D9{opacity:02X}"
            dot.node(class_name, fillcolor=color, style="filled")
    
    return dot

System Dependency:

# macOS
brew install graphviz

# Ubuntu/Debian
sudo apt-get install graphviz

# Windows
choco install graphviz

Installation:

pip install graphviz pydot plantuml

Monitoring & Observability (Optional)

For production monitoring of score calculations:

Package Version Purpose PyPI
prometheus-client >=0.17 Metrics collection for score usage prometheus-client
structlog >=23.0 Structured logging for score decisions structlog

Usage Example:

from prometheus_client import Counter, Histogram

# Track template classification distribution
TEMPLATE_COUNTER = Counter(
    "specificity_template_classifications_total",
    "Number of questions classified per template",
    ["template_id"]
)

# Track score computation latency
SCORE_LATENCY = Histogram(
    "specificity_score_computation_seconds",
    "Time to compute specificity scores",
    ["score_type"]  # "general" or "template"
)

def classify_with_metrics(question: str) -> str:
    """Classify question and record metrics."""
    with SCORE_LATENCY.labels(score_type="template").time():
        template_id = classify_template(question)
    
    TEMPLATE_COUNTER.labels(template_id=template_id).inc()
    return template_id

Installation:

pip install prometheus-client structlog

External Services

Required Services

Service Endpoint Purpose
None - Specificity scoring is self-contained

The specificity score system is fully self-contained and does not require external services. All scores are computed from:

  1. Static annotations in LinkML schema files
  2. In-memory template definitions
  3. DSPy classification (optional LLM backend)

Optional Services

Service Endpoint Purpose
Qdrant Vector DB http://localhost:6333 RAG integration for score-weighted retrieval
Oxigraph SPARQL http://localhost:7878/query Schema metadata queries
LLM API (OpenAI, Z.AI) Varies DSPy template classification

Project Files Required

Existing Files (DO NOT RECREATE)

These files already exist and provide the foundation for specificity scoring:

File Purpose Status
backend/rag/template_sparql.py TemplateClassifier (line 1104), SlotExtractor, ConversationContextResolver Exists - EXTEND
backend/rag/template_sparql.py:634 TemplateClassifierSignature (DSPy Signature) Exists - EXTEND
data/sparql_templates.yaml SPARQL template definitions (11+ templates) Exists - EXTEND
schemas/20251121/linkml/01_custodian_name.yaml Main schema with annotations Exists
schemas/20251121/linkml/modules/classes/*.yaml 304 class YAML files to annotate Exists
backend/rag/dspy_heritage_rag.py RAG integration point Exists
docs/plan/specificity_score/04-prompt-conversation-templates.md Template definitions Exists

New Files to Create

File Purpose Status
backend/rag/specificity_scorer.py Score calculation engine To create
backend/rag/sparql_to_context_mapper.py Maps SPARQL templates → Context templates To create
backend/rag/specificity_lookup.py Reads scores from LinkML annotations To create
backend/rag/specificity_aware_retriever.py Score-weighted retrieval To create
data/validation/specificity_scores.json Cached general scores To create
tests/rag/test_specificity_scorer.py Unit tests To create
scripts/annotate_specificity_scores.py Batch annotation script To create

Key Integration Points

The existing TemplateClassifier in backend/rag/template_sparql.py:1104 already:

  • Classifies questions to SPARQL template IDs
  • Extracts slots (institution_type, location, etc.)
  • Uses DSPy for classification

New code should WRAP this classifier, not replace it:

# backend/rag/specificity_aware_classifier.py
from backend.rag.template_sparql import TemplateClassifier

class SpecificityAwareClassifier:
    """Wraps existing TemplateClassifier with specificity score lookup."""
    
    def __init__(self, base_classifier: TemplateClassifier, specificity_lookup):
        self.base_classifier = base_classifier
        self.specificity_lookup = specificity_lookup
    
    def classify_with_scores(self, question: str) -> ClassificationWithScores:
        # Use existing classifier
        result = self.base_classifier.classify(question)
        
        # Map SPARQL template → context template
        context_template = self._map_to_context_template(
            result.template_id, 
            result.slots
        )
        
        # Look up specificity scores for context template
        scores = self.specificity_lookup.get_scores(context_template)
        
        return ClassificationWithScores(
            sparql_template=result.template_id,
            context_template=context_template,
            slots=result.slots,
            class_scores=scores
        )

pyproject.toml Updates

Add optional dependencies for specificity scoring:

[project.optional-dependencies]
# Core specificity scoring
specificity = [
    "linkml-runtime>=1.6",
    "cachetools>=5.0",
]

# Full specificity system with visualization
specificity-full = [
    "linkml-runtime>=1.6",
    "linkml-validator>=0.5",
    "cachetools>=5.0",
    "diskcache>=5.6",
    "graphviz>=0.20",
    "pydot>=1.4",
]

# Specificity with monitoring
specificity-monitored = [
    "linkml-runtime>=1.6",
    "cachetools>=5.0",
    "prometheus-client>=0.17",
    "structlog>=23.0",
]

Installation:

# Minimal specificity support
pip install -e ".[specificity]"

# Full specificity support with visualization
pip install -e ".[specificity-full]"

# Specificity with production monitoring
pip install -e ".[specificity-monitored]"

Environment Variables

Variable Default Purpose
SPECIFICITY_CACHE_TTL 3600 Cache TTL in seconds
SPECIFICITY_DEFAULT_THRESHOLD 0.5 Default filtering threshold
SPECIFICITY_TEMPLATE_FALLBACK general_heritage Fallback template ID
SPECIFICITY_ENABLE_METRICS false Enable Prometheus metrics
ZAI_API_TOKEN (required for DSPy) Z.AI API token for classification

Version Compatibility Matrix

Python LinkML DSPy Pydantic Status
3.11+ 1.6+ 2.6+ 2.0+ Supported
3.10 1.6+ 2.6+ 2.0+ Supported
3.9 1.5+ 2.5+ 2.0+ ⚠️ Limited
<3.9 - - - Not supported

Docker Considerations

If deploying in Docker, ensure these are in the Dockerfile:

# System dependencies for graphviz (if using UML visualization)
RUN apt-get update && apt-get install -y graphviz && rm -rf /var/lib/apt/lists/*

# Python dependencies
RUN pip install --no-cache-dir \
    pydantic>=2.0 \
    pyyaml>=6.0 \
    dspy-ai>=2.6 \
    linkml>=1.6 \
    linkml-runtime>=1.6 \
    cachetools>=5.0

# Optional: graphviz Python bindings
# RUN pip install graphviz>=0.20 pydot>=1.4

Dependency Security

All recommended packages are actively maintained and have no known critical CVEs as of 2025-01.

Package Last Updated Security Status
pydantic 2024-12 No known CVEs
linkml 2024-12 No known CVEs
linkml-runtime 2024-12 No known CVEs
dspy-ai 2025-01 No known CVEs
cachetools 2024-11 No known CVEs

Run security audit:

pip-audit --requirement requirements.txt

Dependency Graph

specificity_scorer.py
├── linkml-runtime (schema loading)
│   └── pyyaml
├── pydantic (data models)
├── cachetools (performance)
└── dspy-ai (classification)
    └── httpx (LLM API calls)

specificity_aware_retriever.py
├── specificity_scorer.py
├── qdrant-client (vector store)
└── numpy (score calculations)

uml_visualizer.py (optional)
├── graphviz
├── pydot
└── specificity_scorer.py

Summary

Minimum viable installation:

pip install pydantic pyyaml linkml linkml-runtime

Recommended installation:

pip install pydantic pyyaml linkml linkml-runtime cachetools dspy-ai

Full installation (with visualization and monitoring):

pip install pydantic pyyaml linkml linkml-runtime linkml-validator cachetools diskcache dspy-ai graphviz pydot prometheus-client structlog

References

  • docs/plan/prompt-query_template_mapping/external-dependencies.md - Related dependencies
  • docs/plan/specificity_score/03-rag-dspy-integration.md - DSPy integration details
  • pyproject.toml - Current project dependencies