# DSPy + GitOps Testing Framework for Heritage RAG ## Executive Summary This document outlines a comprehensive testing framework that integrates **DSPy evaluation patterns** with **GitOps CI/CD workflows** for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change. ## Goals 1. **Continuous Evaluation**: Run DSPy evaluations on every PR to catch regressions 2. **Reproducible Results**: Version-control test datasets, prompts, and evaluation metrics 3. **Quality Gates**: Block merges when evaluation scores drop below thresholds 4. **Progressive Testing**: Fast smoke tests on PR, comprehensive evals on merge 5. **Observability**: Track evaluation metrics over time ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ GitOps Testing Pipeline │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ GitHub │───>│ GitHub │───>│ DSPy Evaluation │ │ │ │ Push/PR │ │ Actions │ │ (pytest + dspy.Eval) │ │ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Evaluation Layers │ │ │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Unit Tests (pytest) │ │ │ │ - Intent classification accuracy │ │ │ │ - Entity extraction correctness │ │ │ │ - SPARQL syntax validation │ │ │ │ │ │ │ │ Layer 2: Integration Tests (DSPy Evaluate) │ │ │ │ - End-to-end RAG pipeline │ │ │ │ - Answer quality metrics │ │ │ │ - Response latency │ │ │ │ │ │ │ │ Layer 3: Smoke Tests (Live API) │ │ │ │ - API endpoint health │ │ │ │ - SPARQL endpoint connectivity │ │ │ │ - Sample queries │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ Quality │───>│ Merge │───>│ Deploy to │ │ │ │ Gate │ │ Allowed │ │ Production │ │ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Key Insights from Research ### DSPy Evaluation Best Practices 1. **Start Simple, Iterate**: Begin with exact match/accuracy metrics, evolve to LLM-as-judge 2. **Metric as DSPy Program**: Complex metrics can be DSPy modules themselves 3. **Dev Set Size**: 20-200 examples for development, more for optimization 4. **Multiple Properties**: Metrics should check multiple aspects (correctness, groundedness, format) ### GitOps LLM Testing Patterns 1. **Arize Phoenix Pattern**: Experiments API + GitHub Actions for automated evals 2. **Promptfoo Pattern**: YAML-based test definitions with assertions 3. **CircleCI Evals Orb Pattern**: Declarative evaluation jobs with CEL assertions 4. **Evidently Pattern**: Regression testing with LLM-as-judge metrics ## Testing Layers ### Layer 1: Fast Unit Tests (< 10 seconds) No LLM calls. Test deterministic logic: - Query routing rules - Entity extraction regex patterns - SPARQL template selection - Response formatting ### Layer 2: DSPy Module Tests (< 2 minutes) LLM-powered tests with cached/mock responses: - Intent classification evaluation - Entity extraction accuracy - SPARQL generation correctness - Answer relevance scoring ### Layer 3: Integration Tests (< 5 minutes) Live system tests with real LLM/database: - End-to-end RAG pipeline - Oxigraph SPARQL queries - API response validation - Streaming endpoint behavior ### Layer 4: Comprehensive Evaluation (nightly) Full dataset evaluation: - All training examples - Edge cases and adversarial queries - Performance benchmarking - Regression detection ## Quality Gates | Layer | Metric | Threshold | Block Merge? | |-------|--------|-----------|--------------| | 1 | Unit test pass rate | 100% | Yes | | 2 | Intent accuracy | ≥ 85% | Yes | | 2 | Entity F1 | ≥ 80% | Yes | | 3 | API health | All endpoints OK | Yes | | 3 | Sample query success | ≥ 90% | Yes | | 4 | Overall RAG score | ≥ 75% | No (warning) | ## File Structure ``` tests/ ├── dspy_gitops/ │ ├── __init__.py │ ├── conftest.py # Pytest fixtures for DSPy │ ├── datasets/ │ │ ├── heritage_rag_dev.json # Development set (20-50 examples) │ │ ├── heritage_rag_test.json # Test set (100+ examples) │ │ └── golden_queries.yaml # Golden test cases │ ├── metrics/ │ │ ├── __init__.py │ │ ├── intent_accuracy.py # Intent classification metric │ │ ├── entity_extraction.py # Entity F1 metric │ │ ├── sparql_correctness.py # SPARQL validation metric │ │ └── answer_relevance.py # LLM-as-judge metric │ ├── test_layer1_unit.py # Fast unit tests │ ├── test_layer2_dspy.py # DSPy module tests │ ├── test_layer3_integration.py # Integration tests │ └── test_layer4_comprehensive.py # Full evaluation ├── pytest.ini # Pytest configuration └── .github/ └── workflows/ └── dspy-eval.yml # GitHub Actions workflow ``` ## Implementation Plan See subsequent documents: - `01-datasets.md` - Test dataset design - `02-metrics.md` - Evaluation metrics implementation - `03-github-actions.md` - GitOps workflow configuration - `04-pytest-integration.md` - pytest + DSPy integration - `05-quality-gates.md` - Quality gate implementation