8.2 KiB
8.2 KiB
DSPy + GitOps Testing Framework for Heritage RAG
Executive Summary
This document outlines a comprehensive testing framework that integrates DSPy evaluation patterns with GitOps CI/CD workflows for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.
Goals
- Continuous Evaluation: Run DSPy evaluations on every PR to catch regressions
- Reproducible Results: Version-control test datasets, prompts, and evaluation metrics
- Quality Gates: Block merges when evaluation scores drop below thresholds
- Progressive Testing: Fast smoke tests on PR, comprehensive evals on merge
- Observability: Track evaluation metrics over time
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ GitOps Testing Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ GitHub │───>│ GitHub │───>│ DSPy Evaluation │ │
│ │ Push/PR │ │ Actions │ │ (pytest + dspy.Eval) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Evaluation Layers │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Layer 1: Unit Tests (pytest) │ │
│ │ - Intent classification accuracy │ │
│ │ - Entity extraction correctness │ │
│ │ - SPARQL syntax validation │ │
│ │ │ │
│ │ Layer 2: Integration Tests (DSPy Evaluate) │ │
│ │ - End-to-end RAG pipeline │ │
│ │ - Answer quality metrics │ │
│ │ - Response latency │ │
│ │ │ │
│ │ Layer 3: Smoke Tests (Live API) │ │
│ │ - API endpoint health │ │
│ │ - SPARQL endpoint connectivity │ │
│ │ - Sample queries │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Quality │───>│ Merge │───>│ Deploy to │ │
│ │ Gate │ │ Allowed │ │ Production │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Insights from Research
DSPy Evaluation Best Practices
- Start Simple, Iterate: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
- Metric as DSPy Program: Complex metrics can be DSPy modules themselves
- Dev Set Size: 20-200 examples for development, more for optimization
- Multiple Properties: Metrics should check multiple aspects (correctness, groundedness, format)
GitOps LLM Testing Patterns
- Arize Phoenix Pattern: Experiments API + GitHub Actions for automated evals
- Promptfoo Pattern: YAML-based test definitions with assertions
- CircleCI Evals Orb Pattern: Declarative evaluation jobs with CEL assertions
- Evidently Pattern: Regression testing with LLM-as-judge metrics
Testing Layers
Layer 1: Fast Unit Tests (< 10 seconds)
No LLM calls. Test deterministic logic:
- Query routing rules
- Entity extraction regex patterns
- SPARQL template selection
- Response formatting
Layer 2: DSPy Module Tests (< 2 minutes)
LLM-powered tests with cached/mock responses:
- Intent classification evaluation
- Entity extraction accuracy
- SPARQL generation correctness
- Answer relevance scoring
Layer 3: Integration Tests (< 5 minutes)
Live system tests with real LLM/database:
- End-to-end RAG pipeline
- Oxigraph SPARQL queries
- API response validation
- Streaming endpoint behavior
Layer 4: Comprehensive Evaluation (nightly)
Full dataset evaluation:
- All training examples
- Edge cases and adversarial queries
- Performance benchmarking
- Regression detection
Quality Gates
| Layer | Metric | Threshold | Block Merge? |
|---|---|---|---|
| 1 | Unit test pass rate | 100% | Yes |
| 2 | Intent accuracy | ≥ 85% | Yes |
| 2 | Entity F1 | ≥ 80% | Yes |
| 3 | API health | All endpoints OK | Yes |
| 3 | Sample query success | ≥ 90% | Yes |
| 4 | Overall RAG score | ≥ 75% | No (warning) |
File Structure
tests/
├── dspy_gitops/
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures for DSPy
│ ├── datasets/
│ │ ├── heritage_rag_dev.json # Development set (20-50 examples)
│ │ ├── heritage_rag_test.json # Test set (100+ examples)
│ │ └── golden_queries.yaml # Golden test cases
│ ├── metrics/
│ │ ├── __init__.py
│ │ ├── intent_accuracy.py # Intent classification metric
│ │ ├── entity_extraction.py # Entity F1 metric
│ │ ├── sparql_correctness.py # SPARQL validation metric
│ │ └── answer_relevance.py # LLM-as-judge metric
│ ├── test_layer1_unit.py # Fast unit tests
│ ├── test_layer2_dspy.py # DSPy module tests
│ ├── test_layer3_integration.py # Integration tests
│ └── test_layer4_comprehensive.py # Full evaluation
├── pytest.ini # Pytest configuration
└── .github/
└── workflows/
└── dspy-eval.yml # GitHub Actions workflow
Implementation Plan
See subsequent documents:
01-datasets.md- Test dataset design02-metrics.md- Evaluation metrics implementation03-github-actions.md- GitOps workflow configuration04-pytest-integration.md- pytest + DSPy integration05-quality-gates.md- Quality gate implementation