151 lines
8.2 KiB
Markdown
151 lines
8.2 KiB
Markdown
# DSPy + GitOps Testing Framework for Heritage RAG
|
|
|
|
## Executive Summary
|
|
|
|
This document outlines a comprehensive testing framework that integrates **DSPy evaluation patterns** with **GitOps CI/CD workflows** for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.
|
|
|
|
## Goals
|
|
|
|
1. **Continuous Evaluation**: Run DSPy evaluations on every PR to catch regressions
|
|
2. **Reproducible Results**: Version-control test datasets, prompts, and evaluation metrics
|
|
3. **Quality Gates**: Block merges when evaluation scores drop below thresholds
|
|
4. **Progressive Testing**: Fast smoke tests on PR, comprehensive evals on merge
|
|
5. **Observability**: Track evaluation metrics over time
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ GitOps Testing Pipeline │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
|
│ │ GitHub │───>│ GitHub │───>│ DSPy Evaluation │ │
|
|
│ │ Push/PR │ │ Actions │ │ (pytest + dspy.Eval) │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ Evaluation Layers │ │
|
|
│ ├─────────────────────────────────────────────────────────────┤ │
|
|
│ │ Layer 1: Unit Tests (pytest) │ │
|
|
│ │ - Intent classification accuracy │ │
|
|
│ │ - Entity extraction correctness │ │
|
|
│ │ - SPARQL syntax validation │ │
|
|
│ │ │ │
|
|
│ │ Layer 2: Integration Tests (DSPy Evaluate) │ │
|
|
│ │ - End-to-end RAG pipeline │ │
|
|
│ │ - Answer quality metrics │ │
|
|
│ │ - Response latency │ │
|
|
│ │ │ │
|
|
│ │ Layer 3: Smoke Tests (Live API) │ │
|
|
│ │ - API endpoint health │ │
|
|
│ │ - SPARQL endpoint connectivity │ │
|
|
│ │ - Sample queries │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
|
│ │ Quality │───>│ Merge │───>│ Deploy to │ │
|
|
│ │ Gate │ │ Allowed │ │ Production │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Key Insights from Research
|
|
|
|
### DSPy Evaluation Best Practices
|
|
|
|
1. **Start Simple, Iterate**: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
|
|
2. **Metric as DSPy Program**: Complex metrics can be DSPy modules themselves
|
|
3. **Dev Set Size**: 20-200 examples for development, more for optimization
|
|
4. **Multiple Properties**: Metrics should check multiple aspects (correctness, groundedness, format)
|
|
|
|
### GitOps LLM Testing Patterns
|
|
|
|
1. **Arize Phoenix Pattern**: Experiments API + GitHub Actions for automated evals
|
|
2. **Promptfoo Pattern**: YAML-based test definitions with assertions
|
|
3. **CircleCI Evals Orb Pattern**: Declarative evaluation jobs with CEL assertions
|
|
4. **Evidently Pattern**: Regression testing with LLM-as-judge metrics
|
|
|
|
## Testing Layers
|
|
|
|
### Layer 1: Fast Unit Tests (< 10 seconds)
|
|
|
|
No LLM calls. Test deterministic logic:
|
|
- Query routing rules
|
|
- Entity extraction regex patterns
|
|
- SPARQL template selection
|
|
- Response formatting
|
|
|
|
### Layer 2: DSPy Module Tests (< 2 minutes)
|
|
|
|
LLM-powered tests with cached/mock responses:
|
|
- Intent classification evaluation
|
|
- Entity extraction accuracy
|
|
- SPARQL generation correctness
|
|
- Answer relevance scoring
|
|
|
|
### Layer 3: Integration Tests (< 5 minutes)
|
|
|
|
Live system tests with real LLM/database:
|
|
- End-to-end RAG pipeline
|
|
- Oxigraph SPARQL queries
|
|
- API response validation
|
|
- Streaming endpoint behavior
|
|
|
|
### Layer 4: Comprehensive Evaluation (nightly)
|
|
|
|
Full dataset evaluation:
|
|
- All training examples
|
|
- Edge cases and adversarial queries
|
|
- Performance benchmarking
|
|
- Regression detection
|
|
|
|
## Quality Gates
|
|
|
|
| Layer | Metric | Threshold | Block Merge? |
|
|
|-------|--------|-----------|--------------|
|
|
| 1 | Unit test pass rate | 100% | Yes |
|
|
| 2 | Intent accuracy | ≥ 85% | Yes |
|
|
| 2 | Entity F1 | ≥ 80% | Yes |
|
|
| 3 | API health | All endpoints OK | Yes |
|
|
| 3 | Sample query success | ≥ 90% | Yes |
|
|
| 4 | Overall RAG score | ≥ 75% | No (warning) |
|
|
|
|
## File Structure
|
|
|
|
```
|
|
tests/
|
|
├── dspy_gitops/
|
|
│ ├── __init__.py
|
|
│ ├── conftest.py # Pytest fixtures for DSPy
|
|
│ ├── datasets/
|
|
│ │ ├── heritage_rag_dev.json # Development set (20-50 examples)
|
|
│ │ ├── heritage_rag_test.json # Test set (100+ examples)
|
|
│ │ └── golden_queries.yaml # Golden test cases
|
|
│ ├── metrics/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── intent_accuracy.py # Intent classification metric
|
|
│ │ ├── entity_extraction.py # Entity F1 metric
|
|
│ │ ├── sparql_correctness.py # SPARQL validation metric
|
|
│ │ └── answer_relevance.py # LLM-as-judge metric
|
|
│ ├── test_layer1_unit.py # Fast unit tests
|
|
│ ├── test_layer2_dspy.py # DSPy module tests
|
|
│ ├── test_layer3_integration.py # Integration tests
|
|
│ └── test_layer4_comprehensive.py # Full evaluation
|
|
├── pytest.ini # Pytest configuration
|
|
└── .github/
|
|
└── workflows/
|
|
└── dspy-eval.yml # GitHub Actions workflow
|
|
```
|
|
|
|
## Implementation Plan
|
|
|
|
See subsequent documents:
|
|
- `01-datasets.md` - Test dataset design
|
|
- `02-metrics.md` - Evaluation metrics implementation
|
|
- `03-github-actions.md` - GitOps workflow configuration
|
|
- `04-pytest-integration.md` - pytest + DSPy integration
|
|
- `05-quality-gates.md` - Quality gate implementation
|