glam/docs/plan/dspy_gitops/00-overview.md
2026-01-11 18:08:40 +01:00

151 lines
8.2 KiB
Markdown

# DSPy + GitOps Testing Framework for Heritage RAG
## Executive Summary
This document outlines a comprehensive testing framework that integrates **DSPy evaluation patterns** with **GitOps CI/CD workflows** for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.
## Goals
1. **Continuous Evaluation**: Run DSPy evaluations on every PR to catch regressions
2. **Reproducible Results**: Version-control test datasets, prompts, and evaluation metrics
3. **Quality Gates**: Block merges when evaluation scores drop below thresholds
4. **Progressive Testing**: Fast smoke tests on PR, comprehensive evals on merge
5. **Observability**: Track evaluation metrics over time
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ GitOps Testing Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ GitHub │───>│ GitHub │───>│ DSPy Evaluation │ │
│ │ Push/PR │ │ Actions │ │ (pytest + dspy.Eval) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Evaluation Layers │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Layer 1: Unit Tests (pytest) │ │
│ │ - Intent classification accuracy │ │
│ │ - Entity extraction correctness │ │
│ │ - SPARQL syntax validation │ │
│ │ │ │
│ │ Layer 2: Integration Tests (DSPy Evaluate) │ │
│ │ - End-to-end RAG pipeline │ │
│ │ - Answer quality metrics │ │
│ │ - Response latency │ │
│ │ │ │
│ │ Layer 3: Smoke Tests (Live API) │ │
│ │ - API endpoint health │ │
│ │ - SPARQL endpoint connectivity │ │
│ │ - Sample queries │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Quality │───>│ Merge │───>│ Deploy to │ │
│ │ Gate │ │ Allowed │ │ Production │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
## Key Insights from Research
### DSPy Evaluation Best Practices
1. **Start Simple, Iterate**: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
2. **Metric as DSPy Program**: Complex metrics can be DSPy modules themselves
3. **Dev Set Size**: 20-200 examples for development, more for optimization
4. **Multiple Properties**: Metrics should check multiple aspects (correctness, groundedness, format)
### GitOps LLM Testing Patterns
1. **Arize Phoenix Pattern**: Experiments API + GitHub Actions for automated evals
2. **Promptfoo Pattern**: YAML-based test definitions with assertions
3. **CircleCI Evals Orb Pattern**: Declarative evaluation jobs with CEL assertions
4. **Evidently Pattern**: Regression testing with LLM-as-judge metrics
## Testing Layers
### Layer 1: Fast Unit Tests (< 10 seconds)
No LLM calls. Test deterministic logic:
- Query routing rules
- Entity extraction regex patterns
- SPARQL template selection
- Response formatting
### Layer 2: DSPy Module Tests (< 2 minutes)
LLM-powered tests with cached/mock responses:
- Intent classification evaluation
- Entity extraction accuracy
- SPARQL generation correctness
- Answer relevance scoring
### Layer 3: Integration Tests (< 5 minutes)
Live system tests with real LLM/database:
- End-to-end RAG pipeline
- Oxigraph SPARQL queries
- API response validation
- Streaming endpoint behavior
### Layer 4: Comprehensive Evaluation (nightly)
Full dataset evaluation:
- All training examples
- Edge cases and adversarial queries
- Performance benchmarking
- Regression detection
## Quality Gates
| Layer | Metric | Threshold | Block Merge? |
|-------|--------|-----------|--------------|
| 1 | Unit test pass rate | 100% | Yes |
| 2 | Intent accuracy | ≥ 85% | Yes |
| 2 | Entity F1 | ≥ 80% | Yes |
| 3 | API health | All endpoints OK | Yes |
| 3 | Sample query success | ≥ 90% | Yes |
| 4 | Overall RAG score | ≥ 75% | No (warning) |
## File Structure
```
tests/
├── dspy_gitops/
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures for DSPy
│ ├── datasets/
│ │ ├── heritage_rag_dev.json # Development set (20-50 examples)
│ │ ├── heritage_rag_test.json # Test set (100+ examples)
│ │ └── golden_queries.yaml # Golden test cases
│ ├── metrics/
│ │ ├── __init__.py
│ │ ├── intent_accuracy.py # Intent classification metric
│ │ ├── entity_extraction.py # Entity F1 metric
│ │ ├── sparql_correctness.py # SPARQL validation metric
│ │ └── answer_relevance.py # LLM-as-judge metric
│ ├── test_layer1_unit.py # Fast unit tests
│ ├── test_layer2_dspy.py # DSPy module tests
│ ├── test_layer3_integration.py # Integration tests
│ └── test_layer4_comprehensive.py # Full evaluation
├── pytest.ini # Pytest configuration
└── .github/
└── workflows/
└── dspy-eval.yml # GitHub Actions workflow
```
## Implementation Plan
See subsequent documents:
- `01-datasets.md` - Test dataset design
- `02-metrics.md` - Evaluation metrics implementation
- `03-github-actions.md` - GitOps workflow configuration
- `04-pytest-integration.md` - pytest + DSPy integration
- `05-quality-gates.md` - Quality gate implementation