# DSPy + GitOps Testing Framework for Heritage RAG

## Executive Summary

This document outlines a comprehensive testing framework that integrates **DSPy evaluation patterns** with **GitOps CI/CD workflows** for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.

## Goals

1. **Continuous Evaluation**: Run DSPy evaluations on every PR to catch regressions
2. **Reproducible Results**: Version-control test datasets, prompts, and evaluation metrics
3. **Quality Gates**: Block merges when evaluation scores drop below thresholds
4. **Progressive Testing**: Fast smoke tests on PR, comprehensive evals on merge
5. **Observability**: Track evaluation metrics over time

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        GitOps Testing Pipeline                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   GitHub    │───>│   GitHub    │───>│   DSPy Evaluation       │  │
│  │   Push/PR   │    │   Actions   │    │   (pytest + dspy.Eval)  │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Evaluation Layers                         │    │
│  ├─────────────────────────────────────────────────────────────┤    │
│  │  Layer 1: Unit Tests (pytest)                                │    │
│  │  - Intent classification accuracy                            │    │
│  │  - Entity extraction correctness                             │    │
│  │  - SPARQL syntax validation                                  │    │
│  │                                                              │    │
│  │  Layer 2: Integration Tests (DSPy Evaluate)                  │    │
│  │  - End-to-end RAG pipeline                                   │    │
│  │  - Answer quality metrics                                    │    │
│  │  - Response latency                                          │    │
│  │                                                              │    │
│  │  Layer 3: Smoke Tests (Live API)                             │    │
│  │  - API endpoint health                                       │    │
│  │  - SPARQL endpoint connectivity                              │    │
│  │  - Sample queries                                            │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   Quality   │───>│   Merge     │───>│   Deploy to             │  │
│  │   Gate      │    │   Allowed   │    │   Production            │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

## Key Insights from Research

### DSPy Evaluation Best Practices

1. **Start Simple, Iterate**: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
2. **Metric as DSPy Program**: Complex metrics can be DSPy modules themselves
3. **Dev Set Size**: 20-200 examples for development, more for optimization
4. **Multiple Properties**: Metrics should check multiple aspects (correctness, groundedness, format)

### GitOps LLM Testing Patterns

1. **Arize Phoenix Pattern**: Experiments API + GitHub Actions for automated evals
2. **Promptfoo Pattern**: YAML-based test definitions with assertions
3. **CircleCI Evals Orb Pattern**: Declarative evaluation jobs with CEL assertions
4. **Evidently Pattern**: Regression testing with LLM-as-judge metrics

## Testing Layers

### Layer 1: Fast Unit Tests (< 10 seconds)

No LLM calls. Test deterministic logic:
- Query routing rules
- Entity extraction regex patterns  
- SPARQL template selection
- Response formatting

### Layer 2: DSPy Module Tests (< 2 minutes)

LLM-powered tests with cached/mock responses:
- Intent classification evaluation
- Entity extraction accuracy
- SPARQL generation correctness
- Answer relevance scoring

### Layer 3: Integration Tests (< 5 minutes)

Live system tests with real LLM/database:
- End-to-end RAG pipeline
- Oxigraph SPARQL queries
- API response validation
- Streaming endpoint behavior

### Layer 4: Comprehensive Evaluation (nightly)

Full dataset evaluation:
- All training examples
- Edge cases and adversarial queries
- Performance benchmarking
- Regression detection

## Quality Gates

| Layer | Metric | Threshold | Block Merge? |
|-------|--------|-----------|--------------|
| 1 | Unit test pass rate | 100% | Yes |
| 2 | Intent accuracy | ≥ 85% | Yes |
| 2 | Entity F1 | ≥ 80% | Yes |
| 3 | API health | All endpoints OK | Yes |
| 3 | Sample query success | ≥ 90% | Yes |
| 4 | Overall RAG score | ≥ 75% | No (warning) |

## File Structure

```
tests/
├── dspy_gitops/
│   ├── __init__.py
│   ├── conftest.py              # Pytest fixtures for DSPy
│   ├── datasets/
│   │   ├── heritage_rag_dev.json    # Development set (20-50 examples)
│   │   ├── heritage_rag_test.json   # Test set (100+ examples)
│   │   └── golden_queries.yaml      # Golden test cases
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── intent_accuracy.py       # Intent classification metric
│   │   ├── entity_extraction.py     # Entity F1 metric
│   │   ├── sparql_correctness.py    # SPARQL validation metric
│   │   └── answer_relevance.py      # LLM-as-judge metric
│   ├── test_layer1_unit.py          # Fast unit tests
│   ├── test_layer2_dspy.py          # DSPy module tests
│   ├── test_layer3_integration.py   # Integration tests
│   └── test_layer4_comprehensive.py # Full evaluation
├── pytest.ini                       # Pytest configuration
└── .github/
    └── workflows/
        └── dspy-eval.yml            # GitHub Actions workflow
```

## Implementation Plan

See subsequent documents:
- `01-datasets.md` - Test dataset design
- `02-metrics.md` - Evaluation metrics implementation
- `03-github-actions.md` - GitOps workflow configuration
- `04-pytest-integration.md` - pytest + DSPy integration
- `05-quality-gates.md` - Quality gate implementation