DSPy + GitOps Testing Framework for Heritage RAG

Executive Summary

This document outlines a comprehensive testing framework that integrates DSPy evaluation patterns with GitOps CI/CD workflows for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.

Goals

Continuous Evaluation: Run DSPy evaluations on every PR to catch regressions
Reproducible Results: Version-control test datasets, prompts, and evaluation metrics
Quality Gates: Block merges when evaluation scores drop below thresholds
Progressive Testing: Fast smoke tests on PR, comprehensive evals on merge
Observability: Track evaluation metrics over time

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        GitOps Testing Pipeline                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   GitHub    │───>│   GitHub    │───>│   DSPy Evaluation       │  │
│  │   Push/PR   │    │   Actions   │    │   (pytest + dspy.Eval)  │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Evaluation Layers                         │    │
│  ├─────────────────────────────────────────────────────────────┤    │
│  │  Layer 1: Unit Tests (pytest)                                │    │
│  │  - Intent classification accuracy                            │    │
│  │  - Entity extraction correctness                             │    │
│  │  - SPARQL syntax validation                                  │    │
│  │                                                              │    │
│  │  Layer 2: Integration Tests (DSPy Evaluate)                  │    │
│  │  - End-to-end RAG pipeline                                   │    │
│  │  - Answer quality metrics                                    │    │
│  │  - Response latency                                          │    │
│  │                                                              │    │
│  │  Layer 3: Smoke Tests (Live API)                             │    │
│  │  - API endpoint health                                       │    │
│  │  - SPARQL endpoint connectivity                              │    │
│  │  - Sample queries                                            │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   Quality   │───>│   Merge     │───>│   Deploy to             │  │
│  │   Gate      │    │   Allowed   │    │   Production            │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Insights from Research

DSPy Evaluation Best Practices

Start Simple, Iterate: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
Metric as DSPy Program: Complex metrics can be DSPy modules themselves
Dev Set Size: 20-200 examples for development, more for optimization
Multiple Properties: Metrics should check multiple aspects (correctness, groundedness, format)

GitOps LLM Testing Patterns

Arize Phoenix Pattern: Experiments API + GitHub Actions for automated evals
Promptfoo Pattern: YAML-based test definitions with assertions
CircleCI Evals Orb Pattern: Declarative evaluation jobs with CEL assertions
Evidently Pattern: Regression testing with LLM-as-judge metrics

Testing Layers

Layer 1: Fast Unit Tests (< 10 seconds)

No LLM calls. Test deterministic logic:

Query routing rules
Entity extraction regex patterns
SPARQL template selection
Response formatting

Layer 2: DSPy Module Tests (< 2 minutes)

LLM-powered tests with cached/mock responses:

Intent classification evaluation
Entity extraction accuracy
SPARQL generation correctness
Answer relevance scoring

Layer 3: Integration Tests (< 5 minutes)

Live system tests with real LLM/database:

End-to-end RAG pipeline
Oxigraph SPARQL queries
API response validation
Streaming endpoint behavior

Layer 4: Comprehensive Evaluation (nightly)

Full dataset evaluation:

All training examples
Edge cases and adversarial queries
Performance benchmarking
Regression detection

Quality Gates

Layer	Metric	Threshold	Block Merge?
1	Unit test pass rate	100%	Yes
2	Intent accuracy	≥ 85%	Yes
2	Entity F1	≥ 80%	Yes
3	API health	All endpoints OK	Yes
3	Sample query success	≥ 90%	Yes
4	Overall RAG score	≥ 75%	No (warning)

File Structure

tests/
├── dspy_gitops/
│   ├── __init__.py
│   ├── conftest.py              # Pytest fixtures for DSPy
│   ├── datasets/
│   │   ├── heritage_rag_dev.json    # Development set (20-50 examples)
│   │   ├── heritage_rag_test.json   # Test set (100+ examples)
│   │   └── golden_queries.yaml      # Golden test cases
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── intent_accuracy.py       # Intent classification metric
│   │   ├── entity_extraction.py     # Entity F1 metric
│   │   ├── sparql_correctness.py    # SPARQL validation metric
│   │   └── answer_relevance.py      # LLM-as-judge metric
│   ├── test_layer1_unit.py          # Fast unit tests
│   ├── test_layer2_dspy.py          # DSPy module tests
│   ├── test_layer3_integration.py   # Integration tests
│   └── test_layer4_comprehensive.py # Full evaluation
├── pytest.ini                       # Pytest configuration
└── .github/
    └── workflows/
        └── dspy-eval.yml            # GitHub Actions workflow

Implementation Plan

See subsequent documents:

01-datasets.md - Test dataset design
02-metrics.md - Evaluation metrics implementation
03-github-actions.md - GitOps workflow configuration
04-pytest-integration.md - pytest + DSPy integration
05-quality-gates.md - Quality gate implementation

8.2 KiB Raw Blame History