glam/docs/plan/dspy_gitops/00-overview.md
2026-01-11 18:08:40 +01:00

8.2 KiB

DSPy + GitOps Testing Framework for Heritage RAG

Executive Summary

This document outlines a comprehensive testing framework that integrates DSPy evaluation patterns with GitOps CI/CD workflows for the GLAM Heritage RAG system. The framework ensures that LLM-powered components maintain quality through automated testing on every code change.

Goals

  1. Continuous Evaluation: Run DSPy evaluations on every PR to catch regressions
  2. Reproducible Results: Version-control test datasets, prompts, and evaluation metrics
  3. Quality Gates: Block merges when evaluation scores drop below thresholds
  4. Progressive Testing: Fast smoke tests on PR, comprehensive evals on merge
  5. Observability: Track evaluation metrics over time

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        GitOps Testing Pipeline                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   GitHub    │───>│   GitHub    │───>│   DSPy Evaluation       │  │
│  │   Push/PR   │    │   Actions   │    │   (pytest + dspy.Eval)  │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Evaluation Layers                         │    │
│  ├─────────────────────────────────────────────────────────────┤    │
│  │  Layer 1: Unit Tests (pytest)                                │    │
│  │  - Intent classification accuracy                            │    │
│  │  - Entity extraction correctness                             │    │
│  │  - SPARQL syntax validation                                  │    │
│  │                                                              │    │
│  │  Layer 2: Integration Tests (DSPy Evaluate)                  │    │
│  │  - End-to-end RAG pipeline                                   │    │
│  │  - Answer quality metrics                                    │    │
│  │  - Response latency                                          │    │
│  │                                                              │    │
│  │  Layer 3: Smoke Tests (Live API)                             │    │
│  │  - API endpoint health                                       │    │
│  │  - SPARQL endpoint connectivity                              │    │
│  │  - Sample queries                                            │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                 │                    │
│                                                 ▼                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   Quality   │───>│   Merge     │───>│   Deploy to             │  │
│  │   Gate      │    │   Allowed   │    │   Production            │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Insights from Research

DSPy Evaluation Best Practices

  1. Start Simple, Iterate: Begin with exact match/accuracy metrics, evolve to LLM-as-judge
  2. Metric as DSPy Program: Complex metrics can be DSPy modules themselves
  3. Dev Set Size: 20-200 examples for development, more for optimization
  4. Multiple Properties: Metrics should check multiple aspects (correctness, groundedness, format)

GitOps LLM Testing Patterns

  1. Arize Phoenix Pattern: Experiments API + GitHub Actions for automated evals
  2. Promptfoo Pattern: YAML-based test definitions with assertions
  3. CircleCI Evals Orb Pattern: Declarative evaluation jobs with CEL assertions
  4. Evidently Pattern: Regression testing with LLM-as-judge metrics

Testing Layers

Layer 1: Fast Unit Tests (< 10 seconds)

No LLM calls. Test deterministic logic:

  • Query routing rules
  • Entity extraction regex patterns
  • SPARQL template selection
  • Response formatting

Layer 2: DSPy Module Tests (< 2 minutes)

LLM-powered tests with cached/mock responses:

  • Intent classification evaluation
  • Entity extraction accuracy
  • SPARQL generation correctness
  • Answer relevance scoring

Layer 3: Integration Tests (< 5 minutes)

Live system tests with real LLM/database:

  • End-to-end RAG pipeline
  • Oxigraph SPARQL queries
  • API response validation
  • Streaming endpoint behavior

Layer 4: Comprehensive Evaluation (nightly)

Full dataset evaluation:

  • All training examples
  • Edge cases and adversarial queries
  • Performance benchmarking
  • Regression detection

Quality Gates

Layer Metric Threshold Block Merge?
1 Unit test pass rate 100% Yes
2 Intent accuracy ≥ 85% Yes
2 Entity F1 ≥ 80% Yes
3 API health All endpoints OK Yes
3 Sample query success ≥ 90% Yes
4 Overall RAG score ≥ 75% No (warning)

File Structure

tests/
├── dspy_gitops/
│   ├── __init__.py
│   ├── conftest.py              # Pytest fixtures for DSPy
│   ├── datasets/
│   │   ├── heritage_rag_dev.json    # Development set (20-50 examples)
│   │   ├── heritage_rag_test.json   # Test set (100+ examples)
│   │   └── golden_queries.yaml      # Golden test cases
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── intent_accuracy.py       # Intent classification metric
│   │   ├── entity_extraction.py     # Entity F1 metric
│   │   ├── sparql_correctness.py    # SPARQL validation metric
│   │   └── answer_relevance.py      # LLM-as-judge metric
│   ├── test_layer1_unit.py          # Fast unit tests
│   ├── test_layer2_dspy.py          # DSPy module tests
│   ├── test_layer3_integration.py   # Integration tests
│   └── test_layer4_comprehensive.py # Full evaluation
├── pytest.ini                       # Pytest configuration
└── .github/
    └── workflows/
        └── dspy-eval.yml            # GitHub Actions workflow

Implementation Plan

See subsequent documents:

  • 01-datasets.md - Test dataset design
  • 02-metrics.md - Evaluation metrics implementation
  • 03-github-actions.md - GitOps workflow configuration
  • 04-pytest-integration.md - pytest + DSPy integration
  • 05-quality-gates.md - Quality gate implementation