DSPy RAG Automation: TDD and Fine-tuning Plan

Executive Summary

This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:

DSPy Evaluate - Programmatic evaluation of RAG responses against golden datasets
DSPy Optimizers (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
Playwright - End-to-end UI validation to ensure answers render correctly in the frontend
Continuous Improvement Loop - Automated feedback cycle from production to training

Problem Statement

Currently, RAG quality is validated manually by checking the frontend output. This approach:

Doesn't scale with the number of query types
Misses regression bugs when prompts or pipelines change
Provides no mechanism for systematic improvement
Lacks quantitative metrics for answer quality

Solution Architecture

                                    ┌─────────────────────┐
                                    │   Golden Dataset    │
                                    │   (200+ examples)   │
                                    └─────────┬───────────┘
                                              │
                    ┌─────────────────────────┼─────────────────────────┐
                    │                         │                         │
                    ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ DSPy Evaluate │         │   MIPROv2     │         │  Playwright   │
           │  (Metrics)    │         │  Optimizer    │         │  E2E Tests    │
           └───────┬───────┘         └───────┬───────┘         └───────┬───────┘
                   │                         │                         │
                   │                         │                         │
                   ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ Quality Score │         │  Optimized    │         │  UI Renders   │
           │ Report (JSON) │         │  Prompts      │         │  Correctly?   │
           └───────────────┘         └───────────────┘         └───────────────┘
                   │                         │                         │
                   └─────────────────────────┼─────────────────────────┘
                                             │
                                             ▼
                                    ┌─────────────────────┐
                                    │   CI/CD Pipeline    │
                                    │   (Pass/Fail Gate)  │
                                    └─────────────────────┘

Key Components

1. Golden Dataset (`data/rag_eval/golden_dataset.json`)

200+ curated question-answer pairs
Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
Includes expected SPARQL patterns and slot values
Multiple languages (NL, EN)

2. DSPy Metrics (`backend/rag/evaluation/metrics.py`)

count_accuracy - COUNT queries return correct numbers
list_completeness - LIST queries return expected institutions
slot_extraction_accuracy - Correct institution_type, location extracted
answer_relevance - LLM-judged answer quality (0-1 score)
faithfulness - Answer grounded in retrieved context

3. DSPy Optimizer (`backend/rag/evaluation/optimizer.py`)

MIPROv2 for instruction + few-shot optimization
BootstrapFewShot for rapid example collection
Automatic prompt improvement based on metrics

4. Playwright E2E Suite (`tests/e2e/rag_tests.py`)

Test answer text appears in chat bubble
Test institution cards render with correct data
Test map markers appear for geographic queries
Test debug panel shows correct SPARQL

5. CI/CD Integration (`scripts/run_rag_evaluation.py`)

Run full evaluation suite on each PR
Block merge if quality drops below threshold
Generate regression reports

Benefits

Metric	Before	After
Manual testing time	30 min/day	0 min/day
Regression detection	Manual	Automatic
Answer quality tracking	None	Quantitative
Prompt optimization	Manual iteration	Automatic
UI validation	Manual	Automatic

Implementation Timeline

Phase	Duration	Deliverables
1. Golden Dataset	2 days	200+ curated examples
2. DSPy Metrics	2 days	Custom metric functions
3. Evaluation Harness	1 day	Run evaluations, generate reports
4. Playwright Suite	2 days	E2E tests for chat UI
5. Optimizer Integration	2 days	MIPROv2 prompt tuning
6. CI/CD Pipeline	1 day	Automated quality gates

Total: ~10 days

Files in This Plan

00-overview.md - This document
01-golden-dataset.md - Dataset structure and examples
02-dspy-metrics.md - Custom metric function specifications
03-evaluation-harness.md - Running evaluations programmatically
04-playwright-tests.md - E2E UI test specifications
05-optimizer-config.md - MIPROv2 optimization settings
06-ci-cd-integration.md - GitHub Actions workflow

6 KiB Raw Blame History

DSPy RAG Automation: TDD and Fine-tuning Plan

Executive Summary

Problem Statement

Solution Architecture

Key Components

1. Golden Dataset (data/rag_eval/golden_dataset.json)

2. DSPy Metrics (backend/rag/evaluation/metrics.py)

3. DSPy Optimizer (backend/rag/evaluation/optimizer.py)

4. Playwright E2E Suite (tests/e2e/rag_tests.py)

5. CI/CD Integration (scripts/run_rag_evaluation.py)