6 KiB
6 KiB
DSPy RAG Automation: TDD and Fine-tuning Plan
Executive Summary
This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:
- DSPy Evaluate - Programmatic evaluation of RAG responses against golden datasets
- DSPy Optimizers (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
- Playwright - End-to-end UI validation to ensure answers render correctly in the frontend
- Continuous Improvement Loop - Automated feedback cycle from production to training
Problem Statement
Currently, RAG quality is validated manually by checking the frontend output. This approach:
- Doesn't scale with the number of query types
- Misses regression bugs when prompts or pipelines change
- Provides no mechanism for systematic improvement
- Lacks quantitative metrics for answer quality
Solution Architecture
┌─────────────────────┐
│ Golden Dataset │
│ (200+ examples) │
└─────────┬───────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ DSPy Evaluate │ │ MIPROv2 │ │ Playwright │
│ (Metrics) │ │ Optimizer │ │ E2E Tests │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Quality Score │ │ Optimized │ │ UI Renders │
│ Report (JSON) │ │ Prompts │ │ Correctly? │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────────┼─────────────────────────┘
│
▼
┌─────────────────────┐
│ CI/CD Pipeline │
│ (Pass/Fail Gate) │
└─────────────────────┘
Key Components
1. Golden Dataset (data/rag_eval/golden_dataset.json)
- 200+ curated question-answer pairs
- Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
- Includes expected SPARQL patterns and slot values
- Multiple languages (NL, EN)
2. DSPy Metrics (backend/rag/evaluation/metrics.py)
count_accuracy- COUNT queries return correct numberslist_completeness- LIST queries return expected institutionsslot_extraction_accuracy- Correct institution_type, location extractedanswer_relevance- LLM-judged answer quality (0-1 score)faithfulness- Answer grounded in retrieved context
3. DSPy Optimizer (backend/rag/evaluation/optimizer.py)
- MIPROv2 for instruction + few-shot optimization
- BootstrapFewShot for rapid example collection
- Automatic prompt improvement based on metrics
4. Playwright E2E Suite (tests/e2e/rag_tests.py)
- Test answer text appears in chat bubble
- Test institution cards render with correct data
- Test map markers appear for geographic queries
- Test debug panel shows correct SPARQL
5. CI/CD Integration (scripts/run_rag_evaluation.py)
- Run full evaluation suite on each PR
- Block merge if quality drops below threshold
- Generate regression reports
Benefits
| Metric | Before | After |
|---|---|---|
| Manual testing time | 30 min/day | 0 min/day |
| Regression detection | Manual | Automatic |
| Answer quality tracking | None | Quantitative |
| Prompt optimization | Manual iteration | Automatic |
| UI validation | Manual | Automatic |
Implementation Timeline
| Phase | Duration | Deliverables |
|---|---|---|
| 1. Golden Dataset | 2 days | 200+ curated examples |
| 2. DSPy Metrics | 2 days | Custom metric functions |
| 3. Evaluation Harness | 1 day | Run evaluations, generate reports |
| 4. Playwright Suite | 2 days | E2E tests for chat UI |
| 5. Optimizer Integration | 2 days | MIPROv2 prompt tuning |
| 6. CI/CD Pipeline | 1 day | Automated quality gates |
Total: ~10 days
Files in This Plan
00-overview.md- This document01-golden-dataset.md- Dataset structure and examples02-dspy-metrics.md- Custom metric function specifications03-evaluation-harness.md- Running evaluations programmatically04-playwright-tests.md- E2E UI test specifications05-optimizer-config.md- MIPROv2 optimization settings06-ci-cd-integration.md- GitHub Actions workflow