114 lines
6 KiB
Markdown
114 lines
6 KiB
Markdown
# DSPy RAG Automation: TDD and Fine-tuning Plan
|
|
|
|
## Executive Summary
|
|
|
|
This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:
|
|
|
|
1. **DSPy Evaluate** - Programmatic evaluation of RAG responses against golden datasets
|
|
2. **DSPy Optimizers** (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
|
|
3. **Playwright** - End-to-end UI validation to ensure answers render correctly in the frontend
|
|
4. **Continuous Improvement Loop** - Automated feedback cycle from production to training
|
|
|
|
## Problem Statement
|
|
|
|
Currently, RAG quality is validated manually by checking the frontend output. This approach:
|
|
- Doesn't scale with the number of query types
|
|
- Misses regression bugs when prompts or pipelines change
|
|
- Provides no mechanism for systematic improvement
|
|
- Lacks quantitative metrics for answer quality
|
|
|
|
## Solution Architecture
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ Golden Dataset │
|
|
│ (200+ examples) │
|
|
└─────────┬───────────┘
|
|
│
|
|
┌─────────────────────────┼─────────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
|
│ DSPy Evaluate │ │ MIPROv2 │ │ Playwright │
|
|
│ (Metrics) │ │ Optimizer │ │ E2E Tests │
|
|
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
|
|
│ │ │
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
|
│ Quality Score │ │ Optimized │ │ UI Renders │
|
|
│ Report (JSON) │ │ Prompts │ │ Correctly? │
|
|
└───────────────┘ └───────────────┘ └───────────────┘
|
|
│ │ │
|
|
└─────────────────────────┼─────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ CI/CD Pipeline │
|
|
│ (Pass/Fail Gate) │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
## Key Components
|
|
|
|
### 1. Golden Dataset (`data/rag_eval/golden_dataset.json`)
|
|
- 200+ curated question-answer pairs
|
|
- Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
|
|
- Includes expected SPARQL patterns and slot values
|
|
- Multiple languages (NL, EN)
|
|
|
|
### 2. DSPy Metrics (`backend/rag/evaluation/metrics.py`)
|
|
- `count_accuracy` - COUNT queries return correct numbers
|
|
- `list_completeness` - LIST queries return expected institutions
|
|
- `slot_extraction_accuracy` - Correct institution_type, location extracted
|
|
- `answer_relevance` - LLM-judged answer quality (0-1 score)
|
|
- `faithfulness` - Answer grounded in retrieved context
|
|
|
|
### 3. DSPy Optimizer (`backend/rag/evaluation/optimizer.py`)
|
|
- MIPROv2 for instruction + few-shot optimization
|
|
- BootstrapFewShot for rapid example collection
|
|
- Automatic prompt improvement based on metrics
|
|
|
|
### 4. Playwright E2E Suite (`tests/e2e/rag_tests.py`)
|
|
- Test answer text appears in chat bubble
|
|
- Test institution cards render with correct data
|
|
- Test map markers appear for geographic queries
|
|
- Test debug panel shows correct SPARQL
|
|
|
|
### 5. CI/CD Integration (`scripts/run_rag_evaluation.py`)
|
|
- Run full evaluation suite on each PR
|
|
- Block merge if quality drops below threshold
|
|
- Generate regression reports
|
|
|
|
## Benefits
|
|
|
|
| Metric | Before | After |
|
|
|--------|--------|-------|
|
|
| Manual testing time | 30 min/day | 0 min/day |
|
|
| Regression detection | Manual | Automatic |
|
|
| Answer quality tracking | None | Quantitative |
|
|
| Prompt optimization | Manual iteration | Automatic |
|
|
| UI validation | Manual | Automatic |
|
|
|
|
## Implementation Timeline
|
|
|
|
| Phase | Duration | Deliverables |
|
|
|-------|----------|--------------|
|
|
| 1. Golden Dataset | 2 days | 200+ curated examples |
|
|
| 2. DSPy Metrics | 2 days | Custom metric functions |
|
|
| 3. Evaluation Harness | 1 day | Run evaluations, generate reports |
|
|
| 4. Playwright Suite | 2 days | E2E tests for chat UI |
|
|
| 5. Optimizer Integration | 2 days | MIPROv2 prompt tuning |
|
|
| 6. CI/CD Pipeline | 1 day | Automated quality gates |
|
|
|
|
**Total: ~10 days**
|
|
|
|
## Files in This Plan
|
|
|
|
1. `00-overview.md` - This document
|
|
2. `01-golden-dataset.md` - Dataset structure and examples
|
|
3. `02-dspy-metrics.md` - Custom metric function specifications
|
|
4. `03-evaluation-harness.md` - Running evaluations programmatically
|
|
5. `04-playwright-tests.md` - E2E UI test specifications
|
|
6. `05-optimizer-config.md` - MIPROv2 optimization settings
|
|
7. `06-ci-cd-integration.md` - GitHub Actions workflow
|