glam/docs/plan/dspy_rag_automation/00-overview.md
2026-01-09 20:35:19 +01:00

114 lines
6 KiB
Markdown

# DSPy RAG Automation: TDD and Fine-tuning Plan
## Executive Summary
This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:
1. **DSPy Evaluate** - Programmatic evaluation of RAG responses against golden datasets
2. **DSPy Optimizers** (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
3. **Playwright** - End-to-end UI validation to ensure answers render correctly in the frontend
4. **Continuous Improvement Loop** - Automated feedback cycle from production to training
## Problem Statement
Currently, RAG quality is validated manually by checking the frontend output. This approach:
- Doesn't scale with the number of query types
- Misses regression bugs when prompts or pipelines change
- Provides no mechanism for systematic improvement
- Lacks quantitative metrics for answer quality
## Solution Architecture
```
┌─────────────────────┐
│ Golden Dataset │
│ (200+ examples) │
└─────────┬───────────┘
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ DSPy Evaluate │ │ MIPROv2 │ │ Playwright │
│ (Metrics) │ │ Optimizer │ │ E2E Tests │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Quality Score │ │ Optimized │ │ UI Renders │
│ Report (JSON) │ │ Prompts │ │ Correctly? │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────────┼─────────────────────────┘
┌─────────────────────┐
│ CI/CD Pipeline │
│ (Pass/Fail Gate) │
└─────────────────────┘
```
## Key Components
### 1. Golden Dataset (`data/rag_eval/golden_dataset.json`)
- 200+ curated question-answer pairs
- Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
- Includes expected SPARQL patterns and slot values
- Multiple languages (NL, EN)
### 2. DSPy Metrics (`backend/rag/evaluation/metrics.py`)
- `count_accuracy` - COUNT queries return correct numbers
- `list_completeness` - LIST queries return expected institutions
- `slot_extraction_accuracy` - Correct institution_type, location extracted
- `answer_relevance` - LLM-judged answer quality (0-1 score)
- `faithfulness` - Answer grounded in retrieved context
### 3. DSPy Optimizer (`backend/rag/evaluation/optimizer.py`)
- MIPROv2 for instruction + few-shot optimization
- BootstrapFewShot for rapid example collection
- Automatic prompt improvement based on metrics
### 4. Playwright E2E Suite (`tests/e2e/rag_tests.py`)
- Test answer text appears in chat bubble
- Test institution cards render with correct data
- Test map markers appear for geographic queries
- Test debug panel shows correct SPARQL
### 5. CI/CD Integration (`scripts/run_rag_evaluation.py`)
- Run full evaluation suite on each PR
- Block merge if quality drops below threshold
- Generate regression reports
## Benefits
| Metric | Before | After |
|--------|--------|-------|
| Manual testing time | 30 min/day | 0 min/day |
| Regression detection | Manual | Automatic |
| Answer quality tracking | None | Quantitative |
| Prompt optimization | Manual iteration | Automatic |
| UI validation | Manual | Automatic |
## Implementation Timeline
| Phase | Duration | Deliverables |
|-------|----------|--------------|
| 1. Golden Dataset | 2 days | 200+ curated examples |
| 2. DSPy Metrics | 2 days | Custom metric functions |
| 3. Evaluation Harness | 1 day | Run evaluations, generate reports |
| 4. Playwright Suite | 2 days | E2E tests for chat UI |
| 5. Optimizer Integration | 2 days | MIPROv2 prompt tuning |
| 6. CI/CD Pipeline | 1 day | Automated quality gates |
**Total: ~10 days**
## Files in This Plan
1. `00-overview.md` - This document
2. `01-golden-dataset.md` - Dataset structure and examples
3. `02-dspy-metrics.md` - Custom metric function specifications
4. `03-evaluation-harness.md` - Running evaluations programmatically
5. `04-playwright-tests.md` - E2E UI test specifications
6. `05-optimizer-config.md` - MIPROv2 optimization settings
7. `06-ci-cd-integration.md` - GitHub Actions workflow