glam/docs/plan/dspy_rag_automation/00-overview.md

# DSPy RAG Automation: TDD and Fine-tuning Plan

## Executive Summary

This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:

1. **DSPy Evaluate** - Programmatic evaluation of RAG responses against golden datasets
2. **DSPy Optimizers** (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
3. **Playwright** - End-to-end UI validation to ensure answers render correctly in the frontend
4. **Continuous Improvement Loop** - Automated feedback cycle from production to training

## Problem Statement

Currently, RAG quality is validated manually by checking the frontend output. This approach:
- Doesn't scale with the number of query types
- Misses regression bugs when prompts or pipelines change
- Provides no mechanism for systematic improvement
- Lacks quantitative metrics for answer quality

## Solution Architecture

```
                                    ┌─────────────────────┐
                                    │   Golden Dataset    │
                                    │   (200+ examples)   │
                                    └─────────┬───────────┘
                                              │
                    ┌─────────────────────────┼─────────────────────────┐
                    │                         │                         │
                    ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ DSPy Evaluate │         │   MIPROv2     │         │  Playwright   │
           │  (Metrics)    │         │  Optimizer    │         │  E2E Tests    │
           └───────┬───────┘         └───────┬───────┘         └───────┬───────┘
                   │                         │                         │
                   │                         │                         │
                   ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ Quality Score │         │  Optimized    │         │  UI Renders   │
           │ Report (JSON) │         │  Prompts      │         │  Correctly?   │
           └───────────────┘         └───────────────┘         └───────────────┘
                   │                         │                         │
                   └─────────────────────────┼─────────────────────────┘
                                             │
                                             ▼
                                    ┌─────────────────────┐
                                    │   CI/CD Pipeline    │
                                    │   (Pass/Fail Gate)  │
                                    └─────────────────────┘
```

## Key Components

### 1. Golden Dataset (`data/rag_eval/golden_dataset.json`)
- 200+ curated question-answer pairs
- Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
- Includes expected SPARQL patterns and slot values
- Multiple languages (NL, EN)

### 2. DSPy Metrics (`backend/rag/evaluation/metrics.py`)
- `count_accuracy` - COUNT queries return correct numbers
- `list_completeness` - LIST queries return expected institutions
- `slot_extraction_accuracy` - Correct institution_type, location extracted
- `answer_relevance` - LLM-judged answer quality (0-1 score)
- `faithfulness` - Answer grounded in retrieved context

### 3. DSPy Optimizer (`backend/rag/evaluation/optimizer.py`)
- MIPROv2 for instruction + few-shot optimization
- BootstrapFewShot for rapid example collection
- Automatic prompt improvement based on metrics

### 4. Playwright E2E Suite (`tests/e2e/rag_tests.py`)
- Test answer text appears in chat bubble
- Test institution cards render with correct data
- Test map markers appear for geographic queries
- Test debug panel shows correct SPARQL

### 5. CI/CD Integration (`scripts/run_rag_evaluation.py`)
- Run full evaluation suite on each PR
- Block merge if quality drops below threshold
- Generate regression reports

## Benefits

| Metric | Before | After |
|--------|--------|-------|
| Manual testing time | 30 min/day | 0 min/day |
| Regression detection | Manual | Automatic |
| Answer quality tracking | None | Quantitative |
| Prompt optimization | Manual iteration | Automatic |
| UI validation | Manual | Automatic |

## Implementation Timeline

| Phase | Duration | Deliverables |
|-------|----------|--------------|
| 1. Golden Dataset | 2 days | 200+ curated examples |
| 2. DSPy Metrics | 2 days | Custom metric functions |
| 3. Evaluation Harness | 1 day | Run evaluations, generate reports |
| 4. Playwright Suite | 2 days | E2E tests for chat UI |
| 5. Optimizer Integration | 2 days | MIPROv2 prompt tuning |
| 6. CI/CD Pipeline | 1 day | Automated quality gates |

**Total: ~10 days**

## Files in This Plan

1. `00-overview.md` - This document
2. `01-golden-dataset.md` - Dataset structure and examples
3. `02-dspy-metrics.md` - Custom metric function specifications
4. `03-evaluation-harness.md` - Running evaluations programmatically
5. `04-playwright-tests.md` - E2E UI test specifications
6. `05-optimizer-config.md` - MIPROv2 optimization settings
7. `06-ci-cd-integration.md` - GitHub Actions workflow