glam/docs/plan/dspy_rag_automation/00-overview.md
2026-01-09 20:35:19 +01:00

6 KiB

DSPy RAG Automation: TDD and Fine-tuning Plan

Executive Summary

This plan outlines a comprehensive approach to automatically test and improve the GLAM Heritage RAG system using:

  1. DSPy Evaluate - Programmatic evaluation of RAG responses against golden datasets
  2. DSPy Optimizers (MIPROv2, BootstrapFewShot) - Automatic prompt tuning and few-shot optimization
  3. Playwright - End-to-end UI validation to ensure answers render correctly in the frontend
  4. Continuous Improvement Loop - Automated feedback cycle from production to training

Problem Statement

Currently, RAG quality is validated manually by checking the frontend output. This approach:

  • Doesn't scale with the number of query types
  • Misses regression bugs when prompts or pipelines change
  • Provides no mechanism for systematic improvement
  • Lacks quantitative metrics for answer quality

Solution Architecture

                                    ┌─────────────────────┐
                                    │   Golden Dataset    │
                                    │   (200+ examples)   │
                                    └─────────┬───────────┘
                                              │
                    ┌─────────────────────────┼─────────────────────────┐
                    │                         │                         │
                    ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ DSPy Evaluate │         │   MIPROv2     │         │  Playwright   │
           │  (Metrics)    │         │  Optimizer    │         │  E2E Tests    │
           └───────┬───────┘         └───────┬───────┘         └───────┬───────┘
                   │                         │                         │
                   │                         │                         │
                   ▼                         ▼                         ▼
           ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
           │ Quality Score │         │  Optimized    │         │  UI Renders   │
           │ Report (JSON) │         │  Prompts      │         │  Correctly?   │
           └───────────────┘         └───────────────┘         └───────────────┘
                   │                         │                         │
                   └─────────────────────────┼─────────────────────────┘
                                             │
                                             ▼
                                    ┌─────────────────────┐
                                    │   CI/CD Pipeline    │
                                    │   (Pass/Fail Gate)  │
                                    └─────────────────────┘

Key Components

1. Golden Dataset (data/rag_eval/golden_dataset.json)

  • 200+ curated question-answer pairs
  • Covers all query types: COUNT, LIST, DETAIL, TEMPORAL, PERSON
  • Includes expected SPARQL patterns and slot values
  • Multiple languages (NL, EN)

2. DSPy Metrics (backend/rag/evaluation/metrics.py)

  • count_accuracy - COUNT queries return correct numbers
  • list_completeness - LIST queries return expected institutions
  • slot_extraction_accuracy - Correct institution_type, location extracted
  • answer_relevance - LLM-judged answer quality (0-1 score)
  • faithfulness - Answer grounded in retrieved context

3. DSPy Optimizer (backend/rag/evaluation/optimizer.py)

  • MIPROv2 for instruction + few-shot optimization
  • BootstrapFewShot for rapid example collection
  • Automatic prompt improvement based on metrics

4. Playwright E2E Suite (tests/e2e/rag_tests.py)

  • Test answer text appears in chat bubble
  • Test institution cards render with correct data
  • Test map markers appear for geographic queries
  • Test debug panel shows correct SPARQL

5. CI/CD Integration (scripts/run_rag_evaluation.py)

  • Run full evaluation suite on each PR
  • Block merge if quality drops below threshold
  • Generate regression reports

Benefits

Metric Before After
Manual testing time 30 min/day 0 min/day
Regression detection Manual Automatic
Answer quality tracking None Quantitative
Prompt optimization Manual iteration Automatic
UI validation Manual Automatic

Implementation Timeline

Phase Duration Deliverables
1. Golden Dataset 2 days 200+ curated examples
2. DSPy Metrics 2 days Custom metric functions
3. Evaluation Harness 1 day Run evaluations, generate reports
4. Playwright Suite 2 days E2E tests for chat UI
5. Optimizer Integration 2 days MIPROv2 prompt tuning
6. CI/CD Pipeline 1 day Automated quality gates

Total: ~10 days

Files in This Plan

  1. 00-overview.md - This document
  2. 01-golden-dataset.md - Dataset structure and examples
  3. 02-dspy-metrics.md - Custom metric function specifications
  4. 03-evaluation-harness.md - Running evaluations programmatically
  5. 04-playwright-tests.md - E2E UI test specifications
  6. 05-optimizer-config.md - MIPROv2 optimization settings
  7. 06-ci-cd-integration.md - GitHub Actions workflow