glam/docs/plan/dspy_gitops/01-datasets.md
2026-01-11 18:08:40 +01:00

6.2 KiB

DSPy Test Datasets

Overview

Test datasets for DSPy evaluation follow the dspy.Example format and are versioned alongside code. This enables reproducible evaluations and regression detection.

Dataset Structure

Example Format

dspy.Example(
    # Inputs (used by the module)
    question="Hoeveel musea zijn er in Amsterdam?",
    language="nl",
    
    # Expected outputs (used by metrics)
    expected_intent="statistical",
    expected_entities=["amsterdam", "musea"],
    expected_entity_type="institution",
    expected_sources=["oxigraph", "sparql"],
    gold_answer="Er zijn 127 musea in Amsterdam.",  # Optional
).with_inputs("question", "language")

Dataset Files

# datasets/heritage_rag_dev.json
{
  "version": "1.0.0",
  "created_at": "2026-01-11",
  "description": "Development set for Heritage RAG evaluation",
  "examples": [
    {
      "id": "nl_statistical_001",
      "question": "Hoeveel musea zijn er in Amsterdam?",
      "language": "nl",
      "expected_intent": "statistical",
      "expected_entities": ["amsterdam", "musea"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "count_query",
      "difficulty": "easy"
    },
    {
      "id": "nl_entity_001",
      "question": "Waar is het Rijksmuseum gevestigd?",
      "language": "nl",
      "expected_intent": "entity_lookup",
      "expected_entities": ["rijksmuseum"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "location_query",
      "difficulty": "easy"
    }
  ]
}

Dataset Categories

By Intent Type

Intent Count Description
statistical 15 Count/aggregate queries
entity_lookup 20 Find specific institution
temporal 10 Time-based queries (mergers, founding)
geographic 15 Location-based queries
exploration 10 Open-ended discovery
comparative 5 Compare multiple institutions
relational 10 Network/relationship queries

By Language

Language Count Description
nl 50 Dutch queries
en 35 English queries

By Difficulty

Level Count Description
easy 30 Single-hop, clear intent
medium 35 Multi-entity, some ambiguity
hard 20 Multi-hop, complex temporal

Golden Test Cases

High-priority test cases that must always pass:

# datasets/golden_queries.yaml
golden_tests:
  - id: "golden_amsterdam_museums"
    question: "Hoeveel musea zijn er in Amsterdam?"
    language: nl
    expected_intent: statistical
    min_answer_contains: ["127", "musea", "Amsterdam"]
    max_latency_ms: 5000
    
  - id: "golden_rijksmuseum_location"
    question: "Waar is het Rijksmuseum gevestigd?"
    language: nl
    expected_intent: entity_lookup
    expected_answer_contains: ["Amsterdam", "Museumstraat"]
    
  - id: "golden_nationaal_archief_staff"
    question: "Who works at the Nationaal Archief?"
    language: en
    expected_intent: entity_lookup
    expected_entity_type: person
    expected_sources: ["oxigraph"]

Dataset Versioning

datasets/
├── v1.0/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
├── v1.1/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
└── current -> v1.1/  # Symlink to latest

Loading Datasets

# tests/dspy_gitops/conftest.py
import json
from pathlib import Path
import dspy
import pytest

DATASETS_DIR = Path(__file__).parent / "datasets"

def load_dspy_examples(filename: str) -> list[dspy.Example]:
    """Load examples from JSON file into DSPy format."""
    with open(DATASETS_DIR / filename) as f:
        data = json.load(f)
    
    examples = []
    for ex in data["examples"]:
        example = dspy.Example(
            question=ex["question"],
            language=ex["language"],
            expected_intent=ex["expected_intent"],
            expected_entities=ex["expected_entities"],
            expected_entity_type=ex.get("expected_entity_type", "institution"),
            expected_sources=ex["expected_sources"],
        ).with_inputs("question", "language")
        examples.append(example)
    
    return examples

@pytest.fixture
def dev_set() -> list[dspy.Example]:
    """Load development set for evaluation."""
    return load_dspy_examples("heritage_rag_dev.json")

@pytest.fixture
def test_set() -> list[dspy.Example]:
    """Load test set for final evaluation."""
    return load_dspy_examples("heritage_rag_test.json")

@pytest.fixture
def golden_tests() -> list[dict]:
    """Load golden test cases."""
    import yaml
    with open(DATASETS_DIR / "golden_queries.yaml") as f:
        data = yaml.safe_load(f)
    return data["golden_tests"]

Dataset Maintenance

Adding New Examples

  1. Identify gap in coverage (new intent type, edge case, etc.)
  2. Create example with all required fields
  3. Validate against schema
  4. Run full evaluation to establish baseline
  5. Commit with description of addition

Updating Expected Outputs

Only update expected outputs when:

  1. Ground truth changes (data update)
  2. Intent classification rules change
  3. New sources become available

Never update expected outputs just to make tests pass.

Dataset Quality Checks

def validate_dataset(examples: list[dspy.Example]) -> list[str]:
    """Validate dataset quality."""
    errors = []
    
    # Check required fields
    for ex in examples:
        if not ex.question:
            errors.append(f"Missing question: {ex}")
        if ex.language not in ["nl", "en"]:
            errors.append(f"Invalid language: {ex.language}")
        if ex.expected_intent not in VALID_INTENTS:
            errors.append(f"Invalid intent: {ex.expected_intent}")
    
    # Check for duplicates
    questions = [ex.question for ex in examples]
    if len(questions) != len(set(questions)):
        errors.append("Duplicate questions found")
    
    # Check balance
    intent_counts = Counter(ex.expected_intent for ex in examples)
    if min(intent_counts.values()) < 3:
        errors.append(f"Underrepresented intents: {intent_counts}")
    
    return errors