DSPy Test Datasets

Overview

Test datasets for DSPy evaluation follow the dspy.Example format and are versioned alongside code. This enables reproducible evaluations and regression detection.

Dataset Structure

Example Format

dspy.Example(
    # Inputs (used by the module)
    question="Hoeveel musea zijn er in Amsterdam?",
    language="nl",
    
    # Expected outputs (used by metrics)
    expected_intent="statistical",
    expected_entities=["amsterdam", "musea"],
    expected_entity_type="institution",
    expected_sources=["oxigraph", "sparql"],
    gold_answer="Er zijn 127 musea in Amsterdam.",  # Optional
).with_inputs("question", "language")

Dataset Files

# datasets/heritage_rag_dev.json
{
  "version": "1.0.0",
  "created_at": "2026-01-11",
  "description": "Development set for Heritage RAG evaluation",
  "examples": [
    {
      "id": "nl_statistical_001",
      "question": "Hoeveel musea zijn er in Amsterdam?",
      "language": "nl",
      "expected_intent": "statistical",
      "expected_entities": ["amsterdam", "musea"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "count_query",
      "difficulty": "easy"
    },
    {
      "id": "nl_entity_001",
      "question": "Waar is het Rijksmuseum gevestigd?",
      "language": "nl",
      "expected_intent": "entity_lookup",
      "expected_entities": ["rijksmuseum"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "location_query",
      "difficulty": "easy"
    }
  ]
}

Dataset Categories

By Intent Type

Intent	Count	Description
statistical	15	Count/aggregate queries
entity_lookup	20	Find specific institution
temporal	10	Time-based queries (mergers, founding)
geographic	15	Location-based queries
exploration	10	Open-ended discovery
comparative	5	Compare multiple institutions
relational	10	Network/relationship queries

By Language

Language	Count	Description
nl	50	Dutch queries
en	35	English queries

By Difficulty

Level	Count	Description
easy	30	Single-hop, clear intent
medium	35	Multi-entity, some ambiguity
hard	20	Multi-hop, complex temporal

Golden Test Cases

High-priority test cases that must always pass:

# datasets/golden_queries.yaml
golden_tests:
  - id: "golden_amsterdam_museums"
    question: "Hoeveel musea zijn er in Amsterdam?"
    language: nl
    expected_intent: statistical
    min_answer_contains: ["127", "musea", "Amsterdam"]
    max_latency_ms: 5000
    
  - id: "golden_rijksmuseum_location"
    question: "Waar is het Rijksmuseum gevestigd?"
    language: nl
    expected_intent: entity_lookup
    expected_answer_contains: ["Amsterdam", "Museumstraat"]
    
  - id: "golden_nationaal_archief_staff"
    question: "Who works at the Nationaal Archief?"
    language: en
    expected_intent: entity_lookup
    expected_entity_type: person
    expected_sources: ["oxigraph"]

Dataset Versioning

datasets/
├── v1.0/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
├── v1.1/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
└── current -> v1.1/  # Symlink to latest

Loading Datasets

# tests/dspy_gitops/conftest.py
import json
from pathlib import Path
import dspy
import pytest

DATASETS_DIR = Path(__file__).parent / "datasets"

def load_dspy_examples(filename: str) -> list[dspy.Example]:
    """Load examples from JSON file into DSPy format."""
    with open(DATASETS_DIR / filename) as f:
        data = json.load(f)
    
    examples = []
    for ex in data["examples"]:
        example = dspy.Example(
            question=ex["question"],
            language=ex["language"],
            expected_intent=ex["expected_intent"],
            expected_entities=ex["expected_entities"],
            expected_entity_type=ex.get("expected_entity_type", "institution"),
            expected_sources=ex["expected_sources"],
        ).with_inputs("question", "language")
        examples.append(example)
    
    return examples

@pytest.fixture
def dev_set() -> list[dspy.Example]:
    """Load development set for evaluation."""
    return load_dspy_examples("heritage_rag_dev.json")

@pytest.fixture
def test_set() -> list[dspy.Example]:
    """Load test set for final evaluation."""
    return load_dspy_examples("heritage_rag_test.json")

@pytest.fixture
def golden_tests() -> list[dict]:
    """Load golden test cases."""
    import yaml
    with open(DATASETS_DIR / "golden_queries.yaml") as f:
        data = yaml.safe_load(f)
    return data["golden_tests"]

Dataset Maintenance

Adding New Examples

Identify gap in coverage (new intent type, edge case, etc.)
Create example with all required fields
Validate against schema
Run full evaluation to establish baseline
Commit with description of addition

Updating Expected Outputs

Only update expected outputs when:

Ground truth changes (data update)
Intent classification rules change
New sources become available

Never update expected outputs just to make tests pass.

Dataset Quality Checks

def validate_dataset(examples: list[dspy.Example]) -> list[str]:
    """Validate dataset quality."""
    errors = []
    
    # Check required fields
    for ex in examples:
        if not ex.question:
            errors.append(f"Missing question: {ex}")
        if ex.language not in ["nl", "en"]:
            errors.append(f"Invalid language: {ex.language}")
        if ex.expected_intent not in VALID_INTENTS:
            errors.append(f"Invalid intent: {ex.expected_intent}")
    
    # Check for duplicates
    questions = [ex.question for ex in examples]
    if len(questions) != len(set(questions)):
        errors.append("Duplicate questions found")
    
    # Check balance
    intent_counts = Counter(ex.expected_intent for ex in examples)
    if min(intent_counts.values()) < 3:
        errors.append(f"Underrepresented intents: {intent_counts}")
    
    return errors

6.2 KiB Raw Blame History