glam/docs/plan/dspy_gitops/01-datasets.md

# DSPy Test Datasets

## Overview

Test datasets for DSPy evaluation follow the `dspy.Example` format and are versioned alongside code. This enables reproducible evaluations and regression detection.

## Dataset Structure

### Example Format

```python
dspy.Example(
    # Inputs (used by the module)
    question="Hoeveel musea zijn er in Amsterdam?",
    language="nl",

    # Expected outputs (used by metrics)
    expected_intent="statistical",
    expected_entities=["amsterdam", "musea"],
    expected_entity_type="institution",
    expected_sources=["oxigraph", "sparql"],
    gold_answer="Er zijn 127 musea in Amsterdam.",  # Optional
).with_inputs("question", "language")
```

### Dataset Files

```yaml
# datasets/heritage_rag_dev.json
{
  "version": "1.0.0",
  "created_at": "2026-01-11",
  "description": "Development set for Heritage RAG evaluation",
  "examples": [
    {
      "id": "nl_statistical_001",
      "question": "Hoeveel musea zijn er in Amsterdam?",
      "language": "nl",
      "expected_intent": "statistical",
      "expected_entities": ["amsterdam", "musea"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "count_query",
      "difficulty": "easy"
    },
    {
      "id": "nl_entity_001",
      "question": "Waar is het Rijksmuseum gevestigd?",
      "language": "nl",
      "expected_intent": "entity_lookup",
      "expected_entities": ["rijksmuseum"],
      "expected_entity_type": "institution",
      "expected_sources": ["oxigraph"],
      "category": "location_query",
      "difficulty": "easy"
    }
  ]
}
```

## Dataset Categories

### By Intent Type

| Intent | Count | Description |
|--------|-------|-------------|
| statistical | 15 | Count/aggregate queries |
| entity_lookup | 20 | Find specific institution |
| temporal | 10 | Time-based queries (mergers, founding) |
| geographic | 15 | Location-based queries |
| exploration | 10 | Open-ended discovery |
| comparative | 5 | Compare multiple institutions |
| relational | 10 | Network/relationship queries |

### By Language

| Language | Count | Description |
|----------|-------|-------------|
| nl | 50 | Dutch queries |
| en | 35 | English queries |

### By Difficulty

| Level | Count | Description |
|-------|-------|-------------|
| easy | 30 | Single-hop, clear intent |
| medium | 35 | Multi-entity, some ambiguity |
| hard | 20 | Multi-hop, complex temporal |

## Golden Test Cases

High-priority test cases that must always pass:

```yaml
# datasets/golden_queries.yaml
golden_tests:
  - id: "golden_amsterdam_museums"
    question: "Hoeveel musea zijn er in Amsterdam?"
    language: nl
    expected_intent: statistical
    min_answer_contains: ["127", "musea", "Amsterdam"]
    max_latency_ms: 5000

  - id: "golden_rijksmuseum_location"
    question: "Waar is het Rijksmuseum gevestigd?"
    language: nl
    expected_intent: entity_lookup
    expected_answer_contains: ["Amsterdam", "Museumstraat"]

  - id: "golden_nationaal_archief_staff"
    question: "Who works at the Nationaal Archief?"
    language: en
    expected_intent: entity_lookup
    expected_entity_type: person
    expected_sources: ["oxigraph"]
```

## Dataset Versioning

```
datasets/
├── v1.0/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
├── v1.1/
│   ├── heritage_rag_dev.json
│   ├── heritage_rag_test.json
│   └── CHANGELOG.md
└── current -> v1.1/  # Symlink to latest
```

## Loading Datasets

```python
# tests/dspy_gitops/conftest.py
import json
from pathlib import Path
import dspy
import pytest

DATASETS_DIR = Path(__file__).parent / "datasets"

def load_dspy_examples(filename: str) -> list[dspy.Example]:
    """Load examples from JSON file into DSPy format."""
    with open(DATASETS_DIR / filename) as f:
        data = json.load(f)

    examples = []
    for ex in data["examples"]:
        example = dspy.Example(
            question=ex["question"],
            language=ex["language"],
            expected_intent=ex["expected_intent"],
            expected_entities=ex["expected_entities"],
            expected_entity_type=ex.get("expected_entity_type", "institution"),
            expected_sources=ex["expected_sources"],
        ).with_inputs("question", "language")
        examples.append(example)

    return examples

@pytest.fixture
def dev_set() -> list[dspy.Example]:
    """Load development set for evaluation."""
    return load_dspy_examples("heritage_rag_dev.json")

@pytest.fixture
def test_set() -> list[dspy.Example]:
    """Load test set for final evaluation."""
    return load_dspy_examples("heritage_rag_test.json")

@pytest.fixture
def golden_tests() -> list[dict]:
    """Load golden test cases."""
    import yaml
    with open(DATASETS_DIR / "golden_queries.yaml") as f:
        data = yaml.safe_load(f)
    return data["golden_tests"]
```

## Dataset Maintenance

### Adding New Examples

1. Identify gap in coverage (new intent type, edge case, etc.)
2. Create example with all required fields
3. Validate against schema
4. Run full evaluation to establish baseline
5. Commit with description of addition

### Updating Expected Outputs

Only update expected outputs when:
1. Ground truth changes (data update)
2. Intent classification rules change
3. New sources become available

**Never** update expected outputs just to make tests pass.

### Dataset Quality Checks

```python
def validate_dataset(examples: list[dspy.Example]) -> list[str]:
    """Validate dataset quality."""
    errors = []

    # Check required fields
    for ex in examples:
        if not ex.question:
            errors.append(f"Missing question: {ex}")
        if ex.language not in ["nl", "en"]:
            errors.append(f"Invalid language: {ex.language}")
        if ex.expected_intent not in VALID_INTENTS:
            errors.append(f"Invalid intent: {ex.expected_intent}")

    # Check for duplicates
    questions = [ex.question for ex in examples]
    if len(questions) != len(set(questions)):
        errors.append("Duplicate questions found")

    # Check balance
    intent_counts = Counter(ex.expected_intent for ex in examples)
    if min(intent_counts.values()) < 3:
        errors.append(f"Underrepresented intents: {intent_counts}")

    return errors
```