228 lines
6.2 KiB
Markdown
228 lines
6.2 KiB
Markdown
# DSPy Test Datasets
|
|
|
|
## Overview
|
|
|
|
Test datasets for DSPy evaluation follow the `dspy.Example` format and are versioned alongside code. This enables reproducible evaluations and regression detection.
|
|
|
|
## Dataset Structure
|
|
|
|
### Example Format
|
|
|
|
```python
|
|
dspy.Example(
|
|
# Inputs (used by the module)
|
|
question="Hoeveel musea zijn er in Amsterdam?",
|
|
language="nl",
|
|
|
|
# Expected outputs (used by metrics)
|
|
expected_intent="statistical",
|
|
expected_entities=["amsterdam", "musea"],
|
|
expected_entity_type="institution",
|
|
expected_sources=["oxigraph", "sparql"],
|
|
gold_answer="Er zijn 127 musea in Amsterdam.", # Optional
|
|
).with_inputs("question", "language")
|
|
```
|
|
|
|
### Dataset Files
|
|
|
|
```yaml
|
|
# datasets/heritage_rag_dev.json
|
|
{
|
|
"version": "1.0.0",
|
|
"created_at": "2026-01-11",
|
|
"description": "Development set for Heritage RAG evaluation",
|
|
"examples": [
|
|
{
|
|
"id": "nl_statistical_001",
|
|
"question": "Hoeveel musea zijn er in Amsterdam?",
|
|
"language": "nl",
|
|
"expected_intent": "statistical",
|
|
"expected_entities": ["amsterdam", "musea"],
|
|
"expected_entity_type": "institution",
|
|
"expected_sources": ["oxigraph"],
|
|
"category": "count_query",
|
|
"difficulty": "easy"
|
|
},
|
|
{
|
|
"id": "nl_entity_001",
|
|
"question": "Waar is het Rijksmuseum gevestigd?",
|
|
"language": "nl",
|
|
"expected_intent": "entity_lookup",
|
|
"expected_entities": ["rijksmuseum"],
|
|
"expected_entity_type": "institution",
|
|
"expected_sources": ["oxigraph"],
|
|
"category": "location_query",
|
|
"difficulty": "easy"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Dataset Categories
|
|
|
|
### By Intent Type
|
|
|
|
| Intent | Count | Description |
|
|
|--------|-------|-------------|
|
|
| statistical | 15 | Count/aggregate queries |
|
|
| entity_lookup | 20 | Find specific institution |
|
|
| temporal | 10 | Time-based queries (mergers, founding) |
|
|
| geographic | 15 | Location-based queries |
|
|
| exploration | 10 | Open-ended discovery |
|
|
| comparative | 5 | Compare multiple institutions |
|
|
| relational | 10 | Network/relationship queries |
|
|
|
|
### By Language
|
|
|
|
| Language | Count | Description |
|
|
|----------|-------|-------------|
|
|
| nl | 50 | Dutch queries |
|
|
| en | 35 | English queries |
|
|
|
|
### By Difficulty
|
|
|
|
| Level | Count | Description |
|
|
|-------|-------|-------------|
|
|
| easy | 30 | Single-hop, clear intent |
|
|
| medium | 35 | Multi-entity, some ambiguity |
|
|
| hard | 20 | Multi-hop, complex temporal |
|
|
|
|
## Golden Test Cases
|
|
|
|
High-priority test cases that must always pass:
|
|
|
|
```yaml
|
|
# datasets/golden_queries.yaml
|
|
golden_tests:
|
|
- id: "golden_amsterdam_museums"
|
|
question: "Hoeveel musea zijn er in Amsterdam?"
|
|
language: nl
|
|
expected_intent: statistical
|
|
min_answer_contains: ["127", "musea", "Amsterdam"]
|
|
max_latency_ms: 5000
|
|
|
|
- id: "golden_rijksmuseum_location"
|
|
question: "Waar is het Rijksmuseum gevestigd?"
|
|
language: nl
|
|
expected_intent: entity_lookup
|
|
expected_answer_contains: ["Amsterdam", "Museumstraat"]
|
|
|
|
- id: "golden_nationaal_archief_staff"
|
|
question: "Who works at the Nationaal Archief?"
|
|
language: en
|
|
expected_intent: entity_lookup
|
|
expected_entity_type: person
|
|
expected_sources: ["oxigraph"]
|
|
```
|
|
|
|
## Dataset Versioning
|
|
|
|
```
|
|
datasets/
|
|
├── v1.0/
|
|
│ ├── heritage_rag_dev.json
|
|
│ ├── heritage_rag_test.json
|
|
│ └── CHANGELOG.md
|
|
├── v1.1/
|
|
│ ├── heritage_rag_dev.json
|
|
│ ├── heritage_rag_test.json
|
|
│ └── CHANGELOG.md
|
|
└── current -> v1.1/ # Symlink to latest
|
|
```
|
|
|
|
## Loading Datasets
|
|
|
|
```python
|
|
# tests/dspy_gitops/conftest.py
|
|
import json
|
|
from pathlib import Path
|
|
import dspy
|
|
import pytest
|
|
|
|
DATASETS_DIR = Path(__file__).parent / "datasets"
|
|
|
|
def load_dspy_examples(filename: str) -> list[dspy.Example]:
|
|
"""Load examples from JSON file into DSPy format."""
|
|
with open(DATASETS_DIR / filename) as f:
|
|
data = json.load(f)
|
|
|
|
examples = []
|
|
for ex in data["examples"]:
|
|
example = dspy.Example(
|
|
question=ex["question"],
|
|
language=ex["language"],
|
|
expected_intent=ex["expected_intent"],
|
|
expected_entities=ex["expected_entities"],
|
|
expected_entity_type=ex.get("expected_entity_type", "institution"),
|
|
expected_sources=ex["expected_sources"],
|
|
).with_inputs("question", "language")
|
|
examples.append(example)
|
|
|
|
return examples
|
|
|
|
@pytest.fixture
|
|
def dev_set() -> list[dspy.Example]:
|
|
"""Load development set for evaluation."""
|
|
return load_dspy_examples("heritage_rag_dev.json")
|
|
|
|
@pytest.fixture
|
|
def test_set() -> list[dspy.Example]:
|
|
"""Load test set for final evaluation."""
|
|
return load_dspy_examples("heritage_rag_test.json")
|
|
|
|
@pytest.fixture
|
|
def golden_tests() -> list[dict]:
|
|
"""Load golden test cases."""
|
|
import yaml
|
|
with open(DATASETS_DIR / "golden_queries.yaml") as f:
|
|
data = yaml.safe_load(f)
|
|
return data["golden_tests"]
|
|
```
|
|
|
|
## Dataset Maintenance
|
|
|
|
### Adding New Examples
|
|
|
|
1. Identify gap in coverage (new intent type, edge case, etc.)
|
|
2. Create example with all required fields
|
|
3. Validate against schema
|
|
4. Run full evaluation to establish baseline
|
|
5. Commit with description of addition
|
|
|
|
### Updating Expected Outputs
|
|
|
|
Only update expected outputs when:
|
|
1. Ground truth changes (data update)
|
|
2. Intent classification rules change
|
|
3. New sources become available
|
|
|
|
**Never** update expected outputs just to make tests pass.
|
|
|
|
### Dataset Quality Checks
|
|
|
|
```python
|
|
def validate_dataset(examples: list[dspy.Example]) -> list[str]:
|
|
"""Validate dataset quality."""
|
|
errors = []
|
|
|
|
# Check required fields
|
|
for ex in examples:
|
|
if not ex.question:
|
|
errors.append(f"Missing question: {ex}")
|
|
if ex.language not in ["nl", "en"]:
|
|
errors.append(f"Invalid language: {ex.language}")
|
|
if ex.expected_intent not in VALID_INTENTS:
|
|
errors.append(f"Invalid intent: {ex.expected_intent}")
|
|
|
|
# Check for duplicates
|
|
questions = [ex.question for ex in examples]
|
|
if len(questions) != len(set(questions)):
|
|
errors.append("Duplicate questions found")
|
|
|
|
# Check balance
|
|
intent_counts = Counter(ex.expected_intent for ex in examples)
|
|
if min(intent_counts.values()) < 3:
|
|
errors.append(f"Underrepresented intents: {intent_counts}")
|
|
|
|
return errors
|
|
```
|