# DSPy Test Datasets ## Overview Test datasets for DSPy evaluation follow the `dspy.Example` format and are versioned alongside code. This enables reproducible evaluations and regression detection. ## Dataset Structure ### Example Format ```python dspy.Example( # Inputs (used by the module) question="Hoeveel musea zijn er in Amsterdam?", language="nl", # Expected outputs (used by metrics) expected_intent="statistical", expected_entities=["amsterdam", "musea"], expected_entity_type="institution", expected_sources=["oxigraph", "sparql"], gold_answer="Er zijn 127 musea in Amsterdam.", # Optional ).with_inputs("question", "language") ``` ### Dataset Files ```yaml # datasets/heritage_rag_dev.json { "version": "1.0.0", "created_at": "2026-01-11", "description": "Development set for Heritage RAG evaluation", "examples": [ { "id": "nl_statistical_001", "question": "Hoeveel musea zijn er in Amsterdam?", "language": "nl", "expected_intent": "statistical", "expected_entities": ["amsterdam", "musea"], "expected_entity_type": "institution", "expected_sources": ["oxigraph"], "category": "count_query", "difficulty": "easy" }, { "id": "nl_entity_001", "question": "Waar is het Rijksmuseum gevestigd?", "language": "nl", "expected_intent": "entity_lookup", "expected_entities": ["rijksmuseum"], "expected_entity_type": "institution", "expected_sources": ["oxigraph"], "category": "location_query", "difficulty": "easy" } ] } ``` ## Dataset Categories ### By Intent Type | Intent | Count | Description | |--------|-------|-------------| | statistical | 15 | Count/aggregate queries | | entity_lookup | 20 | Find specific institution | | temporal | 10 | Time-based queries (mergers, founding) | | geographic | 15 | Location-based queries | | exploration | 10 | Open-ended discovery | | comparative | 5 | Compare multiple institutions | | relational | 10 | Network/relationship queries | ### By Language | Language | Count | Description | |----------|-------|-------------| | nl | 50 | Dutch queries | | en | 35 | English queries | ### By Difficulty | Level | Count | Description | |-------|-------|-------------| | easy | 30 | Single-hop, clear intent | | medium | 35 | Multi-entity, some ambiguity | | hard | 20 | Multi-hop, complex temporal | ## Golden Test Cases High-priority test cases that must always pass: ```yaml # datasets/golden_queries.yaml golden_tests: - id: "golden_amsterdam_museums" question: "Hoeveel musea zijn er in Amsterdam?" language: nl expected_intent: statistical min_answer_contains: ["127", "musea", "Amsterdam"] max_latency_ms: 5000 - id: "golden_rijksmuseum_location" question: "Waar is het Rijksmuseum gevestigd?" language: nl expected_intent: entity_lookup expected_answer_contains: ["Amsterdam", "Museumstraat"] - id: "golden_nationaal_archief_staff" question: "Who works at the Nationaal Archief?" language: en expected_intent: entity_lookup expected_entity_type: person expected_sources: ["oxigraph"] ``` ## Dataset Versioning ``` datasets/ ├── v1.0/ │ ├── heritage_rag_dev.json │ ├── heritage_rag_test.json │ └── CHANGELOG.md ├── v1.1/ │ ├── heritage_rag_dev.json │ ├── heritage_rag_test.json │ └── CHANGELOG.md └── current -> v1.1/ # Symlink to latest ``` ## Loading Datasets ```python # tests/dspy_gitops/conftest.py import json from pathlib import Path import dspy import pytest DATASETS_DIR = Path(__file__).parent / "datasets" def load_dspy_examples(filename: str) -> list[dspy.Example]: """Load examples from JSON file into DSPy format.""" with open(DATASETS_DIR / filename) as f: data = json.load(f) examples = [] for ex in data["examples"]: example = dspy.Example( question=ex["question"], language=ex["language"], expected_intent=ex["expected_intent"], expected_entities=ex["expected_entities"], expected_entity_type=ex.get("expected_entity_type", "institution"), expected_sources=ex["expected_sources"], ).with_inputs("question", "language") examples.append(example) return examples @pytest.fixture def dev_set() -> list[dspy.Example]: """Load development set for evaluation.""" return load_dspy_examples("heritage_rag_dev.json") @pytest.fixture def test_set() -> list[dspy.Example]: """Load test set for final evaluation.""" return load_dspy_examples("heritage_rag_test.json") @pytest.fixture def golden_tests() -> list[dict]: """Load golden test cases.""" import yaml with open(DATASETS_DIR / "golden_queries.yaml") as f: data = yaml.safe_load(f) return data["golden_tests"] ``` ## Dataset Maintenance ### Adding New Examples 1. Identify gap in coverage (new intent type, edge case, etc.) 2. Create example with all required fields 3. Validate against schema 4. Run full evaluation to establish baseline 5. Commit with description of addition ### Updating Expected Outputs Only update expected outputs when: 1. Ground truth changes (data update) 2. Intent classification rules change 3. New sources become available **Never** update expected outputs just to make tests pass. ### Dataset Quality Checks ```python def validate_dataset(examples: list[dspy.Example]) -> list[str]: """Validate dataset quality.""" errors = [] # Check required fields for ex in examples: if not ex.question: errors.append(f"Missing question: {ex}") if ex.language not in ["nl", "en"]: errors.append(f"Invalid language: {ex.language}") if ex.expected_intent not in VALID_INTENTS: errors.append(f"Invalid intent: {ex.expected_intent}") # Check for duplicates questions = [ex.question for ex in examples] if len(questions) != len(set(questions)): errors.append("Duplicate questions found") # Check balance intent_counts = Counter(ex.expected_intent for ex in examples) if min(intent_counts.values()) < 3: errors.append(f"Underrepresented intents: {intent_counts}") return errors ```