2025-11-30 23:30:29 +01:00

58 KiB

Raw Blame History

UNESCO Data Extraction - Test-Driven Development Strategy

Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 04 - TDD Strategy
Version: 1.1
Date: 2025-11-10
Status: Updated for OpenDataSoft API v2.0

Executive Summary

This document outlines the test-driven development (TDD) strategy for the UNESCO World Heritage Sites extraction project. Following TDD principles ensures high code quality, maintainability, and confidence in data integrity. The strategy emphasizes writing tests BEFORE implementation, using diverse testing approaches (unit, integration, property-based, golden dataset), and maintaining 90%+ code coverage.

Key Principle: Red → Green → Refactor

Red: Write failing test first
Green: Write minimal code to pass test
Refactor: Improve code while keeping tests green

Testing Philosophy

Core Principles

Tests as Specification: Tests document expected behavior and serve as executable specifications
Fast Feedback: Unit tests run in milliseconds, integration tests in seconds
Isolation: Each test is independent, can run in any order
Determinism: Same input always produces same output (no flaky tests)
Coverage as Quality Gate: 90%+ coverage required before merge

Testing Pyramid

                    ┌─────────────────┐
                    │   E2E Tests     │  (10% of tests)
                    │   Full pipeline │
                    └─────────────────┘
                  ┌───────────────────────┐
                  │  Integration Tests    │  (20% of tests)
                  │  API + Parser + DB    │
                  └───────────────────────┘
              ┌─────────────────────────────────┐
              │      Unit Tests                 │  (70% of tests)
              │  Functions, Classes, Validators │
              └─────────────────────────────────┘

Distribution:

70% Unit Tests: Fast, isolated, test individual functions/classes
20% Integration Tests: Test component interactions (API client + parser)
10% E2E Tests: Full pipeline from API fetch to LinkML validation

Test Infrastructure

Testing Framework Stack

# pyproject.toml [tool.pytest]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
addopts = [
    "--cov=src/glam_extractor",
    "--cov-report=html",
    "--cov-report=term-missing",
    "--cov-fail-under=90",
    "--verbose",
    "--strict-markers",
    "--tb=short"
]
markers = [
    "unit: Unit tests (fast, isolated)",
    "integration: Integration tests (requires external resources)",
    "e2e: End-to-end tests (full pipeline)",
    "slow: Slow tests (> 1 second)",
    "api: Tests requiring UNESCO API access",
    "wikidata: Tests requiring Wikidata SPARQL endpoint"
]

Dependencies

[project.optional-dependencies]
test = [
    "pytest >= 8.0.0",
    "pytest-cov >= 4.1.0",
    "pytest-mock >= 3.12.0",
    "pytest-xdist >= 3.5.0",  # Parallel test execution
    "hypothesis >= 6.92.0",   # Property-based testing
    "responses >= 0.24.0",    # Mock HTTP requests
    "freezegun >= 1.4.0",     # Mock datetime
    "faker >= 22.0.0",        # Generate fake data
    "deepdiff >= 6.7.0",      # Deep object comparison
]

Directory Structure

tests/
├── unit/                          # Unit tests (fast, isolated)
│   ├── test_unesco_api_client.py
│   ├── test_unesco_parser.py
│   ├── test_institution_classifier.py
│   ├── test_ghcid_generator.py
│   └── test_linkml_map.py
├── integration/                   # Integration tests
│   ├── test_api_to_parser_pipeline.py
│   ├── test_parser_to_validator_pipeline.py
│   ├── test_wikidata_enrichment.py
│   └── test_dataset_merge.py
├── e2e/                          # End-to-end tests
│   ├── test_full_unesco_extraction.py
│   └── test_export_all_formats.py
├── fixtures/                     # Test data
│   ├── unesco_api_responses/     # Sample UNESCO JSON responses
│   │   ├── site_600_bnf.json
│   │   ├── site_list_sample.json
│   │   └── ...
│   ├── expected_outputs/         # Golden dataset expected results
│   │   ├── unesco_bnf_expected.yaml
│   │   └── ...
│   └── mock_responses/           # Pre-recorded API responses
├── conftest.py                   # Pytest configuration and shared fixtures
└── README.md                     # Testing guide

Phase-by-Phase TDD Strategy

Phase 1: API Exploration & Schema Design (Days 1-5)

Day 1: UNESCO API Client Tests

Test Plan:

Test API connectivity (integration test)
Test response parsing (unit test)
Test error handling (unit test)
Test caching (unit test)
Test rate limiting (integration test)

Example TDD Cycle:

# tests/unit/test_unesco_api_client.py

import pytest
import responses
from glam_extractor.parsers.unesco_api_client import UNESCOAPIClient

# RED: Write failing test first
def test_fetch_site_list_returns_list_of_sites():
    """Test that fetch_site_list returns a list of UNESCO sites."""
    client = UNESCOAPIClient()
    sites = client.fetch_site_list()
    
    assert isinstance(sites, list)
    assert len(sites) > 0
    assert 'id_number' in sites[0]
    assert 'site' in sites[0]

# This test will FAIL because UNESCOAPIClient doesn't exist yet
# Now implement minimal code to make it pass (GREEN)

Mock API Responses:

@responses.activate
def test_fetch_site_details_with_valid_id():
    """Test fetching site details with mocked API response."""
    # Arrange: Set up mock response (OpenDataSoft API v2.0 format)
    responses.add(
        responses.GET,
        "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
        json={
            "record": {
                "id": "rec_600",
                "fields": {
                    "unique_number": 600,
                    "name_en": "Paris, Banks of the Seine",
                    "category": "Cultural",
                    "states_name_en": "France"
                }
            }
        },
        status=200
    )
    
    # Act: Call the method
    client = UNESCOAPIClient()
    site = client.fetch_site_details(600)
    
    # Assert: Verify results (nested structure)
    assert site['record']['fields']['unique_number'] == 600
    assert site['record']['fields']['name_en'] == "Paris, Banks of the Seine"
    assert site['record']['fields']['category'] == "Cultural"

Error Handling Tests:

@responses.activate
def test_fetch_site_details_handles_404():
    """Test that 404 errors are handled gracefully."""
    responses.add(
        responses.GET,
        "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/99999",
        status=404
    )
    
    client = UNESCOAPIClient()
    
    with pytest.raises(UNESCOSiteNotFoundError) as exc_info:
        client.fetch_site_details(99999)
    
    assert "Site 99999 not found" in str(exc_info.value)

@responses.activate
def test_fetch_site_details_retries_on_network_error():
    """Test that network errors trigger retry logic."""
    # First call fails, second call succeeds
    responses.add(
        responses.GET,
        "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
        body=Exception("Network error")
    )
    responses.add(
        responses.GET,
        "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
        json={
            "record": {
                "id": "rec_600",
                "fields": {
                    "unique_number": 600,
                    "name_en": "Paris, Banks of the Seine"
                }
            }
        },
        status=200
    )
    
    client = UNESCOAPIClient(max_retries=3)
    site = client.fetch_site_details(600)
    
    assert site['record']['fields']['unique_number'] == 600
    assert len(responses.calls) == 2  # Verify retry happened

Caching Tests:

def test_api_client_caches_responses(tmp_path):
    """Test that API responses are cached to avoid redundant requests."""
    cache_db = tmp_path / "test_cache.db"
    client = UNESCOAPIClient(cache_path=cache_db)
    
    # First call: cache miss, fetches from API
    with responses.RequestsMock() as rsps:
        rsps.add(responses.GET, "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
                 json={"record": {"id": "rec_600", "fields": {"unique_number": 600}}}, status=200)
        site1 = client.fetch_site_details(600)
        assert len(rsps.calls) == 1
    
    # Second call: cache hit, no API request
    with responses.RequestsMock() as rsps:
        site2 = client.fetch_site_details(600)
        assert len(rsps.calls) == 0  # No new API calls
    
    assert site1 == site2

Success Criteria:

15+ unit tests for API client
100% coverage for unesco_api_client.py
All tests run in < 1 second (mocked responses)

Day 2-3: LinkML Map Schema Tests

Test Plan:

Test basic field extraction (name, location, UNESCO WHC ID)
Test multi-language name handling
Test conditional institution type mapping
Test missing field handling
Test regex validators

Golden Dataset Approach:

# tests/integration/test_linkml_map_golden_dataset.py

import pytest
from pathlib import Path
from linkml_map import Mapper
from glam_extractor.models import HeritageCustodian

FIXTURES_DIR = Path(__file__).parent.parent / "fixtures"

@pytest.mark.parametrize("site_id", [
    600,  # Paris, Banks of the Seine (Library/Museum)
    252,  # Taj Mahal (Monument with museum)
    936,  # Kew Gardens (Botanical garden)
    # Add 17 more...
])
def test_linkml_map_golden_dataset(site_id):
    """Test LinkML Map transformation against golden dataset."""
    # Load input JSON
    input_json = FIXTURES_DIR / f"unesco_api_responses/site_{site_id}.json"
    with open(input_json) as f:
        api_response = json.load(f)
    
    # Load expected output YAML
    expected_yaml = FIXTURES_DIR / f"expected_outputs/site_{site_id}_expected.yaml"
    with open(expected_yaml) as f:
        expected = yaml.safe_load(f)
    
    # Apply LinkML Map transformation
    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    result = mapper.transform(api_response)
    
    # Compare result with expected output
    assert result['name'] == expected['name']
    assert result['institution_type'] == expected['institution_type']
    assert result['locations'][0]['city'] == expected['locations'][0]['city']
    
    # Deep comparison for complex nested structures
    from deepdiff import DeepDiff
    diff = DeepDiff(result, expected, ignore_order=True)
    assert not diff, f"Unexpected differences: {diff}"

Conditional Mapping Tests:

def test_institution_type_mapping_for_museum():
    """Test that sites with 'museum' in description map to MUSEUM type."""
    api_response = {
        "record": {
            "id": "rec_600",
            "fields": {
                "unique_number": 600,
                "name_en": "Paris, Banks of the Seine",
                "short_description_en": "Includes the Louvre Museum and other heritage institutions",
                "category": "Cultural"
            }
        }
    }
    
    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    result = mapper.transform(api_response)
    
    assert result['institution_type'] == "MUSEUM"

def test_institution_type_mapping_for_botanical_garden():
    """Test that botanical gardens map correctly."""
    api_response = {
        "record": {
            "id": "rec_936",
            "fields": {
                "unique_number": 936,
                "name_en": "Royal Botanic Gardens, Kew",
                "short_description_en": "Botanical garden and herbarium",
                "category": "Cultural"
            }
        }
    }
    
    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    result = mapper.transform(api_response)
    
    assert result['institution_type'] == "BOTANICAL_ZOO"

def test_institution_type_defaults_to_mixed_when_ambiguous():
    """Test that ambiguous sites default to MIXED type."""
    api_response = {
        "record": {
            "id": "rec_123",
            "fields": {
                "unique_number": 123,
                "name_en": "Historic District",
                "short_description_en": "Historic area with no specific institution mentioned",
                "category": "Cultural"
            }
        }
    }
    
    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    result = mapper.transform(api_response)
    
    assert result['institution_type'] == "MIXED"

Multi-language Name Tests:

def test_multilingual_name_extraction():
    """Test extraction of names in multiple languages."""
    api_response = {
        "record": {
            "id": "rec_600",
            "fields": {
                "unique_number": 600,
                "name_en": "Paris, Banks of the Seine",
                "name_fr": "Paris, rives de la Seine",
                "name_es": "París, riberas del Sena"
            }
        }
    }
    
    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    result = mapper.transform(api_response)
    
    assert result['name'] == "Paris, Banks of the Seine"
    assert "Paris, rives de la Seine@fr" in result['alternative_names']
    assert "París, riberas del Sena@es" in result['alternative_names']

Success Criteria:

20 golden dataset test cases (100% passing)
10+ conditional mapping tests
5+ multi-language tests
LinkML Map schema validates with linkml-validate

Day 4: Test Fixture Creation

Fixture Quality Checklist:

# tests/conftest.py - Shared fixtures

import pytest
from pathlib import Path

@pytest.fixture
def fixtures_dir():
    """Return path to test fixtures directory."""
    return Path(__file__).parent / "fixtures"

@pytest.fixture
def sample_unesco_site_600():
    """Fixture for UNESCO site 600 (Paris, Banks of the Seine)."""
    return {
        "record": {
            "id": "rec_600",
            "fields": {
                "unique_number": 600,
                "name_en": "Paris, Banks of the Seine",
                "category": "Cultural",
                "states_name_en": "France",
                "region_en": "Europe and North America",
                "latitude": 48.8566,
                "longitude": 2.3522,
                "short_description_en": "From the Louvre to the Eiffel Tower...",
                "justification_en": "Contains numerous heritage institutions including museums and libraries",
                "date_inscribed": 1991,
                "http_url": "https://whc.unesco.org/en/list/600"
            }
        }
    }

@pytest.fixture
def expected_heritage_custodian_600():
    """Expected HeritageCustodian output for site 600."""
    return {
        "id": "https://w3id.org/heritage/custodian/fr/louvre",
        "name": "Paris, Banks of the Seine",
        "institution_type": "MIXED",  # Multiple institutions at this site
        "locations": [
            {
                "city": "Paris",
                "country": "FR",
                "coordinates": [48.8566, 2.3522],
                "geonames_id": "2988507"
            }
        ],
        "identifiers": [
            {
                "identifier_scheme": "UNESCO_WHC",
                "identifier_value": "600",
                "identifier_url": "https://whc.unesco.org/en/list/600"
            },
            {
                "identifier_scheme": "Wikidata",
                "identifier_value": "Q90",
                "identifier_url": "https://www.wikidata.org/wiki/Q90"
            }
        ],
        "provenance": {
            "data_source": "UNESCO_WORLD_HERITAGE",
            "data_tier": "TIER_1_AUTHORITATIVE",
            "extraction_date": "2025-11-09T10:00:00Z",
            "confidence_score": 1.0
        }
    }

@pytest.fixture
def mock_unesco_api_client(mocker):
    """Mock UNESCO API client for testing without network calls."""
    mock_client = mocker.Mock(spec=UNESCOAPIClient)
    mock_client.fetch_site_details.return_value = {
        "record": {
            "id": "rec_600",
            "fields": {
                "unique_number": 600,
                "name_en": "Paris, Banks of the Seine"
            }
        }
    }
    return mock_client

Fixture Coverage Matrix:

Site ID	Location	Institution Type	Special Case
600	France, Paris	MIXED	Multiple institutions
252	India, Agra	MUSEUM	Archaeological museum
936	UK, London	BOTANICAL_ZOO	Kew Gardens
148	Brazil, Brasília	ARCHIVE	National Archive
274	Egypt, Cairo	MUSEUM	Egyptian Museum
890	Japan, Kyoto	HOLY_SITES	Temple with collection
1110	Vietnam, Hanoi	LIBRARY	National Library
723	Mexico, Mexico City	MUSEUM	Anthropology Museum
...	...	...	...

Success Criteria:

20 diverse test fixtures covering all continents
20 expected output YAML files (manually verified)
Fixtures cover all institution types in InstitutionTypeEnum
Edge cases documented (serial nominations, transboundary sites)

Day 5: Institution Type Classifier Tests

Test Plan:

Test keyword-based classification
Test multi-keyword scoring
Test confidence scoring
Test multilingual keywords
Test ambiguous case handling

Property-Based Testing with Hypothesis:

# tests/unit/test_institution_classifier_properties.py

from hypothesis import given, strategies as st
from glam_extractor.classifiers.unesco_institution_type import classify_institution_type

@given(description=st.text(min_size=10, max_size=500))
def test_classifier_always_returns_valid_enum(description):
    """Property: Classifier must always return a valid InstitutionTypeEnum."""
    result = classify_institution_type(description)
    
    assert result['institution_type'] in [
        "MUSEUM", "LIBRARY", "ARCHIVE", "GALLERY",
        "BOTANICAL_ZOO", "HOLY_SITES", "FEATURES", "MIXED"
    ]
    assert 0.0 <= result['confidence_score'] <= 1.0

@given(
    description=st.text(),
    keyword=st.sampled_from(["museum", "library", "archive", "botanical garden"])
)
def test_classifier_detects_explicit_keywords(description, keyword):
    """Property: Explicit keywords should be detected reliably."""
    # Force keyword into description
    modified_desc = f"{description} This site contains a {keyword}."
    
    result = classify_institution_type(modified_desc)
    
    # Should detect the keyword with high confidence
    assert result['confidence_score'] >= 0.7

Keyword Detection Tests:

@pytest.mark.parametrize("description,expected_type,min_confidence", [
    # Museum keywords
    ("Site includes the National Museum of Anthropology", "MUSEUM", 0.9),
    ("Archaeological museum with 10,000 artifacts", "MUSEUM", 0.9),
    ("Le Musée du Louvre est situé à Paris", "MUSEUM", 0.9),  # French
    
    # Library keywords
    ("The National Library holds rare manuscripts", "LIBRARY", 0.9),
    ("Biblioteca Nacional do Brasil", "LIBRARY", 0.9),  # Portuguese
    
    # Archive keywords
    ("Historic archive with colonial documents", "ARCHIVE", 0.9),
    ("Rijksarchief in Noord-Holland", "ARCHIVE", 0.9),  # Dutch
    
    # Botanical garden keywords
    ("Royal Botanic Gardens with herbarium collection", "BOTANICAL_ZOO", 0.9),
    ("Jardin botanique with 50,000 plant species", "BOTANICAL_ZOO", 0.9),
    
    # Holy sites
    ("Cathedral with liturgical manuscripts collection", "HOLY_SITES", 0.85),
    ("Monastery library with medieval codices", "HOLY_SITES", 0.85),
    
    # Ambiguous cases (should default to MIXED)
    ("Historic district with various buildings", "MIXED", 0.5),
    ("Cultural landscape", "MIXED", 0.5),
])
def test_keyword_based_classification(description, expected_type, min_confidence):
    """Test that classifier detects keywords correctly."""
    result = classify_institution_type(description)
    
    assert result['institution_type'] == expected_type
    assert result['confidence_score'] >= min_confidence

Multi-language Support Tests:

def test_classifier_handles_multilingual_descriptions():
    """Test that classifier works with non-English descriptions."""
    test_cases = [
        # French
        ("Le site comprend un musée d'art moderne", "MUSEUM"),
        ("Bibliothèque nationale avec des manuscrits rares", "LIBRARY"),
        
        # Spanish
        ("Museo Nacional de Antropología con colecciones", "MUSEUM"),
        ("Biblioteca pública con archivos históricos", "LIBRARY"),
        
        # Portuguese
        ("Museu Nacional com exposições permanentes", "MUSEUM"),
        
        # German
        ("Staatliches Museum für Völkerkunde", "MUSEUM"),
        
        # Dutch
        ("Rijksmuseum met kunstcollecties", "MUSEUM"),
    ]
    
    for description, expected_type in test_cases:
        result = classify_institution_type(description)
        assert result['institution_type'] == expected_type, \
            f"Failed for: {description}"

Confidence Scoring Tests:

def test_confidence_high_for_explicit_mentions():
    """Test that explicit mentions produce high confidence scores."""
    result = classify_institution_type("Site contains the National Museum")
    assert result['confidence_score'] >= 0.9

def test_confidence_medium_for_inferred_types():
    """Test that inferred types produce medium confidence scores."""
    result = classify_institution_type("Archaeological site with exhibition hall")
    assert 0.6 <= result['confidence_score'] < 0.9

def test_confidence_low_for_ambiguous_cases():
    """Test that ambiguous cases produce low confidence scores."""
    result = classify_institution_type("Historic area")
    assert result['confidence_score'] < 0.6

Success Criteria:

50+ unit tests for classifier
100% coverage for unesco_institution_type.py
Property-based tests cover edge cases
90%+ accuracy on golden dataset (20 fixtures)

Phase 2: Extractor Implementation (Days 6-13)

Day 6-7: Parser Tests

Test Plan:

Test parsing valid UNESCO JSON to HeritageCustodian
Test handling missing optional fields
Test handling invalid/malformed JSON
Test LinkML validation integration
Test provenance metadata generation

Parser Unit Tests:

# tests/unit/test_unesco_parser.py

def test_parse_unesco_site_creates_valid_heritage_custodian(sample_unesco_site_600):
    """Test that parser creates valid HeritageCustodian from UNESCO JSON."""
    from glam_extractor.parsers.unesco_parser import parse_unesco_site
    
    custodian = parse_unesco_site(sample_unesco_site_600)
    
    assert isinstance(custodian, HeritageCustodian)
    assert custodian.name == "Paris, Banks of the Seine"
    assert custodian.institution_type in InstitutionTypeEnum
    assert len(custodian.identifiers) >= 1
    assert custodian.provenance.data_source == "UNESCO_WORLD_HERITAGE"

def test_parse_unesco_site_handles_missing_coordinates():
    """Test that parser handles missing latitude/longitude gracefully."""
    incomplete_site = {
        "record": {
            "id": "rec_999",
            "fields": {
                "unique_number": 999,
                "name_en": "Test Site",
                "category": "Cultural"
                # Missing latitude, longitude
            }
        }
    }
    
    custodian = parse_unesco_site(incomplete_site)
    
    # Should still parse, but location may lack coordinates
    assert custodian.name == "Test Site"
    if custodian.locations:
        assert custodian.locations[0].latitude is None
        assert custodian.locations[0].longitude is None

def test_parse_unesco_site_extracts_all_identifiers():
    """Test that parser extracts UNESCO WHC ID and Wikidata Q-number."""
    site_with_links = {
        "record": {
            "id": "rec_600",
            "fields": {
                "unique_number": 600,
                "name_en": "Paris, Banks of the Seine",
                "http_url": "https://whc.unesco.org/en/list/600",
                "wdpaid": "Q90"  # Wikidata ID field
            }
        }
    }
    
    custodian = parse_unesco_site(site_with_links)
    
    # Check UNESCO WHC ID
    whc_ids = [i for i in custodian.identifiers if i.identifier_scheme == "UNESCO_WHC"]
    assert len(whc_ids) == 1
    assert whc_ids[0].identifier_value == "600"
    
    # Check Wikidata Q-number
    wd_ids = [i for i in custodian.identifiers if i.identifier_scheme == "Wikidata"]
    assert len(wd_ids) == 1
    assert wd_ids[0].identifier_value == "Q90"

LinkML Validation Integration:

def test_parsed_custodian_validates_against_linkml_schema(sample_unesco_site_600):
    """Test that parsed HeritageCustodian passes LinkML validation."""
    from linkml.validators import JsonSchemaValidator
    
    custodian = parse_unesco_site(sample_unesco_site_600)
    validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
    
    # Convert to dict for validation
    custodian_dict = custodian.dict()
    
    # Should not raise ValidationError
    report = validator.validate(custodian_dict)
    assert report.valid, f"Validation errors: {report.results}"

Success Criteria:

20+ unit tests for parser
100% coverage for unesco_parser.py
All parsed instances validate against LinkML schema

Day 8: GHCID Generator Tests

Test Plan:

Test GHCID format correctness
Test deterministic generation (same input → same GHCID)
Test UUID v5 generation
Test collision detection
Test native name suffix generation (NOT Wikidata Q-numbers)

Note

: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See docs/plan/global_glam/07-ghcid-collision-resolution.md for authoritative documentation.

GHCID Format Tests:

# tests/unit/test_ghcid_unesco.py

def test_ghcid_format_for_french_museum():
    """Test GHCID generation for French museum follows format."""
    custodian = HeritageCustodian(
        name="Musée du Louvre",
        institution_type=InstitutionTypeEnum.MUSEUM,
        locations=[Location(city="Paris", country="FR", region="Île-de-France")]
    )
    
    ghcid = generate_ghcid(custodian)
    
    # Format: FR-IDF-PAR-M-LOU
    assert ghcid.startswith("FR-")
    assert "-M-" in ghcid  # M = Museum
    assert re.match(r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}$', ghcid)

def test_ghcid_with_name_suffix_collision():
    """Test GHCID includes name suffix when collision detected."""
    custodian = HeritageCustodian(
        name="Stedelijk Museum Amsterdam",
        institution_type=InstitutionTypeEnum.MUSEUM,
        locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
    )
    
    # Simulate collision detection
    ghcid = generate_ghcid(custodian, collision_detected=True)
    
    # Format: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
    assert ghcid.endswith("-stedelijk_museum_amsterdam")

Determinism Tests (Property-Based):

from hypothesis import given, strategies as st

@given(
    country=st.sampled_from(["FR", "NL", "BR", "IN", "JP"]),
    city=st.text(min_size=3, max_size=20, alphabet=st.characters(whitelist_categories=('Lu', 'Ll'))),
    inst_type=st.sampled_from(list(InstitutionTypeEnum))
)
def test_ghcid_generation_is_deterministic(country, city, inst_type):
    """Property: Same input always produces same GHCID."""
    custodian = HeritageCustodian(
        name="Test Institution",
        institution_type=inst_type,
        locations=[Location(city=city, country=country)]
    )
    
    ghcid1 = generate_ghcid(custodian)
    ghcid2 = generate_ghcid(custodian)
    
    assert ghcid1 == ghcid2

@given(st.text(min_size=5, max_size=100))
def test_uuid_v5_determinism_for_ghcid(ghcid_string):
    """Property: UUID v5 generation is deterministic."""
    uuid1 = generate_uuid_v5(ghcid_string)
    uuid2 = generate_uuid_v5(ghcid_string)
    
    assert uuid1 == uuid2
    assert isinstance(uuid1, uuid.UUID)
    assert uuid1.version == 5

Collision Detection Tests:

def test_collision_detection_with_same_base_ghcid():
    """Test that collision is detected when base GHCID matches existing record."""
    existing_dataset = [
        HeritageCustodian(
            ghcid="NL-NH-AMS-M-HM",
            name="Hermitage Amsterdam"
        )
    ]
    
    new_custodian = HeritageCustodian(
        name="Historical Museum Amsterdam",
        institution_type=InstitutionTypeEnum.MUSEUM,
        locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
    )
    
    collision = detect_ghcid_collision(new_custodian, existing_dataset)
    
    assert collision is True
    
    # Generate GHCID with name suffix to resolve collision
    ghcid = generate_ghcid(new_custodian, collision_detected=True)
    
    assert ghcid == "NL-NH-AMS-M-HM-historical_museum_amsterdam"

Success Criteria:

25+ unit tests for GHCID generator
100% coverage for GHCID logic
Property-based tests verify determinism
Collision detection 100% accurate on test cases

Day 9-12: Integration Tests

Test Plan:

Test API → Parser pipeline
Test Parser → Validator pipeline
Test full extraction pipeline
Test batch processing
Test error recovery

Pipeline Integration Tests:

# tests/integration/test_api_to_parser_pipeline.py

@responses.activate
def test_full_pipeline_from_api_to_linkml_instance():
    """Test complete pipeline: API fetch → Parse → Validate."""
    # Arrange: Mock UNESCO API response
    responses.add(
        responses.GET,
        "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
        json={
            "record": {
                "id": "rec_600",
                "fields": {
                    "unique_number": 600,
                    "name_en": "Paris, Banks of the Seine",
                    "category": "Cultural",
                    "states_name_en": "France",
                    "latitude": 48.8566,
                    "longitude": 2.3522
                }
            }
        },
        status=200
    )
    
    # Act: Run pipeline
    api_client = UNESCOAPIClient()
    site_data = api_client.fetch_site_details(600)
    custodian = parse_unesco_site(site_data)
    custodian.ghcid = generate_ghcid(custodian)
    
    # Assert: Verify result
    assert custodian.name == "Paris, Banks of the Seine"
    assert custodian.ghcid.startswith("FR-")
    
    # Validate against schema
    validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
    report = validator.validate(custodian.dict())
    assert report.valid

@pytest.mark.slow
def test_batch_processing_handles_failures_gracefully():
    """Test that batch processing continues when individual sites fail."""
    site_ids = [600, 99999, 252, 88888, 936]  # Mix of valid and invalid IDs
    
    results = []
    errors = []
    
    for site_id in site_ids:
        try:
            site_data = api_client.fetch_site_details(site_id)
            custodian = parse_unesco_site(site_data)
            results.append(custodian)
        except UNESCOSiteNotFoundError as e:
            errors.append((site_id, str(e)))
    
    # Should have processed valid sites
    assert len(results) == 3  # 600, 252, 936
    
    # Should have logged errors for invalid sites
    assert len(errors) == 2  # 99999, 88888

Parallel Processing Tests:

@pytest.mark.slow
def test_parallel_extraction_produces_same_results_as_sequential():
    """Test that parallel processing produces identical results to sequential."""
    site_ids = [600, 252, 936, 148, 274]
    
    # Sequential extraction
    sequential_results = []
    for site_id in site_ids:
        site_data = api_client.fetch_site_details(site_id)
        custodian = parse_unesco_site(site_data)
        sequential_results.append(custodian)
    
    # Parallel extraction
    from concurrent.futures import ThreadPoolExecutor
    
    def process_site(site_id):
        site_data = api_client.fetch_site_details(site_id)
        return parse_unesco_site(site_data)
    
    with ThreadPoolExecutor(max_workers=4) as executor:
        parallel_results = list(executor.map(process_site, site_ids))
    
    # Results should be identical (order may differ)
    sequential_ghcids = sorted([c.ghcid for c in sequential_results])
    parallel_ghcids = sorted([c.ghcid for c in parallel_results])
    
    assert sequential_ghcids == parallel_ghcids

Success Criteria:

15+ integration tests covering pipeline components
End-to-end test processes 20 golden dataset sites
Parallel processing test confirms determinism
All integration tests pass consistently

Phase 3: Data Quality & Validation (Days 14-19)

Day 14-16: Cross-Referencing and Conflict Detection Tests

Test Plan:

Test matching by Wikidata Q-number
Test matching by ISIL code
Test fuzzy name matching
Test conflict detection
Test conflict resolution

Matching Tests:

# tests/unit/test_crosslinking.py

def test_match_by_wikidata_qnumber():
    """Test that institutions match by Wikidata Q-number."""
    unesco_record = HeritageCustodian(
        name="Bibliothèque nationale de France",
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
    )
    
    existing_record = HeritageCustodian(
        name="BnF",  # Different name
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
    )
    
    match = find_match(unesco_record, [existing_record])
    
    assert match is not None
    assert match.name == "BnF"

def test_fuzzy_name_matching_with_threshold():
    """Test fuzzy name matching with similarity threshold."""
    unesco_record = HeritageCustodian(
        name="Rijksmuseum Amsterdam",
        locations=[Location(city="Amsterdam", country="NL")]
    )
    
    existing_record = HeritageCustodian(
        name="Rijks Museum Amsterdam",  # Slightly different spelling
        locations=[Location(city="Amsterdam", country="NL")]
    )
    
    match, score = fuzzy_match_by_name_and_location(
        unesco_record, [existing_record], threshold=0.85
    )
    
    assert match is not None
    assert score >= 0.85

Conflict Detection Tests:

def test_detect_name_conflict():
    """Test detection of name conflicts between UNESCO and existing data."""
    unesco_record = HeritageCustodian(
        name="Bibliothèque nationale de France",
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
    )
    
    existing_record = HeritageCustodian(
        name="BnF",
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
    )
    
    conflicts = detect_conflicts(unesco_record, existing_record)
    
    assert len(conflicts) == 1
    assert conflicts[0].field == "name"
    assert conflicts[0].unesco_value == "Bibliothèque nationale de France"
    assert conflicts[0].existing_value == "BnF"

def test_detect_institution_type_conflict():
    """Test detection of institution type mismatches."""
    unesco_record = HeritageCustodian(
        name="Site Name",
        institution_type=InstitutionTypeEnum.MUSEUM,
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
    )
    
    existing_record = HeritageCustodian(
        name="Site Name",
        institution_type=InstitutionTypeEnum.LIBRARY,  # Conflict!
        identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
    )
    
    conflicts = detect_conflicts(unesco_record, existing_record)
    
    type_conflicts = [c for c in conflicts if c.field == "institution_type"]
    assert len(type_conflicts) == 1

Conflict Resolution Tests:

def test_unesco_tier1_wins_over_conversation_tier4():
    """Test that UNESCO (TIER_1) data takes priority over conversation (TIER_4) data."""
    unesco_record = HeritageCustodian(
        name="Bibliothèque nationale de France",
        institution_type=InstitutionTypeEnum.LIBRARY,
        provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
    )
    
    conversation_record = HeritageCustodian(
        name="BnF",
        institution_type=InstitutionTypeEnum.MIXED,  # Less accurate
        provenance=Provenance(data_tier=DataTier.TIER_4_INFERRED)
    )
    
    merged = resolve_conflicts(unesco_record, conversation_record)
    
    # UNESCO data should win
    assert merged.name == "Bibliothèque nationale de France"
    assert merged.institution_type == InstitutionTypeEnum.LIBRARY
    
    # But preserve alternative names
    assert "BnF" in merged.alternative_names

Success Criteria:

20+ tests for matching logic
15+ tests for conflict detection
10+ tests for conflict resolution
100% coverage for crosslinking module

Day 17-18: Validation and Confidence Scoring Tests

Test Plan:

Test LinkML schema validation
Test custom validators (UNESCO WHC ID, GHCID format)
Test confidence score calculation
Test quality metrics

Validation Tests:

# tests/unit/test_unesco_validators.py

def test_validate_unesco_whc_id_format():
    """Test UNESCO WHC ID validator accepts valid IDs."""
    assert validate_unesco_whc_id("600") is True
    assert validate_unesco_whc_id("1234") is True
    assert validate_unesco_whc_id("52") is False  # Too short
    assert validate_unesco_whc_id("12345") is False  # Too long
    assert validate_unesco_whc_id("ABC") is False  # Not numeric

def test_validate_ghcid_format_for_unesco():
    """Test GHCID format validator.
    
    Note: Collision suffix uses native language name in snake_case (NOT Wikidata Q-numbers).
    See: docs/plan/global_glam/07-ghcid-collision-resolution.md
    """
    assert validate_ghcid_format("FR-IDF-PAR-M-LOU") is True
    assert validate_ghcid_format("FR-IDF-PAR-M-LOU-musee_du_louvre") is True  # Native name suffix
    assert validate_ghcid_format("INVALID") is False
    assert validate_ghcid_format("FR-PAR-M") is False  # Missing components

@pytest.mark.parametrize("custodian_dict,should_pass", [
    # Valid record
    ({
        "name": "Test Museum",
        "institution_type": "MUSEUM",
        "provenance": {
            "data_source": "UNESCO_WORLD_HERITAGE",
            "data_tier": "TIER_1_AUTHORITATIVE",
            "extraction_date": "2025-11-09T10:00:00Z"
        }
    }, True),
    
    # Missing required field
    ({
        "institution_type": "MUSEUM",  # Missing name!
        "provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
    }, False),
    
    # Invalid enum value
    ({
        "name": "Test",
        "institution_type": "INVALID_TYPE",  # Not in enum!
        "provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
    }, False),
])
def test_linkml_validation(custodian_dict, should_pass):
    """Test LinkML schema validation with valid and invalid records."""
    validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
    report = validator.validate(custodian_dict)
    
    if should_pass:
        assert report.valid, f"Unexpected validation error: {report.results}"
    else:
        assert not report.valid

Confidence Scoring Tests:

def test_confidence_score_high_for_complete_record():
    """Test that complete records with rich metadata score high."""
    custodian = HeritageCustodian(
        name="Musée du Louvre",
        institution_type=InstitutionTypeEnum.MUSEUM,
        locations=[Location(city="Paris", country="FR", latitude=48.86, longitude=2.34)],
        identifiers=[
            Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600"),
            Identifier(identifier_scheme="Wikidata", identifier_value="Q19675"),
            Identifier(identifier_scheme="VIAF", identifier_value="139708098")
        ],
        digital_platforms=[DigitalPlatform(platform_name="Louvre Collections", platform_url="https://collections.louvre.fr")],
        provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
    )
    
    score = calculate_confidence_score(custodian)
    
    assert score >= 0.95

def test_confidence_score_low_for_minimal_record():
    """Test that minimal records score lower."""
    custodian = HeritageCustodian(
        name="Unknown Site",
        institution_type=InstitutionTypeEnum.MIXED,  # Ambiguous
        # Missing: locations, identifiers, digital_platforms
        provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
    )
    
    score = calculate_confidence_score(custodian)
    
    assert score < 0.7

@given(
    num_identifiers=st.integers(min_value=0, max_value=5),
    has_location=st.booleans(),
    has_platform=st.booleans()
)
def test_confidence_score_increases_with_completeness(num_identifiers, has_location, has_platform):
    """Property: More metadata should increase confidence score."""
    custodian = HeritageCustodian(
        name="Test",
        institution_type=InstitutionTypeEnum.MUSEUM,
        identifiers=[Identifier(identifier_scheme="Test", identifier_value=f"{i}") for i in range(num_identifiers)],
        locations=[Location(city="City", country="CC")] if has_location else [],
        digital_platforms=[DigitalPlatform(platform_name="Test", platform_url="https://test.com")] if has_platform else [],
        provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
    )
    
    score = calculate_confidence_score(custodian)
    
    # Score should be in valid range
    assert 0.0 <= score <= 1.0

Success Criteria:

20+ validation tests
15+ confidence scoring tests
Property-based tests verify score consistency
100% coverage for validator modules

Phase 4: Integration & Enrichment (Days 20-24)

Wikidata Enrichment Tests

Test Plan:

Test SPARQL query generation
Test Q-number extraction
Test fallback to fuzzy matching
Test enrichment error handling

# tests/integration/test_wikidata_enrichment.py

@responses.activate
def test_wikidata_query_by_unesco_whc_id():
    """Test querying Wikidata for Q-number using UNESCO WHC ID."""
    # Mock SPARQL endpoint response
    responses.add(
        responses.POST,
        "https://query.wikidata.org/sparql",
        json={
            "results": {
                "bindings": [
                    {"item": {"value": "http://www.wikidata.org/entity/Q90"}}
                ]
            }
        },
        status=200
    )
    
    q_number = query_wikidata_for_unesco_site(whc_id="600")
    
    assert q_number == "Q90"

def test_wikidata_enrichment_adds_identifiers():
    """Test that Wikidata enrichment adds Q-number to custodian."""
    custodian = HeritageCustodian(
        name="Paris, Banks of the Seine",
        identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600")]
    )
    
    enriched = enrich_with_wikidata(custodian)
    
    # Should have added Wikidata identifier
    wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
    assert len(wd_ids) == 1
    assert wd_ids[0].identifier_value == "Q90"

def test_wikidata_enrichment_handles_not_found():
    """Test that enrichment handles missing Wikidata entries gracefully."""
    custodian = HeritageCustodian(
        name="Obscure Site",
        identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="99999")]
    )
    
    enriched = enrich_with_wikidata(custodian)
    
    # Should not crash, just log warning
    wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
    assert len(wd_ids) == 0

Success Criteria:

10+ Wikidata enrichment tests
Mock SPARQL responses for offline testing
Error handling tests for API failures

Phase 5: Export & Documentation (Days 25-30)

Export Format Tests

Test Plan:

Test RDF/Turtle serialization
Test JSON-LD export
Test CSV flattening
Test Parquet export
Test round-trip data integrity

# tests/integration/test_exports.py

def test_rdf_export_produces_valid_turtle(sample_dataset):
    """Test that RDF export produces valid Turtle syntax."""
    from glam_extractor.exporters.rdf_exporter import export_to_rdf
    
    output_path = tmp_path / "test_export.ttl"
    export_to_rdf(sample_dataset, output_path)
    
    # Verify file was created
    assert output_path.exists()
    
    # Parse with rdflib to verify syntax
    from rdflib import Graph
    graph = Graph()
    graph.parse(output_path, format="turtle")
    
    # Check that records were serialized
    assert len(graph) > 0

def test_jsonld_export_includes_context(sample_dataset):
    """Test that JSON-LD export includes @context."""
    from glam_extractor.exporters.jsonld_exporter import export_to_jsonld
    
    output_path = tmp_path / "test_export.jsonld"
    export_to_jsonld(sample_dataset, output_path)
    
    with open(output_path) as f:
        data = json.load(f)
    
    assert "@context" in data
    assert "@graph" in data or isinstance(data, list)

def test_csv_export_flattens_nested_structures(sample_dataset):
    """Test that CSV export flattens nested structures correctly."""
    from glam_extractor.exporters.csv_exporter import export_to_csv
    
    output_path = tmp_path / "test_export.csv"
    export_to_csv(sample_dataset, output_path)
    
    import pandas as pd
    df = pd.read_csv(output_path)
    
    # Check required columns
    assert "ghcid" in df.columns
    assert "name" in df.columns
    assert "institution_type" in df.columns
    assert "country" in df.columns

def test_parquet_export_preserves_data_types(sample_dataset):
    """Test that Parquet export preserves data types."""
    from glam_extractor.exporters.parquet_exporter import export_to_parquet
    
    output_path = tmp_path / "test_export.parquet"
    export_to_parquet(sample_dataset, output_path)
    
    import pandas as pd
    df = pd.read_parquet(output_path)
    
    # Verify data types
    assert df['ghcid'].dtype == 'object'  # String
    assert df['confidence_score'].dtype == 'float64'

def test_round_trip_data_integrity(sample_dataset):
    """Test that data survives export → import round trip."""
    # Export to JSON-LD
    output_path = tmp_path / "test_export.jsonld"
    export_to_jsonld(sample_dataset, output_path)
    
    # Import back
    from glam_extractor.importers.jsonld_importer import import_from_jsonld
    imported_dataset = import_from_jsonld(output_path)
    
    # Compare
    assert len(imported_dataset) == len(sample_dataset)
    
    for original, imported in zip(sample_dataset, imported_dataset):
        assert original.ghcid == imported.ghcid
        assert original.name == imported.name

Success Criteria:

15+ export format tests
Round-trip tests verify data integrity
All exports validate with external parsers

Continuous Integration (CI) Strategy

GitHub Actions Workflow

# .github/workflows/test.yml

name: Test Suite

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12"]
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
      
      - name: Install dependencies
        run: |
          pip install -e ".[test]"          
      
      - name: Run unit tests
        run: |
          pytest tests/unit -v --cov --cov-report=xml -m "not slow"          
      
      - name: Run integration tests
        run: |
          pytest tests/integration -v -m "not api and not wikidata"          
      
      - name: Check code coverage
        run: |
          pytest --cov-report=term-missing --cov-fail-under=90          
      
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

Pre-commit Hooks

# .pre-commit-config.yaml

repos:
  - repo: https://github.com/psf/black
    rev: 24.1.0
    hooks:
      - id: black
  
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.11
    hooks:
      - id: ruff
        args: [--fix, --exit-non-zero-on-fix]
  
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.8.0
    hooks:
      - id: mypy
        additional_dependencies: [types-all]
  
  - repo: local
    hooks:
      - id: pytest-quick
        name: pytest-quick
        entry: pytest tests/unit -m "not slow"
        language: system
        pass_filenames: false
        always_run: true

Test Coverage Goals

Coverage Targets by Module

Module	Target Coverage	Priority
`unesco_api_client.py`	100%	Critical
`unesco_parser.py`	100%	Critical
`unesco_institution_type.py`	100%	Critical
`ghcid_generator.py`	95%	Critical
`linkml_map.py`	90%	High
`conflict_resolver.py`	95%	High
`wikidata_enrichment.py`	85%	Medium
`exporters/*.py`	85%	Medium

Coverage Enforcement

# Fail CI if coverage drops below 90%
pytest --cov-fail-under=90

# Generate HTML coverage report
pytest --cov --cov-report=html

# Open coverage report
open htmlcov/index.html

Performance Testing

Benchmark Tests

# tests/performance/test_benchmarks.py

import pytest
import time

@pytest.mark.slow
def test_parse_1000_sites_completes_in_reasonable_time():
    """Benchmark: Parse 1,000 UNESCO sites in < 60 seconds."""
    start = time.time()
    
    results = []
    for i in range(1000):
        site_data = generate_fake_unesco_site()
        custodian = parse_unesco_site(site_data)
        results.append(custodian)
    
    elapsed = time.time() - start
    
    assert elapsed < 60, f"Parsing took {elapsed:.2f}s, expected < 60s"

@pytest.mark.slow
def test_ghcid_generation_performance():
    """Benchmark: Generate 10,000 GHCIDs in < 5 seconds."""
    custodians = [generate_fake_custodian() for _ in range(10000)]
    
    start = time.time()
    ghcids = [generate_ghcid(c) for c in custodians]
    elapsed = time.time() - start
    
    assert elapsed < 5, f"GHCID generation took {elapsed:.2f}s, expected < 5s"

Test Data Management

Fixture Organization

# tests/conftest.py

import pytest
from pathlib import Path
import json
import yaml

@pytest.fixture(scope="session")
def fixtures_dir():
    """Return path to fixtures directory."""
    return Path(__file__).parent / "fixtures"

@pytest.fixture(scope="session")
def golden_dataset(fixtures_dir):
    """Load all golden dataset fixtures."""
    expected_dir = fixtures_dir / "expected_outputs"
    golden_data = {}
    
    for yaml_file in expected_dir.glob("*.yaml"):
        site_id = yaml_file.stem.split("_")[1]  # Extract site ID
        with open(yaml_file) as f:
            golden_data[site_id] = yaml.safe_load(f)
    
    return golden_data

@pytest.fixture
def mock_api_responses(fixtures_dir):
    """Load mock UNESCO API responses."""
    api_dir = fixtures_dir / "unesco_api_responses"
    responses = {}
    
    for json_file in api_dir.glob("*.json"):
        site_id = json_file.stem.split("_")[1]
        with open(json_file) as f:
            responses[site_id] = json.load(f)
    
    return responses

Summary of Test Counts

Total Tests by Phase

Phase	Unit Tests	Integration Tests	E2E Tests	Property-Based	Total
Phase 1	50	5	0	10	65
Phase 2	75	20	3	15	113
Phase 3	45	15	0	8	68
Phase 4	25	10	2	5	42
Phase 5	30	10	2	0	42
Total	225	60	7	38	330

Success Criteria

330+ total tests implemented
90%+ code coverage across all modules
100% passing tests on CI
Zero flaky tests (deterministic execution)
All integration tests use mocked external dependencies
Property-based tests cover edge cases
Performance benchmarks pass on CI

Version History

Version 1.1 (2025-11-10)

Migration to OpenDataSoft Explore API v2.0

Breaking Changes:

API Endpoint Format: Changed from https://whc.unesco.org/en/list/{id}/json to https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/{id}
JSON Structure: Migrated from flat JSON to nested OpenDataSoft format
- Old: {"id_number": 600, "site": "Paris, Banks of the Seine"}
- New: {"record": {"id": "rec_600", "fields": {"unique_number": 600, "name_en": "Paris, Banks of the Seine"}}}
Field Mappings Updated:
- id_number → unique_number
- site → name_en
- states → states_name_en
- region → region_en
- short_description → short_description_en
- justification → justification_en
- date_inscribed → numeric year (was string)
- links → http_url (single URL field)
- Coordinates now in nested fields object

Updated Sections:

Day 1: API Client Tests (Lines 175-269)
- Updated all API endpoint URLs to OpenDataSoft format
- Updated mock JSON responses to nested structure
- Updated field access patterns in assertions
- Updated error handling tests (404 responses)
- Updated retry logic tests with new endpoints
- Updated caching tests with new response format
Day 2-3: LinkML Map Schema Tests (Lines 342-407)
- Updated institution type mapping test fixtures
- Updated botanical garden test fixture
- Updated multilingual name extraction (now uses name_en, name_fr, name_es fields)
- Updated field access in assertions
Day 4: Test Fixture Creation (Lines 450-510)
- Updated sample_unesco_site_600 fixture with nested structure
- Updated expected_heritage_custodian_600 fixture
- Updated mock API client return values
Day 6-7: Parser Tests (Lines 709-747)
- Updated incomplete site test with nested structure
- Updated identifier extraction test
- Changed Wikidata extraction from links array to wdpaid field
- Updated all field references
Day 9-12: Integration Tests (Lines 886-916)
- Updated full pipeline integration test mock response
- Updated all field names in nested structure
- Updated assertions to match new API format

Field Access Pattern Changes:

# Old (v1.0)
site_id = response['id_number']
site_name = response['site']
country = response['states']

# New (v1.1)
site_id = response['record']['fields']['unique_number']
site_name = response['record']['fields']['name_en']
country = response['record']['fields']['states_name_en']

Testing Impact:

All API client tests require updated endpoint URLs
All mock responses updated to nested JSON structure
Parser tests updated to extract from record.fields.* paths
LinkML Map schema must handle nested path extraction
Golden dataset fixtures require regeneration with new format
Integration tests updated with new field access patterns

Rationale for Migration:

Legacy UNESCO JSON API (whc.unesco.org/en/list/{id}/json) deprecated or unstable
OpenDataSoft Explore API v2.0 is official UNESCO data platform
Provides standardized REST API with pagination, filtering, and versioned datasets
Better maintained and documented than legacy endpoint
Supports structured metadata with type information

Migration Guide:

Update all API client code to use new endpoint format
Update JSON parsing logic to handle record.fields.* nested structure
Update LinkML Map schema transformation paths
Regenerate all test fixtures with new API responses
Run full test suite to verify all tests pass with new format

References:

OpenDataSoft API Documentation: https://help.opendatasoft.com/apis/ods-explore-v2/
UNESCO World Heritage Dataset: https://data.unesco.org/datasets/whc-sites-2021
API Migration Tracking Issue: TBD

Version 1.0 (2025-11-08)

Initial Release

Complete TDD strategy for UNESCO World Heritage Sites extraction
13-day implementation plan with daily test goals
175+ total tests across all phases:
- 15+ API client unit tests
- 20+ golden dataset tests
- 20+ parser unit tests
- 15+ validator tests
- 25+ GHCID generator tests
- 30+ integration tests
- 50+ property-based tests
Coverage targets: 90%+ overall, 100% for critical modules
Test types: Unit, Integration, E2E, Property-Based, Golden Dataset
Fixtures: 20+ golden dataset YAML files, 20+ mock API responses
Performance benchmarks: Parse 1,000 sites in <60s, Generate 10,000 GHCIDs in <5s
CI/CD integration with pytest and coverage reporting

Next Steps

Completed:

✅ 01-dependencies.md - Technical dependencies
✅ 02-consumers.md - Use cases and data consumers
✅ 03-implementation-phases.md - 6-week timeline
✅ 04-tdd-strategy.md - THIS DOCUMENT - Testing strategy

Next to Create:

05-design-patterns.md - UNESCO-specific architectural patterns
06-linkml-map-schema.md - CRITICAL - Complete LinkML Map transformation rules
07-master-checklist.md - Implementation checklist

Document Status: Complete
Next Document: 05-design-patterns.md - Architectural patterns
Version: 1.1

58 KiB Raw Blame History

UNESCO Data Extraction - Test-Driven Development Strategy

Executive Summary

Testing Philosophy

Core Principles

Testing Pyramid

Test Infrastructure

Testing Framework Stack

Dependencies

Directory Structure

Phase-by-Phase TDD Strategy

Phase 1: API Exploration & Schema Design (Days 1-5)

Day 1: UNESCO API Client Tests

Day 2-3: LinkML Map Schema Tests

Day 4: Test Fixture Creation

Day 5: Institution Type Classifier Tests

Phase 2: Extractor Implementation (Days 6-13)

Day 6-7: Parser Tests

Day 8: GHCID Generator Tests

Day 9-12: Integration Tests

Phase 3: Data Quality & Validation (Days 14-19)

Day 14-16: Cross-Referencing and Conflict Detection Tests

Day 17-18: Validation and Confidence Scoring Tests

Phase 4: Integration & Enrichment (Days 20-24)

Wikidata Enrichment Tests

Phase 5: Export & Documentation (Days 25-30)

Export Format Tests

Continuous Integration (CI) Strategy

GitHub Actions Workflow

Pre-commit Hooks

Test Coverage Goals

Coverage Targets by Module

Coverage Enforcement

Performance Testing

Benchmark Tests

Test Data Management

Fixture Organization

Summary of Test Counts

Total Tests by Phase

Success Criteria

Version History

Version 1.1 (2025-11-10)

Version 1.0 (2025-11-08)

Next Steps

58 KiB

Raw Blame History