58 KiB
UNESCO Data Extraction - Test-Driven Development Strategy
Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 04 - TDD Strategy
Version: 1.1
Date: 2025-11-10
Status: Updated for OpenDataSoft API v2.0
Executive Summary
This document outlines the test-driven development (TDD) strategy for the UNESCO World Heritage Sites extraction project. Following TDD principles ensures high code quality, maintainability, and confidence in data integrity. The strategy emphasizes writing tests BEFORE implementation, using diverse testing approaches (unit, integration, property-based, golden dataset), and maintaining 90%+ code coverage.
Key Principle: Red → Green → Refactor
- Red: Write failing test first
- Green: Write minimal code to pass test
- Refactor: Improve code while keeping tests green
Testing Philosophy
Core Principles
- Tests as Specification: Tests document expected behavior and serve as executable specifications
- Fast Feedback: Unit tests run in milliseconds, integration tests in seconds
- Isolation: Each test is independent, can run in any order
- Determinism: Same input always produces same output (no flaky tests)
- Coverage as Quality Gate: 90%+ coverage required before merge
Testing Pyramid
┌─────────────────┐
│ E2E Tests │ (10% of tests)
│ Full pipeline │
└─────────────────┘
┌───────────────────────┐
│ Integration Tests │ (20% of tests)
│ API + Parser + DB │
└───────────────────────┘
┌─────────────────────────────────┐
│ Unit Tests │ (70% of tests)
│ Functions, Classes, Validators │
└─────────────────────────────────┘
Distribution:
- 70% Unit Tests: Fast, isolated, test individual functions/classes
- 20% Integration Tests: Test component interactions (API client + parser)
- 10% E2E Tests: Full pipeline from API fetch to LinkML validation
Test Infrastructure
Testing Framework Stack
# pyproject.toml [tool.pytest]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
addopts = [
"--cov=src/glam_extractor",
"--cov-report=html",
"--cov-report=term-missing",
"--cov-fail-under=90",
"--verbose",
"--strict-markers",
"--tb=short"
]
markers = [
"unit: Unit tests (fast, isolated)",
"integration: Integration tests (requires external resources)",
"e2e: End-to-end tests (full pipeline)",
"slow: Slow tests (> 1 second)",
"api: Tests requiring UNESCO API access",
"wikidata: Tests requiring Wikidata SPARQL endpoint"
]
Dependencies
[project.optional-dependencies]
test = [
"pytest >= 8.0.0",
"pytest-cov >= 4.1.0",
"pytest-mock >= 3.12.0",
"pytest-xdist >= 3.5.0", # Parallel test execution
"hypothesis >= 6.92.0", # Property-based testing
"responses >= 0.24.0", # Mock HTTP requests
"freezegun >= 1.4.0", # Mock datetime
"faker >= 22.0.0", # Generate fake data
"deepdiff >= 6.7.0", # Deep object comparison
]
Directory Structure
tests/
├── unit/ # Unit tests (fast, isolated)
│ ├── test_unesco_api_client.py
│ ├── test_unesco_parser.py
│ ├── test_institution_classifier.py
│ ├── test_ghcid_generator.py
│ └── test_linkml_map.py
├── integration/ # Integration tests
│ ├── test_api_to_parser_pipeline.py
│ ├── test_parser_to_validator_pipeline.py
│ ├── test_wikidata_enrichment.py
│ └── test_dataset_merge.py
├── e2e/ # End-to-end tests
│ ├── test_full_unesco_extraction.py
│ └── test_export_all_formats.py
├── fixtures/ # Test data
│ ├── unesco_api_responses/ # Sample UNESCO JSON responses
│ │ ├── site_600_bnf.json
│ │ ├── site_list_sample.json
│ │ └── ...
│ ├── expected_outputs/ # Golden dataset expected results
│ │ ├── unesco_bnf_expected.yaml
│ │ └── ...
│ └── mock_responses/ # Pre-recorded API responses
├── conftest.py # Pytest configuration and shared fixtures
└── README.md # Testing guide
Phase-by-Phase TDD Strategy
Phase 1: API Exploration & Schema Design (Days 1-5)
Day 1: UNESCO API Client Tests
Test Plan:
- Test API connectivity (integration test)
- Test response parsing (unit test)
- Test error handling (unit test)
- Test caching (unit test)
- Test rate limiting (integration test)
Example TDD Cycle:
# tests/unit/test_unesco_api_client.py
import pytest
import responses
from glam_extractor.parsers.unesco_api_client import UNESCOAPIClient
# RED: Write failing test first
def test_fetch_site_list_returns_list_of_sites():
"""Test that fetch_site_list returns a list of UNESCO sites."""
client = UNESCOAPIClient()
sites = client.fetch_site_list()
assert isinstance(sites, list)
assert len(sites) > 0
assert 'id_number' in sites[0]
assert 'site' in sites[0]
# This test will FAIL because UNESCOAPIClient doesn't exist yet
# Now implement minimal code to make it pass (GREEN)
Mock API Responses:
@responses.activate
def test_fetch_site_details_with_valid_id():
"""Test fetching site details with mocked API response."""
# Arrange: Set up mock response (OpenDataSoft API v2.0 format)
responses.add(
responses.GET,
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
json={
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"category": "Cultural",
"states_name_en": "France"
}
}
},
status=200
)
# Act: Call the method
client = UNESCOAPIClient()
site = client.fetch_site_details(600)
# Assert: Verify results (nested structure)
assert site['record']['fields']['unique_number'] == 600
assert site['record']['fields']['name_en'] == "Paris, Banks of the Seine"
assert site['record']['fields']['category'] == "Cultural"
Error Handling Tests:
@responses.activate
def test_fetch_site_details_handles_404():
"""Test that 404 errors are handled gracefully."""
responses.add(
responses.GET,
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/99999",
status=404
)
client = UNESCOAPIClient()
with pytest.raises(UNESCOSiteNotFoundError) as exc_info:
client.fetch_site_details(99999)
assert "Site 99999 not found" in str(exc_info.value)
@responses.activate
def test_fetch_site_details_retries_on_network_error():
"""Test that network errors trigger retry logic."""
# First call fails, second call succeeds
responses.add(
responses.GET,
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
body=Exception("Network error")
)
responses.add(
responses.GET,
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
json={
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine"
}
}
},
status=200
)
client = UNESCOAPIClient(max_retries=3)
site = client.fetch_site_details(600)
assert site['record']['fields']['unique_number'] == 600
assert len(responses.calls) == 2 # Verify retry happened
Caching Tests:
def test_api_client_caches_responses(tmp_path):
"""Test that API responses are cached to avoid redundant requests."""
cache_db = tmp_path / "test_cache.db"
client = UNESCOAPIClient(cache_path=cache_db)
# First call: cache miss, fetches from API
with responses.RequestsMock() as rsps:
rsps.add(responses.GET, "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
json={"record": {"id": "rec_600", "fields": {"unique_number": 600}}}, status=200)
site1 = client.fetch_site_details(600)
assert len(rsps.calls) == 1
# Second call: cache hit, no API request
with responses.RequestsMock() as rsps:
site2 = client.fetch_site_details(600)
assert len(rsps.calls) == 0 # No new API calls
assert site1 == site2
Success Criteria:
- 15+ unit tests for API client
- 100% coverage for
unesco_api_client.py - All tests run in < 1 second (mocked responses)
Day 2-3: LinkML Map Schema Tests
Test Plan:
- Test basic field extraction (name, location, UNESCO WHC ID)
- Test multi-language name handling
- Test conditional institution type mapping
- Test missing field handling
- Test regex validators
Golden Dataset Approach:
# tests/integration/test_linkml_map_golden_dataset.py
import pytest
from pathlib import Path
from linkml_map import Mapper
from glam_extractor.models import HeritageCustodian
FIXTURES_DIR = Path(__file__).parent.parent / "fixtures"
@pytest.mark.parametrize("site_id", [
600, # Paris, Banks of the Seine (Library/Museum)
252, # Taj Mahal (Monument with museum)
936, # Kew Gardens (Botanical garden)
# Add 17 more...
])
def test_linkml_map_golden_dataset(site_id):
"""Test LinkML Map transformation against golden dataset."""
# Load input JSON
input_json = FIXTURES_DIR / f"unesco_api_responses/site_{site_id}.json"
with open(input_json) as f:
api_response = json.load(f)
# Load expected output YAML
expected_yaml = FIXTURES_DIR / f"expected_outputs/site_{site_id}_expected.yaml"
with open(expected_yaml) as f:
expected = yaml.safe_load(f)
# Apply LinkML Map transformation
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
result = mapper.transform(api_response)
# Compare result with expected output
assert result['name'] == expected['name']
assert result['institution_type'] == expected['institution_type']
assert result['locations'][0]['city'] == expected['locations'][0]['city']
# Deep comparison for complex nested structures
from deepdiff import DeepDiff
diff = DeepDiff(result, expected, ignore_order=True)
assert not diff, f"Unexpected differences: {diff}"
Conditional Mapping Tests:
def test_institution_type_mapping_for_museum():
"""Test that sites with 'museum' in description map to MUSEUM type."""
api_response = {
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"short_description_en": "Includes the Louvre Museum and other heritage institutions",
"category": "Cultural"
}
}
}
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
result = mapper.transform(api_response)
assert result['institution_type'] == "MUSEUM"
def test_institution_type_mapping_for_botanical_garden():
"""Test that botanical gardens map correctly."""
api_response = {
"record": {
"id": "rec_936",
"fields": {
"unique_number": 936,
"name_en": "Royal Botanic Gardens, Kew",
"short_description_en": "Botanical garden and herbarium",
"category": "Cultural"
}
}
}
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
result = mapper.transform(api_response)
assert result['institution_type'] == "BOTANICAL_ZOO"
def test_institution_type_defaults_to_mixed_when_ambiguous():
"""Test that ambiguous sites default to MIXED type."""
api_response = {
"record": {
"id": "rec_123",
"fields": {
"unique_number": 123,
"name_en": "Historic District",
"short_description_en": "Historic area with no specific institution mentioned",
"category": "Cultural"
}
}
}
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
result = mapper.transform(api_response)
assert result['institution_type'] == "MIXED"
Multi-language Name Tests:
def test_multilingual_name_extraction():
"""Test extraction of names in multiple languages."""
api_response = {
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"name_fr": "Paris, rives de la Seine",
"name_es": "París, riberas del Sena"
}
}
}
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
result = mapper.transform(api_response)
assert result['name'] == "Paris, Banks of the Seine"
assert "Paris, rives de la Seine@fr" in result['alternative_names']
assert "París, riberas del Sena@es" in result['alternative_names']
Success Criteria:
- 20 golden dataset test cases (100% passing)
- 10+ conditional mapping tests
- 5+ multi-language tests
- LinkML Map schema validates with
linkml-validate
Day 4: Test Fixture Creation
Fixture Quality Checklist:
# tests/conftest.py - Shared fixtures
import pytest
from pathlib import Path
@pytest.fixture
def fixtures_dir():
"""Return path to test fixtures directory."""
return Path(__file__).parent / "fixtures"
@pytest.fixture
def sample_unesco_site_600():
"""Fixture for UNESCO site 600 (Paris, Banks of the Seine)."""
return {
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"category": "Cultural",
"states_name_en": "France",
"region_en": "Europe and North America",
"latitude": 48.8566,
"longitude": 2.3522,
"short_description_en": "From the Louvre to the Eiffel Tower...",
"justification_en": "Contains numerous heritage institutions including museums and libraries",
"date_inscribed": 1991,
"http_url": "https://whc.unesco.org/en/list/600"
}
}
}
@pytest.fixture
def expected_heritage_custodian_600():
"""Expected HeritageCustodian output for site 600."""
return {
"id": "https://w3id.org/heritage/custodian/fr/louvre",
"name": "Paris, Banks of the Seine",
"institution_type": "MIXED", # Multiple institutions at this site
"locations": [
{
"city": "Paris",
"country": "FR",
"coordinates": [48.8566, 2.3522],
"geonames_id": "2988507"
}
],
"identifiers": [
{
"identifier_scheme": "UNESCO_WHC",
"identifier_value": "600",
"identifier_url": "https://whc.unesco.org/en/list/600"
},
{
"identifier_scheme": "Wikidata",
"identifier_value": "Q90",
"identifier_url": "https://www.wikidata.org/wiki/Q90"
}
],
"provenance": {
"data_source": "UNESCO_WORLD_HERITAGE",
"data_tier": "TIER_1_AUTHORITATIVE",
"extraction_date": "2025-11-09T10:00:00Z",
"confidence_score": 1.0
}
}
@pytest.fixture
def mock_unesco_api_client(mocker):
"""Mock UNESCO API client for testing without network calls."""
mock_client = mocker.Mock(spec=UNESCOAPIClient)
mock_client.fetch_site_details.return_value = {
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine"
}
}
}
return mock_client
Fixture Coverage Matrix:
| Site ID | Location | Institution Type | Special Case |
|---|---|---|---|
| 600 | France, Paris | MIXED | Multiple institutions |
| 252 | India, Agra | MUSEUM | Archaeological museum |
| 936 | UK, London | BOTANICAL_ZOO | Kew Gardens |
| 148 | Brazil, Brasília | ARCHIVE | National Archive |
| 274 | Egypt, Cairo | MUSEUM | Egyptian Museum |
| 890 | Japan, Kyoto | HOLY_SITES | Temple with collection |
| 1110 | Vietnam, Hanoi | LIBRARY | National Library |
| 723 | Mexico, Mexico City | MUSEUM | Anthropology Museum |
| ... | ... | ... | ... |
Success Criteria:
- 20 diverse test fixtures covering all continents
- 20 expected output YAML files (manually verified)
- Fixtures cover all institution types in InstitutionTypeEnum
- Edge cases documented (serial nominations, transboundary sites)
Day 5: Institution Type Classifier Tests
Test Plan:
- Test keyword-based classification
- Test multi-keyword scoring
- Test confidence scoring
- Test multilingual keywords
- Test ambiguous case handling
Property-Based Testing with Hypothesis:
# tests/unit/test_institution_classifier_properties.py
from hypothesis import given, strategies as st
from glam_extractor.classifiers.unesco_institution_type import classify_institution_type
@given(description=st.text(min_size=10, max_size=500))
def test_classifier_always_returns_valid_enum(description):
"""Property: Classifier must always return a valid InstitutionTypeEnum."""
result = classify_institution_type(description)
assert result['institution_type'] in [
"MUSEUM", "LIBRARY", "ARCHIVE", "GALLERY",
"BOTANICAL_ZOO", "HOLY_SITES", "FEATURES", "MIXED"
]
assert 0.0 <= result['confidence_score'] <= 1.0
@given(
description=st.text(),
keyword=st.sampled_from(["museum", "library", "archive", "botanical garden"])
)
def test_classifier_detects_explicit_keywords(description, keyword):
"""Property: Explicit keywords should be detected reliably."""
# Force keyword into description
modified_desc = f"{description} This site contains a {keyword}."
result = classify_institution_type(modified_desc)
# Should detect the keyword with high confidence
assert result['confidence_score'] >= 0.7
Keyword Detection Tests:
@pytest.mark.parametrize("description,expected_type,min_confidence", [
# Museum keywords
("Site includes the National Museum of Anthropology", "MUSEUM", 0.9),
("Archaeological museum with 10,000 artifacts", "MUSEUM", 0.9),
("Le Musée du Louvre est situé à Paris", "MUSEUM", 0.9), # French
# Library keywords
("The National Library holds rare manuscripts", "LIBRARY", 0.9),
("Biblioteca Nacional do Brasil", "LIBRARY", 0.9), # Portuguese
# Archive keywords
("Historic archive with colonial documents", "ARCHIVE", 0.9),
("Rijksarchief in Noord-Holland", "ARCHIVE", 0.9), # Dutch
# Botanical garden keywords
("Royal Botanic Gardens with herbarium collection", "BOTANICAL_ZOO", 0.9),
("Jardin botanique with 50,000 plant species", "BOTANICAL_ZOO", 0.9),
# Holy sites
("Cathedral with liturgical manuscripts collection", "HOLY_SITES", 0.85),
("Monastery library with medieval codices", "HOLY_SITES", 0.85),
# Ambiguous cases (should default to MIXED)
("Historic district with various buildings", "MIXED", 0.5),
("Cultural landscape", "MIXED", 0.5),
])
def test_keyword_based_classification(description, expected_type, min_confidence):
"""Test that classifier detects keywords correctly."""
result = classify_institution_type(description)
assert result['institution_type'] == expected_type
assert result['confidence_score'] >= min_confidence
Multi-language Support Tests:
def test_classifier_handles_multilingual_descriptions():
"""Test that classifier works with non-English descriptions."""
test_cases = [
# French
("Le site comprend un musée d'art moderne", "MUSEUM"),
("Bibliothèque nationale avec des manuscrits rares", "LIBRARY"),
# Spanish
("Museo Nacional de Antropología con colecciones", "MUSEUM"),
("Biblioteca pública con archivos históricos", "LIBRARY"),
# Portuguese
("Museu Nacional com exposições permanentes", "MUSEUM"),
# German
("Staatliches Museum für Völkerkunde", "MUSEUM"),
# Dutch
("Rijksmuseum met kunstcollecties", "MUSEUM"),
]
for description, expected_type in test_cases:
result = classify_institution_type(description)
assert result['institution_type'] == expected_type, \
f"Failed for: {description}"
Confidence Scoring Tests:
def test_confidence_high_for_explicit_mentions():
"""Test that explicit mentions produce high confidence scores."""
result = classify_institution_type("Site contains the National Museum")
assert result['confidence_score'] >= 0.9
def test_confidence_medium_for_inferred_types():
"""Test that inferred types produce medium confidence scores."""
result = classify_institution_type("Archaeological site with exhibition hall")
assert 0.6 <= result['confidence_score'] < 0.9
def test_confidence_low_for_ambiguous_cases():
"""Test that ambiguous cases produce low confidence scores."""
result = classify_institution_type("Historic area")
assert result['confidence_score'] < 0.6
Success Criteria:
- 50+ unit tests for classifier
- 100% coverage for
unesco_institution_type.py - Property-based tests cover edge cases
- 90%+ accuracy on golden dataset (20 fixtures)
Phase 2: Extractor Implementation (Days 6-13)
Day 6-7: Parser Tests
Test Plan:
- Test parsing valid UNESCO JSON to HeritageCustodian
- Test handling missing optional fields
- Test handling invalid/malformed JSON
- Test LinkML validation integration
- Test provenance metadata generation
Parser Unit Tests:
# tests/unit/test_unesco_parser.py
def test_parse_unesco_site_creates_valid_heritage_custodian(sample_unesco_site_600):
"""Test that parser creates valid HeritageCustodian from UNESCO JSON."""
from glam_extractor.parsers.unesco_parser import parse_unesco_site
custodian = parse_unesco_site(sample_unesco_site_600)
assert isinstance(custodian, HeritageCustodian)
assert custodian.name == "Paris, Banks of the Seine"
assert custodian.institution_type in InstitutionTypeEnum
assert len(custodian.identifiers) >= 1
assert custodian.provenance.data_source == "UNESCO_WORLD_HERITAGE"
def test_parse_unesco_site_handles_missing_coordinates():
"""Test that parser handles missing latitude/longitude gracefully."""
incomplete_site = {
"record": {
"id": "rec_999",
"fields": {
"unique_number": 999,
"name_en": "Test Site",
"category": "Cultural"
# Missing latitude, longitude
}
}
}
custodian = parse_unesco_site(incomplete_site)
# Should still parse, but location may lack coordinates
assert custodian.name == "Test Site"
if custodian.locations:
assert custodian.locations[0].latitude is None
assert custodian.locations[0].longitude is None
def test_parse_unesco_site_extracts_all_identifiers():
"""Test that parser extracts UNESCO WHC ID and Wikidata Q-number."""
site_with_links = {
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"http_url": "https://whc.unesco.org/en/list/600",
"wdpaid": "Q90" # Wikidata ID field
}
}
}
custodian = parse_unesco_site(site_with_links)
# Check UNESCO WHC ID
whc_ids = [i for i in custodian.identifiers if i.identifier_scheme == "UNESCO_WHC"]
assert len(whc_ids) == 1
assert whc_ids[0].identifier_value == "600"
# Check Wikidata Q-number
wd_ids = [i for i in custodian.identifiers if i.identifier_scheme == "Wikidata"]
assert len(wd_ids) == 1
assert wd_ids[0].identifier_value == "Q90"
LinkML Validation Integration:
def test_parsed_custodian_validates_against_linkml_schema(sample_unesco_site_600):
"""Test that parsed HeritageCustodian passes LinkML validation."""
from linkml.validators import JsonSchemaValidator
custodian = parse_unesco_site(sample_unesco_site_600)
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
# Convert to dict for validation
custodian_dict = custodian.dict()
# Should not raise ValidationError
report = validator.validate(custodian_dict)
assert report.valid, f"Validation errors: {report.results}"
Success Criteria:
- 20+ unit tests for parser
- 100% coverage for
unesco_parser.py - All parsed instances validate against LinkML schema
Day 8: GHCID Generator Tests
Test Plan:
- Test GHCID format correctness
- Test deterministic generation (same input → same GHCID)
- Test UUID v5 generation
- Test collision detection
- Test native name suffix generation (NOT Wikidata Q-numbers)
Note
: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See
docs/plan/global_glam/07-ghcid-collision-resolution.mdfor authoritative documentation.
GHCID Format Tests:
# tests/unit/test_ghcid_unesco.py
def test_ghcid_format_for_french_museum():
"""Test GHCID generation for French museum follows format."""
custodian = HeritageCustodian(
name="Musée du Louvre",
institution_type=InstitutionTypeEnum.MUSEUM,
locations=[Location(city="Paris", country="FR", region="Île-de-France")]
)
ghcid = generate_ghcid(custodian)
# Format: FR-IDF-PAR-M-LOU
assert ghcid.startswith("FR-")
assert "-M-" in ghcid # M = Museum
assert re.match(r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}$', ghcid)
def test_ghcid_with_name_suffix_collision():
"""Test GHCID includes name suffix when collision detected."""
custodian = HeritageCustodian(
name="Stedelijk Museum Amsterdam",
institution_type=InstitutionTypeEnum.MUSEUM,
locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
)
# Simulate collision detection
ghcid = generate_ghcid(custodian, collision_detected=True)
# Format: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
assert ghcid.endswith("-stedelijk_museum_amsterdam")
Determinism Tests (Property-Based):
from hypothesis import given, strategies as st
@given(
country=st.sampled_from(["FR", "NL", "BR", "IN", "JP"]),
city=st.text(min_size=3, max_size=20, alphabet=st.characters(whitelist_categories=('Lu', 'Ll'))),
inst_type=st.sampled_from(list(InstitutionTypeEnum))
)
def test_ghcid_generation_is_deterministic(country, city, inst_type):
"""Property: Same input always produces same GHCID."""
custodian = HeritageCustodian(
name="Test Institution",
institution_type=inst_type,
locations=[Location(city=city, country=country)]
)
ghcid1 = generate_ghcid(custodian)
ghcid2 = generate_ghcid(custodian)
assert ghcid1 == ghcid2
@given(st.text(min_size=5, max_size=100))
def test_uuid_v5_determinism_for_ghcid(ghcid_string):
"""Property: UUID v5 generation is deterministic."""
uuid1 = generate_uuid_v5(ghcid_string)
uuid2 = generate_uuid_v5(ghcid_string)
assert uuid1 == uuid2
assert isinstance(uuid1, uuid.UUID)
assert uuid1.version == 5
Collision Detection Tests:
def test_collision_detection_with_same_base_ghcid():
"""Test that collision is detected when base GHCID matches existing record."""
existing_dataset = [
HeritageCustodian(
ghcid="NL-NH-AMS-M-HM",
name="Hermitage Amsterdam"
)
]
new_custodian = HeritageCustodian(
name="Historical Museum Amsterdam",
institution_type=InstitutionTypeEnum.MUSEUM,
locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
)
collision = detect_ghcid_collision(new_custodian, existing_dataset)
assert collision is True
# Generate GHCID with name suffix to resolve collision
ghcid = generate_ghcid(new_custodian, collision_detected=True)
assert ghcid == "NL-NH-AMS-M-HM-historical_museum_amsterdam"
Success Criteria:
- 25+ unit tests for GHCID generator
- 100% coverage for GHCID logic
- Property-based tests verify determinism
- Collision detection 100% accurate on test cases
Day 9-12: Integration Tests
Test Plan:
- Test API → Parser pipeline
- Test Parser → Validator pipeline
- Test full extraction pipeline
- Test batch processing
- Test error recovery
Pipeline Integration Tests:
# tests/integration/test_api_to_parser_pipeline.py
@responses.activate
def test_full_pipeline_from_api_to_linkml_instance():
"""Test complete pipeline: API fetch → Parse → Validate."""
# Arrange: Mock UNESCO API response
responses.add(
responses.GET,
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
json={
"record": {
"id": "rec_600",
"fields": {
"unique_number": 600,
"name_en": "Paris, Banks of the Seine",
"category": "Cultural",
"states_name_en": "France",
"latitude": 48.8566,
"longitude": 2.3522
}
}
},
status=200
)
# Act: Run pipeline
api_client = UNESCOAPIClient()
site_data = api_client.fetch_site_details(600)
custodian = parse_unesco_site(site_data)
custodian.ghcid = generate_ghcid(custodian)
# Assert: Verify result
assert custodian.name == "Paris, Banks of the Seine"
assert custodian.ghcid.startswith("FR-")
# Validate against schema
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
report = validator.validate(custodian.dict())
assert report.valid
@pytest.mark.slow
def test_batch_processing_handles_failures_gracefully():
"""Test that batch processing continues when individual sites fail."""
site_ids = [600, 99999, 252, 88888, 936] # Mix of valid and invalid IDs
results = []
errors = []
for site_id in site_ids:
try:
site_data = api_client.fetch_site_details(site_id)
custodian = parse_unesco_site(site_data)
results.append(custodian)
except UNESCOSiteNotFoundError as e:
errors.append((site_id, str(e)))
# Should have processed valid sites
assert len(results) == 3 # 600, 252, 936
# Should have logged errors for invalid sites
assert len(errors) == 2 # 99999, 88888
Parallel Processing Tests:
@pytest.mark.slow
def test_parallel_extraction_produces_same_results_as_sequential():
"""Test that parallel processing produces identical results to sequential."""
site_ids = [600, 252, 936, 148, 274]
# Sequential extraction
sequential_results = []
for site_id in site_ids:
site_data = api_client.fetch_site_details(site_id)
custodian = parse_unesco_site(site_data)
sequential_results.append(custodian)
# Parallel extraction
from concurrent.futures import ThreadPoolExecutor
def process_site(site_id):
site_data = api_client.fetch_site_details(site_id)
return parse_unesco_site(site_data)
with ThreadPoolExecutor(max_workers=4) as executor:
parallel_results = list(executor.map(process_site, site_ids))
# Results should be identical (order may differ)
sequential_ghcids = sorted([c.ghcid for c in sequential_results])
parallel_ghcids = sorted([c.ghcid for c in parallel_results])
assert sequential_ghcids == parallel_ghcids
Success Criteria:
- 15+ integration tests covering pipeline components
- End-to-end test processes 20 golden dataset sites
- Parallel processing test confirms determinism
- All integration tests pass consistently
Phase 3: Data Quality & Validation (Days 14-19)
Day 14-16: Cross-Referencing and Conflict Detection Tests
Test Plan:
- Test matching by Wikidata Q-number
- Test matching by ISIL code
- Test fuzzy name matching
- Test conflict detection
- Test conflict resolution
Matching Tests:
# tests/unit/test_crosslinking.py
def test_match_by_wikidata_qnumber():
"""Test that institutions match by Wikidata Q-number."""
unesco_record = HeritageCustodian(
name="Bibliothèque nationale de France",
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
)
existing_record = HeritageCustodian(
name="BnF", # Different name
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
)
match = find_match(unesco_record, [existing_record])
assert match is not None
assert match.name == "BnF"
def test_fuzzy_name_matching_with_threshold():
"""Test fuzzy name matching with similarity threshold."""
unesco_record = HeritageCustodian(
name="Rijksmuseum Amsterdam",
locations=[Location(city="Amsterdam", country="NL")]
)
existing_record = HeritageCustodian(
name="Rijks Museum Amsterdam", # Slightly different spelling
locations=[Location(city="Amsterdam", country="NL")]
)
match, score = fuzzy_match_by_name_and_location(
unesco_record, [existing_record], threshold=0.85
)
assert match is not None
assert score >= 0.85
Conflict Detection Tests:
def test_detect_name_conflict():
"""Test detection of name conflicts between UNESCO and existing data."""
unesco_record = HeritageCustodian(
name="Bibliothèque nationale de France",
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
)
existing_record = HeritageCustodian(
name="BnF",
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
)
conflicts = detect_conflicts(unesco_record, existing_record)
assert len(conflicts) == 1
assert conflicts[0].field == "name"
assert conflicts[0].unesco_value == "Bibliothèque nationale de France"
assert conflicts[0].existing_value == "BnF"
def test_detect_institution_type_conflict():
"""Test detection of institution type mismatches."""
unesco_record = HeritageCustodian(
name="Site Name",
institution_type=InstitutionTypeEnum.MUSEUM,
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
)
existing_record = HeritageCustodian(
name="Site Name",
institution_type=InstitutionTypeEnum.LIBRARY, # Conflict!
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
)
conflicts = detect_conflicts(unesco_record, existing_record)
type_conflicts = [c for c in conflicts if c.field == "institution_type"]
assert len(type_conflicts) == 1
Conflict Resolution Tests:
def test_unesco_tier1_wins_over_conversation_tier4():
"""Test that UNESCO (TIER_1) data takes priority over conversation (TIER_4) data."""
unesco_record = HeritageCustodian(
name="Bibliothèque nationale de France",
institution_type=InstitutionTypeEnum.LIBRARY,
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
)
conversation_record = HeritageCustodian(
name="BnF",
institution_type=InstitutionTypeEnum.MIXED, # Less accurate
provenance=Provenance(data_tier=DataTier.TIER_4_INFERRED)
)
merged = resolve_conflicts(unesco_record, conversation_record)
# UNESCO data should win
assert merged.name == "Bibliothèque nationale de France"
assert merged.institution_type == InstitutionTypeEnum.LIBRARY
# But preserve alternative names
assert "BnF" in merged.alternative_names
Success Criteria:
- 20+ tests for matching logic
- 15+ tests for conflict detection
- 10+ tests for conflict resolution
- 100% coverage for crosslinking module
Day 17-18: Validation and Confidence Scoring Tests
Test Plan:
- Test LinkML schema validation
- Test custom validators (UNESCO WHC ID, GHCID format)
- Test confidence score calculation
- Test quality metrics
Validation Tests:
# tests/unit/test_unesco_validators.py
def test_validate_unesco_whc_id_format():
"""Test UNESCO WHC ID validator accepts valid IDs."""
assert validate_unesco_whc_id("600") is True
assert validate_unesco_whc_id("1234") is True
assert validate_unesco_whc_id("52") is False # Too short
assert validate_unesco_whc_id("12345") is False # Too long
assert validate_unesco_whc_id("ABC") is False # Not numeric
def test_validate_ghcid_format_for_unesco():
"""Test GHCID format validator.
Note: Collision suffix uses native language name in snake_case (NOT Wikidata Q-numbers).
See: docs/plan/global_glam/07-ghcid-collision-resolution.md
"""
assert validate_ghcid_format("FR-IDF-PAR-M-LOU") is True
assert validate_ghcid_format("FR-IDF-PAR-M-LOU-musee_du_louvre") is True # Native name suffix
assert validate_ghcid_format("INVALID") is False
assert validate_ghcid_format("FR-PAR-M") is False # Missing components
@pytest.mark.parametrize("custodian_dict,should_pass", [
# Valid record
({
"name": "Test Museum",
"institution_type": "MUSEUM",
"provenance": {
"data_source": "UNESCO_WORLD_HERITAGE",
"data_tier": "TIER_1_AUTHORITATIVE",
"extraction_date": "2025-11-09T10:00:00Z"
}
}, True),
# Missing required field
({
"institution_type": "MUSEUM", # Missing name!
"provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
}, False),
# Invalid enum value
({
"name": "Test",
"institution_type": "INVALID_TYPE", # Not in enum!
"provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
}, False),
])
def test_linkml_validation(custodian_dict, should_pass):
"""Test LinkML schema validation with valid and invalid records."""
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
report = validator.validate(custodian_dict)
if should_pass:
assert report.valid, f"Unexpected validation error: {report.results}"
else:
assert not report.valid
Confidence Scoring Tests:
def test_confidence_score_high_for_complete_record():
"""Test that complete records with rich metadata score high."""
custodian = HeritageCustodian(
name="Musée du Louvre",
institution_type=InstitutionTypeEnum.MUSEUM,
locations=[Location(city="Paris", country="FR", latitude=48.86, longitude=2.34)],
identifiers=[
Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600"),
Identifier(identifier_scheme="Wikidata", identifier_value="Q19675"),
Identifier(identifier_scheme="VIAF", identifier_value="139708098")
],
digital_platforms=[DigitalPlatform(platform_name="Louvre Collections", platform_url="https://collections.louvre.fr")],
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
)
score = calculate_confidence_score(custodian)
assert score >= 0.95
def test_confidence_score_low_for_minimal_record():
"""Test that minimal records score lower."""
custodian = HeritageCustodian(
name="Unknown Site",
institution_type=InstitutionTypeEnum.MIXED, # Ambiguous
# Missing: locations, identifiers, digital_platforms
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
)
score = calculate_confidence_score(custodian)
assert score < 0.7
@given(
num_identifiers=st.integers(min_value=0, max_value=5),
has_location=st.booleans(),
has_platform=st.booleans()
)
def test_confidence_score_increases_with_completeness(num_identifiers, has_location, has_platform):
"""Property: More metadata should increase confidence score."""
custodian = HeritageCustodian(
name="Test",
institution_type=InstitutionTypeEnum.MUSEUM,
identifiers=[Identifier(identifier_scheme="Test", identifier_value=f"{i}") for i in range(num_identifiers)],
locations=[Location(city="City", country="CC")] if has_location else [],
digital_platforms=[DigitalPlatform(platform_name="Test", platform_url="https://test.com")] if has_platform else [],
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
)
score = calculate_confidence_score(custodian)
# Score should be in valid range
assert 0.0 <= score <= 1.0
Success Criteria:
- 20+ validation tests
- 15+ confidence scoring tests
- Property-based tests verify score consistency
- 100% coverage for validator modules
Phase 4: Integration & Enrichment (Days 20-24)
Wikidata Enrichment Tests
Test Plan:
- Test SPARQL query generation
- Test Q-number extraction
- Test fallback to fuzzy matching
- Test enrichment error handling
# tests/integration/test_wikidata_enrichment.py
@responses.activate
def test_wikidata_query_by_unesco_whc_id():
"""Test querying Wikidata for Q-number using UNESCO WHC ID."""
# Mock SPARQL endpoint response
responses.add(
responses.POST,
"https://query.wikidata.org/sparql",
json={
"results": {
"bindings": [
{"item": {"value": "http://www.wikidata.org/entity/Q90"}}
]
}
},
status=200
)
q_number = query_wikidata_for_unesco_site(whc_id="600")
assert q_number == "Q90"
def test_wikidata_enrichment_adds_identifiers():
"""Test that Wikidata enrichment adds Q-number to custodian."""
custodian = HeritageCustodian(
name="Paris, Banks of the Seine",
identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600")]
)
enriched = enrich_with_wikidata(custodian)
# Should have added Wikidata identifier
wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
assert len(wd_ids) == 1
assert wd_ids[0].identifier_value == "Q90"
def test_wikidata_enrichment_handles_not_found():
"""Test that enrichment handles missing Wikidata entries gracefully."""
custodian = HeritageCustodian(
name="Obscure Site",
identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="99999")]
)
enriched = enrich_with_wikidata(custodian)
# Should not crash, just log warning
wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
assert len(wd_ids) == 0
Success Criteria:
- 10+ Wikidata enrichment tests
- Mock SPARQL responses for offline testing
- Error handling tests for API failures
Phase 5: Export & Documentation (Days 25-30)
Export Format Tests
Test Plan:
- Test RDF/Turtle serialization
- Test JSON-LD export
- Test CSV flattening
- Test Parquet export
- Test round-trip data integrity
# tests/integration/test_exports.py
def test_rdf_export_produces_valid_turtle(sample_dataset):
"""Test that RDF export produces valid Turtle syntax."""
from glam_extractor.exporters.rdf_exporter import export_to_rdf
output_path = tmp_path / "test_export.ttl"
export_to_rdf(sample_dataset, output_path)
# Verify file was created
assert output_path.exists()
# Parse with rdflib to verify syntax
from rdflib import Graph
graph = Graph()
graph.parse(output_path, format="turtle")
# Check that records were serialized
assert len(graph) > 0
def test_jsonld_export_includes_context(sample_dataset):
"""Test that JSON-LD export includes @context."""
from glam_extractor.exporters.jsonld_exporter import export_to_jsonld
output_path = tmp_path / "test_export.jsonld"
export_to_jsonld(sample_dataset, output_path)
with open(output_path) as f:
data = json.load(f)
assert "@context" in data
assert "@graph" in data or isinstance(data, list)
def test_csv_export_flattens_nested_structures(sample_dataset):
"""Test that CSV export flattens nested structures correctly."""
from glam_extractor.exporters.csv_exporter import export_to_csv
output_path = tmp_path / "test_export.csv"
export_to_csv(sample_dataset, output_path)
import pandas as pd
df = pd.read_csv(output_path)
# Check required columns
assert "ghcid" in df.columns
assert "name" in df.columns
assert "institution_type" in df.columns
assert "country" in df.columns
def test_parquet_export_preserves_data_types(sample_dataset):
"""Test that Parquet export preserves data types."""
from glam_extractor.exporters.parquet_exporter import export_to_parquet
output_path = tmp_path / "test_export.parquet"
export_to_parquet(sample_dataset, output_path)
import pandas as pd
df = pd.read_parquet(output_path)
# Verify data types
assert df['ghcid'].dtype == 'object' # String
assert df['confidence_score'].dtype == 'float64'
def test_round_trip_data_integrity(sample_dataset):
"""Test that data survives export → import round trip."""
# Export to JSON-LD
output_path = tmp_path / "test_export.jsonld"
export_to_jsonld(sample_dataset, output_path)
# Import back
from glam_extractor.importers.jsonld_importer import import_from_jsonld
imported_dataset = import_from_jsonld(output_path)
# Compare
assert len(imported_dataset) == len(sample_dataset)
for original, imported in zip(sample_dataset, imported_dataset):
assert original.ghcid == imported.ghcid
assert original.name == imported.name
Success Criteria:
- 15+ export format tests
- Round-trip tests verify data integrity
- All exports validate with external parsers
Continuous Integration (CI) Strategy
GitHub Actions Workflow
# .github/workflows/test.yml
name: Test Suite
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install -e ".[test]"
- name: Run unit tests
run: |
pytest tests/unit -v --cov --cov-report=xml -m "not slow"
- name: Run integration tests
run: |
pytest tests/integration -v -m "not api and not wikidata"
- name: Check code coverage
run: |
pytest --cov-report=term-missing --cov-fail-under=90
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
files: ./coverage.xml
Pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/psf/black
rev: 24.1.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.11
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
- id: mypy
additional_dependencies: [types-all]
- repo: local
hooks:
- id: pytest-quick
name: pytest-quick
entry: pytest tests/unit -m "not slow"
language: system
pass_filenames: false
always_run: true
Test Coverage Goals
Coverage Targets by Module
| Module | Target Coverage | Priority |
|---|---|---|
unesco_api_client.py |
100% | Critical |
unesco_parser.py |
100% | Critical |
unesco_institution_type.py |
100% | Critical |
ghcid_generator.py |
95% | Critical |
linkml_map.py |
90% | High |
conflict_resolver.py |
95% | High |
wikidata_enrichment.py |
85% | Medium |
exporters/*.py |
85% | Medium |
Coverage Enforcement
# Fail CI if coverage drops below 90%
pytest --cov-fail-under=90
# Generate HTML coverage report
pytest --cov --cov-report=html
# Open coverage report
open htmlcov/index.html
Performance Testing
Benchmark Tests
# tests/performance/test_benchmarks.py
import pytest
import time
@pytest.mark.slow
def test_parse_1000_sites_completes_in_reasonable_time():
"""Benchmark: Parse 1,000 UNESCO sites in < 60 seconds."""
start = time.time()
results = []
for i in range(1000):
site_data = generate_fake_unesco_site()
custodian = parse_unesco_site(site_data)
results.append(custodian)
elapsed = time.time() - start
assert elapsed < 60, f"Parsing took {elapsed:.2f}s, expected < 60s"
@pytest.mark.slow
def test_ghcid_generation_performance():
"""Benchmark: Generate 10,000 GHCIDs in < 5 seconds."""
custodians = [generate_fake_custodian() for _ in range(10000)]
start = time.time()
ghcids = [generate_ghcid(c) for c in custodians]
elapsed = time.time() - start
assert elapsed < 5, f"GHCID generation took {elapsed:.2f}s, expected < 5s"
Test Data Management
Fixture Organization
# tests/conftest.py
import pytest
from pathlib import Path
import json
import yaml
@pytest.fixture(scope="session")
def fixtures_dir():
"""Return path to fixtures directory."""
return Path(__file__).parent / "fixtures"
@pytest.fixture(scope="session")
def golden_dataset(fixtures_dir):
"""Load all golden dataset fixtures."""
expected_dir = fixtures_dir / "expected_outputs"
golden_data = {}
for yaml_file in expected_dir.glob("*.yaml"):
site_id = yaml_file.stem.split("_")[1] # Extract site ID
with open(yaml_file) as f:
golden_data[site_id] = yaml.safe_load(f)
return golden_data
@pytest.fixture
def mock_api_responses(fixtures_dir):
"""Load mock UNESCO API responses."""
api_dir = fixtures_dir / "unesco_api_responses"
responses = {}
for json_file in api_dir.glob("*.json"):
site_id = json_file.stem.split("_")[1]
with open(json_file) as f:
responses[site_id] = json.load(f)
return responses
Summary of Test Counts
Total Tests by Phase
| Phase | Unit Tests | Integration Tests | E2E Tests | Property-Based | Total |
|---|---|---|---|---|---|
| Phase 1 | 50 | 5 | 0 | 10 | 65 |
| Phase 2 | 75 | 20 | 3 | 15 | 113 |
| Phase 3 | 45 | 15 | 0 | 8 | 68 |
| Phase 4 | 25 | 10 | 2 | 5 | 42 |
| Phase 5 | 30 | 10 | 2 | 0 | 42 |
| Total | 225 | 60 | 7 | 38 | 330 |
Success Criteria
- 330+ total tests implemented
- 90%+ code coverage across all modules
- 100% passing tests on CI
- Zero flaky tests (deterministic execution)
- All integration tests use mocked external dependencies
- Property-based tests cover edge cases
- Performance benchmarks pass on CI
Version History
Version 1.1 (2025-11-10)
Migration to OpenDataSoft Explore API v2.0
Breaking Changes:
- API Endpoint Format: Changed from
https://whc.unesco.org/en/list/{id}/jsontohttps://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/{id} - JSON Structure: Migrated from flat JSON to nested OpenDataSoft format
- Old:
{"id_number": 600, "site": "Paris, Banks of the Seine"} - New:
{"record": {"id": "rec_600", "fields": {"unique_number": 600, "name_en": "Paris, Banks of the Seine"}}}
- Old:
- Field Mappings Updated:
id_number→unique_numbersite→name_enstates→states_name_enregion→region_enshort_description→short_description_enjustification→justification_endate_inscribed→ numeric year (was string)links→http_url(single URL field)- Coordinates now in nested
fieldsobject
Updated Sections:
-
Day 1: API Client Tests (Lines 175-269)
- Updated all API endpoint URLs to OpenDataSoft format
- Updated mock JSON responses to nested structure
- Updated field access patterns in assertions
- Updated error handling tests (404 responses)
- Updated retry logic tests with new endpoints
- Updated caching tests with new response format
-
Day 2-3: LinkML Map Schema Tests (Lines 342-407)
- Updated institution type mapping test fixtures
- Updated botanical garden test fixture
- Updated multilingual name extraction (now uses
name_en,name_fr,name_esfields) - Updated field access in assertions
-
Day 4: Test Fixture Creation (Lines 450-510)
- Updated
sample_unesco_site_600fixture with nested structure - Updated
expected_heritage_custodian_600fixture - Updated mock API client return values
- Updated
-
Day 6-7: Parser Tests (Lines 709-747)
- Updated incomplete site test with nested structure
- Updated identifier extraction test
- Changed Wikidata extraction from
linksarray towdpaidfield - Updated all field references
-
Day 9-12: Integration Tests (Lines 886-916)
- Updated full pipeline integration test mock response
- Updated all field names in nested structure
- Updated assertions to match new API format
Field Access Pattern Changes:
# Old (v1.0)
site_id = response['id_number']
site_name = response['site']
country = response['states']
# New (v1.1)
site_id = response['record']['fields']['unique_number']
site_name = response['record']['fields']['name_en']
country = response['record']['fields']['states_name_en']
Testing Impact:
- All API client tests require updated endpoint URLs
- All mock responses updated to nested JSON structure
- Parser tests updated to extract from
record.fields.*paths - LinkML Map schema must handle nested path extraction
- Golden dataset fixtures require regeneration with new format
- Integration tests updated with new field access patterns
Rationale for Migration:
- Legacy UNESCO JSON API (
whc.unesco.org/en/list/{id}/json) deprecated or unstable - OpenDataSoft Explore API v2.0 is official UNESCO data platform
- Provides standardized REST API with pagination, filtering, and versioned datasets
- Better maintained and documented than legacy endpoint
- Supports structured metadata with type information
Migration Guide:
- Update all API client code to use new endpoint format
- Update JSON parsing logic to handle
record.fields.*nested structure - Update LinkML Map schema transformation paths
- Regenerate all test fixtures with new API responses
- Run full test suite to verify all tests pass with new format
References:
- OpenDataSoft API Documentation: https://help.opendatasoft.com/apis/ods-explore-v2/
- UNESCO World Heritage Dataset: https://data.unesco.org/datasets/whc-sites-2021
- API Migration Tracking Issue: TBD
Version 1.0 (2025-11-08)
Initial Release
- Complete TDD strategy for UNESCO World Heritage Sites extraction
- 13-day implementation plan with daily test goals
- 175+ total tests across all phases:
- 15+ API client unit tests
- 20+ golden dataset tests
- 20+ parser unit tests
- 15+ validator tests
- 25+ GHCID generator tests
- 30+ integration tests
- 50+ property-based tests
- Coverage targets: 90%+ overall, 100% for critical modules
- Test types: Unit, Integration, E2E, Property-Based, Golden Dataset
- Fixtures: 20+ golden dataset YAML files, 20+ mock API responses
- Performance benchmarks: Parse 1,000 sites in <60s, Generate 10,000 GHCIDs in <5s
- CI/CD integration with pytest and coverage reporting
Next Steps
Completed:
- ✅
01-dependencies.md- Technical dependencies - ✅
02-consumers.md- Use cases and data consumers - ✅
03-implementation-phases.md- 6-week timeline - ✅
04-tdd-strategy.md- THIS DOCUMENT - Testing strategy
Next to Create:
05-design-patterns.md- UNESCO-specific architectural patterns06-linkml-map-schema.md- CRITICAL - Complete LinkML Map transformation rules07-master-checklist.md- Implementation checklist
Document Status: Complete
Next Document: 05-design-patterns.md - Architectural patterns
Version: 1.1