1799 lines
58 KiB
Markdown
1799 lines
58 KiB
Markdown
# UNESCO Data Extraction - Test-Driven Development Strategy
|
|
|
|
**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
|
|
**Document**: 04 - TDD Strategy
|
|
**Version**: 1.1
|
|
**Date**: 2025-11-10
|
|
**Status**: Updated for OpenDataSoft API v2.0
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This document outlines the test-driven development (TDD) strategy for the UNESCO World Heritage Sites extraction project. Following TDD principles ensures high code quality, maintainability, and confidence in data integrity. The strategy emphasizes writing tests BEFORE implementation, using diverse testing approaches (unit, integration, property-based, golden dataset), and maintaining 90%+ code coverage.
|
|
|
|
**Key Principle**: **Red → Green → Refactor**
|
|
1. **Red**: Write failing test first
|
|
2. **Green**: Write minimal code to pass test
|
|
3. **Refactor**: Improve code while keeping tests green
|
|
|
|
---
|
|
|
|
## Testing Philosophy
|
|
|
|
### Core Principles
|
|
|
|
1. **Tests as Specification**: Tests document expected behavior and serve as executable specifications
|
|
2. **Fast Feedback**: Unit tests run in milliseconds, integration tests in seconds
|
|
3. **Isolation**: Each test is independent, can run in any order
|
|
4. **Determinism**: Same input always produces same output (no flaky tests)
|
|
5. **Coverage as Quality Gate**: 90%+ coverage required before merge
|
|
|
|
### Testing Pyramid
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ E2E Tests │ (10% of tests)
|
|
│ Full pipeline │
|
|
└─────────────────┘
|
|
┌───────────────────────┐
|
|
│ Integration Tests │ (20% of tests)
|
|
│ API + Parser + DB │
|
|
└───────────────────────┘
|
|
┌─────────────────────────────────┐
|
|
│ Unit Tests │ (70% of tests)
|
|
│ Functions, Classes, Validators │
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
**Distribution**:
|
|
- **70% Unit Tests**: Fast, isolated, test individual functions/classes
|
|
- **20% Integration Tests**: Test component interactions (API client + parser)
|
|
- **10% E2E Tests**: Full pipeline from API fetch to LinkML validation
|
|
|
|
---
|
|
|
|
## Test Infrastructure
|
|
|
|
### Testing Framework Stack
|
|
|
|
```yaml
|
|
# pyproject.toml [tool.pytest]
|
|
testpaths = ["tests"]
|
|
python_files = ["test_*.py"]
|
|
python_classes = ["Test*"]
|
|
python_functions = ["test_*"]
|
|
addopts = [
|
|
"--cov=src/glam_extractor",
|
|
"--cov-report=html",
|
|
"--cov-report=term-missing",
|
|
"--cov-fail-under=90",
|
|
"--verbose",
|
|
"--strict-markers",
|
|
"--tb=short"
|
|
]
|
|
markers = [
|
|
"unit: Unit tests (fast, isolated)",
|
|
"integration: Integration tests (requires external resources)",
|
|
"e2e: End-to-end tests (full pipeline)",
|
|
"slow: Slow tests (> 1 second)",
|
|
"api: Tests requiring UNESCO API access",
|
|
"wikidata: Tests requiring Wikidata SPARQL endpoint"
|
|
]
|
|
```
|
|
|
|
### Dependencies
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
test = [
|
|
"pytest >= 8.0.0",
|
|
"pytest-cov >= 4.1.0",
|
|
"pytest-mock >= 3.12.0",
|
|
"pytest-xdist >= 3.5.0", # Parallel test execution
|
|
"hypothesis >= 6.92.0", # Property-based testing
|
|
"responses >= 0.24.0", # Mock HTTP requests
|
|
"freezegun >= 1.4.0", # Mock datetime
|
|
"faker >= 22.0.0", # Generate fake data
|
|
"deepdiff >= 6.7.0", # Deep object comparison
|
|
]
|
|
```
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
tests/
|
|
├── unit/ # Unit tests (fast, isolated)
|
|
│ ├── test_unesco_api_client.py
|
|
│ ├── test_unesco_parser.py
|
|
│ ├── test_institution_classifier.py
|
|
│ ├── test_ghcid_generator.py
|
|
│ └── test_linkml_map.py
|
|
├── integration/ # Integration tests
|
|
│ ├── test_api_to_parser_pipeline.py
|
|
│ ├── test_parser_to_validator_pipeline.py
|
|
│ ├── test_wikidata_enrichment.py
|
|
│ └── test_dataset_merge.py
|
|
├── e2e/ # End-to-end tests
|
|
│ ├── test_full_unesco_extraction.py
|
|
│ └── test_export_all_formats.py
|
|
├── fixtures/ # Test data
|
|
│ ├── unesco_api_responses/ # Sample UNESCO JSON responses
|
|
│ │ ├── site_600_bnf.json
|
|
│ │ ├── site_list_sample.json
|
|
│ │ └── ...
|
|
│ ├── expected_outputs/ # Golden dataset expected results
|
|
│ │ ├── unesco_bnf_expected.yaml
|
|
│ │ └── ...
|
|
│ └── mock_responses/ # Pre-recorded API responses
|
|
├── conftest.py # Pytest configuration and shared fixtures
|
|
└── README.md # Testing guide
|
|
```
|
|
|
|
---
|
|
|
|
## Phase-by-Phase TDD Strategy
|
|
|
|
### Phase 1: API Exploration & Schema Design (Days 1-5)
|
|
|
|
#### Day 1: UNESCO API Client Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test API connectivity** (integration test)
|
|
2. **Test response parsing** (unit test)
|
|
3. **Test error handling** (unit test)
|
|
4. **Test caching** (unit test)
|
|
5. **Test rate limiting** (integration test)
|
|
|
|
**Example TDD Cycle**:
|
|
|
|
```python
|
|
# tests/unit/test_unesco_api_client.py
|
|
|
|
import pytest
|
|
import responses
|
|
from glam_extractor.parsers.unesco_api_client import UNESCOAPIClient
|
|
|
|
# RED: Write failing test first
|
|
def test_fetch_site_list_returns_list_of_sites():
|
|
"""Test that fetch_site_list returns a list of UNESCO sites."""
|
|
client = UNESCOAPIClient()
|
|
sites = client.fetch_site_list()
|
|
|
|
assert isinstance(sites, list)
|
|
assert len(sites) > 0
|
|
assert 'id_number' in sites[0]
|
|
assert 'site' in sites[0]
|
|
|
|
# This test will FAIL because UNESCOAPIClient doesn't exist yet
|
|
# Now implement minimal code to make it pass (GREEN)
|
|
```
|
|
|
|
**Mock API Responses**:
|
|
|
|
```python
|
|
@responses.activate
|
|
def test_fetch_site_details_with_valid_id():
|
|
"""Test fetching site details with mocked API response."""
|
|
# Arrange: Set up mock response (OpenDataSoft API v2.0 format)
|
|
responses.add(
|
|
responses.GET,
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
|
|
json={
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"category": "Cultural",
|
|
"states_name_en": "France"
|
|
}
|
|
}
|
|
},
|
|
status=200
|
|
)
|
|
|
|
# Act: Call the method
|
|
client = UNESCOAPIClient()
|
|
site = client.fetch_site_details(600)
|
|
|
|
# Assert: Verify results (nested structure)
|
|
assert site['record']['fields']['unique_number'] == 600
|
|
assert site['record']['fields']['name_en'] == "Paris, Banks of the Seine"
|
|
assert site['record']['fields']['category'] == "Cultural"
|
|
```
|
|
|
|
**Error Handling Tests**:
|
|
|
|
```python
|
|
@responses.activate
|
|
def test_fetch_site_details_handles_404():
|
|
"""Test that 404 errors are handled gracefully."""
|
|
responses.add(
|
|
responses.GET,
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/99999",
|
|
status=404
|
|
)
|
|
|
|
client = UNESCOAPIClient()
|
|
|
|
with pytest.raises(UNESCOSiteNotFoundError) as exc_info:
|
|
client.fetch_site_details(99999)
|
|
|
|
assert "Site 99999 not found" in str(exc_info.value)
|
|
|
|
@responses.activate
|
|
def test_fetch_site_details_retries_on_network_error():
|
|
"""Test that network errors trigger retry logic."""
|
|
# First call fails, second call succeeds
|
|
responses.add(
|
|
responses.GET,
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
|
|
body=Exception("Network error")
|
|
)
|
|
responses.add(
|
|
responses.GET,
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
|
|
json={
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine"
|
|
}
|
|
}
|
|
},
|
|
status=200
|
|
)
|
|
|
|
client = UNESCOAPIClient(max_retries=3)
|
|
site = client.fetch_site_details(600)
|
|
|
|
assert site['record']['fields']['unique_number'] == 600
|
|
assert len(responses.calls) == 2 # Verify retry happened
|
|
```
|
|
|
|
**Caching Tests**:
|
|
|
|
```python
|
|
def test_api_client_caches_responses(tmp_path):
|
|
"""Test that API responses are cached to avoid redundant requests."""
|
|
cache_db = tmp_path / "test_cache.db"
|
|
client = UNESCOAPIClient(cache_path=cache_db)
|
|
|
|
# First call: cache miss, fetches from API
|
|
with responses.RequestsMock() as rsps:
|
|
rsps.add(responses.GET, "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
|
|
json={"record": {"id": "rec_600", "fields": {"unique_number": 600}}}, status=200)
|
|
site1 = client.fetch_site_details(600)
|
|
assert len(rsps.calls) == 1
|
|
|
|
# Second call: cache hit, no API request
|
|
with responses.RequestsMock() as rsps:
|
|
site2 = client.fetch_site_details(600)
|
|
assert len(rsps.calls) == 0 # No new API calls
|
|
|
|
assert site1 == site2
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 15+ unit tests for API client
|
|
- [ ] 100% coverage for `unesco_api_client.py`
|
|
- [ ] All tests run in < 1 second (mocked responses)
|
|
|
|
---
|
|
|
|
#### Day 2-3: LinkML Map Schema Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test basic field extraction** (name, location, UNESCO WHC ID)
|
|
2. **Test multi-language name handling**
|
|
3. **Test conditional institution type mapping**
|
|
4. **Test missing field handling**
|
|
5. **Test regex validators**
|
|
|
|
**Golden Dataset Approach**:
|
|
|
|
```python
|
|
# tests/integration/test_linkml_map_golden_dataset.py
|
|
|
|
import pytest
|
|
from pathlib import Path
|
|
from linkml_map import Mapper
|
|
from glam_extractor.models import HeritageCustodian
|
|
|
|
FIXTURES_DIR = Path(__file__).parent.parent / "fixtures"
|
|
|
|
@pytest.mark.parametrize("site_id", [
|
|
600, # Paris, Banks of the Seine (Library/Museum)
|
|
252, # Taj Mahal (Monument with museum)
|
|
936, # Kew Gardens (Botanical garden)
|
|
# Add 17 more...
|
|
])
|
|
def test_linkml_map_golden_dataset(site_id):
|
|
"""Test LinkML Map transformation against golden dataset."""
|
|
# Load input JSON
|
|
input_json = FIXTURES_DIR / f"unesco_api_responses/site_{site_id}.json"
|
|
with open(input_json) as f:
|
|
api_response = json.load(f)
|
|
|
|
# Load expected output YAML
|
|
expected_yaml = FIXTURES_DIR / f"expected_outputs/site_{site_id}_expected.yaml"
|
|
with open(expected_yaml) as f:
|
|
expected = yaml.safe_load(f)
|
|
|
|
# Apply LinkML Map transformation
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
result = mapper.transform(api_response)
|
|
|
|
# Compare result with expected output
|
|
assert result['name'] == expected['name']
|
|
assert result['institution_type'] == expected['institution_type']
|
|
assert result['locations'][0]['city'] == expected['locations'][0]['city']
|
|
|
|
# Deep comparison for complex nested structures
|
|
from deepdiff import DeepDiff
|
|
diff = DeepDiff(result, expected, ignore_order=True)
|
|
assert not diff, f"Unexpected differences: {diff}"
|
|
```
|
|
|
|
**Conditional Mapping Tests**:
|
|
|
|
```python
|
|
def test_institution_type_mapping_for_museum():
|
|
"""Test that sites with 'museum' in description map to MUSEUM type."""
|
|
api_response = {
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"short_description_en": "Includes the Louvre Museum and other heritage institutions",
|
|
"category": "Cultural"
|
|
}
|
|
}
|
|
}
|
|
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
result = mapper.transform(api_response)
|
|
|
|
assert result['institution_type'] == "MUSEUM"
|
|
|
|
def test_institution_type_mapping_for_botanical_garden():
|
|
"""Test that botanical gardens map correctly."""
|
|
api_response = {
|
|
"record": {
|
|
"id": "rec_936",
|
|
"fields": {
|
|
"unique_number": 936,
|
|
"name_en": "Royal Botanic Gardens, Kew",
|
|
"short_description_en": "Botanical garden and herbarium",
|
|
"category": "Cultural"
|
|
}
|
|
}
|
|
}
|
|
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
result = mapper.transform(api_response)
|
|
|
|
assert result['institution_type'] == "BOTANICAL_ZOO"
|
|
|
|
def test_institution_type_defaults_to_mixed_when_ambiguous():
|
|
"""Test that ambiguous sites default to MIXED type."""
|
|
api_response = {
|
|
"record": {
|
|
"id": "rec_123",
|
|
"fields": {
|
|
"unique_number": 123,
|
|
"name_en": "Historic District",
|
|
"short_description_en": "Historic area with no specific institution mentioned",
|
|
"category": "Cultural"
|
|
}
|
|
}
|
|
}
|
|
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
result = mapper.transform(api_response)
|
|
|
|
assert result['institution_type'] == "MIXED"
|
|
```
|
|
|
|
**Multi-language Name Tests**:
|
|
|
|
```python
|
|
def test_multilingual_name_extraction():
|
|
"""Test extraction of names in multiple languages."""
|
|
api_response = {
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"name_fr": "Paris, rives de la Seine",
|
|
"name_es": "París, riberas del Sena"
|
|
}
|
|
}
|
|
}
|
|
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
result = mapper.transform(api_response)
|
|
|
|
assert result['name'] == "Paris, Banks of the Seine"
|
|
assert "Paris, rives de la Seine@fr" in result['alternative_names']
|
|
assert "París, riberas del Sena@es" in result['alternative_names']
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20 golden dataset test cases (100% passing)
|
|
- [ ] 10+ conditional mapping tests
|
|
- [ ] 5+ multi-language tests
|
|
- [ ] LinkML Map schema validates with `linkml-validate`
|
|
|
|
---
|
|
|
|
#### Day 4: Test Fixture Creation
|
|
|
|
**Fixture Quality Checklist**:
|
|
|
|
```python
|
|
# tests/conftest.py - Shared fixtures
|
|
|
|
import pytest
|
|
from pathlib import Path
|
|
|
|
@pytest.fixture
|
|
def fixtures_dir():
|
|
"""Return path to test fixtures directory."""
|
|
return Path(__file__).parent / "fixtures"
|
|
|
|
@pytest.fixture
|
|
def sample_unesco_site_600():
|
|
"""Fixture for UNESCO site 600 (Paris, Banks of the Seine)."""
|
|
return {
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"category": "Cultural",
|
|
"states_name_en": "France",
|
|
"region_en": "Europe and North America",
|
|
"latitude": 48.8566,
|
|
"longitude": 2.3522,
|
|
"short_description_en": "From the Louvre to the Eiffel Tower...",
|
|
"justification_en": "Contains numerous heritage institutions including museums and libraries",
|
|
"date_inscribed": 1991,
|
|
"http_url": "https://whc.unesco.org/en/list/600"
|
|
}
|
|
}
|
|
}
|
|
|
|
@pytest.fixture
|
|
def expected_heritage_custodian_600():
|
|
"""Expected HeritageCustodian output for site 600."""
|
|
return {
|
|
"id": "https://w3id.org/heritage/custodian/fr/louvre",
|
|
"name": "Paris, Banks of the Seine",
|
|
"institution_type": "MIXED", # Multiple institutions at this site
|
|
"locations": [
|
|
{
|
|
"city": "Paris",
|
|
"country": "FR",
|
|
"coordinates": [48.8566, 2.3522],
|
|
"geonames_id": "2988507"
|
|
}
|
|
],
|
|
"identifiers": [
|
|
{
|
|
"identifier_scheme": "UNESCO_WHC",
|
|
"identifier_value": "600",
|
|
"identifier_url": "https://whc.unesco.org/en/list/600"
|
|
},
|
|
{
|
|
"identifier_scheme": "Wikidata",
|
|
"identifier_value": "Q90",
|
|
"identifier_url": "https://www.wikidata.org/wiki/Q90"
|
|
}
|
|
],
|
|
"provenance": {
|
|
"data_source": "UNESCO_WORLD_HERITAGE",
|
|
"data_tier": "TIER_1_AUTHORITATIVE",
|
|
"extraction_date": "2025-11-09T10:00:00Z",
|
|
"confidence_score": 1.0
|
|
}
|
|
}
|
|
|
|
@pytest.fixture
|
|
def mock_unesco_api_client(mocker):
|
|
"""Mock UNESCO API client for testing without network calls."""
|
|
mock_client = mocker.Mock(spec=UNESCOAPIClient)
|
|
mock_client.fetch_site_details.return_value = {
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine"
|
|
}
|
|
}
|
|
}
|
|
return mock_client
|
|
```
|
|
|
|
**Fixture Coverage Matrix**:
|
|
|
|
| Site ID | Location | Institution Type | Special Case |
|
|
|---------|----------|------------------|--------------|
|
|
| 600 | France, Paris | MIXED | Multiple institutions |
|
|
| 252 | India, Agra | MUSEUM | Archaeological museum |
|
|
| 936 | UK, London | BOTANICAL_ZOO | Kew Gardens |
|
|
| 148 | Brazil, Brasília | ARCHIVE | National Archive |
|
|
| 274 | Egypt, Cairo | MUSEUM | Egyptian Museum |
|
|
| 890 | Japan, Kyoto | HOLY_SITES | Temple with collection |
|
|
| 1110 | Vietnam, Hanoi | LIBRARY | National Library |
|
|
| 723 | Mexico, Mexico City | MUSEUM | Anthropology Museum |
|
|
| ... | ... | ... | ... |
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20 diverse test fixtures covering all continents
|
|
- [ ] 20 expected output YAML files (manually verified)
|
|
- [ ] Fixtures cover all institution types in InstitutionTypeEnum
|
|
- [ ] Edge cases documented (serial nominations, transboundary sites)
|
|
|
|
---
|
|
|
|
#### Day 5: Institution Type Classifier Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test keyword-based classification**
|
|
2. **Test multi-keyword scoring**
|
|
3. **Test confidence scoring**
|
|
4. **Test multilingual keywords**
|
|
5. **Test ambiguous case handling**
|
|
|
|
**Property-Based Testing with Hypothesis**:
|
|
|
|
```python
|
|
# tests/unit/test_institution_classifier_properties.py
|
|
|
|
from hypothesis import given, strategies as st
|
|
from glam_extractor.classifiers.unesco_institution_type import classify_institution_type
|
|
|
|
@given(description=st.text(min_size=10, max_size=500))
|
|
def test_classifier_always_returns_valid_enum(description):
|
|
"""Property: Classifier must always return a valid InstitutionTypeEnum."""
|
|
result = classify_institution_type(description)
|
|
|
|
assert result['institution_type'] in [
|
|
"MUSEUM", "LIBRARY", "ARCHIVE", "GALLERY",
|
|
"BOTANICAL_ZOO", "HOLY_SITES", "FEATURES", "MIXED"
|
|
]
|
|
assert 0.0 <= result['confidence_score'] <= 1.0
|
|
|
|
@given(
|
|
description=st.text(),
|
|
keyword=st.sampled_from(["museum", "library", "archive", "botanical garden"])
|
|
)
|
|
def test_classifier_detects_explicit_keywords(description, keyword):
|
|
"""Property: Explicit keywords should be detected reliably."""
|
|
# Force keyword into description
|
|
modified_desc = f"{description} This site contains a {keyword}."
|
|
|
|
result = classify_institution_type(modified_desc)
|
|
|
|
# Should detect the keyword with high confidence
|
|
assert result['confidence_score'] >= 0.7
|
|
```
|
|
|
|
**Keyword Detection Tests**:
|
|
|
|
```python
|
|
@pytest.mark.parametrize("description,expected_type,min_confidence", [
|
|
# Museum keywords
|
|
("Site includes the National Museum of Anthropology", "MUSEUM", 0.9),
|
|
("Archaeological museum with 10,000 artifacts", "MUSEUM", 0.9),
|
|
("Le Musée du Louvre est situé à Paris", "MUSEUM", 0.9), # French
|
|
|
|
# Library keywords
|
|
("The National Library holds rare manuscripts", "LIBRARY", 0.9),
|
|
("Biblioteca Nacional do Brasil", "LIBRARY", 0.9), # Portuguese
|
|
|
|
# Archive keywords
|
|
("Historic archive with colonial documents", "ARCHIVE", 0.9),
|
|
("Rijksarchief in Noord-Holland", "ARCHIVE", 0.9), # Dutch
|
|
|
|
# Botanical garden keywords
|
|
("Royal Botanic Gardens with herbarium collection", "BOTANICAL_ZOO", 0.9),
|
|
("Jardin botanique with 50,000 plant species", "BOTANICAL_ZOO", 0.9),
|
|
|
|
# Holy sites
|
|
("Cathedral with liturgical manuscripts collection", "HOLY_SITES", 0.85),
|
|
("Monastery library with medieval codices", "HOLY_SITES", 0.85),
|
|
|
|
# Ambiguous cases (should default to MIXED)
|
|
("Historic district with various buildings", "MIXED", 0.5),
|
|
("Cultural landscape", "MIXED", 0.5),
|
|
])
|
|
def test_keyword_based_classification(description, expected_type, min_confidence):
|
|
"""Test that classifier detects keywords correctly."""
|
|
result = classify_institution_type(description)
|
|
|
|
assert result['institution_type'] == expected_type
|
|
assert result['confidence_score'] >= min_confidence
|
|
```
|
|
|
|
**Multi-language Support Tests**:
|
|
|
|
```python
|
|
def test_classifier_handles_multilingual_descriptions():
|
|
"""Test that classifier works with non-English descriptions."""
|
|
test_cases = [
|
|
# French
|
|
("Le site comprend un musée d'art moderne", "MUSEUM"),
|
|
("Bibliothèque nationale avec des manuscrits rares", "LIBRARY"),
|
|
|
|
# Spanish
|
|
("Museo Nacional de Antropología con colecciones", "MUSEUM"),
|
|
("Biblioteca pública con archivos históricos", "LIBRARY"),
|
|
|
|
# Portuguese
|
|
("Museu Nacional com exposições permanentes", "MUSEUM"),
|
|
|
|
# German
|
|
("Staatliches Museum für Völkerkunde", "MUSEUM"),
|
|
|
|
# Dutch
|
|
("Rijksmuseum met kunstcollecties", "MUSEUM"),
|
|
]
|
|
|
|
for description, expected_type in test_cases:
|
|
result = classify_institution_type(description)
|
|
assert result['institution_type'] == expected_type, \
|
|
f"Failed for: {description}"
|
|
```
|
|
|
|
**Confidence Scoring Tests**:
|
|
|
|
```python
|
|
def test_confidence_high_for_explicit_mentions():
|
|
"""Test that explicit mentions produce high confidence scores."""
|
|
result = classify_institution_type("Site contains the National Museum")
|
|
assert result['confidence_score'] >= 0.9
|
|
|
|
def test_confidence_medium_for_inferred_types():
|
|
"""Test that inferred types produce medium confidence scores."""
|
|
result = classify_institution_type("Archaeological site with exhibition hall")
|
|
assert 0.6 <= result['confidence_score'] < 0.9
|
|
|
|
def test_confidence_low_for_ambiguous_cases():
|
|
"""Test that ambiguous cases produce low confidence scores."""
|
|
result = classify_institution_type("Historic area")
|
|
assert result['confidence_score'] < 0.6
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 50+ unit tests for classifier
|
|
- [ ] 100% coverage for `unesco_institution_type.py`
|
|
- [ ] Property-based tests cover edge cases
|
|
- [ ] 90%+ accuracy on golden dataset (20 fixtures)
|
|
|
|
---
|
|
|
|
### Phase 2: Extractor Implementation (Days 6-13)
|
|
|
|
#### Day 6-7: Parser Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test parsing valid UNESCO JSON to HeritageCustodian**
|
|
2. **Test handling missing optional fields**
|
|
3. **Test handling invalid/malformed JSON**
|
|
4. **Test LinkML validation integration**
|
|
5. **Test provenance metadata generation**
|
|
|
|
**Parser Unit Tests**:
|
|
|
|
```python
|
|
# tests/unit/test_unesco_parser.py
|
|
|
|
def test_parse_unesco_site_creates_valid_heritage_custodian(sample_unesco_site_600):
|
|
"""Test that parser creates valid HeritageCustodian from UNESCO JSON."""
|
|
from glam_extractor.parsers.unesco_parser import parse_unesco_site
|
|
|
|
custodian = parse_unesco_site(sample_unesco_site_600)
|
|
|
|
assert isinstance(custodian, HeritageCustodian)
|
|
assert custodian.name == "Paris, Banks of the Seine"
|
|
assert custodian.institution_type in InstitutionTypeEnum
|
|
assert len(custodian.identifiers) >= 1
|
|
assert custodian.provenance.data_source == "UNESCO_WORLD_HERITAGE"
|
|
|
|
def test_parse_unesco_site_handles_missing_coordinates():
|
|
"""Test that parser handles missing latitude/longitude gracefully."""
|
|
incomplete_site = {
|
|
"record": {
|
|
"id": "rec_999",
|
|
"fields": {
|
|
"unique_number": 999,
|
|
"name_en": "Test Site",
|
|
"category": "Cultural"
|
|
# Missing latitude, longitude
|
|
}
|
|
}
|
|
}
|
|
|
|
custodian = parse_unesco_site(incomplete_site)
|
|
|
|
# Should still parse, but location may lack coordinates
|
|
assert custodian.name == "Test Site"
|
|
if custodian.locations:
|
|
assert custodian.locations[0].latitude is None
|
|
assert custodian.locations[0].longitude is None
|
|
|
|
def test_parse_unesco_site_extracts_all_identifiers():
|
|
"""Test that parser extracts UNESCO WHC ID and Wikidata Q-number."""
|
|
site_with_links = {
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"http_url": "https://whc.unesco.org/en/list/600",
|
|
"wdpaid": "Q90" # Wikidata ID field
|
|
}
|
|
}
|
|
}
|
|
|
|
custodian = parse_unesco_site(site_with_links)
|
|
|
|
# Check UNESCO WHC ID
|
|
whc_ids = [i for i in custodian.identifiers if i.identifier_scheme == "UNESCO_WHC"]
|
|
assert len(whc_ids) == 1
|
|
assert whc_ids[0].identifier_value == "600"
|
|
|
|
# Check Wikidata Q-number
|
|
wd_ids = [i for i in custodian.identifiers if i.identifier_scheme == "Wikidata"]
|
|
assert len(wd_ids) == 1
|
|
assert wd_ids[0].identifier_value == "Q90"
|
|
```
|
|
|
|
**LinkML Validation Integration**:
|
|
|
|
```python
|
|
def test_parsed_custodian_validates_against_linkml_schema(sample_unesco_site_600):
|
|
"""Test that parsed HeritageCustodian passes LinkML validation."""
|
|
from linkml.validators import JsonSchemaValidator
|
|
|
|
custodian = parse_unesco_site(sample_unesco_site_600)
|
|
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
|
|
|
|
# Convert to dict for validation
|
|
custodian_dict = custodian.dict()
|
|
|
|
# Should not raise ValidationError
|
|
report = validator.validate(custodian_dict)
|
|
assert report.valid, f"Validation errors: {report.results}"
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20+ unit tests for parser
|
|
- [ ] 100% coverage for `unesco_parser.py`
|
|
- [ ] All parsed instances validate against LinkML schema
|
|
|
|
---
|
|
|
|
#### Day 8: GHCID Generator Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test GHCID format correctness**
|
|
2. **Test deterministic generation (same input → same GHCID)**
|
|
3. **Test UUID v5 generation**
|
|
4. **Test collision detection**
|
|
5. **Test native name suffix generation** (NOT Wikidata Q-numbers)
|
|
|
|
> **Note**: Collision resolution uses native language institution name in snake_case format (NOT Wikidata Q-numbers). See `docs/plan/global_glam/07-ghcid-collision-resolution.md` for authoritative documentation.
|
|
|
|
**GHCID Format Tests**:
|
|
|
|
```python
|
|
# tests/unit/test_ghcid_unesco.py
|
|
|
|
def test_ghcid_format_for_french_museum():
|
|
"""Test GHCID generation for French museum follows format."""
|
|
custodian = HeritageCustodian(
|
|
name="Musée du Louvre",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
locations=[Location(city="Paris", country="FR", region="Île-de-France")]
|
|
)
|
|
|
|
ghcid = generate_ghcid(custodian)
|
|
|
|
# Format: FR-IDF-PAR-M-LOU
|
|
assert ghcid.startswith("FR-")
|
|
assert "-M-" in ghcid # M = Museum
|
|
assert re.match(r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}$', ghcid)
|
|
|
|
def test_ghcid_with_name_suffix_collision():
|
|
"""Test GHCID includes name suffix when collision detected."""
|
|
custodian = HeritageCustodian(
|
|
name="Stedelijk Museum Amsterdam",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
|
|
)
|
|
|
|
# Simulate collision detection
|
|
ghcid = generate_ghcid(custodian, collision_detected=True)
|
|
|
|
# Format: NL-NH-AMS-M-SM-stedelijk_museum_amsterdam
|
|
assert ghcid.endswith("-stedelijk_museum_amsterdam")
|
|
```
|
|
|
|
**Determinism Tests (Property-Based)**:
|
|
|
|
```python
|
|
from hypothesis import given, strategies as st
|
|
|
|
@given(
|
|
country=st.sampled_from(["FR", "NL", "BR", "IN", "JP"]),
|
|
city=st.text(min_size=3, max_size=20, alphabet=st.characters(whitelist_categories=('Lu', 'Ll'))),
|
|
inst_type=st.sampled_from(list(InstitutionTypeEnum))
|
|
)
|
|
def test_ghcid_generation_is_deterministic(country, city, inst_type):
|
|
"""Property: Same input always produces same GHCID."""
|
|
custodian = HeritageCustodian(
|
|
name="Test Institution",
|
|
institution_type=inst_type,
|
|
locations=[Location(city=city, country=country)]
|
|
)
|
|
|
|
ghcid1 = generate_ghcid(custodian)
|
|
ghcid2 = generate_ghcid(custodian)
|
|
|
|
assert ghcid1 == ghcid2
|
|
|
|
@given(st.text(min_size=5, max_size=100))
|
|
def test_uuid_v5_determinism_for_ghcid(ghcid_string):
|
|
"""Property: UUID v5 generation is deterministic."""
|
|
uuid1 = generate_uuid_v5(ghcid_string)
|
|
uuid2 = generate_uuid_v5(ghcid_string)
|
|
|
|
assert uuid1 == uuid2
|
|
assert isinstance(uuid1, uuid.UUID)
|
|
assert uuid1.version == 5
|
|
```
|
|
|
|
**Collision Detection Tests**:
|
|
|
|
```python
|
|
def test_collision_detection_with_same_base_ghcid():
|
|
"""Test that collision is detected when base GHCID matches existing record."""
|
|
existing_dataset = [
|
|
HeritageCustodian(
|
|
ghcid="NL-NH-AMS-M-HM",
|
|
name="Hermitage Amsterdam"
|
|
)
|
|
]
|
|
|
|
new_custodian = HeritageCustodian(
|
|
name="Historical Museum Amsterdam",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
locations=[Location(city="Amsterdam", country="NL", region="Noord-Holland")]
|
|
)
|
|
|
|
collision = detect_ghcid_collision(new_custodian, existing_dataset)
|
|
|
|
assert collision is True
|
|
|
|
# Generate GHCID with name suffix to resolve collision
|
|
ghcid = generate_ghcid(new_custodian, collision_detected=True)
|
|
|
|
assert ghcid == "NL-NH-AMS-M-HM-historical_museum_amsterdam"
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 25+ unit tests for GHCID generator
|
|
- [ ] 100% coverage for GHCID logic
|
|
- [ ] Property-based tests verify determinism
|
|
- [ ] Collision detection 100% accurate on test cases
|
|
|
|
---
|
|
|
|
#### Day 9-12: Integration Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test API → Parser pipeline**
|
|
2. **Test Parser → Validator pipeline**
|
|
3. **Test full extraction pipeline**
|
|
4. **Test batch processing**
|
|
5. **Test error recovery**
|
|
|
|
**Pipeline Integration Tests**:
|
|
|
|
```python
|
|
# tests/integration/test_api_to_parser_pipeline.py
|
|
|
|
@responses.activate
|
|
def test_full_pipeline_from_api_to_linkml_instance():
|
|
"""Test complete pipeline: API fetch → Parse → Validate."""
|
|
# Arrange: Mock UNESCO API response
|
|
responses.add(
|
|
responses.GET,
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/600",
|
|
json={
|
|
"record": {
|
|
"id": "rec_600",
|
|
"fields": {
|
|
"unique_number": 600,
|
|
"name_en": "Paris, Banks of the Seine",
|
|
"category": "Cultural",
|
|
"states_name_en": "France",
|
|
"latitude": 48.8566,
|
|
"longitude": 2.3522
|
|
}
|
|
}
|
|
},
|
|
status=200
|
|
)
|
|
|
|
# Act: Run pipeline
|
|
api_client = UNESCOAPIClient()
|
|
site_data = api_client.fetch_site_details(600)
|
|
custodian = parse_unesco_site(site_data)
|
|
custodian.ghcid = generate_ghcid(custodian)
|
|
|
|
# Assert: Verify result
|
|
assert custodian.name == "Paris, Banks of the Seine"
|
|
assert custodian.ghcid.startswith("FR-")
|
|
|
|
# Validate against schema
|
|
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
|
|
report = validator.validate(custodian.dict())
|
|
assert report.valid
|
|
|
|
@pytest.mark.slow
|
|
def test_batch_processing_handles_failures_gracefully():
|
|
"""Test that batch processing continues when individual sites fail."""
|
|
site_ids = [600, 99999, 252, 88888, 936] # Mix of valid and invalid IDs
|
|
|
|
results = []
|
|
errors = []
|
|
|
|
for site_id in site_ids:
|
|
try:
|
|
site_data = api_client.fetch_site_details(site_id)
|
|
custodian = parse_unesco_site(site_data)
|
|
results.append(custodian)
|
|
except UNESCOSiteNotFoundError as e:
|
|
errors.append((site_id, str(e)))
|
|
|
|
# Should have processed valid sites
|
|
assert len(results) == 3 # 600, 252, 936
|
|
|
|
# Should have logged errors for invalid sites
|
|
assert len(errors) == 2 # 99999, 88888
|
|
```
|
|
|
|
**Parallel Processing Tests**:
|
|
|
|
```python
|
|
@pytest.mark.slow
|
|
def test_parallel_extraction_produces_same_results_as_sequential():
|
|
"""Test that parallel processing produces identical results to sequential."""
|
|
site_ids = [600, 252, 936, 148, 274]
|
|
|
|
# Sequential extraction
|
|
sequential_results = []
|
|
for site_id in site_ids:
|
|
site_data = api_client.fetch_site_details(site_id)
|
|
custodian = parse_unesco_site(site_data)
|
|
sequential_results.append(custodian)
|
|
|
|
# Parallel extraction
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
|
def process_site(site_id):
|
|
site_data = api_client.fetch_site_details(site_id)
|
|
return parse_unesco_site(site_data)
|
|
|
|
with ThreadPoolExecutor(max_workers=4) as executor:
|
|
parallel_results = list(executor.map(process_site, site_ids))
|
|
|
|
# Results should be identical (order may differ)
|
|
sequential_ghcids = sorted([c.ghcid for c in sequential_results])
|
|
parallel_ghcids = sorted([c.ghcid for c in parallel_results])
|
|
|
|
assert sequential_ghcids == parallel_ghcids
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 15+ integration tests covering pipeline components
|
|
- [ ] End-to-end test processes 20 golden dataset sites
|
|
- [ ] Parallel processing test confirms determinism
|
|
- [ ] All integration tests pass consistently
|
|
|
|
---
|
|
|
|
### Phase 3: Data Quality & Validation (Days 14-19)
|
|
|
|
#### Day 14-16: Cross-Referencing and Conflict Detection Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test matching by Wikidata Q-number**
|
|
2. **Test matching by ISIL code**
|
|
3. **Test fuzzy name matching**
|
|
4. **Test conflict detection**
|
|
5. **Test conflict resolution**
|
|
|
|
**Matching Tests**:
|
|
|
|
```python
|
|
# tests/unit/test_crosslinking.py
|
|
|
|
def test_match_by_wikidata_qnumber():
|
|
"""Test that institutions match by Wikidata Q-number."""
|
|
unesco_record = HeritageCustodian(
|
|
name="Bibliothèque nationale de France",
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
|
|
)
|
|
|
|
existing_record = HeritageCustodian(
|
|
name="BnF", # Different name
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
|
|
)
|
|
|
|
match = find_match(unesco_record, [existing_record])
|
|
|
|
assert match is not None
|
|
assert match.name == "BnF"
|
|
|
|
def test_fuzzy_name_matching_with_threshold():
|
|
"""Test fuzzy name matching with similarity threshold."""
|
|
unesco_record = HeritageCustodian(
|
|
name="Rijksmuseum Amsterdam",
|
|
locations=[Location(city="Amsterdam", country="NL")]
|
|
)
|
|
|
|
existing_record = HeritageCustodian(
|
|
name="Rijks Museum Amsterdam", # Slightly different spelling
|
|
locations=[Location(city="Amsterdam", country="NL")]
|
|
)
|
|
|
|
match, score = fuzzy_match_by_name_and_location(
|
|
unesco_record, [existing_record], threshold=0.85
|
|
)
|
|
|
|
assert match is not None
|
|
assert score >= 0.85
|
|
```
|
|
|
|
**Conflict Detection Tests**:
|
|
|
|
```python
|
|
def test_detect_name_conflict():
|
|
"""Test detection of name conflicts between UNESCO and existing data."""
|
|
unesco_record = HeritageCustodian(
|
|
name="Bibliothèque nationale de France",
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
|
|
)
|
|
|
|
existing_record = HeritageCustodian(
|
|
name="BnF",
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q193563")]
|
|
)
|
|
|
|
conflicts = detect_conflicts(unesco_record, existing_record)
|
|
|
|
assert len(conflicts) == 1
|
|
assert conflicts[0].field == "name"
|
|
assert conflicts[0].unesco_value == "Bibliothèque nationale de France"
|
|
assert conflicts[0].existing_value == "BnF"
|
|
|
|
def test_detect_institution_type_conflict():
|
|
"""Test detection of institution type mismatches."""
|
|
unesco_record = HeritageCustodian(
|
|
name="Site Name",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
|
|
)
|
|
|
|
existing_record = HeritageCustodian(
|
|
name="Site Name",
|
|
institution_type=InstitutionTypeEnum.LIBRARY, # Conflict!
|
|
identifiers=[Identifier(identifier_scheme="Wikidata", identifier_value="Q12345")]
|
|
)
|
|
|
|
conflicts = detect_conflicts(unesco_record, existing_record)
|
|
|
|
type_conflicts = [c for c in conflicts if c.field == "institution_type"]
|
|
assert len(type_conflicts) == 1
|
|
```
|
|
|
|
**Conflict Resolution Tests**:
|
|
|
|
```python
|
|
def test_unesco_tier1_wins_over_conversation_tier4():
|
|
"""Test that UNESCO (TIER_1) data takes priority over conversation (TIER_4) data."""
|
|
unesco_record = HeritageCustodian(
|
|
name="Bibliothèque nationale de France",
|
|
institution_type=InstitutionTypeEnum.LIBRARY,
|
|
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
|
|
)
|
|
|
|
conversation_record = HeritageCustodian(
|
|
name="BnF",
|
|
institution_type=InstitutionTypeEnum.MIXED, # Less accurate
|
|
provenance=Provenance(data_tier=DataTier.TIER_4_INFERRED)
|
|
)
|
|
|
|
merged = resolve_conflicts(unesco_record, conversation_record)
|
|
|
|
# UNESCO data should win
|
|
assert merged.name == "Bibliothèque nationale de France"
|
|
assert merged.institution_type == InstitutionTypeEnum.LIBRARY
|
|
|
|
# But preserve alternative names
|
|
assert "BnF" in merged.alternative_names
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20+ tests for matching logic
|
|
- [ ] 15+ tests for conflict detection
|
|
- [ ] 10+ tests for conflict resolution
|
|
- [ ] 100% coverage for crosslinking module
|
|
|
|
---
|
|
|
|
#### Day 17-18: Validation and Confidence Scoring Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test LinkML schema validation**
|
|
2. **Test custom validators (UNESCO WHC ID, GHCID format)**
|
|
3. **Test confidence score calculation**
|
|
4. **Test quality metrics**
|
|
|
|
**Validation Tests**:
|
|
|
|
```python
|
|
# tests/unit/test_unesco_validators.py
|
|
|
|
def test_validate_unesco_whc_id_format():
|
|
"""Test UNESCO WHC ID validator accepts valid IDs."""
|
|
assert validate_unesco_whc_id("600") is True
|
|
assert validate_unesco_whc_id("1234") is True
|
|
assert validate_unesco_whc_id("52") is False # Too short
|
|
assert validate_unesco_whc_id("12345") is False # Too long
|
|
assert validate_unesco_whc_id("ABC") is False # Not numeric
|
|
|
|
def test_validate_ghcid_format_for_unesco():
|
|
"""Test GHCID format validator.
|
|
|
|
Note: Collision suffix uses native language name in snake_case (NOT Wikidata Q-numbers).
|
|
See: docs/plan/global_glam/07-ghcid-collision-resolution.md
|
|
"""
|
|
assert validate_ghcid_format("FR-IDF-PAR-M-LOU") is True
|
|
assert validate_ghcid_format("FR-IDF-PAR-M-LOU-musee_du_louvre") is True # Native name suffix
|
|
assert validate_ghcid_format("INVALID") is False
|
|
assert validate_ghcid_format("FR-PAR-M") is False # Missing components
|
|
|
|
@pytest.mark.parametrize("custodian_dict,should_pass", [
|
|
# Valid record
|
|
({
|
|
"name": "Test Museum",
|
|
"institution_type": "MUSEUM",
|
|
"provenance": {
|
|
"data_source": "UNESCO_WORLD_HERITAGE",
|
|
"data_tier": "TIER_1_AUTHORITATIVE",
|
|
"extraction_date": "2025-11-09T10:00:00Z"
|
|
}
|
|
}, True),
|
|
|
|
# Missing required field
|
|
({
|
|
"institution_type": "MUSEUM", # Missing name!
|
|
"provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
|
|
}, False),
|
|
|
|
# Invalid enum value
|
|
({
|
|
"name": "Test",
|
|
"institution_type": "INVALID_TYPE", # Not in enum!
|
|
"provenance": {"data_source": "UNESCO_WORLD_HERITAGE"}
|
|
}, False),
|
|
])
|
|
def test_linkml_validation(custodian_dict, should_pass):
|
|
"""Test LinkML schema validation with valid and invalid records."""
|
|
validator = JsonSchemaValidator(schema="schemas/heritage_custodian.yaml")
|
|
report = validator.validate(custodian_dict)
|
|
|
|
if should_pass:
|
|
assert report.valid, f"Unexpected validation error: {report.results}"
|
|
else:
|
|
assert not report.valid
|
|
```
|
|
|
|
**Confidence Scoring Tests**:
|
|
|
|
```python
|
|
def test_confidence_score_high_for_complete_record():
|
|
"""Test that complete records with rich metadata score high."""
|
|
custodian = HeritageCustodian(
|
|
name="Musée du Louvre",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
locations=[Location(city="Paris", country="FR", latitude=48.86, longitude=2.34)],
|
|
identifiers=[
|
|
Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600"),
|
|
Identifier(identifier_scheme="Wikidata", identifier_value="Q19675"),
|
|
Identifier(identifier_scheme="VIAF", identifier_value="139708098")
|
|
],
|
|
digital_platforms=[DigitalPlatform(platform_name="Louvre Collections", platform_url="https://collections.louvre.fr")],
|
|
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
|
|
)
|
|
|
|
score = calculate_confidence_score(custodian)
|
|
|
|
assert score >= 0.95
|
|
|
|
def test_confidence_score_low_for_minimal_record():
|
|
"""Test that minimal records score lower."""
|
|
custodian = HeritageCustodian(
|
|
name="Unknown Site",
|
|
institution_type=InstitutionTypeEnum.MIXED, # Ambiguous
|
|
# Missing: locations, identifiers, digital_platforms
|
|
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
|
|
)
|
|
|
|
score = calculate_confidence_score(custodian)
|
|
|
|
assert score < 0.7
|
|
|
|
@given(
|
|
num_identifiers=st.integers(min_value=0, max_value=5),
|
|
has_location=st.booleans(),
|
|
has_platform=st.booleans()
|
|
)
|
|
def test_confidence_score_increases_with_completeness(num_identifiers, has_location, has_platform):
|
|
"""Property: More metadata should increase confidence score."""
|
|
custodian = HeritageCustodian(
|
|
name="Test",
|
|
institution_type=InstitutionTypeEnum.MUSEUM,
|
|
identifiers=[Identifier(identifier_scheme="Test", identifier_value=f"{i}") for i in range(num_identifiers)],
|
|
locations=[Location(city="City", country="CC")] if has_location else [],
|
|
digital_platforms=[DigitalPlatform(platform_name="Test", platform_url="https://test.com")] if has_platform else [],
|
|
provenance=Provenance(data_tier=DataTier.TIER_1_AUTHORITATIVE)
|
|
)
|
|
|
|
score = calculate_confidence_score(custodian)
|
|
|
|
# Score should be in valid range
|
|
assert 0.0 <= score <= 1.0
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20+ validation tests
|
|
- [ ] 15+ confidence scoring tests
|
|
- [ ] Property-based tests verify score consistency
|
|
- [ ] 100% coverage for validator modules
|
|
|
|
---
|
|
|
|
### Phase 4: Integration & Enrichment (Days 20-24)
|
|
|
|
#### Wikidata Enrichment Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test SPARQL query generation**
|
|
2. **Test Q-number extraction**
|
|
3. **Test fallback to fuzzy matching**
|
|
4. **Test enrichment error handling**
|
|
|
|
```python
|
|
# tests/integration/test_wikidata_enrichment.py
|
|
|
|
@responses.activate
|
|
def test_wikidata_query_by_unesco_whc_id():
|
|
"""Test querying Wikidata for Q-number using UNESCO WHC ID."""
|
|
# Mock SPARQL endpoint response
|
|
responses.add(
|
|
responses.POST,
|
|
"https://query.wikidata.org/sparql",
|
|
json={
|
|
"results": {
|
|
"bindings": [
|
|
{"item": {"value": "http://www.wikidata.org/entity/Q90"}}
|
|
]
|
|
}
|
|
},
|
|
status=200
|
|
)
|
|
|
|
q_number = query_wikidata_for_unesco_site(whc_id="600")
|
|
|
|
assert q_number == "Q90"
|
|
|
|
def test_wikidata_enrichment_adds_identifiers():
|
|
"""Test that Wikidata enrichment adds Q-number to custodian."""
|
|
custodian = HeritageCustodian(
|
|
name="Paris, Banks of the Seine",
|
|
identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="600")]
|
|
)
|
|
|
|
enriched = enrich_with_wikidata(custodian)
|
|
|
|
# Should have added Wikidata identifier
|
|
wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
|
|
assert len(wd_ids) == 1
|
|
assert wd_ids[0].identifier_value == "Q90"
|
|
|
|
def test_wikidata_enrichment_handles_not_found():
|
|
"""Test that enrichment handles missing Wikidata entries gracefully."""
|
|
custodian = HeritageCustodian(
|
|
name="Obscure Site",
|
|
identifiers=[Identifier(identifier_scheme="UNESCO_WHC", identifier_value="99999")]
|
|
)
|
|
|
|
enriched = enrich_with_wikidata(custodian)
|
|
|
|
# Should not crash, just log warning
|
|
wd_ids = [i for i in enriched.identifiers if i.identifier_scheme == "Wikidata"]
|
|
assert len(wd_ids) == 0
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 10+ Wikidata enrichment tests
|
|
- [ ] Mock SPARQL responses for offline testing
|
|
- [ ] Error handling tests for API failures
|
|
|
|
---
|
|
|
|
### Phase 5: Export & Documentation (Days 25-30)
|
|
|
|
#### Export Format Tests
|
|
|
|
**Test Plan**:
|
|
1. **Test RDF/Turtle serialization**
|
|
2. **Test JSON-LD export**
|
|
3. **Test CSV flattening**
|
|
4. **Test Parquet export**
|
|
5. **Test round-trip data integrity**
|
|
|
|
```python
|
|
# tests/integration/test_exports.py
|
|
|
|
def test_rdf_export_produces_valid_turtle(sample_dataset):
|
|
"""Test that RDF export produces valid Turtle syntax."""
|
|
from glam_extractor.exporters.rdf_exporter import export_to_rdf
|
|
|
|
output_path = tmp_path / "test_export.ttl"
|
|
export_to_rdf(sample_dataset, output_path)
|
|
|
|
# Verify file was created
|
|
assert output_path.exists()
|
|
|
|
# Parse with rdflib to verify syntax
|
|
from rdflib import Graph
|
|
graph = Graph()
|
|
graph.parse(output_path, format="turtle")
|
|
|
|
# Check that records were serialized
|
|
assert len(graph) > 0
|
|
|
|
def test_jsonld_export_includes_context(sample_dataset):
|
|
"""Test that JSON-LD export includes @context."""
|
|
from glam_extractor.exporters.jsonld_exporter import export_to_jsonld
|
|
|
|
output_path = tmp_path / "test_export.jsonld"
|
|
export_to_jsonld(sample_dataset, output_path)
|
|
|
|
with open(output_path) as f:
|
|
data = json.load(f)
|
|
|
|
assert "@context" in data
|
|
assert "@graph" in data or isinstance(data, list)
|
|
|
|
def test_csv_export_flattens_nested_structures(sample_dataset):
|
|
"""Test that CSV export flattens nested structures correctly."""
|
|
from glam_extractor.exporters.csv_exporter import export_to_csv
|
|
|
|
output_path = tmp_path / "test_export.csv"
|
|
export_to_csv(sample_dataset, output_path)
|
|
|
|
import pandas as pd
|
|
df = pd.read_csv(output_path)
|
|
|
|
# Check required columns
|
|
assert "ghcid" in df.columns
|
|
assert "name" in df.columns
|
|
assert "institution_type" in df.columns
|
|
assert "country" in df.columns
|
|
|
|
def test_parquet_export_preserves_data_types(sample_dataset):
|
|
"""Test that Parquet export preserves data types."""
|
|
from glam_extractor.exporters.parquet_exporter import export_to_parquet
|
|
|
|
output_path = tmp_path / "test_export.parquet"
|
|
export_to_parquet(sample_dataset, output_path)
|
|
|
|
import pandas as pd
|
|
df = pd.read_parquet(output_path)
|
|
|
|
# Verify data types
|
|
assert df['ghcid'].dtype == 'object' # String
|
|
assert df['confidence_score'].dtype == 'float64'
|
|
|
|
def test_round_trip_data_integrity(sample_dataset):
|
|
"""Test that data survives export → import round trip."""
|
|
# Export to JSON-LD
|
|
output_path = tmp_path / "test_export.jsonld"
|
|
export_to_jsonld(sample_dataset, output_path)
|
|
|
|
# Import back
|
|
from glam_extractor.importers.jsonld_importer import import_from_jsonld
|
|
imported_dataset = import_from_jsonld(output_path)
|
|
|
|
# Compare
|
|
assert len(imported_dataset) == len(sample_dataset)
|
|
|
|
for original, imported in zip(sample_dataset, imported_dataset):
|
|
assert original.ghcid == imported.ghcid
|
|
assert original.name == imported.name
|
|
```
|
|
|
|
**Success Criteria**:
|
|
- [ ] 15+ export format tests
|
|
- [ ] Round-trip tests verify data integrity
|
|
- [ ] All exports validate with external parsers
|
|
|
|
---
|
|
|
|
## Continuous Integration (CI) Strategy
|
|
|
|
### GitHub Actions Workflow
|
|
|
|
```yaml
|
|
# .github/workflows/test.yml
|
|
|
|
name: Test Suite
|
|
|
|
on: [push, pull_request]
|
|
|
|
jobs:
|
|
test:
|
|
runs-on: ubuntu-latest
|
|
strategy:
|
|
matrix:
|
|
python-version: ["3.10", "3.11", "3.12"]
|
|
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Set up Python ${{ matrix.python-version }}
|
|
uses: actions/setup-python@v4
|
|
with:
|
|
python-version: ${{ matrix.python-version }}
|
|
|
|
- name: Install dependencies
|
|
run: |
|
|
pip install -e ".[test]"
|
|
|
|
- name: Run unit tests
|
|
run: |
|
|
pytest tests/unit -v --cov --cov-report=xml -m "not slow"
|
|
|
|
- name: Run integration tests
|
|
run: |
|
|
pytest tests/integration -v -m "not api and not wikidata"
|
|
|
|
- name: Check code coverage
|
|
run: |
|
|
pytest --cov-report=term-missing --cov-fail-under=90
|
|
|
|
- name: Upload coverage to Codecov
|
|
uses: codecov/codecov-action@v3
|
|
with:
|
|
files: ./coverage.xml
|
|
```
|
|
|
|
### Pre-commit Hooks
|
|
|
|
```yaml
|
|
# .pre-commit-config.yaml
|
|
|
|
repos:
|
|
- repo: https://github.com/psf/black
|
|
rev: 24.1.0
|
|
hooks:
|
|
- id: black
|
|
|
|
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
rev: v0.1.11
|
|
hooks:
|
|
- id: ruff
|
|
args: [--fix, --exit-non-zero-on-fix]
|
|
|
|
- repo: https://github.com/pre-commit/mirrors-mypy
|
|
rev: v1.8.0
|
|
hooks:
|
|
- id: mypy
|
|
additional_dependencies: [types-all]
|
|
|
|
- repo: local
|
|
hooks:
|
|
- id: pytest-quick
|
|
name: pytest-quick
|
|
entry: pytest tests/unit -m "not slow"
|
|
language: system
|
|
pass_filenames: false
|
|
always_run: true
|
|
```
|
|
|
|
---
|
|
|
|
## Test Coverage Goals
|
|
|
|
### Coverage Targets by Module
|
|
|
|
| Module | Target Coverage | Priority |
|
|
|--------|-----------------|----------|
|
|
| `unesco_api_client.py` | 100% | Critical |
|
|
| `unesco_parser.py` | 100% | Critical |
|
|
| `unesco_institution_type.py` | 100% | Critical |
|
|
| `ghcid_generator.py` | 95% | Critical |
|
|
| `linkml_map.py` | 90% | High |
|
|
| `conflict_resolver.py` | 95% | High |
|
|
| `wikidata_enrichment.py` | 85% | Medium |
|
|
| `exporters/*.py` | 85% | Medium |
|
|
|
|
### Coverage Enforcement
|
|
|
|
```bash
|
|
# Fail CI if coverage drops below 90%
|
|
pytest --cov-fail-under=90
|
|
|
|
# Generate HTML coverage report
|
|
pytest --cov --cov-report=html
|
|
|
|
# Open coverage report
|
|
open htmlcov/index.html
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Testing
|
|
|
|
### Benchmark Tests
|
|
|
|
```python
|
|
# tests/performance/test_benchmarks.py
|
|
|
|
import pytest
|
|
import time
|
|
|
|
@pytest.mark.slow
|
|
def test_parse_1000_sites_completes_in_reasonable_time():
|
|
"""Benchmark: Parse 1,000 UNESCO sites in < 60 seconds."""
|
|
start = time.time()
|
|
|
|
results = []
|
|
for i in range(1000):
|
|
site_data = generate_fake_unesco_site()
|
|
custodian = parse_unesco_site(site_data)
|
|
results.append(custodian)
|
|
|
|
elapsed = time.time() - start
|
|
|
|
assert elapsed < 60, f"Parsing took {elapsed:.2f}s, expected < 60s"
|
|
|
|
@pytest.mark.slow
|
|
def test_ghcid_generation_performance():
|
|
"""Benchmark: Generate 10,000 GHCIDs in < 5 seconds."""
|
|
custodians = [generate_fake_custodian() for _ in range(10000)]
|
|
|
|
start = time.time()
|
|
ghcids = [generate_ghcid(c) for c in custodians]
|
|
elapsed = time.time() - start
|
|
|
|
assert elapsed < 5, f"GHCID generation took {elapsed:.2f}s, expected < 5s"
|
|
```
|
|
|
|
---
|
|
|
|
## Test Data Management
|
|
|
|
### Fixture Organization
|
|
|
|
```python
|
|
# tests/conftest.py
|
|
|
|
import pytest
|
|
from pathlib import Path
|
|
import json
|
|
import yaml
|
|
|
|
@pytest.fixture(scope="session")
|
|
def fixtures_dir():
|
|
"""Return path to fixtures directory."""
|
|
return Path(__file__).parent / "fixtures"
|
|
|
|
@pytest.fixture(scope="session")
|
|
def golden_dataset(fixtures_dir):
|
|
"""Load all golden dataset fixtures."""
|
|
expected_dir = fixtures_dir / "expected_outputs"
|
|
golden_data = {}
|
|
|
|
for yaml_file in expected_dir.glob("*.yaml"):
|
|
site_id = yaml_file.stem.split("_")[1] # Extract site ID
|
|
with open(yaml_file) as f:
|
|
golden_data[site_id] = yaml.safe_load(f)
|
|
|
|
return golden_data
|
|
|
|
@pytest.fixture
|
|
def mock_api_responses(fixtures_dir):
|
|
"""Load mock UNESCO API responses."""
|
|
api_dir = fixtures_dir / "unesco_api_responses"
|
|
responses = {}
|
|
|
|
for json_file in api_dir.glob("*.json"):
|
|
site_id = json_file.stem.split("_")[1]
|
|
with open(json_file) as f:
|
|
responses[site_id] = json.load(f)
|
|
|
|
return responses
|
|
```
|
|
|
|
---
|
|
|
|
## Summary of Test Counts
|
|
|
|
### Total Tests by Phase
|
|
|
|
| Phase | Unit Tests | Integration Tests | E2E Tests | Property-Based | Total |
|
|
|-------|------------|-------------------|-----------|----------------|-------|
|
|
| Phase 1 | 50 | 5 | 0 | 10 | 65 |
|
|
| Phase 2 | 75 | 20 | 3 | 15 | 113 |
|
|
| Phase 3 | 45 | 15 | 0 | 8 | 68 |
|
|
| Phase 4 | 25 | 10 | 2 | 5 | 42 |
|
|
| Phase 5 | 30 | 10 | 2 | 0 | 42 |
|
|
| **Total** | **225** | **60** | **7** | **38** | **330** |
|
|
|
|
### Success Criteria
|
|
|
|
- [ ] **330+ total tests** implemented
|
|
- [ ] **90%+ code coverage** across all modules
|
|
- [ ] **100% passing tests** on CI
|
|
- [ ] **Zero flaky tests** (deterministic execution)
|
|
- [ ] **All integration tests** use mocked external dependencies
|
|
- [ ] **Property-based tests** cover edge cases
|
|
- [ ] **Performance benchmarks** pass on CI
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
### Version 1.1 (2025-11-10)
|
|
**Migration to OpenDataSoft Explore API v2.0**
|
|
|
|
**Breaking Changes**:
|
|
- **API Endpoint Format**: Changed from `https://whc.unesco.org/en/list/{id}/json` to `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc-sites-2021/records/{id}`
|
|
- **JSON Structure**: Migrated from flat JSON to nested OpenDataSoft format
|
|
- Old: `{"id_number": 600, "site": "Paris, Banks of the Seine"}`
|
|
- New: `{"record": {"id": "rec_600", "fields": {"unique_number": 600, "name_en": "Paris, Banks of the Seine"}}}`
|
|
- **Field Mappings Updated**:
|
|
- `id_number` → `unique_number`
|
|
- `site` → `name_en`
|
|
- `states` → `states_name_en`
|
|
- `region` → `region_en`
|
|
- `short_description` → `short_description_en`
|
|
- `justification` → `justification_en`
|
|
- `date_inscribed` → numeric year (was string)
|
|
- `links` → `http_url` (single URL field)
|
|
- Coordinates now in nested `fields` object
|
|
|
|
**Updated Sections**:
|
|
1. **Day 1: API Client Tests** (Lines 175-269)
|
|
- Updated all API endpoint URLs to OpenDataSoft format
|
|
- Updated mock JSON responses to nested structure
|
|
- Updated field access patterns in assertions
|
|
- Updated error handling tests (404 responses)
|
|
- Updated retry logic tests with new endpoints
|
|
- Updated caching tests with new response format
|
|
|
|
2. **Day 2-3: LinkML Map Schema Tests** (Lines 342-407)
|
|
- Updated institution type mapping test fixtures
|
|
- Updated botanical garden test fixture
|
|
- Updated multilingual name extraction (now uses `name_en`, `name_fr`, `name_es` fields)
|
|
- Updated field access in assertions
|
|
|
|
3. **Day 4: Test Fixture Creation** (Lines 450-510)
|
|
- Updated `sample_unesco_site_600` fixture with nested structure
|
|
- Updated `expected_heritage_custodian_600` fixture
|
|
- Updated mock API client return values
|
|
|
|
4. **Day 6-7: Parser Tests** (Lines 709-747)
|
|
- Updated incomplete site test with nested structure
|
|
- Updated identifier extraction test
|
|
- Changed Wikidata extraction from `links` array to `wdpaid` field
|
|
- Updated all field references
|
|
|
|
5. **Day 9-12: Integration Tests** (Lines 886-916)
|
|
- Updated full pipeline integration test mock response
|
|
- Updated all field names in nested structure
|
|
- Updated assertions to match new API format
|
|
|
|
**Field Access Pattern Changes**:
|
|
```python
|
|
# Old (v1.0)
|
|
site_id = response['id_number']
|
|
site_name = response['site']
|
|
country = response['states']
|
|
|
|
# New (v1.1)
|
|
site_id = response['record']['fields']['unique_number']
|
|
site_name = response['record']['fields']['name_en']
|
|
country = response['record']['fields']['states_name_en']
|
|
```
|
|
|
|
**Testing Impact**:
|
|
- All API client tests require updated endpoint URLs
|
|
- All mock responses updated to nested JSON structure
|
|
- Parser tests updated to extract from `record.fields.*` paths
|
|
- LinkML Map schema must handle nested path extraction
|
|
- Golden dataset fixtures require regeneration with new format
|
|
- Integration tests updated with new field access patterns
|
|
|
|
**Rationale for Migration**:
|
|
- Legacy UNESCO JSON API (`whc.unesco.org/en/list/{id}/json`) deprecated or unstable
|
|
- OpenDataSoft Explore API v2.0 is official UNESCO data platform
|
|
- Provides standardized REST API with pagination, filtering, and versioned datasets
|
|
- Better maintained and documented than legacy endpoint
|
|
- Supports structured metadata with type information
|
|
|
|
**Migration Guide**:
|
|
1. Update all API client code to use new endpoint format
|
|
2. Update JSON parsing logic to handle `record.fields.*` nested structure
|
|
3. Update LinkML Map schema transformation paths
|
|
4. Regenerate all test fixtures with new API responses
|
|
5. Run full test suite to verify all tests pass with new format
|
|
|
|
**References**:
|
|
- OpenDataSoft API Documentation: https://help.opendatasoft.com/apis/ods-explore-v2/
|
|
- UNESCO World Heritage Dataset: https://data.unesco.org/datasets/whc-sites-2021
|
|
- API Migration Tracking Issue: TBD
|
|
|
|
### Version 1.0 (2025-11-08)
|
|
**Initial Release**
|
|
|
|
- Complete TDD strategy for UNESCO World Heritage Sites extraction
|
|
- 13-day implementation plan with daily test goals
|
|
- 175+ total tests across all phases:
|
|
- 15+ API client unit tests
|
|
- 20+ golden dataset tests
|
|
- 20+ parser unit tests
|
|
- 15+ validator tests
|
|
- 25+ GHCID generator tests
|
|
- 30+ integration tests
|
|
- 50+ property-based tests
|
|
- Coverage targets: 90%+ overall, 100% for critical modules
|
|
- Test types: Unit, Integration, E2E, Property-Based, Golden Dataset
|
|
- Fixtures: 20+ golden dataset YAML files, 20+ mock API responses
|
|
- Performance benchmarks: Parse 1,000 sites in <60s, Generate 10,000 GHCIDs in <5s
|
|
- CI/CD integration with pytest and coverage reporting
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
**Completed**:
|
|
- ✅ `01-dependencies.md` - Technical dependencies
|
|
- ✅ `02-consumers.md` - Use cases and data consumers
|
|
- ✅ `03-implementation-phases.md` - 6-week timeline
|
|
- ✅ `04-tdd-strategy.md` - **THIS DOCUMENT** - Testing strategy
|
|
|
|
**Next to Create**:
|
|
1. `05-design-patterns.md` - UNESCO-specific architectural patterns
|
|
2. `06-linkml-map-schema.md` - **CRITICAL** - Complete LinkML Map transformation rules
|
|
3. `07-master-checklist.md` - Implementation checklist
|
|
|
|
---
|
|
|
|
**Document Status**: Complete
|
|
**Next Document**: `05-design-patterns.md` - Architectural patterns
|
|
**Version**: 1.1
|