glam/docs/plan/unesco/03-implementation-phases.md
2025-11-30 23:30:29 +01:00

1444 lines
48 KiB
Markdown

# UNESCO Data Extraction - Implementation Phases
**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
**Document**: 03 - Implementation Phases
**Version**: 1.1
**Date**: 2025-11-10
**Status**: Draft
---
## Executive Summary
This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.
**Total Effort**: 30 working days (6 weeks)
**Team Size**: 1-2 developers + AI agents
**Target Output**: 1,000+ heritage custodian records from UNESCO sites
---
## Phase Overview
| Phase | Duration | Focus | Key Deliverables |
|-------|----------|-------|------------------|
| **Phase 1** | 5 days | API Exploration & Schema Design | UNESCO API parser, LinkML Map schema |
| **Phase 2** | 8 days | Extractor Implementation | Institution type classifier, GHCID generator |
| **Phase 3** | 6 days | Data Quality & Validation | LinkML validator, conflict resolver |
| **Phase 4** | 5 days | Integration & Enrichment | Wikidata enrichment, dataset merge |
| **Phase 5** | 6 days | Export & Documentation | RDF/JSON-LD exporters, user docs |
---
## Phase 1: API Exploration & Schema Design (Days 1-5)
### Objectives
- Understand UNESCO DataHub API structure and data quality
- Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
- Create test fixtures from real UNESCO API responses
- Establish baseline for institution type classification
### Day 1: UNESCO API Reconnaissance
**Tasks**:
1. **API Documentation Review**
- Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
- Identify available endpoints (dataset `whc001` - World Heritage List)
- Document authentication requirements (none - public dataset)
- Document pagination limits (max 100 records per request, use `offset` parameter)
- Test API responses for sample sites
2. **Data Structure Analysis**
```bash
# Fetch sample UNESCO site data (OpenDataSoft Explore API v2)
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \
> samples/unesco_site_list.json
# Fetch specific site by unique_number using ODSQL
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \
> samples/unesco_angkor_detail.json
# Response structure: nested record.fields with coordinates object
# {
# "total_count": 1248,
# "records": [{
# "record": {
# "fields": {
# "name_en": "Angkor",
# "unique_number": 668,
# "coordinates": {"lat": 13.4333, "lon": 103.8333}
# }
# }
# }]
# }
```
3. **Schema Mapping**
- Map OpenDataSoft `record.fields` to LinkML `HeritageCustodian` slots
- Identify missing fields (require inference or external enrichment)
- Document ambiguities (e.g., when is a site also a museum?)
- Handle nested response structure (`response['records'][i]['record']['fields']`)
**Deliverables**:
- `docs/unesco-api-analysis.md` - API structure documentation
- `tests/fixtures/unesco_api_responses/` - 10+ sample JSON files
- `docs/unesco-to-linkml-mapping.md` - Field mapping table
**Success Criteria**:
- [ ] Successfully fetch data for 50 UNESCO sites via API
- [ ] Document all relevant JSON fields for extraction
- [ ] Identify 3+ institution type classification patterns
---
### Day 2: LinkML Map Schema Design (Part 1)
**Tasks**:
1. **Install LinkML Map Extension**
```bash
pip install linkml-map
# OR implement custom extension: src/glam_extractor/mappers/extended_map.py
```
2. **Design Transformation Rules**
- Create `schemas/maps/unesco_to_heritage_custodian.yaml`
- Define JSONPath expressions for field extraction
- Handle multi-language names (UNESCO provides English, French, often local language)
- Map UNESCO categories to InstitutionTypeEnum
3. **Conditional Extraction Logic**
```yaml
# Example LinkML Map rule
mappings:
- source_path: $.category
target_path: institution_type
transform:
type: conditional
rules:
- condition: "contains(description, 'museum')"
value: MUSEUM
- condition: "contains(description, 'library')"
value: LIBRARY
- condition: "contains(description, 'archive')"
value: ARCHIVE
- default: MIXED
```
**Deliverables**:
- `schemas/maps/unesco_to_heritage_custodian.yaml` (initial version)
- `tests/test_unesco_linkml_map.py` - Unit tests for mapping rules
**Success Criteria**:
- [ ] LinkML Map schema validates against sample UNESCO JSON
- [ ] Successfully extract name, location, UNESCO WHC ID from 10 fixtures
- [ ] Handle multilingual names without data loss
---
### Day 3: LinkML Map Schema Design (Part 2)
**Tasks**:
1. **Advanced Transformation Rules**
- Regex extraction for identifiers (UNESCO WHC ID format: `^\d{3,4}$`)
- GeoNames ID lookup from UNESCO location strings
- Wikidata Q-number extraction from UNESCO external links
2. **Multi-value Array Handling**
```yaml
# Extract all languages from UNESCO site names
mappings:
- source_path: $.names[*]
target_path: alternative_names
transform:
type: array
element_transform:
type: template
template: "{name}@{lang}"
```
3. **Error Handling Patterns**
- Missing required fields → skip record with warning
- Invalid coordinates → flag for manual geocoding
- Unknown institution type → default to MIXED, log for review
**Deliverables**:
- `schemas/maps/unesco_to_heritage_custodian.yaml` (complete)
- `docs/linkml-map-extension-spec.md` - Custom extension specification
**Success Criteria**:
- [ ] Extract ALL relevant fields from 10 diverse UNESCO sites
- [ ] Handle edge cases (missing data, malformed coordinates)
- [ ] Generate valid LinkML instances from real API responses
---
### Day 4: Test Fixture Creation
**Tasks**:
1. **Curate Representative Samples**
- Select 20 UNESCO sites covering:
- All continents (Europe, Asia, Africa, Americas, Oceania)
- Multiple institution types (museums, libraries, archives, botanical gardens)
- Edge cases (serial nominations, transboundary sites)
2. **Create Expected Outputs**
```yaml
# tests/fixtures/expected_outputs/unesco_louvre.yaml
- id: https://w3id.org/heritage/custodian/fr/louvre
name: Musée du Louvre
institution_type: MUSEUM
locations:
- city: Paris
country: FR
coordinates: [48.8606, 2.3376]
identifiers:
- identifier_scheme: UNESCO_WHC
identifier_value: "600"
identifier_url: https://whc.unesco.org/en/list/600
provenance:
data_source: UNESCO_WORLD_HERITAGE
data_tier: TIER_1_AUTHORITATIVE
```
3. **Golden Dataset Construction**
- Manually verify 20 expected outputs against authoritative sources
- Document any assumptions or inferences made
**Deliverables**:
- `tests/fixtures/unesco_api_responses/` - 20 JSON files
- `tests/fixtures/expected_outputs/` - 20 YAML files
- `tests/test_unesco_golden_dataset.py` - Integration tests
**Success Criteria**:
- [ ] 20 golden dataset pairs (input JSON + expected YAML)
- [ ] 100% passing tests for golden dataset
- [ ] Documented edge cases and classification rules
---
### Day 5: Institution Type Classifier Design
**Tasks**:
1. **Pattern Analysis**
- Analyze UNESCO descriptions for GLAM-related keywords
- Create decision tree for institution type classification
- Handle ambiguous cases (e.g., "archaeological park with museum")
2. **Keyword Extraction**
```python
# src/glam_extractor/classifiers/unesco_institution_type.py
MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition']
LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek']
ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage']
BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum']
HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue']
FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate']
```
3. **Confidence Scoring**
- High confidence (0.9+): Explicit mentions of "museum" or "library"
- Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
- Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review
**Deliverables**:
- `src/glam_extractor/classifiers/unesco_institution_type.py`
- `tests/test_unesco_classifier.py` - 50+ test cases
- `docs/unesco-classification-rules.md` - Decision tree documentation
**Success Criteria**:
- [ ] Classifier achieves 90%+ accuracy on golden dataset
- [ ] Low-confidence classifications flagged for manual review
- [ ] Handle multilingual descriptions (English, French, Spanish, etc.)
---
## Phase 2: Extractor Implementation (Days 6-13)
### Objectives
- Implement UNESCO API client with caching and rate limiting
- Build LinkML instance generator using Map schema
- Create GHCID generator for UNESCO institutions
- Achieve 100% test coverage for core extraction logic
### Day 6: UNESCO API Client
**Tasks**:
1. **HTTP Client Implementation**
```python
# src/glam_extractor/parsers/unesco_api_client.py
class UNESCOAPIClient:
def __init__(self, cache_ttl: int = 86400):
self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001"
self.cache = Cache(ttl=cache_ttl)
self.rate_limiter = RateLimiter(requests_per_second=2)
def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict:
"""
Fetch paginated list of UNESCO World Heritage Sites.
Returns OpenDataSoft API response with structure:
{
"total_count": int,
"results": [{"record": {"id": str, "fields": {...}}, ...}]
}
"""
...
def fetch_site_details(self, whc_id: int) -> Dict:
"""
Fetch detailed information for a specific site using ODSQL query.
Returns single record with structure:
{"record": {"id": str, "fields": {field_name: value, ...}}}
"""
...
```
2. **Caching Strategy**
- Cache API responses for 24 hours (UNESCO updates infrequently)
- Store in SQLite database: `cache/unesco_api_cache.db`
- Invalidate cache on demand for data refreshes
3. **Error Handling**
- Network errors → retry with exponential backoff
- 404 Not Found → skip site, log warning
- Rate limit exceeded → pause and retry
**Deliverables**:
- `src/glam_extractor/parsers/unesco_api_client.py`
- `tests/test_unesco_api_client.py` - Mock API tests
- `cache/unesco_api_cache.db` - SQLite cache database
**Success Criteria**:
- [ ] Successfully fetch all UNESCO sites (1,000+ sites)
- [ ] Handle API errors gracefully (no crashes)
- [ ] Cache reduces API calls by 95% on repeat runs
---
### Day 7: LinkML Instance Generator
**Tasks**:
1. **Apply LinkML Map Transformations**
```python
# src/glam_extractor/parsers/unesco_parser.py
from linkml_map import Mapper
def parse_unesco_site(api_response: Dict) -> HeritageCustodian:
"""
Parse OpenDataSoft API response to HeritageCustodian instance.
Args:
api_response: OpenDataSoft record with nested structure:
{"record": {"id": str, "fields": {field_name: value, ...}}}
Returns:
HeritageCustodian: Validated LinkML instance
"""
# Extract fields from nested structure
site_data = api_response['record']['fields']
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
instance = mapper.transform(site_data)
return HeritageCustodian(**instance)
```
2. **Validation Pipeline**
- Apply LinkML schema validation after transformation
- Catch validation errors, log details
- Skip invalid records, continue processing
3. **Multi-language Name Handling**
```python
def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]:
"""
Extract primary name and alternative names in multiple languages.
Args:
site_data: Extracted fields from OpenDataSoft record['fields']
"""
primary_name = site_data.get('site', '')
alternative_names = []
# OpenDataSoft may provide language variants in separate fields
# or as structured data - adjust based on actual API response
for lang_data in site_data.get('names', []):
name = lang_data.get('name', '')
lang = lang_data.get('lang', 'en')
if name and name != primary_name:
alternative_names.append(f"{name}@{lang}")
return primary_name, alternative_names
```
**Deliverables**:
- `src/glam_extractor/parsers/unesco_parser.py`
- `tests/test_unesco_parser.py` - 20 golden dataset tests
**Success Criteria**:
- [ ] Parse 20 golden dataset fixtures with 100% accuracy
- [ ] Extract multilingual names without data loss
- [ ] Generate valid LinkML instances (pass schema validation)
---
### Day 8: GHCID Generator for UNESCO Sites
**Tasks**:
1. **Extend GHCID Logic for UNESCO**
```python
# src/glam_extractor/identifiers/ghcid_generator.py
def generate_ghcid_for_unesco_site(
country_code: str,
region_code: str,
city_code: str,
institution_type: InstitutionTypeEnum,
institution_name: str,
has_collision: bool = False
) -> str:
"""
Generate GHCID for UNESCO World Heritage Site institution.
Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix)
Note: Collision suffix uses native language institution name in snake_case,
NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md
"""
...
```
2. **City Code Lookup**
- Use GeoNames API to convert city names to UN/LOCODE
- Fallback to 3-letter abbreviation if UN/LOCODE not found
- Cache lookups to minimize API calls
3. **Collision Detection**
- Check existing GHCID dataset for collisions
- Apply temporal priority rules (first batch vs. historical addition)
- Append native language name suffix if collision detected
**Deliverables**:
- Extended `src/glam_extractor/identifiers/ghcid_generator.py`
- `tests/test_ghcid_unesco.py` - GHCID generation tests
**Success Criteria**:
- [ ] Generate valid GHCIDs for 20 golden dataset institutions
- [ ] No collisions with existing Dutch ISIL dataset
- [ ] Handle missing Wikidata Q-numbers gracefully
---
### Day 9-10: Batch Processing Pipeline
**Tasks**:
1. **Parallel Processing**
```python
# scripts/extract_unesco_sites.py
from concurrent.futures import ThreadPoolExecutor
def extract_all_unesco_sites(max_workers: int = 4):
api_client = UNESCOAPIClient()
# Fetch paginated site list from OpenDataSoft API
all_sites = []
offset = 0
limit = 100
while True:
response = api_client.fetch_site_list(limit=limit, offset=offset)
all_sites.extend(response['results'])
if len(all_sites) >= response['total_count']:
break
offset += limit
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = executor.map(process_unesco_site, all_sites)
return list(results)
def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]:
"""
Process single OpenDataSoft record.
Args:
site_record: {"record": {"id": str, "fields": {...}}}
"""
try:
# Extract unique_number from nested fields
whc_id = site_record['record']['fields']['unique_number']
# Fetch full details if needed (or use site_record directly)
details = api_client.fetch_site_details(whc_id)
institution_type = classify_institution_type(details['record']['fields'])
custodian = parse_unesco_site(details)
custodian.ghcid = generate_ghcid(custodian)
return custodian
except Exception as e:
log.error(f"Failed to process site {whc_id}: {e}")
return None
```
2. **Progress Tracking**
- Use `tqdm` for progress bars
- Log successful extractions to `logs/unesco_extraction.log`
- Save intermediate results every 100 sites
3. **Error Recovery**
- Resume from last checkpoint if script crashes
- Separate successful extractions from failed ones
- Generate error report with failed site IDs
**Deliverables**:
- `scripts/extract_unesco_sites.py` - Batch extraction script
- `data/unesco_extracted/` - Output directory for YAML instances
- `logs/unesco_extraction.log` - Extraction log
**Success Criteria**:
- [ ] Process all 1,000+ UNESCO sites in < 2 hours
- [ ] < 5% failure rate (API errors, missing data)
- [ ] Successful extractions saved as valid LinkML YAML files
---
### Day 11-12: Integration Testing
**Tasks**:
1. **End-to-End Tests**
```python
# tests/integration/test_unesco_pipeline.py
def test_full_unesco_extraction_pipeline():
"""Test complete pipeline from OpenDataSoft API fetch to LinkML instance."""
# 1. Fetch API data from OpenDataSoft
api_client = UNESCOAPIClient()
response = api_client.fetch_site_details(600) # Paris, Banks of the Seine
site_data = response['record']['fields'] # Extract from nested structure
# 2. Classify institution type
inst_type = classify_institution_type(site_data)
assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED]
# 3. Parse to LinkML instance
custodian = parse_unesco_site(response)
assert custodian.name is not None
# 4. Generate GHCID
custodian.ghcid = generate_ghcid(custodian)
assert custodian.ghcid.startswith("FR-")
# 5. Validate against schema
validator = SchemaValidator(schema="schemas/heritage_custodian.yaml")
result = validator.validate(custodian)
assert result.is_valid
```
2. **Property-Based Testing**
```python
from hypothesis import given, strategies as st
@given(st.integers(min_value=1, max_value=1500))
def test_ghcid_determinism(whc_id: int):
"""GHCID generation is deterministic for same input."""
site1 = generate_ghcid_for_site(whc_id)
site2 = generate_ghcid_for_site(whc_id)
assert site1 == site2
```
3. **Performance Testing**
- Benchmark extraction speed (sites per second)
- Memory profiling (ensure no memory leaks)
- Cache hit rate analysis
**Deliverables**:
- `tests/integration/test_unesco_pipeline.py` - End-to-end tests
- `tests/test_unesco_property_based.py` - Property-based tests
- `docs/performance-benchmarks.md` - Performance results
**Success Criteria**:
- [ ] 100% passing integration tests
- [ ] Extract 1,000 sites in < 2 hours (with cache)
- [ ] Memory usage < 500MB for full extraction
---
### Day 13: Code Review & Refactoring
**Tasks**:
1. **Code Quality Review**
- Run `ruff` linter, fix all warnings
- Run `mypy` type checker, resolve type errors
- Ensure 90%+ test coverage
2. **Documentation Review**
- Add docstrings to all public functions
- Update README with UNESCO extraction instructions
- Create developer guide for extending classifiers
3. **Performance Optimization**
- Profile slow functions, optimize bottlenecks
- Reduce redundant API calls
- Optimize GHCID generation (cache city code lookups)
**Deliverables**:
- Refactored codebase with 90%+ test coverage
- Updated documentation
- Performance optimizations applied
**Success Criteria**:
- [ ] Zero linter warnings
- [ ] Zero type errors
- [ ] Test coverage > 90%
---
## Phase 3: Data Quality & Validation (Days 14-19)
### Objectives
- Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
- Detect and resolve conflicts
- Implement confidence scoring system
- Generate data quality report
### Day 14-15: Cross-Referencing with Existing Data
**Tasks**:
1. **Load Existing Dataset**
```python
# scripts/crosslink_unesco_with_glam.py
def load_existing_glam_dataset():
"""Load Dutch ISIL + conversation extractions."""
dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv")
dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
conversations = load_conversation_extractions("data/instances/")
return merge_datasets([dutch_isil, dutch_orgs, conversations])
```
2. **Match UNESCO Sites to Existing Records**
- Match by Wikidata Q-number (highest confidence)
- Match by ISIL code (for Dutch institutions)
- Match by name + location (fuzzy matching, score > 0.85)
3. **Conflict Detection**
```python
def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]:
"""Detect field-level conflicts between UNESCO and existing data."""
conflicts = []
if unesco_record.name != existing_record.name:
conflicts.append(Conflict(
field="name",
unesco_value=unesco_record.name,
existing_value=existing_record.name,
resolution="MANUAL_REVIEW"
))
# Check institution type mismatch
if unesco_record.institution_type != existing_record.institution_type:
conflicts.append(Conflict(field="institution_type", ...))
return conflicts
```
**Deliverables**:
- `scripts/crosslink_unesco_with_glam.py` - Cross-linking script
- `data/unesco_conflicts.csv` - Detected conflicts report
- `tests/test_crosslinking.py` - Unit tests for matching logic
**Success Criteria**:
- [ ] Identify 50+ matches between UNESCO and existing dataset
- [ ] Detect conflicts (name mismatches, type discrepancies)
- [ ] Generate conflict report for manual review
---
### Day 16: Conflict Resolution
**Tasks**:
1. **Tier-Based Priority**
- TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
- When conflict: UNESCO data wins, existing data flagged
2. **Merge Strategy**
```python
def merge_unesco_with_existing(
unesco_record: HeritageCustodian,
existing_record: HeritageCustodian
) -> HeritageCustodian:
"""Merge UNESCO data with existing record, UNESCO takes priority."""
merged = existing_record.copy()
# UNESCO name becomes primary
merged.name = unesco_record.name
# Preserve alternative names from both sources
merged.alternative_names = list(set(
unesco_record.alternative_names + existing_record.alternative_names
))
# Add UNESCO identifier
merged.identifiers.append({
'identifier_scheme': 'UNESCO_WHC',
'identifier_value': unesco_record.identifiers[0].identifier_value
})
# Track provenance of merge
merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX"
return merged
```
3. **Manual Review Queue**
- Flag high-impact conflicts (institution type change, location change)
- Generate review spreadsheet for human validation
- Provide evidence for each conflict (source URLs, descriptions)
**Deliverables**:
- `src/glam_extractor/validators/conflict_resolver.py`
- `data/manual_review_queue.csv` - Conflicts requiring human review
- `tests/test_conflict_resolution.py`
**Success Criteria**:
- [ ] Resolve 80% of conflicts automatically (tier-based priority)
- [ ] Flag 20% for manual review (complex cases)
- [ ] Zero data loss (preserve all alternative names, identifiers)
---
### Day 17: Confidence Scoring System
**Tasks**:
1. **Score Calculation**
```python
def calculate_confidence_score(custodian: HeritageCustodian) -> float:
"""Calculate confidence score based on data completeness and source quality."""
score = 1.0 # Start at maximum (TIER_1 authoritative)
# Deduct for missing fields
if not custodian.identifiers:
score -= 0.1
if not custodian.locations:
score -= 0.15
if custodian.institution_type == InstitutionTypeEnum.MIXED:
score -= 0.2 # Ambiguous classification
# Boost for rich metadata
if len(custodian.identifiers) > 2:
score += 0.05
if custodian.digital_platforms:
score += 0.05
return max(0.0, min(1.0, score)) # Clamp to [0.0, 1.0]
```
2. **Quality Metrics**
- Completeness: % of optional fields populated
- Accuracy: Agreement with authoritative sources (Wikidata, official websites)
- Freshness: Days since extraction
3. **Tier Validation**
- Ensure all UNESCO records have `data_tier: TIER_1_AUTHORITATIVE`
- Downgrade tier if conflicts detected (TIER_1 → TIER_2)
**Deliverables**:
- `src/glam_extractor/validators/confidence_scorer.py`
- `tests/test_confidence_scoring.py`
- `data/unesco_quality_metrics.json` - Aggregate statistics
**Success Criteria**:
- [ ] 90%+ of UNESCO records score > 0.85 confidence
- [ ] Flag < 5% as low confidence (require review)
- [ ] Document scoring methodology in provenance metadata
---
### Day 18: LinkML Schema Validation
**Tasks**:
1. **Batch Validation**
```bash
# Validate all UNESCO extractions against LinkML schema
for file in data/unesco_extracted/*.yaml; do
linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file"
done
```
2. **Custom Validators**
```python
# src/glam_extractor/validators/unesco_validators.py
def validate_unesco_whc_id(whc_id: str) -> bool:
"""UNESCO WHC IDs are 3-4 digit integers."""
return bool(re.match(r'^\d{3,4}$', whc_id))
def validate_ghcid_format(ghcid: str) -> bool:
"""Validate GHCID format for UNESCO institutions."""
pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$'
return bool(re.match(pattern, ghcid))
```
3. **Error Reporting**
- Generate validation report with line numbers and error messages
- Categorize errors (required field missing, invalid enum value, format error)
- Prioritize fixes (blocking errors vs. warnings)
**Deliverables**:
- `scripts/validate_unesco_dataset.py` - Batch validation script
- `src/glam_extractor/validators/unesco_validators.py` - Custom validators
- `data/unesco_validation_report.json` - Validation results
**Success Criteria**:
- [ ] 100% of extracted records pass LinkML validation
- [ ] Zero blocking errors
- [ ] Document any warnings in provenance notes
---
### Day 19: Data Quality Report
**Tasks**:
1. **Generate Statistics**
```python
# scripts/generate_unesco_quality_report.py
def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict:
return {
'total_institutions': len(unesco_dataset),
'by_country': count_by_country(unesco_dataset),
'by_institution_type': count_by_type(unesco_dataset),
'avg_confidence_score': calculate_avg_confidence(unesco_dataset),
'completeness_metrics': {
'with_wikidata_id': count_with_wikidata(unesco_dataset),
'with_digital_platform': count_with_platforms(unesco_dataset),
'with_geocoded_location': count_with_geocoding(unesco_dataset)
},
'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')),
'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv'))
}
```
2. **Visualization**
- Generate maps showing UNESCO site distribution
- Bar charts: institutions by country, by type
- Heatmap: data completeness by field
3. **Documentation**
- Write executive summary of data quality
- Document known issues and limitations
- Provide recommendations for improvement
**Deliverables**:
- `scripts/generate_unesco_quality_report.py`
- `data/unesco_quality_report.json` - Statistics
- `docs/unesco-data-quality.md` - Quality report document
- `data/visualizations/` - Maps and charts
**Success Criteria**:
- [ ] Quality report shows 90%+ completeness for core fields
- [ ] < 5% of records require manual review
- [ ] Geographic coverage across all inhabited continents
---
## Phase 4: Integration & Enrichment (Days 20-24)
### Objectives
- Enrich UNESCO data with Wikidata identifiers
- Merge UNESCO dataset with existing GLAM dataset
- Resolve GHCID collisions
- Update GHCID history for modified records
### Day 20-21: Wikidata Enrichment
**Tasks**:
1. **SPARQL Query for UNESCO Sites**
```python
# scripts/enrich_unesco_with_wikidata.py
def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]:
"""Find Wikidata Q-number for UNESCO World Heritage Site."""
query = f"""
SELECT ?item WHERE {{
?item wdt:P757 "{whc_id}" . # P757 = UNESCO World Heritage Site ID
}}
LIMIT 1
"""
results = sparql_query(query)
if results:
return extract_qid(results[0]['item']['value'])
return None
```
2. **Batch Enrichment**
- Query Wikidata for all 1,000+ UNESCO sites
- Extract Q-numbers, VIAF IDs, ISIL codes (if available)
- Add to `identifiers` array in LinkML instances
3. **Fuzzy Matching Fallback**
- If WHC ID not found in Wikidata, try name + location matching
- Use same fuzzy matching logic from existing enrichment scripts
- Threshold: 0.85 similarity score
**Deliverables**:
- `scripts/enrich_unesco_with_wikidata.py` - Enrichment script
- `data/unesco_enriched/` - Enriched YAML instances
- `logs/wikidata_enrichment.log` - Enrichment log
**Success Criteria**:
- [ ] Find Wikidata Q-numbers for 80%+ of UNESCO sites
- [ ] Add VIAF/ISIL identifiers where available
- [ ] Document enrichment in provenance metadata
---
### Day 22: Dataset Merge
**Tasks**:
1. **Merge Strategy**
```python
# scripts/merge_unesco_into_glam_dataset.py
def merge_datasets(
unesco_data: List[HeritageCustodian],
existing_data: List[HeritageCustodian]
) -> List[HeritageCustodian]:
"""Merge UNESCO data into existing GLAM dataset."""
merged = existing_data.copy()
for unesco_record in unesco_data:
# Check if institution already exists
match = find_match(unesco_record, existing_data)
if match:
# Merge records
merged_record = merge_unesco_with_existing(unesco_record, match)
merged[merged.index(match)] = merged_record
else:
# Add new institution
merged.append(unesco_record)
return merged
```
2. **Deduplication**
- Detect duplicates by GHCID, Wikidata Q-number, ISIL code
- Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
- Preserve alternative names and identifiers from both sources
3. **Provenance Tracking**
- Update `provenance.notes` for merged records
- Record merge timestamp
- Link back to original extraction sources
**Deliverables**:
- `scripts/merge_unesco_into_glam_dataset.py` - Merge script
- `data/merged_glam_dataset/` - Merged dataset (YAML files)
- `data/merge_report.json` - Merge statistics
**Success Criteria**:
- [ ] Merge 1,000+ UNESCO records with existing GLAM dataset
- [ ] Deduplicate matches (no duplicate GHCIDs)
- [ ] Preserve data from all sources (no information loss)
---
### Day 23: GHCID Collision Resolution
**Tasks**:
1. **Detect Collisions**
```python
# scripts/resolve_ghcid_collisions.py
def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]:
"""Find institutions with identical base GHCIDs."""
ghcid_map = defaultdict(list)
for custodian in dataset:
base_ghcid = remove_q_number(custodian.ghcid)
ghcid_map[base_ghcid].append(custodian)
collisions = [
Collision(base_ghcid=k, institutions=v)
for k, v in ghcid_map.items()
if len(v) > 1
]
return collisions
```
2. **Apply Temporal Priority Rules**
- Compare `provenance.extraction_date` for colliding institutions
- First batch (same date): ALL get Q-numbers
- Historical addition (later date): ONLY new gets name suffix
3. **Update GHCID History**
```python
def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str):
"""Record GHCID change in history."""
custodian.ghcid_history.append(GHCIDHistoryEntry(
ghcid=new_ghcid,
ghcid_numeric=generate_numeric_id(new_ghcid),
valid_from=datetime.now(timezone.utc).isoformat(),
valid_to=None,
reason=f"Name suffix added to resolve collision with {old_ghcid}"
))
```
**Deliverables**:
- `scripts/resolve_ghcid_collisions.py` - Collision resolution script
- `data/ghcid_collision_report.json` - Detected collisions
- Updated YAML instances with `ghcid_history` entries
**Success Criteria**:
- [ ] Resolve all GHCID collisions (zero duplicates)
- [ ] Update GHCID history for affected records
- [ ] Preserve PID stability (no changes to published GHCIDs)
---
### Day 24: Final Data Validation
**Tasks**:
1. **Full Dataset Validation**
- Run LinkML validation on merged dataset
- Check for orphaned references (invalid foreign keys)
- Verify all GHCIDs are unique
2. **Integrity Checks**
```python
# tests/integration/test_merged_dataset_integrity.py
def test_no_duplicate_ghcids():
dataset = load_merged_dataset()
ghcids = [c.ghcid for c in dataset]
assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!"
def test_all_unesco_sites_have_whc_id():
dataset = load_merged_dataset()
unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"]
for record in unesco_records:
whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"]
assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID"
```
3. **Coverage Analysis**
- Verify UNESCO sites across all continents
- Check institution type distribution (not all MUSEUM)
- Ensure Dutch institutions properly merged with ISIL registry
**Deliverables**:
- `tests/integration/test_merged_dataset_integrity.py` - Integrity tests
- `data/final_validation_report.json` - Validation results
- `docs/dataset-coverage.md` - Coverage analysis
**Success Criteria**:
- [ ] 100% passing integrity tests
- [ ] Zero duplicate GHCIDs
- [ ] UNESCO sites cover 100+ countries
---
## Phase 5: Export & Documentation (Days 25-30)
### Objectives
- Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
- Generate user documentation and API docs
- Create example queries and use case tutorials
- Publish dataset with persistent identifiers
### Day 25-26: RDF/JSON-LD Export
**Tasks**:
1. **RDF Serialization**
```python
# src/glam_extractor/exporters/rdf_exporter.py
def export_to_rdf(dataset: List[HeritageCustodian], output_path: str):
"""Export dataset to RDF/Turtle format."""
graph = Graph()
# Define namespaces
GLAM = Namespace("https://w3id.org/heritage/custodian/")
graph.bind("glam", GLAM)
graph.bind("schema", SCHEMA)
graph.bind("cpov", Namespace("http://data.europa.eu/m8g/"))
for custodian in dataset:
uri = URIRef(custodian.id)
# Type assertions
graph.add((uri, RDF.type, GLAM.HeritageCustodian))
graph.add((uri, RDF.type, SCHEMA.Museum)) # If institution_type == MUSEUM
# Literals
graph.add((uri, SCHEMA.name, Literal(custodian.name)))
graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type)))
# Identifiers (owl:sameAs)
for identifier in custodian.identifiers:
if identifier.identifier_scheme == "Wikidata":
graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url)))
graph.serialize(destination=output_path, format="turtle")
```
2. **JSON-LD Context**
```json
// data/context/heritage_custodian_context.jsonld
{
"@context": {
"@vocab": "https://w3id.org/heritage/custodian/",
"schema": "http://schema.org/",
"name": "schema:name",
"location": "schema:location",
"identifiers": "schema:identifier",
"institution_type": "institutionType",
"data_source": "dataSource"
}
}
```
3. **Content Negotiation Setup**
- Configure w3id.org redirects (if hosting on GitHub Pages)
- Test URI resolution for sample institutions
- Ensure Accept header routing (text/turtle, application/ld+json)
**Deliverables**:
- `src/glam_extractor/exporters/rdf_exporter.py` - RDF exporter
- `data/exports/glam_dataset.ttl` - RDF/Turtle export
- `data/exports/glam_dataset.jsonld` - JSON-LD export
- `data/context/heritage_custodian_context.jsonld` - JSON-LD context
**Success Criteria**:
- [ ] RDF validates with Turtle parser
- [ ] JSON-LD validates with JSON-LD Playground
- [ ] Sample URIs resolve correctly
---
### Day 27: CSV/Parquet Export
**Tasks**:
1. **Flatten Schema for CSV**
```python
# src/glam_extractor/exporters/csv_exporter.py
def export_to_csv(dataset: List[HeritageCustodian], output_path: str):
"""Export dataset to CSV with flattened structure."""
rows = []
for custodian in dataset:
row = {
'ghcid': custodian.ghcid,
'ghcid_uuid': str(custodian.ghcid_uuid),
'name': custodian.name,
'institution_type': custodian.institution_type,
'country': custodian.locations[0].country if custodian.locations else None,
'city': custodian.locations[0].city if custodian.locations else None,
'wikidata_id': get_identifier(custodian, 'Wikidata'),
'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'),
'data_source': custodian.provenance.data_source,
'data_tier': custodian.provenance.data_tier,
'confidence_score': custodian.provenance.confidence_score
}
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv(output_path, index=False, encoding='utf-8-sig')
```
2. **Parquet Export (Columnar)**
```python
def export_to_parquet(dataset: List[HeritageCustodian], output_path: str):
"""Export dataset to Parquet for efficient querying."""
df = pd.DataFrame([custodian.dict() for custodian in dataset])
df.to_parquet(output_path, engine='pyarrow', compression='snappy')
```
3. **SQLite Export**
```python
def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str):
"""Export dataset to SQLite database."""
conn = sqlite3.connect(db_path)
# Create tables
conn.execute("""
CREATE TABLE heritage_custodians (
ghcid TEXT PRIMARY KEY,
ghcid_uuid TEXT UNIQUE,
name TEXT NOT NULL,
institution_type TEXT,
data_source TEXT,
...
)
""")
# Insert records
for custodian in dataset:
conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...)
conn.commit()
```
**Deliverables**:
- `src/glam_extractor/exporters/csv_exporter.py` - CSV exporter
- `data/exports/glam_dataset.csv` - CSV export
- `data/exports/glam_dataset.parquet` - Parquet export
- `data/exports/glam_dataset.db` - SQLite database
**Success Criteria**:
- [ ] CSV opens correctly in Excel, Google Sheets
- [ ] Parquet loads in pandas, DuckDB
- [ ] SQLite database queryable with SQL
---
### Day 28: Documentation - User Guide
**Tasks**:
1. **Getting Started Guide**
```markdown
# docs/user-guide/getting-started.md
## Installation
pip install glam-extractor
## Quick Start
```python
from glam_extractor import load_dataset
# Load the GLAM dataset
dataset = load_dataset("data/exports/glam_dataset.parquet")
# Filter UNESCO museums in France
museums = dataset.filter(
institution_type="MUSEUM",
data_source="UNESCO_WORLD_HERITAGE",
country="FR"
)
```
2. **Example Queries**
- SPARQL examples (find institutions by type, country)
- Pandas examples (data analysis, statistics)
- SQL examples (SQLite queries)
3. **API Reference**
- Document all public classes and methods
- Provide code examples for each function
- Link to LinkML schema documentation
**Deliverables**:
- `docs/user-guide/getting-started.md` - Quick start guide
- `docs/user-guide/example-queries.md` - Query examples
- `docs/api-reference.md` - API documentation
**Success Criteria**:
- [ ] Complete documentation for all public APIs
- [ ] 10+ example queries covering common use cases
- [ ] Step-by-step tutorials for data consumers
---
### Day 29: Documentation - Developer Guide
**Tasks**:
1. **Architecture Overview**
- Diagram of extraction pipeline (API → Parser → Validator → Exporter)
- Explanation of LinkML Map transformation
- GHCID generation algorithm
2. **Contributing Guide**
- How to add new institution type classifiers
- How to extend LinkML schema with new fields
- How to add new export formats
3. **Testing Guide**
- Running unit tests, integration tests
- Creating new test fixtures
- Using property-based testing
**Deliverables**:
- `docs/developer-guide/architecture.md` - Architecture docs
- `docs/developer-guide/contributing.md` - Contribution guide
- `docs/developer-guide/testing.md` - Testing guide
**Success Criteria**:
- [ ] Complete architecture documentation with diagrams
- [ ] Clear instructions for extending the system
- [ ] Comprehensive testing guide
---
### Day 30: Release & Publication
**Tasks**:
1. **Dataset Release**
- Tag repository with version number (e.g., v1.0.0-unesco)
- Create GitHub Release with exports attached
- Publish to Zenodo for DOI (persistent citation)
2. **Announcement**
- Write blog post announcing UNESCO data release
- Share on social media (Twitter, Mastodon, LinkedIn)
- Notify stakeholders (Europeana, DPLA, heritage researchers)
3. **Data Portal Update**
- Update w3id.org redirects for new institutions
- Deploy SPARQL endpoint (if applicable)
- Update REST API to include UNESCO data
**Deliverables**:
- GitHub Release with dataset exports
- Zenodo DOI for citation
- Blog post and announcement
**Success Criteria**:
- [ ] Dataset published with persistent DOI
- [ ] Documentation live and accessible
- [ ] Stakeholders notified of release
---
## Risk Mitigation
### Technical Risks
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| UNESCO API changes format | Low | High | Cache all responses, version API client |
| LinkML Map lacks features | Medium | High | Implement custom extension early (Day 2-3) |
| GHCID collisions exceed capacity | Low | Medium | Q-number resolution strategy documented |
| Wikidata enrichment fails | Medium | Medium | Fallback to fuzzy name matching |
### Resource Risks
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Timeline slips past 6 weeks | Medium | Medium | Prioritize core features, defer non-critical exports |
| Test coverage falls below 90% | Low | High | TDD approach enforced from Day 1 |
| Documentation incomplete | Medium | High | Reserve full week for docs (Phase 5) |
### Data Quality Risks
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Institution type misclassification | Medium | Medium | Manual review queue for low-confidence cases |
| Missing Wikidata Q-numbers | Medium | Low | Accept base GHCID without Q-number, enrich later |
| Conflicts with existing data | Low | Medium | Tier-based priority, UNESCO wins |
---
## Success Metrics
### Quantitative Metrics
- **Coverage**: Extract 1,000+ UNESCO site institutions
- **Quality**: 90%+ confidence score average
- **Completeness**: 80%+ have Wikidata Q-numbers
- **Performance**: Process all sites in < 2 hours
- **Test Coverage**: 90%+ code coverage
### Qualitative Metrics
- **Usability**: Positive feedback from 3+ data consumers
- **Documentation**: Complete user guide and API docs
- **Maintainability**: Code passes linter, type checker
- **Reproducibility**: Dataset generation fully automated
---
## Appendix: Day-by-Day Checklist
### Phase 1 (Days 1-5)
- [ ] Day 1: UNESCO API documentation reviewed, 50 sites fetched
- [ ] Day 2: LinkML Map schema (part 1) - basic transformations
- [ ] Day 3: LinkML Map schema (part 2) - advanced patterns
- [ ] Day 4: Golden dataset created (20 test fixtures)
- [ ] Day 5: Institution type classifier designed
### Phase 2 (Days 6-13)
- [ ] Day 6: UNESCO API client implemented
- [ ] Day 7: LinkML instance generator implemented
- [ ] Day 8: GHCID generator extended for UNESCO
- [ ] Day 9-10: Batch processing pipeline
- [ ] Day 11-12: Integration testing
- [ ] Day 13: Code review and refactoring
### Phase 3 (Days 14-19)
- [ ] Day 14-15: Cross-referencing with existing data
- [ ] Day 16: Conflict resolution
- [ ] Day 17: Confidence scoring system
- [ ] Day 18: LinkML schema validation
- [ ] Day 19: Data quality report
### Phase 4 (Days 20-24)
- [ ] Day 20-21: Wikidata enrichment
- [ ] Day 22: Dataset merge
- [ ] Day 23: GHCID collision resolution
- [ ] Day 24: Final data validation
### Phase 5 (Days 25-30)
- [ ] Day 25-26: RDF/JSON-LD export
- [ ] Day 27: CSV/Parquet/SQLite export
- [ ] Day 28: User guide documentation
- [ ] Day 29: Developer guide documentation
- [ ] Day 30: Release and publication
---
**Document Status**: Complete
**Next Document**: `04-tdd-strategy.md` - Test-driven development plan
**Version**: 1.1
---
## Version History
### Version 1.1 (2025-11-10)
**Changes**: Updated for OpenDataSoft Explore API v2.0 migration
- **Day 1 API Reconnaissance** (lines 44-62): Updated API endpoint from legacy `whc.unesco.org/en/list/json` to OpenDataSoft `data.unesco.org/api/explore/v2.0`
- **Day 6 UNESCO API Client** (lines 265-305):
- Updated `base_url` to OpenDataSoft API endpoint
- Removed `api_key` parameter (public dataset, no authentication)
- Added pagination parameters to `fetch_site_list()`: `limit`, `offset`
- Updated method documentation to reflect OpenDataSoft response structure: `{"record": {"fields": {...}}}`
- **Day 7 LinkML Parser** (lines 309-352):
- Updated `parse_unesco_site()` to extract from nested `api_response['record']['fields']`
- Added documentation clarifying OpenDataSoft structure
- Updated `extract_multilingual_names()` parameter name from `unesco_data` to `site_data`
- **Day 10 Batch Processing** (lines 406-450):
- Updated `extract_all_unesco_sites()` with pagination loop for OpenDataSoft API
- Updated `process_unesco_site()` to handle nested record structure
- Changed field access from `site_data['id_number']` to `site_record['record']['fields']['unique_number']`
- **Day 11-12 Integration Tests** (lines 454-483):
- Updated `test_full_unesco_extraction_pipeline()` to extract `site_data` from `response['record']['fields']`
- Added explicit documentation of OpenDataSoft API structure
**Rationale**: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.
### Version 1.0 (2025-11-09)
**Initial Release**
- Comprehensive 30-day implementation plan for UNESCO data extraction
- Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
- TDD approach with golden dataset and integration tests
- GHCID generation strategy for UNESCO heritage sites
- Wikidata enrichment and cross-referencing plan