48 KiB
UNESCO Data Extraction - Implementation Phases
Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 03 - Implementation Phases
Version: 1.1
Date: 2025-11-10
Status: Draft
Executive Summary
This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.
Total Effort: 30 working days (6 weeks)
Team Size: 1-2 developers + AI agents
Target Output: 1,000+ heritage custodian records from UNESCO sites
Phase Overview
| Phase | Duration | Focus | Key Deliverables |
|---|---|---|---|
| Phase 1 | 5 days | API Exploration & Schema Design | UNESCO API parser, LinkML Map schema |
| Phase 2 | 8 days | Extractor Implementation | Institution type classifier, GHCID generator |
| Phase 3 | 6 days | Data Quality & Validation | LinkML validator, conflict resolver |
| Phase 4 | 5 days | Integration & Enrichment | Wikidata enrichment, dataset merge |
| Phase 5 | 6 days | Export & Documentation | RDF/JSON-LD exporters, user docs |
Phase 1: API Exploration & Schema Design (Days 1-5)
Objectives
- Understand UNESCO DataHub API structure and data quality
- Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
- Create test fixtures from real UNESCO API responses
- Establish baseline for institution type classification
Day 1: UNESCO API Reconnaissance
Tasks:
-
API Documentation Review
- Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
- Identify available endpoints (dataset
whc001- World Heritage List) - Document authentication requirements (none - public dataset)
- Document pagination limits (max 100 records per request, use
offsetparameter) - Test API responses for sample sites
-
Data Structure Analysis
# Fetch sample UNESCO site data (OpenDataSoft Explore API v2) curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \ > samples/unesco_site_list.json # Fetch specific site by unique_number using ODSQL curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \ > samples/unesco_angkor_detail.json # Response structure: nested record.fields with coordinates object # { # "total_count": 1248, # "records": [{ # "record": { # "fields": { # "name_en": "Angkor", # "unique_number": 668, # "coordinates": {"lat": 13.4333, "lon": 103.8333} # } # } # }] # } -
Schema Mapping
- Map OpenDataSoft
record.fieldsto LinkMLHeritageCustodianslots - Identify missing fields (require inference or external enrichment)
- Document ambiguities (e.g., when is a site also a museum?)
- Handle nested response structure (
response['records'][i]['record']['fields'])
- Map OpenDataSoft
Deliverables:
docs/unesco-api-analysis.md- API structure documentationtests/fixtures/unesco_api_responses/- 10+ sample JSON filesdocs/unesco-to-linkml-mapping.md- Field mapping table
Success Criteria:
- Successfully fetch data for 50 UNESCO sites via API
- Document all relevant JSON fields for extraction
- Identify 3+ institution type classification patterns
Day 2: LinkML Map Schema Design (Part 1)
Tasks:
-
Install LinkML Map Extension
pip install linkml-map # OR implement custom extension: src/glam_extractor/mappers/extended_map.py -
Design Transformation Rules
- Create
schemas/maps/unesco_to_heritage_custodian.yaml - Define JSONPath expressions for field extraction
- Handle multi-language names (UNESCO provides English, French, often local language)
- Map UNESCO categories to InstitutionTypeEnum
- Create
-
Conditional Extraction Logic
# Example LinkML Map rule mappings: - source_path: $.category target_path: institution_type transform: type: conditional rules: - condition: "contains(description, 'museum')" value: MUSEUM - condition: "contains(description, 'library')" value: LIBRARY - condition: "contains(description, 'archive')" value: ARCHIVE - default: MIXED
Deliverables:
schemas/maps/unesco_to_heritage_custodian.yaml(initial version)tests/test_unesco_linkml_map.py- Unit tests for mapping rules
Success Criteria:
- LinkML Map schema validates against sample UNESCO JSON
- Successfully extract name, location, UNESCO WHC ID from 10 fixtures
- Handle multilingual names without data loss
Day 3: LinkML Map Schema Design (Part 2)
Tasks:
-
Advanced Transformation Rules
- Regex extraction for identifiers (UNESCO WHC ID format:
^\d{3,4}$) - GeoNames ID lookup from UNESCO location strings
- Wikidata Q-number extraction from UNESCO external links
- Regex extraction for identifiers (UNESCO WHC ID format:
-
Multi-value Array Handling
# Extract all languages from UNESCO site names mappings: - source_path: $.names[*] target_path: alternative_names transform: type: array element_transform: type: template template: "{name}@{lang}" -
Error Handling Patterns
- Missing required fields → skip record with warning
- Invalid coordinates → flag for manual geocoding
- Unknown institution type → default to MIXED, log for review
Deliverables:
schemas/maps/unesco_to_heritage_custodian.yaml(complete)docs/linkml-map-extension-spec.md- Custom extension specification
Success Criteria:
- Extract ALL relevant fields from 10 diverse UNESCO sites
- Handle edge cases (missing data, malformed coordinates)
- Generate valid LinkML instances from real API responses
Day 4: Test Fixture Creation
Tasks:
-
Curate Representative Samples
- Select 20 UNESCO sites covering:
- All continents (Europe, Asia, Africa, Americas, Oceania)
- Multiple institution types (museums, libraries, archives, botanical gardens)
- Edge cases (serial nominations, transboundary sites)
- Select 20 UNESCO sites covering:
-
Create Expected Outputs
# tests/fixtures/expected_outputs/unesco_louvre.yaml - id: https://w3id.org/heritage/custodian/fr/louvre name: Musée du Louvre institution_type: MUSEUM locations: - city: Paris country: FR coordinates: [48.8606, 2.3376] identifiers: - identifier_scheme: UNESCO_WHC identifier_value: "600" identifier_url: https://whc.unesco.org/en/list/600 provenance: data_source: UNESCO_WORLD_HERITAGE data_tier: TIER_1_AUTHORITATIVE -
Golden Dataset Construction
- Manually verify 20 expected outputs against authoritative sources
- Document any assumptions or inferences made
Deliverables:
tests/fixtures/unesco_api_responses/- 20 JSON filestests/fixtures/expected_outputs/- 20 YAML filestests/test_unesco_golden_dataset.py- Integration tests
Success Criteria:
- 20 golden dataset pairs (input JSON + expected YAML)
- 100% passing tests for golden dataset
- Documented edge cases and classification rules
Day 5: Institution Type Classifier Design
Tasks:
-
Pattern Analysis
- Analyze UNESCO descriptions for GLAM-related keywords
- Create decision tree for institution type classification
- Handle ambiguous cases (e.g., "archaeological park with museum")
-
Keyword Extraction
# src/glam_extractor/classifiers/unesco_institution_type.py MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition'] LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek'] ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage'] BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum'] HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue'] FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate'] -
Confidence Scoring
- High confidence (0.9+): Explicit mentions of "museum" or "library"
- Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
- Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review
Deliverables:
src/glam_extractor/classifiers/unesco_institution_type.pytests/test_unesco_classifier.py- 50+ test casesdocs/unesco-classification-rules.md- Decision tree documentation
Success Criteria:
- Classifier achieves 90%+ accuracy on golden dataset
- Low-confidence classifications flagged for manual review
- Handle multilingual descriptions (English, French, Spanish, etc.)
Phase 2: Extractor Implementation (Days 6-13)
Objectives
- Implement UNESCO API client with caching and rate limiting
- Build LinkML instance generator using Map schema
- Create GHCID generator for UNESCO institutions
- Achieve 100% test coverage for core extraction logic
Day 6: UNESCO API Client
Tasks:
-
HTTP Client Implementation
# src/glam_extractor/parsers/unesco_api_client.py class UNESCOAPIClient: def __init__(self, cache_ttl: int = 86400): self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001" self.cache = Cache(ttl=cache_ttl) self.rate_limiter = RateLimiter(requests_per_second=2) def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict: """ Fetch paginated list of UNESCO World Heritage Sites. Returns OpenDataSoft API response with structure: { "total_count": int, "results": [{"record": {"id": str, "fields": {...}}, ...}] } """ ... def fetch_site_details(self, whc_id: int) -> Dict: """ Fetch detailed information for a specific site using ODSQL query. Returns single record with structure: {"record": {"id": str, "fields": {field_name: value, ...}}} """ ... -
Caching Strategy
- Cache API responses for 24 hours (UNESCO updates infrequently)
- Store in SQLite database:
cache/unesco_api_cache.db - Invalidate cache on demand for data refreshes
-
Error Handling
- Network errors → retry with exponential backoff
- 404 Not Found → skip site, log warning
- Rate limit exceeded → pause and retry
Deliverables:
src/glam_extractor/parsers/unesco_api_client.pytests/test_unesco_api_client.py- Mock API testscache/unesco_api_cache.db- SQLite cache database
Success Criteria:
- Successfully fetch all UNESCO sites (1,000+ sites)
- Handle API errors gracefully (no crashes)
- Cache reduces API calls by 95% on repeat runs
Day 7: LinkML Instance Generator
Tasks:
-
Apply LinkML Map Transformations
# src/glam_extractor/parsers/unesco_parser.py from linkml_map import Mapper def parse_unesco_site(api_response: Dict) -> HeritageCustodian: """ Parse OpenDataSoft API response to HeritageCustodian instance. Args: api_response: OpenDataSoft record with nested structure: {"record": {"id": str, "fields": {field_name: value, ...}}} Returns: HeritageCustodian: Validated LinkML instance """ # Extract fields from nested structure site_data = api_response['record']['fields'] mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml") instance = mapper.transform(site_data) return HeritageCustodian(**instance) -
Validation Pipeline
- Apply LinkML schema validation after transformation
- Catch validation errors, log details
- Skip invalid records, continue processing
-
Multi-language Name Handling
def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]: """ Extract primary name and alternative names in multiple languages. Args: site_data: Extracted fields from OpenDataSoft record['fields'] """ primary_name = site_data.get('site', '') alternative_names = [] # OpenDataSoft may provide language variants in separate fields # or as structured data - adjust based on actual API response for lang_data in site_data.get('names', []): name = lang_data.get('name', '') lang = lang_data.get('lang', 'en') if name and name != primary_name: alternative_names.append(f"{name}@{lang}") return primary_name, alternative_names
Deliverables:
src/glam_extractor/parsers/unesco_parser.pytests/test_unesco_parser.py- 20 golden dataset tests
Success Criteria:
- Parse 20 golden dataset fixtures with 100% accuracy
- Extract multilingual names without data loss
- Generate valid LinkML instances (pass schema validation)
Day 8: GHCID Generator for UNESCO Sites
Tasks:
-
Extend GHCID Logic for UNESCO
# src/glam_extractor/identifiers/ghcid_generator.py def generate_ghcid_for_unesco_site( country_code: str, region_code: str, city_code: str, institution_type: InstitutionTypeEnum, institution_name: str, has_collision: bool = False ) -> str: """ Generate GHCID for UNESCO World Heritage Site institution. Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}] Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix) Note: Collision suffix uses native language institution name in snake_case, NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md """ ... -
City Code Lookup
- Use GeoNames API to convert city names to UN/LOCODE
- Fallback to 3-letter abbreviation if UN/LOCODE not found
- Cache lookups to minimize API calls
-
Collision Detection
- Check existing GHCID dataset for collisions
- Apply temporal priority rules (first batch vs. historical addition)
- Append native language name suffix if collision detected
Deliverables:
- Extended
src/glam_extractor/identifiers/ghcid_generator.py tests/test_ghcid_unesco.py- GHCID generation tests
Success Criteria:
- Generate valid GHCIDs for 20 golden dataset institutions
- No collisions with existing Dutch ISIL dataset
- Handle missing Wikidata Q-numbers gracefully
Day 9-10: Batch Processing Pipeline
Tasks:
-
Parallel Processing
# scripts/extract_unesco_sites.py from concurrent.futures import ThreadPoolExecutor def extract_all_unesco_sites(max_workers: int = 4): api_client = UNESCOAPIClient() # Fetch paginated site list from OpenDataSoft API all_sites = [] offset = 0 limit = 100 while True: response = api_client.fetch_site_list(limit=limit, offset=offset) all_sites.extend(response['results']) if len(all_sites) >= response['total_count']: break offset += limit with ThreadPoolExecutor(max_workers=max_workers) as executor: results = executor.map(process_unesco_site, all_sites) return list(results) def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]: """ Process single OpenDataSoft record. Args: site_record: {"record": {"id": str, "fields": {...}}} """ try: # Extract unique_number from nested fields whc_id = site_record['record']['fields']['unique_number'] # Fetch full details if needed (or use site_record directly) details = api_client.fetch_site_details(whc_id) institution_type = classify_institution_type(details['record']['fields']) custodian = parse_unesco_site(details) custodian.ghcid = generate_ghcid(custodian) return custodian except Exception as e: log.error(f"Failed to process site {whc_id}: {e}") return None -
Progress Tracking
- Use
tqdmfor progress bars - Log successful extractions to
logs/unesco_extraction.log - Save intermediate results every 100 sites
- Use
-
Error Recovery
- Resume from last checkpoint if script crashes
- Separate successful extractions from failed ones
- Generate error report with failed site IDs
Deliverables:
scripts/extract_unesco_sites.py- Batch extraction scriptdata/unesco_extracted/- Output directory for YAML instanceslogs/unesco_extraction.log- Extraction log
Success Criteria:
- Process all 1,000+ UNESCO sites in < 2 hours
- < 5% failure rate (API errors, missing data)
- Successful extractions saved as valid LinkML YAML files
Day 11-12: Integration Testing
Tasks:
-
End-to-End Tests
# tests/integration/test_unesco_pipeline.py def test_full_unesco_extraction_pipeline(): """Test complete pipeline from OpenDataSoft API fetch to LinkML instance.""" # 1. Fetch API data from OpenDataSoft api_client = UNESCOAPIClient() response = api_client.fetch_site_details(600) # Paris, Banks of the Seine site_data = response['record']['fields'] # Extract from nested structure # 2. Classify institution type inst_type = classify_institution_type(site_data) assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED] # 3. Parse to LinkML instance custodian = parse_unesco_site(response) assert custodian.name is not None # 4. Generate GHCID custodian.ghcid = generate_ghcid(custodian) assert custodian.ghcid.startswith("FR-") # 5. Validate against schema validator = SchemaValidator(schema="schemas/heritage_custodian.yaml") result = validator.validate(custodian) assert result.is_valid -
Property-Based Testing
from hypothesis import given, strategies as st @given(st.integers(min_value=1, max_value=1500)) def test_ghcid_determinism(whc_id: int): """GHCID generation is deterministic for same input.""" site1 = generate_ghcid_for_site(whc_id) site2 = generate_ghcid_for_site(whc_id) assert site1 == site2 -
Performance Testing
- Benchmark extraction speed (sites per second)
- Memory profiling (ensure no memory leaks)
- Cache hit rate analysis
Deliverables:
tests/integration/test_unesco_pipeline.py- End-to-end teststests/test_unesco_property_based.py- Property-based testsdocs/performance-benchmarks.md- Performance results
Success Criteria:
- 100% passing integration tests
- Extract 1,000 sites in < 2 hours (with cache)
- Memory usage < 500MB for full extraction
Day 13: Code Review & Refactoring
Tasks:
-
Code Quality Review
- Run
rufflinter, fix all warnings - Run
mypytype checker, resolve type errors - Ensure 90%+ test coverage
- Run
-
Documentation Review
- Add docstrings to all public functions
- Update README with UNESCO extraction instructions
- Create developer guide for extending classifiers
-
Performance Optimization
- Profile slow functions, optimize bottlenecks
- Reduce redundant API calls
- Optimize GHCID generation (cache city code lookups)
Deliverables:
- Refactored codebase with 90%+ test coverage
- Updated documentation
- Performance optimizations applied
Success Criteria:
- Zero linter warnings
- Zero type errors
- Test coverage > 90%
Phase 3: Data Quality & Validation (Days 14-19)
Objectives
- Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
- Detect and resolve conflicts
- Implement confidence scoring system
- Generate data quality report
Day 14-15: Cross-Referencing with Existing Data
Tasks:
-
Load Existing Dataset
# scripts/crosslink_unesco_with_glam.py def load_existing_glam_dataset(): """Load Dutch ISIL + conversation extractions.""" dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv") dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv") conversations = load_conversation_extractions("data/instances/") return merge_datasets([dutch_isil, dutch_orgs, conversations]) -
Match UNESCO Sites to Existing Records
- Match by Wikidata Q-number (highest confidence)
- Match by ISIL code (for Dutch institutions)
- Match by name + location (fuzzy matching, score > 0.85)
-
Conflict Detection
def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]: """Detect field-level conflicts between UNESCO and existing data.""" conflicts = [] if unesco_record.name != existing_record.name: conflicts.append(Conflict( field="name", unesco_value=unesco_record.name, existing_value=existing_record.name, resolution="MANUAL_REVIEW" )) # Check institution type mismatch if unesco_record.institution_type != existing_record.institution_type: conflicts.append(Conflict(field="institution_type", ...)) return conflicts
Deliverables:
scripts/crosslink_unesco_with_glam.py- Cross-linking scriptdata/unesco_conflicts.csv- Detected conflicts reporttests/test_crosslinking.py- Unit tests for matching logic
Success Criteria:
- Identify 50+ matches between UNESCO and existing dataset
- Detect conflicts (name mismatches, type discrepancies)
- Generate conflict report for manual review
Day 16: Conflict Resolution
Tasks:
-
Tier-Based Priority
- TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
- When conflict: UNESCO data wins, existing data flagged
-
Merge Strategy
def merge_unesco_with_existing( unesco_record: HeritageCustodian, existing_record: HeritageCustodian ) -> HeritageCustodian: """Merge UNESCO data with existing record, UNESCO takes priority.""" merged = existing_record.copy() # UNESCO name becomes primary merged.name = unesco_record.name # Preserve alternative names from both sources merged.alternative_names = list(set( unesco_record.alternative_names + existing_record.alternative_names )) # Add UNESCO identifier merged.identifiers.append({ 'identifier_scheme': 'UNESCO_WHC', 'identifier_value': unesco_record.identifiers[0].identifier_value }) # Track provenance of merge merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX" return merged -
Manual Review Queue
- Flag high-impact conflicts (institution type change, location change)
- Generate review spreadsheet for human validation
- Provide evidence for each conflict (source URLs, descriptions)
Deliverables:
src/glam_extractor/validators/conflict_resolver.pydata/manual_review_queue.csv- Conflicts requiring human reviewtests/test_conflict_resolution.py
Success Criteria:
- Resolve 80% of conflicts automatically (tier-based priority)
- Flag 20% for manual review (complex cases)
- Zero data loss (preserve all alternative names, identifiers)
Day 17: Confidence Scoring System
Tasks:
-
Score Calculation
def calculate_confidence_score(custodian: HeritageCustodian) -> float: """Calculate confidence score based on data completeness and source quality.""" score = 1.0 # Start at maximum (TIER_1 authoritative) # Deduct for missing fields if not custodian.identifiers: score -= 0.1 if not custodian.locations: score -= 0.15 if custodian.institution_type == InstitutionTypeEnum.MIXED: score -= 0.2 # Ambiguous classification # Boost for rich metadata if len(custodian.identifiers) > 2: score += 0.05 if custodian.digital_platforms: score += 0.05 return max(0.0, min(1.0, score)) # Clamp to [0.0, 1.0] -
Quality Metrics
- Completeness: % of optional fields populated
- Accuracy: Agreement with authoritative sources (Wikidata, official websites)
- Freshness: Days since extraction
-
Tier Validation
- Ensure all UNESCO records have
data_tier: TIER_1_AUTHORITATIVE - Downgrade tier if conflicts detected (TIER_1 → TIER_2)
- Ensure all UNESCO records have
Deliverables:
src/glam_extractor/validators/confidence_scorer.pytests/test_confidence_scoring.pydata/unesco_quality_metrics.json- Aggregate statistics
Success Criteria:
- 90%+ of UNESCO records score > 0.85 confidence
- Flag < 5% as low confidence (require review)
- Document scoring methodology in provenance metadata
Day 18: LinkML Schema Validation
Tasks:
-
Batch Validation
# Validate all UNESCO extractions against LinkML schema for file in data/unesco_extracted/*.yaml; do linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file" done -
Custom Validators
# src/glam_extractor/validators/unesco_validators.py def validate_unesco_whc_id(whc_id: str) -> bool: """UNESCO WHC IDs are 3-4 digit integers.""" return bool(re.match(r'^\d{3,4}$', whc_id)) def validate_ghcid_format(ghcid: str) -> bool: """Validate GHCID format for UNESCO institutions.""" pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$' return bool(re.match(pattern, ghcid)) -
Error Reporting
- Generate validation report with line numbers and error messages
- Categorize errors (required field missing, invalid enum value, format error)
- Prioritize fixes (blocking errors vs. warnings)
Deliverables:
scripts/validate_unesco_dataset.py- Batch validation scriptsrc/glam_extractor/validators/unesco_validators.py- Custom validatorsdata/unesco_validation_report.json- Validation results
Success Criteria:
- 100% of extracted records pass LinkML validation
- Zero blocking errors
- Document any warnings in provenance notes
Day 19: Data Quality Report
Tasks:
-
Generate Statistics
# scripts/generate_unesco_quality_report.py def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict: return { 'total_institutions': len(unesco_dataset), 'by_country': count_by_country(unesco_dataset), 'by_institution_type': count_by_type(unesco_dataset), 'avg_confidence_score': calculate_avg_confidence(unesco_dataset), 'completeness_metrics': { 'with_wikidata_id': count_with_wikidata(unesco_dataset), 'with_digital_platform': count_with_platforms(unesco_dataset), 'with_geocoded_location': count_with_geocoding(unesco_dataset) }, 'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')), 'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv')) } -
Visualization
- Generate maps showing UNESCO site distribution
- Bar charts: institutions by country, by type
- Heatmap: data completeness by field
-
Documentation
- Write executive summary of data quality
- Document known issues and limitations
- Provide recommendations for improvement
Deliverables:
scripts/generate_unesco_quality_report.pydata/unesco_quality_report.json- Statisticsdocs/unesco-data-quality.md- Quality report documentdata/visualizations/- Maps and charts
Success Criteria:
- Quality report shows 90%+ completeness for core fields
- < 5% of records require manual review
- Geographic coverage across all inhabited continents
Phase 4: Integration & Enrichment (Days 20-24)
Objectives
- Enrich UNESCO data with Wikidata identifiers
- Merge UNESCO dataset with existing GLAM dataset
- Resolve GHCID collisions
- Update GHCID history for modified records
Day 20-21: Wikidata Enrichment
Tasks:
-
SPARQL Query for UNESCO Sites
# scripts/enrich_unesco_with_wikidata.py def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]: """Find Wikidata Q-number for UNESCO World Heritage Site.""" query = f""" SELECT ?item WHERE {{ ?item wdt:P757 "{whc_id}" . # P757 = UNESCO World Heritage Site ID }} LIMIT 1 """ results = sparql_query(query) if results: return extract_qid(results[0]['item']['value']) return None -
Batch Enrichment
- Query Wikidata for all 1,000+ UNESCO sites
- Extract Q-numbers, VIAF IDs, ISIL codes (if available)
- Add to
identifiersarray in LinkML instances
-
Fuzzy Matching Fallback
- If WHC ID not found in Wikidata, try name + location matching
- Use same fuzzy matching logic from existing enrichment scripts
- Threshold: 0.85 similarity score
Deliverables:
scripts/enrich_unesco_with_wikidata.py- Enrichment scriptdata/unesco_enriched/- Enriched YAML instanceslogs/wikidata_enrichment.log- Enrichment log
Success Criteria:
- Find Wikidata Q-numbers for 80%+ of UNESCO sites
- Add VIAF/ISIL identifiers where available
- Document enrichment in provenance metadata
Day 22: Dataset Merge
Tasks:
-
Merge Strategy
# scripts/merge_unesco_into_glam_dataset.py def merge_datasets( unesco_data: List[HeritageCustodian], existing_data: List[HeritageCustodian] ) -> List[HeritageCustodian]: """Merge UNESCO data into existing GLAM dataset.""" merged = existing_data.copy() for unesco_record in unesco_data: # Check if institution already exists match = find_match(unesco_record, existing_data) if match: # Merge records merged_record = merge_unesco_with_existing(unesco_record, match) merged[merged.index(match)] = merged_record else: # Add new institution merged.append(unesco_record) return merged -
Deduplication
- Detect duplicates by GHCID, Wikidata Q-number, ISIL code
- Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
- Preserve alternative names and identifiers from both sources
-
Provenance Tracking
- Update
provenance.notesfor merged records - Record merge timestamp
- Link back to original extraction sources
- Update
Deliverables:
scripts/merge_unesco_into_glam_dataset.py- Merge scriptdata/merged_glam_dataset/- Merged dataset (YAML files)data/merge_report.json- Merge statistics
Success Criteria:
- Merge 1,000+ UNESCO records with existing GLAM dataset
- Deduplicate matches (no duplicate GHCIDs)
- Preserve data from all sources (no information loss)
Day 23: GHCID Collision Resolution
Tasks:
-
Detect Collisions
# scripts/resolve_ghcid_collisions.py def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]: """Find institutions with identical base GHCIDs.""" ghcid_map = defaultdict(list) for custodian in dataset: base_ghcid = remove_q_number(custodian.ghcid) ghcid_map[base_ghcid].append(custodian) collisions = [ Collision(base_ghcid=k, institutions=v) for k, v in ghcid_map.items() if len(v) > 1 ] return collisions -
Apply Temporal Priority Rules
- Compare
provenance.extraction_datefor colliding institutions - First batch (same date): ALL get Q-numbers
- Historical addition (later date): ONLY new gets name suffix
- Compare
-
Update GHCID History
def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str): """Record GHCID change in history.""" custodian.ghcid_history.append(GHCIDHistoryEntry( ghcid=new_ghcid, ghcid_numeric=generate_numeric_id(new_ghcid), valid_from=datetime.now(timezone.utc).isoformat(), valid_to=None, reason=f"Name suffix added to resolve collision with {old_ghcid}" ))
Deliverables:
scripts/resolve_ghcid_collisions.py- Collision resolution scriptdata/ghcid_collision_report.json- Detected collisions- Updated YAML instances with
ghcid_historyentries
Success Criteria:
- Resolve all GHCID collisions (zero duplicates)
- Update GHCID history for affected records
- Preserve PID stability (no changes to published GHCIDs)
Day 24: Final Data Validation
Tasks:
-
Full Dataset Validation
- Run LinkML validation on merged dataset
- Check for orphaned references (invalid foreign keys)
- Verify all GHCIDs are unique
-
Integrity Checks
# tests/integration/test_merged_dataset_integrity.py def test_no_duplicate_ghcids(): dataset = load_merged_dataset() ghcids = [c.ghcid for c in dataset] assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!" def test_all_unesco_sites_have_whc_id(): dataset = load_merged_dataset() unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"] for record in unesco_records: whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"] assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID" -
Coverage Analysis
- Verify UNESCO sites across all continents
- Check institution type distribution (not all MUSEUM)
- Ensure Dutch institutions properly merged with ISIL registry
Deliverables:
tests/integration/test_merged_dataset_integrity.py- Integrity testsdata/final_validation_report.json- Validation resultsdocs/dataset-coverage.md- Coverage analysis
Success Criteria:
- 100% passing integrity tests
- Zero duplicate GHCIDs
- UNESCO sites cover 100+ countries
Phase 5: Export & Documentation (Days 25-30)
Objectives
- Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
- Generate user documentation and API docs
- Create example queries and use case tutorials
- Publish dataset with persistent identifiers
Day 25-26: RDF/JSON-LD Export
Tasks:
-
RDF Serialization
# src/glam_extractor/exporters/rdf_exporter.py def export_to_rdf(dataset: List[HeritageCustodian], output_path: str): """Export dataset to RDF/Turtle format.""" graph = Graph() # Define namespaces GLAM = Namespace("https://w3id.org/heritage/custodian/") graph.bind("glam", GLAM) graph.bind("schema", SCHEMA) graph.bind("cpov", Namespace("http://data.europa.eu/m8g/")) for custodian in dataset: uri = URIRef(custodian.id) # Type assertions graph.add((uri, RDF.type, GLAM.HeritageCustodian)) graph.add((uri, RDF.type, SCHEMA.Museum)) # If institution_type == MUSEUM # Literals graph.add((uri, SCHEMA.name, Literal(custodian.name))) graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type))) # Identifiers (owl:sameAs) for identifier in custodian.identifiers: if identifier.identifier_scheme == "Wikidata": graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url))) graph.serialize(destination=output_path, format="turtle") -
JSON-LD Context
// data/context/heritage_custodian_context.jsonld { "@context": { "@vocab": "https://w3id.org/heritage/custodian/", "schema": "http://schema.org/", "name": "schema:name", "location": "schema:location", "identifiers": "schema:identifier", "institution_type": "institutionType", "data_source": "dataSource" } } -
Content Negotiation Setup
- Configure w3id.org redirects (if hosting on GitHub Pages)
- Test URI resolution for sample institutions
- Ensure Accept header routing (text/turtle, application/ld+json)
Deliverables:
src/glam_extractor/exporters/rdf_exporter.py- RDF exporterdata/exports/glam_dataset.ttl- RDF/Turtle exportdata/exports/glam_dataset.jsonld- JSON-LD exportdata/context/heritage_custodian_context.jsonld- JSON-LD context
Success Criteria:
- RDF validates with Turtle parser
- JSON-LD validates with JSON-LD Playground
- Sample URIs resolve correctly
Day 27: CSV/Parquet Export
Tasks:
-
Flatten Schema for CSV
# src/glam_extractor/exporters/csv_exporter.py def export_to_csv(dataset: List[HeritageCustodian], output_path: str): """Export dataset to CSV with flattened structure.""" rows = [] for custodian in dataset: row = { 'ghcid': custodian.ghcid, 'ghcid_uuid': str(custodian.ghcid_uuid), 'name': custodian.name, 'institution_type': custodian.institution_type, 'country': custodian.locations[0].country if custodian.locations else None, 'city': custodian.locations[0].city if custodian.locations else None, 'wikidata_id': get_identifier(custodian, 'Wikidata'), 'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'), 'data_source': custodian.provenance.data_source, 'data_tier': custodian.provenance.data_tier, 'confidence_score': custodian.provenance.confidence_score } rows.append(row) df = pd.DataFrame(rows) df.to_csv(output_path, index=False, encoding='utf-8-sig') -
Parquet Export (Columnar)
def export_to_parquet(dataset: List[HeritageCustodian], output_path: str): """Export dataset to Parquet for efficient querying.""" df = pd.DataFrame([custodian.dict() for custodian in dataset]) df.to_parquet(output_path, engine='pyarrow', compression='snappy') -
SQLite Export
def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str): """Export dataset to SQLite database.""" conn = sqlite3.connect(db_path) # Create tables conn.execute(""" CREATE TABLE heritage_custodians ( ghcid TEXT PRIMARY KEY, ghcid_uuid TEXT UNIQUE, name TEXT NOT NULL, institution_type TEXT, data_source TEXT, ... ) """) # Insert records for custodian in dataset: conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...) conn.commit()
Deliverables:
src/glam_extractor/exporters/csv_exporter.py- CSV exporterdata/exports/glam_dataset.csv- CSV exportdata/exports/glam_dataset.parquet- Parquet exportdata/exports/glam_dataset.db- SQLite database
Success Criteria:
- CSV opens correctly in Excel, Google Sheets
- Parquet loads in pandas, DuckDB
- SQLite database queryable with SQL
Day 28: Documentation - User Guide
Tasks:
-
Getting Started Guide
# docs/user-guide/getting-started.md ## Installation pip install glam-extractor ## Quick Start ```python from glam_extractor import load_dataset # Load the GLAM dataset dataset = load_dataset("data/exports/glam_dataset.parquet") # Filter UNESCO museums in France museums = dataset.filter( institution_type="MUSEUM", data_source="UNESCO_WORLD_HERITAGE", country="FR" ) -
Example Queries
- SPARQL examples (find institutions by type, country)
- Pandas examples (data analysis, statistics)
- SQL examples (SQLite queries)
-
API Reference
- Document all public classes and methods
- Provide code examples for each function
- Link to LinkML schema documentation
Deliverables:
docs/user-guide/getting-started.md- Quick start guidedocs/user-guide/example-queries.md- Query examplesdocs/api-reference.md- API documentation
Success Criteria:
- Complete documentation for all public APIs
- 10+ example queries covering common use cases
- Step-by-step tutorials for data consumers
Day 29: Documentation - Developer Guide
Tasks:
-
Architecture Overview
- Diagram of extraction pipeline (API → Parser → Validator → Exporter)
- Explanation of LinkML Map transformation
- GHCID generation algorithm
-
Contributing Guide
- How to add new institution type classifiers
- How to extend LinkML schema with new fields
- How to add new export formats
-
Testing Guide
- Running unit tests, integration tests
- Creating new test fixtures
- Using property-based testing
Deliverables:
docs/developer-guide/architecture.md- Architecture docsdocs/developer-guide/contributing.md- Contribution guidedocs/developer-guide/testing.md- Testing guide
Success Criteria:
- Complete architecture documentation with diagrams
- Clear instructions for extending the system
- Comprehensive testing guide
Day 30: Release & Publication
Tasks:
-
Dataset Release
- Tag repository with version number (e.g., v1.0.0-unesco)
- Create GitHub Release with exports attached
- Publish to Zenodo for DOI (persistent citation)
-
Announcement
- Write blog post announcing UNESCO data release
- Share on social media (Twitter, Mastodon, LinkedIn)
- Notify stakeholders (Europeana, DPLA, heritage researchers)
-
Data Portal Update
- Update w3id.org redirects for new institutions
- Deploy SPARQL endpoint (if applicable)
- Update REST API to include UNESCO data
Deliverables:
- GitHub Release with dataset exports
- Zenodo DOI for citation
- Blog post and announcement
Success Criteria:
- Dataset published with persistent DOI
- Documentation live and accessible
- Stakeholders notified of release
Risk Mitigation
Technical Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| UNESCO API changes format | Low | High | Cache all responses, version API client |
| LinkML Map lacks features | Medium | High | Implement custom extension early (Day 2-3) |
| GHCID collisions exceed capacity | Low | Medium | Q-number resolution strategy documented |
| Wikidata enrichment fails | Medium | Medium | Fallback to fuzzy name matching |
Resource Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Timeline slips past 6 weeks | Medium | Medium | Prioritize core features, defer non-critical exports |
| Test coverage falls below 90% | Low | High | TDD approach enforced from Day 1 |
| Documentation incomplete | Medium | High | Reserve full week for docs (Phase 5) |
Data Quality Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Institution type misclassification | Medium | Medium | Manual review queue for low-confidence cases |
| Missing Wikidata Q-numbers | Medium | Low | Accept base GHCID without Q-number, enrich later |
| Conflicts with existing data | Low | Medium | Tier-based priority, UNESCO wins |
Success Metrics
Quantitative Metrics
- Coverage: Extract 1,000+ UNESCO site institutions
- Quality: 90%+ confidence score average
- Completeness: 80%+ have Wikidata Q-numbers
- Performance: Process all sites in < 2 hours
- Test Coverage: 90%+ code coverage
Qualitative Metrics
- Usability: Positive feedback from 3+ data consumers
- Documentation: Complete user guide and API docs
- Maintainability: Code passes linter, type checker
- Reproducibility: Dataset generation fully automated
Appendix: Day-by-Day Checklist
Phase 1 (Days 1-5)
- Day 1: UNESCO API documentation reviewed, 50 sites fetched
- Day 2: LinkML Map schema (part 1) - basic transformations
- Day 3: LinkML Map schema (part 2) - advanced patterns
- Day 4: Golden dataset created (20 test fixtures)
- Day 5: Institution type classifier designed
Phase 2 (Days 6-13)
- Day 6: UNESCO API client implemented
- Day 7: LinkML instance generator implemented
- Day 8: GHCID generator extended for UNESCO
- Day 9-10: Batch processing pipeline
- Day 11-12: Integration testing
- Day 13: Code review and refactoring
Phase 3 (Days 14-19)
- Day 14-15: Cross-referencing with existing data
- Day 16: Conflict resolution
- Day 17: Confidence scoring system
- Day 18: LinkML schema validation
- Day 19: Data quality report
Phase 4 (Days 20-24)
- Day 20-21: Wikidata enrichment
- Day 22: Dataset merge
- Day 23: GHCID collision resolution
- Day 24: Final data validation
Phase 5 (Days 25-30)
- Day 25-26: RDF/JSON-LD export
- Day 27: CSV/Parquet/SQLite export
- Day 28: User guide documentation
- Day 29: Developer guide documentation
- Day 30: Release and publication
Document Status: Complete
Next Document: 04-tdd-strategy.md - Test-driven development plan
Version: 1.1
Version History
Version 1.1 (2025-11-10)
Changes: Updated for OpenDataSoft Explore API v2.0 migration
- Day 1 API Reconnaissance (lines 44-62): Updated API endpoint from legacy
whc.unesco.org/en/list/jsonto OpenDataSoftdata.unesco.org/api/explore/v2.0 - Day 6 UNESCO API Client (lines 265-305):
- Updated
base_urlto OpenDataSoft API endpoint - Removed
api_keyparameter (public dataset, no authentication) - Added pagination parameters to
fetch_site_list():limit,offset - Updated method documentation to reflect OpenDataSoft response structure:
{"record": {"fields": {...}}}
- Updated
- Day 7 LinkML Parser (lines 309-352):
- Updated
parse_unesco_site()to extract from nestedapi_response['record']['fields'] - Added documentation clarifying OpenDataSoft structure
- Updated
extract_multilingual_names()parameter name fromunesco_datatosite_data
- Updated
- Day 10 Batch Processing (lines 406-450):
- Updated
extract_all_unesco_sites()with pagination loop for OpenDataSoft API - Updated
process_unesco_site()to handle nested record structure - Changed field access from
site_data['id_number']tosite_record['record']['fields']['unique_number']
- Updated
- Day 11-12 Integration Tests (lines 454-483):
- Updated
test_full_unesco_extraction_pipeline()to extractsite_datafromresponse['record']['fields'] - Added explicit documentation of OpenDataSoft API structure
- Updated
Rationale: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.
Version 1.0 (2025-11-09)
Initial Release
- Comprehensive 30-day implementation plan for UNESCO data extraction
- Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
- TDD approach with golden dataset and integration tests
- GHCID generation strategy for UNESCO heritage sites
- Wikidata enrichment and cross-referencing plan