# UNESCO Data Extraction - Implementation Phases **Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction **Document**: 03 - Implementation Phases **Version**: 1.1 **Date**: 2025-11-10 **Status**: Draft --- ## Executive Summary This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables. **Total Effort**: 30 working days (6 weeks) **Team Size**: 1-2 developers + AI agents **Target Output**: 1,000+ heritage custodian records from UNESCO sites --- ## Phase Overview | Phase | Duration | Focus | Key Deliverables | |-------|----------|-------|------------------| | **Phase 1** | 5 days | API Exploration & Schema Design | UNESCO API parser, LinkML Map schema | | **Phase 2** | 8 days | Extractor Implementation | Institution type classifier, GHCID generator | | **Phase 3** | 6 days | Data Quality & Validation | LinkML validator, conflict resolver | | **Phase 4** | 5 days | Integration & Enrichment | Wikidata enrichment, dataset merge | | **Phase 5** | 6 days | Export & Documentation | RDF/JSON-LD exporters, user docs | --- ## Phase 1: API Exploration & Schema Design (Days 1-5) ### Objectives - Understand UNESCO DataHub API structure and data quality - Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian - Create test fixtures from real UNESCO API responses - Establish baseline for institution type classification ### Day 1: UNESCO API Reconnaissance **Tasks**: 1. **API Documentation Review** - Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console) - Identify available endpoints (dataset `whc001` - World Heritage List) - Document authentication requirements (none - public dataset) - Document pagination limits (max 100 records per request, use `offset` parameter) - Test API responses for sample sites 2. **Data Structure Analysis** ```bash # Fetch sample UNESCO site data (OpenDataSoft Explore API v2) curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \ > samples/unesco_site_list.json # Fetch specific site by unique_number using ODSQL curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \ > samples/unesco_angkor_detail.json # Response structure: nested record.fields with coordinates object # { # "total_count": 1248, # "records": [{ # "record": { # "fields": { # "name_en": "Angkor", # "unique_number": 668, # "coordinates": {"lat": 13.4333, "lon": 103.8333} # } # } # }] # } ``` 3. **Schema Mapping** - Map OpenDataSoft `record.fields` to LinkML `HeritageCustodian` slots - Identify missing fields (require inference or external enrichment) - Document ambiguities (e.g., when is a site also a museum?) - Handle nested response structure (`response['records'][i]['record']['fields']`) **Deliverables**: - `docs/unesco-api-analysis.md` - API structure documentation - `tests/fixtures/unesco_api_responses/` - 10+ sample JSON files - `docs/unesco-to-linkml-mapping.md` - Field mapping table **Success Criteria**: - [ ] Successfully fetch data for 50 UNESCO sites via API - [ ] Document all relevant JSON fields for extraction - [ ] Identify 3+ institution type classification patterns --- ### Day 2: LinkML Map Schema Design (Part 1) **Tasks**: 1. **Install LinkML Map Extension** ```bash pip install linkml-map # OR implement custom extension: src/glam_extractor/mappers/extended_map.py ``` 2. **Design Transformation Rules** - Create `schemas/maps/unesco_to_heritage_custodian.yaml` - Define JSONPath expressions for field extraction - Handle multi-language names (UNESCO provides English, French, often local language) - Map UNESCO categories to InstitutionTypeEnum 3. **Conditional Extraction Logic** ```yaml # Example LinkML Map rule mappings: - source_path: $.category target_path: institution_type transform: type: conditional rules: - condition: "contains(description, 'museum')" value: MUSEUM - condition: "contains(description, 'library')" value: LIBRARY - condition: "contains(description, 'archive')" value: ARCHIVE - default: MIXED ``` **Deliverables**: - `schemas/maps/unesco_to_heritage_custodian.yaml` (initial version) - `tests/test_unesco_linkml_map.py` - Unit tests for mapping rules **Success Criteria**: - [ ] LinkML Map schema validates against sample UNESCO JSON - [ ] Successfully extract name, location, UNESCO WHC ID from 10 fixtures - [ ] Handle multilingual names without data loss --- ### Day 3: LinkML Map Schema Design (Part 2) **Tasks**: 1. **Advanced Transformation Rules** - Regex extraction for identifiers (UNESCO WHC ID format: `^\d{3,4}$`) - GeoNames ID lookup from UNESCO location strings - Wikidata Q-number extraction from UNESCO external links 2. **Multi-value Array Handling** ```yaml # Extract all languages from UNESCO site names mappings: - source_path: $.names[*] target_path: alternative_names transform: type: array element_transform: type: template template: "{name}@{lang}" ``` 3. **Error Handling Patterns** - Missing required fields → skip record with warning - Invalid coordinates → flag for manual geocoding - Unknown institution type → default to MIXED, log for review **Deliverables**: - `schemas/maps/unesco_to_heritage_custodian.yaml` (complete) - `docs/linkml-map-extension-spec.md` - Custom extension specification **Success Criteria**: - [ ] Extract ALL relevant fields from 10 diverse UNESCO sites - [ ] Handle edge cases (missing data, malformed coordinates) - [ ] Generate valid LinkML instances from real API responses --- ### Day 4: Test Fixture Creation **Tasks**: 1. **Curate Representative Samples** - Select 20 UNESCO sites covering: - All continents (Europe, Asia, Africa, Americas, Oceania) - Multiple institution types (museums, libraries, archives, botanical gardens) - Edge cases (serial nominations, transboundary sites) 2. **Create Expected Outputs** ```yaml # tests/fixtures/expected_outputs/unesco_louvre.yaml - id: https://w3id.org/heritage/custodian/fr/louvre name: Musée du Louvre institution_type: MUSEUM locations: - city: Paris country: FR coordinates: [48.8606, 2.3376] identifiers: - identifier_scheme: UNESCO_WHC identifier_value: "600" identifier_url: https://whc.unesco.org/en/list/600 provenance: data_source: UNESCO_WORLD_HERITAGE data_tier: TIER_1_AUTHORITATIVE ``` 3. **Golden Dataset Construction** - Manually verify 20 expected outputs against authoritative sources - Document any assumptions or inferences made **Deliverables**: - `tests/fixtures/unesco_api_responses/` - 20 JSON files - `tests/fixtures/expected_outputs/` - 20 YAML files - `tests/test_unesco_golden_dataset.py` - Integration tests **Success Criteria**: - [ ] 20 golden dataset pairs (input JSON + expected YAML) - [ ] 100% passing tests for golden dataset - [ ] Documented edge cases and classification rules --- ### Day 5: Institution Type Classifier Design **Tasks**: 1. **Pattern Analysis** - Analyze UNESCO descriptions for GLAM-related keywords - Create decision tree for institution type classification - Handle ambiguous cases (e.g., "archaeological park with museum") 2. **Keyword Extraction** ```python # src/glam_extractor/classifiers/unesco_institution_type.py MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition'] LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek'] ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage'] BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum'] HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue'] FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate'] ``` 3. **Confidence Scoring** - High confidence (0.9+): Explicit mentions of "museum" or "library" - Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords - Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review **Deliverables**: - `src/glam_extractor/classifiers/unesco_institution_type.py` - `tests/test_unesco_classifier.py` - 50+ test cases - `docs/unesco-classification-rules.md` - Decision tree documentation **Success Criteria**: - [ ] Classifier achieves 90%+ accuracy on golden dataset - [ ] Low-confidence classifications flagged for manual review - [ ] Handle multilingual descriptions (English, French, Spanish, etc.) --- ## Phase 2: Extractor Implementation (Days 6-13) ### Objectives - Implement UNESCO API client with caching and rate limiting - Build LinkML instance generator using Map schema - Create GHCID generator for UNESCO institutions - Achieve 100% test coverage for core extraction logic ### Day 6: UNESCO API Client **Tasks**: 1. **HTTP Client Implementation** ```python # src/glam_extractor/parsers/unesco_api_client.py class UNESCOAPIClient: def __init__(self, cache_ttl: int = 86400): self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001" self.cache = Cache(ttl=cache_ttl) self.rate_limiter = RateLimiter(requests_per_second=2) def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict: """ Fetch paginated list of UNESCO World Heritage Sites. Returns OpenDataSoft API response with structure: { "total_count": int, "results": [{"record": {"id": str, "fields": {...}}, ...}] } """ ... def fetch_site_details(self, whc_id: int) -> Dict: """ Fetch detailed information for a specific site using ODSQL query. Returns single record with structure: {"record": {"id": str, "fields": {field_name: value, ...}}} """ ... ``` 2. **Caching Strategy** - Cache API responses for 24 hours (UNESCO updates infrequently) - Store in SQLite database: `cache/unesco_api_cache.db` - Invalidate cache on demand for data refreshes 3. **Error Handling** - Network errors → retry with exponential backoff - 404 Not Found → skip site, log warning - Rate limit exceeded → pause and retry **Deliverables**: - `src/glam_extractor/parsers/unesco_api_client.py` - `tests/test_unesco_api_client.py` - Mock API tests - `cache/unesco_api_cache.db` - SQLite cache database **Success Criteria**: - [ ] Successfully fetch all UNESCO sites (1,000+ sites) - [ ] Handle API errors gracefully (no crashes) - [ ] Cache reduces API calls by 95% on repeat runs --- ### Day 7: LinkML Instance Generator **Tasks**: 1. **Apply LinkML Map Transformations** ```python # src/glam_extractor/parsers/unesco_parser.py from linkml_map import Mapper def parse_unesco_site(api_response: Dict) -> HeritageCustodian: """ Parse OpenDataSoft API response to HeritageCustodian instance. Args: api_response: OpenDataSoft record with nested structure: {"record": {"id": str, "fields": {field_name: value, ...}}} Returns: HeritageCustodian: Validated LinkML instance """ # Extract fields from nested structure site_data = api_response['record']['fields'] mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml") instance = mapper.transform(site_data) return HeritageCustodian(**instance) ``` 2. **Validation Pipeline** - Apply LinkML schema validation after transformation - Catch validation errors, log details - Skip invalid records, continue processing 3. **Multi-language Name Handling** ```python def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]: """ Extract primary name and alternative names in multiple languages. Args: site_data: Extracted fields from OpenDataSoft record['fields'] """ primary_name = site_data.get('site', '') alternative_names = [] # OpenDataSoft may provide language variants in separate fields # or as structured data - adjust based on actual API response for lang_data in site_data.get('names', []): name = lang_data.get('name', '') lang = lang_data.get('lang', 'en') if name and name != primary_name: alternative_names.append(f"{name}@{lang}") return primary_name, alternative_names ``` **Deliverables**: - `src/glam_extractor/parsers/unesco_parser.py` - `tests/test_unesco_parser.py` - 20 golden dataset tests **Success Criteria**: - [ ] Parse 20 golden dataset fixtures with 100% accuracy - [ ] Extract multilingual names without data loss - [ ] Generate valid LinkML instances (pass schema validation) --- ### Day 8: GHCID Generator for UNESCO Sites **Tasks**: 1. **Extend GHCID Logic for UNESCO** ```python # src/glam_extractor/identifiers/ghcid_generator.py def generate_ghcid_for_unesco_site( country_code: str, region_code: str, city_code: str, institution_type: InstitutionTypeEnum, institution_name: str, has_collision: bool = False ) -> str: """ Generate GHCID for UNESCO World Heritage Site institution. Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}] Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix) Note: Collision suffix uses native language institution name in snake_case, NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md """ ... ``` 2. **City Code Lookup** - Use GeoNames API to convert city names to UN/LOCODE - Fallback to 3-letter abbreviation if UN/LOCODE not found - Cache lookups to minimize API calls 3. **Collision Detection** - Check existing GHCID dataset for collisions - Apply temporal priority rules (first batch vs. historical addition) - Append native language name suffix if collision detected **Deliverables**: - Extended `src/glam_extractor/identifiers/ghcid_generator.py` - `tests/test_ghcid_unesco.py` - GHCID generation tests **Success Criteria**: - [ ] Generate valid GHCIDs for 20 golden dataset institutions - [ ] No collisions with existing Dutch ISIL dataset - [ ] Handle missing Wikidata Q-numbers gracefully --- ### Day 9-10: Batch Processing Pipeline **Tasks**: 1. **Parallel Processing** ```python # scripts/extract_unesco_sites.py from concurrent.futures import ThreadPoolExecutor def extract_all_unesco_sites(max_workers: int = 4): api_client = UNESCOAPIClient() # Fetch paginated site list from OpenDataSoft API all_sites = [] offset = 0 limit = 100 while True: response = api_client.fetch_site_list(limit=limit, offset=offset) all_sites.extend(response['results']) if len(all_sites) >= response['total_count']: break offset += limit with ThreadPoolExecutor(max_workers=max_workers) as executor: results = executor.map(process_unesco_site, all_sites) return list(results) def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]: """ Process single OpenDataSoft record. Args: site_record: {"record": {"id": str, "fields": {...}}} """ try: # Extract unique_number from nested fields whc_id = site_record['record']['fields']['unique_number'] # Fetch full details if needed (or use site_record directly) details = api_client.fetch_site_details(whc_id) institution_type = classify_institution_type(details['record']['fields']) custodian = parse_unesco_site(details) custodian.ghcid = generate_ghcid(custodian) return custodian except Exception as e: log.error(f"Failed to process site {whc_id}: {e}") return None ``` 2. **Progress Tracking** - Use `tqdm` for progress bars - Log successful extractions to `logs/unesco_extraction.log` - Save intermediate results every 100 sites 3. **Error Recovery** - Resume from last checkpoint if script crashes - Separate successful extractions from failed ones - Generate error report with failed site IDs **Deliverables**: - `scripts/extract_unesco_sites.py` - Batch extraction script - `data/unesco_extracted/` - Output directory for YAML instances - `logs/unesco_extraction.log` - Extraction log **Success Criteria**: - [ ] Process all 1,000+ UNESCO sites in < 2 hours - [ ] < 5% failure rate (API errors, missing data) - [ ] Successful extractions saved as valid LinkML YAML files --- ### Day 11-12: Integration Testing **Tasks**: 1. **End-to-End Tests** ```python # tests/integration/test_unesco_pipeline.py def test_full_unesco_extraction_pipeline(): """Test complete pipeline from OpenDataSoft API fetch to LinkML instance.""" # 1. Fetch API data from OpenDataSoft api_client = UNESCOAPIClient() response = api_client.fetch_site_details(600) # Paris, Banks of the Seine site_data = response['record']['fields'] # Extract from nested structure # 2. Classify institution type inst_type = classify_institution_type(site_data) assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED] # 3. Parse to LinkML instance custodian = parse_unesco_site(response) assert custodian.name is not None # 4. Generate GHCID custodian.ghcid = generate_ghcid(custodian) assert custodian.ghcid.startswith("FR-") # 5. Validate against schema validator = SchemaValidator(schema="schemas/heritage_custodian.yaml") result = validator.validate(custodian) assert result.is_valid ``` 2. **Property-Based Testing** ```python from hypothesis import given, strategies as st @given(st.integers(min_value=1, max_value=1500)) def test_ghcid_determinism(whc_id: int): """GHCID generation is deterministic for same input.""" site1 = generate_ghcid_for_site(whc_id) site2 = generate_ghcid_for_site(whc_id) assert site1 == site2 ``` 3. **Performance Testing** - Benchmark extraction speed (sites per second) - Memory profiling (ensure no memory leaks) - Cache hit rate analysis **Deliverables**: - `tests/integration/test_unesco_pipeline.py` - End-to-end tests - `tests/test_unesco_property_based.py` - Property-based tests - `docs/performance-benchmarks.md` - Performance results **Success Criteria**: - [ ] 100% passing integration tests - [ ] Extract 1,000 sites in < 2 hours (with cache) - [ ] Memory usage < 500MB for full extraction --- ### Day 13: Code Review & Refactoring **Tasks**: 1. **Code Quality Review** - Run `ruff` linter, fix all warnings - Run `mypy` type checker, resolve type errors - Ensure 90%+ test coverage 2. **Documentation Review** - Add docstrings to all public functions - Update README with UNESCO extraction instructions - Create developer guide for extending classifiers 3. **Performance Optimization** - Profile slow functions, optimize bottlenecks - Reduce redundant API calls - Optimize GHCID generation (cache city code lookups) **Deliverables**: - Refactored codebase with 90%+ test coverage - Updated documentation - Performance optimizations applied **Success Criteria**: - [ ] Zero linter warnings - [ ] Zero type errors - [ ] Test coverage > 90% --- ## Phase 3: Data Quality & Validation (Days 14-19) ### Objectives - Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations) - Detect and resolve conflicts - Implement confidence scoring system - Generate data quality report ### Day 14-15: Cross-Referencing with Existing Data **Tasks**: 1. **Load Existing Dataset** ```python # scripts/crosslink_unesco_with_glam.py def load_existing_glam_dataset(): """Load Dutch ISIL + conversation extractions.""" dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv") dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv") conversations = load_conversation_extractions("data/instances/") return merge_datasets([dutch_isil, dutch_orgs, conversations]) ``` 2. **Match UNESCO Sites to Existing Records** - Match by Wikidata Q-number (highest confidence) - Match by ISIL code (for Dutch institutions) - Match by name + location (fuzzy matching, score > 0.85) 3. **Conflict Detection** ```python def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]: """Detect field-level conflicts between UNESCO and existing data.""" conflicts = [] if unesco_record.name != existing_record.name: conflicts.append(Conflict( field="name", unesco_value=unesco_record.name, existing_value=existing_record.name, resolution="MANUAL_REVIEW" )) # Check institution type mismatch if unesco_record.institution_type != existing_record.institution_type: conflicts.append(Conflict(field="institution_type", ...)) return conflicts ``` **Deliverables**: - `scripts/crosslink_unesco_with_glam.py` - Cross-linking script - `data/unesco_conflicts.csv` - Detected conflicts report - `tests/test_crosslinking.py` - Unit tests for matching logic **Success Criteria**: - [ ] Identify 50+ matches between UNESCO and existing dataset - [ ] Detect conflicts (name mismatches, type discrepancies) - [ ] Generate conflict report for manual review --- ### Day 16: Conflict Resolution **Tasks**: 1. **Tier-Based Priority** - TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP) - When conflict: UNESCO data wins, existing data flagged 2. **Merge Strategy** ```python def merge_unesco_with_existing( unesco_record: HeritageCustodian, existing_record: HeritageCustodian ) -> HeritageCustodian: """Merge UNESCO data with existing record, UNESCO takes priority.""" merged = existing_record.copy() # UNESCO name becomes primary merged.name = unesco_record.name # Preserve alternative names from both sources merged.alternative_names = list(set( unesco_record.alternative_names + existing_record.alternative_names )) # Add UNESCO identifier merged.identifiers.append({ 'identifier_scheme': 'UNESCO_WHC', 'identifier_value': unesco_record.identifiers[0].identifier_value }) # Track provenance of merge merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX" return merged ``` 3. **Manual Review Queue** - Flag high-impact conflicts (institution type change, location change) - Generate review spreadsheet for human validation - Provide evidence for each conflict (source URLs, descriptions) **Deliverables**: - `src/glam_extractor/validators/conflict_resolver.py` - `data/manual_review_queue.csv` - Conflicts requiring human review - `tests/test_conflict_resolution.py` **Success Criteria**: - [ ] Resolve 80% of conflicts automatically (tier-based priority) - [ ] Flag 20% for manual review (complex cases) - [ ] Zero data loss (preserve all alternative names, identifiers) --- ### Day 17: Confidence Scoring System **Tasks**: 1. **Score Calculation** ```python def calculate_confidence_score(custodian: HeritageCustodian) -> float: """Calculate confidence score based on data completeness and source quality.""" score = 1.0 # Start at maximum (TIER_1 authoritative) # Deduct for missing fields if not custodian.identifiers: score -= 0.1 if not custodian.locations: score -= 0.15 if custodian.institution_type == InstitutionTypeEnum.MIXED: score -= 0.2 # Ambiguous classification # Boost for rich metadata if len(custodian.identifiers) > 2: score += 0.05 if custodian.digital_platforms: score += 0.05 return max(0.0, min(1.0, score)) # Clamp to [0.0, 1.0] ``` 2. **Quality Metrics** - Completeness: % of optional fields populated - Accuracy: Agreement with authoritative sources (Wikidata, official websites) - Freshness: Days since extraction 3. **Tier Validation** - Ensure all UNESCO records have `data_tier: TIER_1_AUTHORITATIVE` - Downgrade tier if conflicts detected (TIER_1 → TIER_2) **Deliverables**: - `src/glam_extractor/validators/confidence_scorer.py` - `tests/test_confidence_scoring.py` - `data/unesco_quality_metrics.json` - Aggregate statistics **Success Criteria**: - [ ] 90%+ of UNESCO records score > 0.85 confidence - [ ] Flag < 5% as low confidence (require review) - [ ] Document scoring methodology in provenance metadata --- ### Day 18: LinkML Schema Validation **Tasks**: 1. **Batch Validation** ```bash # Validate all UNESCO extractions against LinkML schema for file in data/unesco_extracted/*.yaml; do linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file" done ``` 2. **Custom Validators** ```python # src/glam_extractor/validators/unesco_validators.py def validate_unesco_whc_id(whc_id: str) -> bool: """UNESCO WHC IDs are 3-4 digit integers.""" return bool(re.match(r'^\d{3,4}$', whc_id)) def validate_ghcid_format(ghcid: str) -> bool: """Validate GHCID format for UNESCO institutions.""" pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$' return bool(re.match(pattern, ghcid)) ``` 3. **Error Reporting** - Generate validation report with line numbers and error messages - Categorize errors (required field missing, invalid enum value, format error) - Prioritize fixes (blocking errors vs. warnings) **Deliverables**: - `scripts/validate_unesco_dataset.py` - Batch validation script - `src/glam_extractor/validators/unesco_validators.py` - Custom validators - `data/unesco_validation_report.json` - Validation results **Success Criteria**: - [ ] 100% of extracted records pass LinkML validation - [ ] Zero blocking errors - [ ] Document any warnings in provenance notes --- ### Day 19: Data Quality Report **Tasks**: 1. **Generate Statistics** ```python # scripts/generate_unesco_quality_report.py def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict: return { 'total_institutions': len(unesco_dataset), 'by_country': count_by_country(unesco_dataset), 'by_institution_type': count_by_type(unesco_dataset), 'avg_confidence_score': calculate_avg_confidence(unesco_dataset), 'completeness_metrics': { 'with_wikidata_id': count_with_wikidata(unesco_dataset), 'with_digital_platform': count_with_platforms(unesco_dataset), 'with_geocoded_location': count_with_geocoding(unesco_dataset) }, 'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')), 'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv')) } ``` 2. **Visualization** - Generate maps showing UNESCO site distribution - Bar charts: institutions by country, by type - Heatmap: data completeness by field 3. **Documentation** - Write executive summary of data quality - Document known issues and limitations - Provide recommendations for improvement **Deliverables**: - `scripts/generate_unesco_quality_report.py` - `data/unesco_quality_report.json` - Statistics - `docs/unesco-data-quality.md` - Quality report document - `data/visualizations/` - Maps and charts **Success Criteria**: - [ ] Quality report shows 90%+ completeness for core fields - [ ] < 5% of records require manual review - [ ] Geographic coverage across all inhabited continents --- ## Phase 4: Integration & Enrichment (Days 20-24) ### Objectives - Enrich UNESCO data with Wikidata identifiers - Merge UNESCO dataset with existing GLAM dataset - Resolve GHCID collisions - Update GHCID history for modified records ### Day 20-21: Wikidata Enrichment **Tasks**: 1. **SPARQL Query for UNESCO Sites** ```python # scripts/enrich_unesco_with_wikidata.py def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]: """Find Wikidata Q-number for UNESCO World Heritage Site.""" query = f""" SELECT ?item WHERE {{ ?item wdt:P757 "{whc_id}" . # P757 = UNESCO World Heritage Site ID }} LIMIT 1 """ results = sparql_query(query) if results: return extract_qid(results[0]['item']['value']) return None ``` 2. **Batch Enrichment** - Query Wikidata for all 1,000+ UNESCO sites - Extract Q-numbers, VIAF IDs, ISIL codes (if available) - Add to `identifiers` array in LinkML instances 3. **Fuzzy Matching Fallback** - If WHC ID not found in Wikidata, try name + location matching - Use same fuzzy matching logic from existing enrichment scripts - Threshold: 0.85 similarity score **Deliverables**: - `scripts/enrich_unesco_with_wikidata.py` - Enrichment script - `data/unesco_enriched/` - Enriched YAML instances - `logs/wikidata_enrichment.log` - Enrichment log **Success Criteria**: - [ ] Find Wikidata Q-numbers for 80%+ of UNESCO sites - [ ] Add VIAF/ISIL identifiers where available - [ ] Document enrichment in provenance metadata --- ### Day 22: Dataset Merge **Tasks**: 1. **Merge Strategy** ```python # scripts/merge_unesco_into_glam_dataset.py def merge_datasets( unesco_data: List[HeritageCustodian], existing_data: List[HeritageCustodian] ) -> List[HeritageCustodian]: """Merge UNESCO data into existing GLAM dataset.""" merged = existing_data.copy() for unesco_record in unesco_data: # Check if institution already exists match = find_match(unesco_record, existing_data) if match: # Merge records merged_record = merge_unesco_with_existing(unesco_record, match) merged[merged.index(match)] = merged_record else: # Add new institution merged.append(unesco_record) return merged ``` 2. **Deduplication** - Detect duplicates by GHCID, Wikidata Q-number, ISIL code - Prefer UNESCO data (TIER_1) over conversation data (TIER_4) - Preserve alternative names and identifiers from both sources 3. **Provenance Tracking** - Update `provenance.notes` for merged records - Record merge timestamp - Link back to original extraction sources **Deliverables**: - `scripts/merge_unesco_into_glam_dataset.py` - Merge script - `data/merged_glam_dataset/` - Merged dataset (YAML files) - `data/merge_report.json` - Merge statistics **Success Criteria**: - [ ] Merge 1,000+ UNESCO records with existing GLAM dataset - [ ] Deduplicate matches (no duplicate GHCIDs) - [ ] Preserve data from all sources (no information loss) --- ### Day 23: GHCID Collision Resolution **Tasks**: 1. **Detect Collisions** ```python # scripts/resolve_ghcid_collisions.py def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]: """Find institutions with identical base GHCIDs.""" ghcid_map = defaultdict(list) for custodian in dataset: base_ghcid = remove_q_number(custodian.ghcid) ghcid_map[base_ghcid].append(custodian) collisions = [ Collision(base_ghcid=k, institutions=v) for k, v in ghcid_map.items() if len(v) > 1 ] return collisions ``` 2. **Apply Temporal Priority Rules** - Compare `provenance.extraction_date` for colliding institutions - First batch (same date): ALL get Q-numbers - Historical addition (later date): ONLY new gets name suffix 3. **Update GHCID History** ```python def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str): """Record GHCID change in history.""" custodian.ghcid_history.append(GHCIDHistoryEntry( ghcid=new_ghcid, ghcid_numeric=generate_numeric_id(new_ghcid), valid_from=datetime.now(timezone.utc).isoformat(), valid_to=None, reason=f"Name suffix added to resolve collision with {old_ghcid}" )) ``` **Deliverables**: - `scripts/resolve_ghcid_collisions.py` - Collision resolution script - `data/ghcid_collision_report.json` - Detected collisions - Updated YAML instances with `ghcid_history` entries **Success Criteria**: - [ ] Resolve all GHCID collisions (zero duplicates) - [ ] Update GHCID history for affected records - [ ] Preserve PID stability (no changes to published GHCIDs) --- ### Day 24: Final Data Validation **Tasks**: 1. **Full Dataset Validation** - Run LinkML validation on merged dataset - Check for orphaned references (invalid foreign keys) - Verify all GHCIDs are unique 2. **Integrity Checks** ```python # tests/integration/test_merged_dataset_integrity.py def test_no_duplicate_ghcids(): dataset = load_merged_dataset() ghcids = [c.ghcid for c in dataset] assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!" def test_all_unesco_sites_have_whc_id(): dataset = load_merged_dataset() unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"] for record in unesco_records: whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"] assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID" ``` 3. **Coverage Analysis** - Verify UNESCO sites across all continents - Check institution type distribution (not all MUSEUM) - Ensure Dutch institutions properly merged with ISIL registry **Deliverables**: - `tests/integration/test_merged_dataset_integrity.py` - Integrity tests - `data/final_validation_report.json` - Validation results - `docs/dataset-coverage.md` - Coverage analysis **Success Criteria**: - [ ] 100% passing integrity tests - [ ] Zero duplicate GHCIDs - [ ] UNESCO sites cover 100+ countries --- ## Phase 5: Export & Documentation (Days 25-30) ### Objectives - Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet) - Generate user documentation and API docs - Create example queries and use case tutorials - Publish dataset with persistent identifiers ### Day 25-26: RDF/JSON-LD Export **Tasks**: 1. **RDF Serialization** ```python # src/glam_extractor/exporters/rdf_exporter.py def export_to_rdf(dataset: List[HeritageCustodian], output_path: str): """Export dataset to RDF/Turtle format.""" graph = Graph() # Define namespaces GLAM = Namespace("https://w3id.org/heritage/custodian/") graph.bind("glam", GLAM) graph.bind("schema", SCHEMA) graph.bind("cpov", Namespace("http://data.europa.eu/m8g/")) for custodian in dataset: uri = URIRef(custodian.id) # Type assertions graph.add((uri, RDF.type, GLAM.HeritageCustodian)) graph.add((uri, RDF.type, SCHEMA.Museum)) # If institution_type == MUSEUM # Literals graph.add((uri, SCHEMA.name, Literal(custodian.name))) graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type))) # Identifiers (owl:sameAs) for identifier in custodian.identifiers: if identifier.identifier_scheme == "Wikidata": graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url))) graph.serialize(destination=output_path, format="turtle") ``` 2. **JSON-LD Context** ```json // data/context/heritage_custodian_context.jsonld { "@context": { "@vocab": "https://w3id.org/heritage/custodian/", "schema": "http://schema.org/", "name": "schema:name", "location": "schema:location", "identifiers": "schema:identifier", "institution_type": "institutionType", "data_source": "dataSource" } } ``` 3. **Content Negotiation Setup** - Configure w3id.org redirects (if hosting on GitHub Pages) - Test URI resolution for sample institutions - Ensure Accept header routing (text/turtle, application/ld+json) **Deliverables**: - `src/glam_extractor/exporters/rdf_exporter.py` - RDF exporter - `data/exports/glam_dataset.ttl` - RDF/Turtle export - `data/exports/glam_dataset.jsonld` - JSON-LD export - `data/context/heritage_custodian_context.jsonld` - JSON-LD context **Success Criteria**: - [ ] RDF validates with Turtle parser - [ ] JSON-LD validates with JSON-LD Playground - [ ] Sample URIs resolve correctly --- ### Day 27: CSV/Parquet Export **Tasks**: 1. **Flatten Schema for CSV** ```python # src/glam_extractor/exporters/csv_exporter.py def export_to_csv(dataset: List[HeritageCustodian], output_path: str): """Export dataset to CSV with flattened structure.""" rows = [] for custodian in dataset: row = { 'ghcid': custodian.ghcid, 'ghcid_uuid': str(custodian.ghcid_uuid), 'name': custodian.name, 'institution_type': custodian.institution_type, 'country': custodian.locations[0].country if custodian.locations else None, 'city': custodian.locations[0].city if custodian.locations else None, 'wikidata_id': get_identifier(custodian, 'Wikidata'), 'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'), 'data_source': custodian.provenance.data_source, 'data_tier': custodian.provenance.data_tier, 'confidence_score': custodian.provenance.confidence_score } rows.append(row) df = pd.DataFrame(rows) df.to_csv(output_path, index=False, encoding='utf-8-sig') ``` 2. **Parquet Export (Columnar)** ```python def export_to_parquet(dataset: List[HeritageCustodian], output_path: str): """Export dataset to Parquet for efficient querying.""" df = pd.DataFrame([custodian.dict() for custodian in dataset]) df.to_parquet(output_path, engine='pyarrow', compression='snappy') ``` 3. **SQLite Export** ```python def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str): """Export dataset to SQLite database.""" conn = sqlite3.connect(db_path) # Create tables conn.execute(""" CREATE TABLE heritage_custodians ( ghcid TEXT PRIMARY KEY, ghcid_uuid TEXT UNIQUE, name TEXT NOT NULL, institution_type TEXT, data_source TEXT, ... ) """) # Insert records for custodian in dataset: conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...) conn.commit() ``` **Deliverables**: - `src/glam_extractor/exporters/csv_exporter.py` - CSV exporter - `data/exports/glam_dataset.csv` - CSV export - `data/exports/glam_dataset.parquet` - Parquet export - `data/exports/glam_dataset.db` - SQLite database **Success Criteria**: - [ ] CSV opens correctly in Excel, Google Sheets - [ ] Parquet loads in pandas, DuckDB - [ ] SQLite database queryable with SQL --- ### Day 28: Documentation - User Guide **Tasks**: 1. **Getting Started Guide** ```markdown # docs/user-guide/getting-started.md ## Installation pip install glam-extractor ## Quick Start ```python from glam_extractor import load_dataset # Load the GLAM dataset dataset = load_dataset("data/exports/glam_dataset.parquet") # Filter UNESCO museums in France museums = dataset.filter( institution_type="MUSEUM", data_source="UNESCO_WORLD_HERITAGE", country="FR" ) ``` 2. **Example Queries** - SPARQL examples (find institutions by type, country) - Pandas examples (data analysis, statistics) - SQL examples (SQLite queries) 3. **API Reference** - Document all public classes and methods - Provide code examples for each function - Link to LinkML schema documentation **Deliverables**: - `docs/user-guide/getting-started.md` - Quick start guide - `docs/user-guide/example-queries.md` - Query examples - `docs/api-reference.md` - API documentation **Success Criteria**: - [ ] Complete documentation for all public APIs - [ ] 10+ example queries covering common use cases - [ ] Step-by-step tutorials for data consumers --- ### Day 29: Documentation - Developer Guide **Tasks**: 1. **Architecture Overview** - Diagram of extraction pipeline (API → Parser → Validator → Exporter) - Explanation of LinkML Map transformation - GHCID generation algorithm 2. **Contributing Guide** - How to add new institution type classifiers - How to extend LinkML schema with new fields - How to add new export formats 3. **Testing Guide** - Running unit tests, integration tests - Creating new test fixtures - Using property-based testing **Deliverables**: - `docs/developer-guide/architecture.md` - Architecture docs - `docs/developer-guide/contributing.md` - Contribution guide - `docs/developer-guide/testing.md` - Testing guide **Success Criteria**: - [ ] Complete architecture documentation with diagrams - [ ] Clear instructions for extending the system - [ ] Comprehensive testing guide --- ### Day 30: Release & Publication **Tasks**: 1. **Dataset Release** - Tag repository with version number (e.g., v1.0.0-unesco) - Create GitHub Release with exports attached - Publish to Zenodo for DOI (persistent citation) 2. **Announcement** - Write blog post announcing UNESCO data release - Share on social media (Twitter, Mastodon, LinkedIn) - Notify stakeholders (Europeana, DPLA, heritage researchers) 3. **Data Portal Update** - Update w3id.org redirects for new institutions - Deploy SPARQL endpoint (if applicable) - Update REST API to include UNESCO data **Deliverables**: - GitHub Release with dataset exports - Zenodo DOI for citation - Blog post and announcement **Success Criteria**: - [ ] Dataset published with persistent DOI - [ ] Documentation live and accessible - [ ] Stakeholders notified of release --- ## Risk Mitigation ### Technical Risks | Risk | Probability | Impact | Mitigation | |------|-------------|--------|------------| | UNESCO API changes format | Low | High | Cache all responses, version API client | | LinkML Map lacks features | Medium | High | Implement custom extension early (Day 2-3) | | GHCID collisions exceed capacity | Low | Medium | Q-number resolution strategy documented | | Wikidata enrichment fails | Medium | Medium | Fallback to fuzzy name matching | ### Resource Risks | Risk | Probability | Impact | Mitigation | |------|-------------|--------|------------| | Timeline slips past 6 weeks | Medium | Medium | Prioritize core features, defer non-critical exports | | Test coverage falls below 90% | Low | High | TDD approach enforced from Day 1 | | Documentation incomplete | Medium | High | Reserve full week for docs (Phase 5) | ### Data Quality Risks | Risk | Probability | Impact | Mitigation | |------|-------------|--------|------------| | Institution type misclassification | Medium | Medium | Manual review queue for low-confidence cases | | Missing Wikidata Q-numbers | Medium | Low | Accept base GHCID without Q-number, enrich later | | Conflicts with existing data | Low | Medium | Tier-based priority, UNESCO wins | --- ## Success Metrics ### Quantitative Metrics - **Coverage**: Extract 1,000+ UNESCO site institutions - **Quality**: 90%+ confidence score average - **Completeness**: 80%+ have Wikidata Q-numbers - **Performance**: Process all sites in < 2 hours - **Test Coverage**: 90%+ code coverage ### Qualitative Metrics - **Usability**: Positive feedback from 3+ data consumers - **Documentation**: Complete user guide and API docs - **Maintainability**: Code passes linter, type checker - **Reproducibility**: Dataset generation fully automated --- ## Appendix: Day-by-Day Checklist ### Phase 1 (Days 1-5) - [ ] Day 1: UNESCO API documentation reviewed, 50 sites fetched - [ ] Day 2: LinkML Map schema (part 1) - basic transformations - [ ] Day 3: LinkML Map schema (part 2) - advanced patterns - [ ] Day 4: Golden dataset created (20 test fixtures) - [ ] Day 5: Institution type classifier designed ### Phase 2 (Days 6-13) - [ ] Day 6: UNESCO API client implemented - [ ] Day 7: LinkML instance generator implemented - [ ] Day 8: GHCID generator extended for UNESCO - [ ] Day 9-10: Batch processing pipeline - [ ] Day 11-12: Integration testing - [ ] Day 13: Code review and refactoring ### Phase 3 (Days 14-19) - [ ] Day 14-15: Cross-referencing with existing data - [ ] Day 16: Conflict resolution - [ ] Day 17: Confidence scoring system - [ ] Day 18: LinkML schema validation - [ ] Day 19: Data quality report ### Phase 4 (Days 20-24) - [ ] Day 20-21: Wikidata enrichment - [ ] Day 22: Dataset merge - [ ] Day 23: GHCID collision resolution - [ ] Day 24: Final data validation ### Phase 5 (Days 25-30) - [ ] Day 25-26: RDF/JSON-LD export - [ ] Day 27: CSV/Parquet/SQLite export - [ ] Day 28: User guide documentation - [ ] Day 29: Developer guide documentation - [ ] Day 30: Release and publication --- **Document Status**: Complete **Next Document**: `04-tdd-strategy.md` - Test-driven development plan **Version**: 1.1 --- ## Version History ### Version 1.1 (2025-11-10) **Changes**: Updated for OpenDataSoft Explore API v2.0 migration - **Day 1 API Reconnaissance** (lines 44-62): Updated API endpoint from legacy `whc.unesco.org/en/list/json` to OpenDataSoft `data.unesco.org/api/explore/v2.0` - **Day 6 UNESCO API Client** (lines 265-305): - Updated `base_url` to OpenDataSoft API endpoint - Removed `api_key` parameter (public dataset, no authentication) - Added pagination parameters to `fetch_site_list()`: `limit`, `offset` - Updated method documentation to reflect OpenDataSoft response structure: `{"record": {"fields": {...}}}` - **Day 7 LinkML Parser** (lines 309-352): - Updated `parse_unesco_site()` to extract from nested `api_response['record']['fields']` - Added documentation clarifying OpenDataSoft structure - Updated `extract_multilingual_names()` parameter name from `unesco_data` to `site_data` - **Day 10 Batch Processing** (lines 406-450): - Updated `extract_all_unesco_sites()` with pagination loop for OpenDataSoft API - Updated `process_unesco_site()` to handle nested record structure - Changed field access from `site_data['id_number']` to `site_record['record']['fields']['unique_number']` - **Day 11-12 Integration Tests** (lines 454-483): - Updated `test_full_unesco_extraction_pipeline()` to extract `site_data` from `response['record']['fields']` - Added explicit documentation of OpenDataSoft API structure **Rationale**: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality. ### Version 1.0 (2025-11-09) **Initial Release** - Comprehensive 30-day implementation plan for UNESCO data extraction - Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export - TDD approach with golden dataset and integration tests - GHCID generation strategy for UNESCO heritage sites - Wikidata enrichment and cross-referencing plan