glam/docs/plan/unesco/03-implementation-phases.md
2025-11-30 23:30:29 +01:00

48 KiB

UNESCO Data Extraction - Implementation Phases

Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 03 - Implementation Phases
Version: 1.1
Date: 2025-11-10
Status: Draft


Executive Summary

This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.

Total Effort: 30 working days (6 weeks)
Team Size: 1-2 developers + AI agents
Target Output: 1,000+ heritage custodian records from UNESCO sites


Phase Overview

Phase Duration Focus Key Deliverables
Phase 1 5 days API Exploration & Schema Design UNESCO API parser, LinkML Map schema
Phase 2 8 days Extractor Implementation Institution type classifier, GHCID generator
Phase 3 6 days Data Quality & Validation LinkML validator, conflict resolver
Phase 4 5 days Integration & Enrichment Wikidata enrichment, dataset merge
Phase 5 6 days Export & Documentation RDF/JSON-LD exporters, user docs

Phase 1: API Exploration & Schema Design (Days 1-5)

Objectives

  • Understand UNESCO DataHub API structure and data quality
  • Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
  • Create test fixtures from real UNESCO API responses
  • Establish baseline for institution type classification

Day 1: UNESCO API Reconnaissance

Tasks:

  1. API Documentation Review

    • Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
    • Identify available endpoints (dataset whc001 - World Heritage List)
    • Document authentication requirements (none - public dataset)
    • Document pagination limits (max 100 records per request, use offset parameter)
    • Test API responses for sample sites
  2. Data Structure Analysis

    # Fetch sample UNESCO site data (OpenDataSoft Explore API v2)
    curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \
      > samples/unesco_site_list.json
    
    # Fetch specific site by unique_number using ODSQL
    curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \
      > samples/unesco_angkor_detail.json
    
    # Response structure: nested record.fields with coordinates object
    # {
    #   "total_count": 1248,
    #   "records": [{
    #     "record": {
    #       "fields": {
    #         "name_en": "Angkor",
    #         "unique_number": 668,
    #         "coordinates": {"lat": 13.4333, "lon": 103.8333}
    #       }
    #     }
    #   }]
    # }
    
  3. Schema Mapping

    • Map OpenDataSoft record.fields to LinkML HeritageCustodian slots
    • Identify missing fields (require inference or external enrichment)
    • Document ambiguities (e.g., when is a site also a museum?)
    • Handle nested response structure (response['records'][i]['record']['fields'])

Deliverables:

  • docs/unesco-api-analysis.md - API structure documentation
  • tests/fixtures/unesco_api_responses/ - 10+ sample JSON files
  • docs/unesco-to-linkml-mapping.md - Field mapping table

Success Criteria:

  • Successfully fetch data for 50 UNESCO sites via API
  • Document all relevant JSON fields for extraction
  • Identify 3+ institution type classification patterns

Day 2: LinkML Map Schema Design (Part 1)

Tasks:

  1. Install LinkML Map Extension

    pip install linkml-map
    # OR implement custom extension: src/glam_extractor/mappers/extended_map.py
    
  2. Design Transformation Rules

    • Create schemas/maps/unesco_to_heritage_custodian.yaml
    • Define JSONPath expressions for field extraction
    • Handle multi-language names (UNESCO provides English, French, often local language)
    • Map UNESCO categories to InstitutionTypeEnum
  3. Conditional Extraction Logic

    # Example LinkML Map rule
    mappings:
      - source_path: $.category
        target_path: institution_type
        transform:
          type: conditional
          rules:
            - condition: "contains(description, 'museum')"
              value: MUSEUM
            - condition: "contains(description, 'library')"
              value: LIBRARY
            - condition: "contains(description, 'archive')"
              value: ARCHIVE
            - default: MIXED
    

Deliverables:

  • schemas/maps/unesco_to_heritage_custodian.yaml (initial version)
  • tests/test_unesco_linkml_map.py - Unit tests for mapping rules

Success Criteria:

  • LinkML Map schema validates against sample UNESCO JSON
  • Successfully extract name, location, UNESCO WHC ID from 10 fixtures
  • Handle multilingual names without data loss

Day 3: LinkML Map Schema Design (Part 2)

Tasks:

  1. Advanced Transformation Rules

    • Regex extraction for identifiers (UNESCO WHC ID format: ^\d{3,4}$)
    • GeoNames ID lookup from UNESCO location strings
    • Wikidata Q-number extraction from UNESCO external links
  2. Multi-value Array Handling

    # Extract all languages from UNESCO site names
    mappings:
      - source_path: $.names[*]
        target_path: alternative_names
        transform:
          type: array
          element_transform:
            type: template
            template: "{name}@{lang}"
    
  3. Error Handling Patterns

    • Missing required fields → skip record with warning
    • Invalid coordinates → flag for manual geocoding
    • Unknown institution type → default to MIXED, log for review

Deliverables:

  • schemas/maps/unesco_to_heritage_custodian.yaml (complete)
  • docs/linkml-map-extension-spec.md - Custom extension specification

Success Criteria:

  • Extract ALL relevant fields from 10 diverse UNESCO sites
  • Handle edge cases (missing data, malformed coordinates)
  • Generate valid LinkML instances from real API responses

Day 4: Test Fixture Creation

Tasks:

  1. Curate Representative Samples

    • Select 20 UNESCO sites covering:
      • All continents (Europe, Asia, Africa, Americas, Oceania)
      • Multiple institution types (museums, libraries, archives, botanical gardens)
      • Edge cases (serial nominations, transboundary sites)
  2. Create Expected Outputs

    # tests/fixtures/expected_outputs/unesco_louvre.yaml
    - id: https://w3id.org/heritage/custodian/fr/louvre
      name: Musée du Louvre
      institution_type: MUSEUM
      locations:
        - city: Paris
          country: FR
          coordinates: [48.8606, 2.3376]
      identifiers:
        - identifier_scheme: UNESCO_WHC
          identifier_value: "600"
          identifier_url: https://whc.unesco.org/en/list/600
      provenance:
        data_source: UNESCO_WORLD_HERITAGE
        data_tier: TIER_1_AUTHORITATIVE
    
  3. Golden Dataset Construction

    • Manually verify 20 expected outputs against authoritative sources
    • Document any assumptions or inferences made

Deliverables:

  • tests/fixtures/unesco_api_responses/ - 20 JSON files
  • tests/fixtures/expected_outputs/ - 20 YAML files
  • tests/test_unesco_golden_dataset.py - Integration tests

Success Criteria:

  • 20 golden dataset pairs (input JSON + expected YAML)
  • 100% passing tests for golden dataset
  • Documented edge cases and classification rules

Day 5: Institution Type Classifier Design

Tasks:

  1. Pattern Analysis

    • Analyze UNESCO descriptions for GLAM-related keywords
    • Create decision tree for institution type classification
    • Handle ambiguous cases (e.g., "archaeological park with museum")
  2. Keyword Extraction

    # src/glam_extractor/classifiers/unesco_institution_type.py
    
     MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition']
     LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek']
     ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage']
     BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum']
     HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue']
     FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate']
    
  3. Confidence Scoring

    • High confidence (0.9+): Explicit mentions of "museum" or "library"
    • Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
    • Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review

Deliverables:

  • src/glam_extractor/classifiers/unesco_institution_type.py
  • tests/test_unesco_classifier.py - 50+ test cases
  • docs/unesco-classification-rules.md - Decision tree documentation

Success Criteria:

  • Classifier achieves 90%+ accuracy on golden dataset
  • Low-confidence classifications flagged for manual review
  • Handle multilingual descriptions (English, French, Spanish, etc.)

Phase 2: Extractor Implementation (Days 6-13)

Objectives

  • Implement UNESCO API client with caching and rate limiting
  • Build LinkML instance generator using Map schema
  • Create GHCID generator for UNESCO institutions
  • Achieve 100% test coverage for core extraction logic

Day 6: UNESCO API Client

Tasks:

  1. HTTP Client Implementation

    # src/glam_extractor/parsers/unesco_api_client.py
    
    class UNESCOAPIClient:
        def __init__(self, cache_ttl: int = 86400):
            self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001"
            self.cache = Cache(ttl=cache_ttl)
            self.rate_limiter = RateLimiter(requests_per_second=2)
    
        def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict:
            """
            Fetch paginated list of UNESCO World Heritage Sites.
    
            Returns OpenDataSoft API response with structure:
            {
                "total_count": int,
                "results": [{"record": {"id": str, "fields": {...}}, ...}]
            }
            """
            ...
    
        def fetch_site_details(self, whc_id: int) -> Dict:
            """
            Fetch detailed information for a specific site using ODSQL query.
    
            Returns single record with structure:
            {"record": {"id": str, "fields": {field_name: value, ...}}}
            """
            ...
    
  2. Caching Strategy

    • Cache API responses for 24 hours (UNESCO updates infrequently)
    • Store in SQLite database: cache/unesco_api_cache.db
    • Invalidate cache on demand for data refreshes
  3. Error Handling

    • Network errors → retry with exponential backoff
    • 404 Not Found → skip site, log warning
    • Rate limit exceeded → pause and retry

Deliverables:

  • src/glam_extractor/parsers/unesco_api_client.py
  • tests/test_unesco_api_client.py - Mock API tests
  • cache/unesco_api_cache.db - SQLite cache database

Success Criteria:

  • Successfully fetch all UNESCO sites (1,000+ sites)
  • Handle API errors gracefully (no crashes)
  • Cache reduces API calls by 95% on repeat runs

Day 7: LinkML Instance Generator

Tasks:

  1. Apply LinkML Map Transformations

    # src/glam_extractor/parsers/unesco_parser.py
    
    from linkml_map import Mapper
    
    def parse_unesco_site(api_response: Dict) -> HeritageCustodian:
        """
        Parse OpenDataSoft API response to HeritageCustodian instance.
    
        Args:
            api_response: OpenDataSoft record with nested structure:
                {"record": {"id": str, "fields": {field_name: value, ...}}}
    
        Returns:
            HeritageCustodian: Validated LinkML instance
        """
        # Extract fields from nested structure
        site_data = api_response['record']['fields']
    
        mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
        instance = mapper.transform(site_data)
        return HeritageCustodian(**instance)
    
  2. Validation Pipeline

    • Apply LinkML schema validation after transformation
    • Catch validation errors, log details
    • Skip invalid records, continue processing
  3. Multi-language Name Handling

    def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]:
        """
        Extract primary name and alternative names in multiple languages.
    
        Args:
            site_data: Extracted fields from OpenDataSoft record['fields']
        """
        primary_name = site_data.get('site', '')
        alternative_names = []
    
        # OpenDataSoft may provide language variants in separate fields
        # or as structured data - adjust based on actual API response
        for lang_data in site_data.get('names', []):
            name = lang_data.get('name', '')
            lang = lang_data.get('lang', 'en')
            if name and name != primary_name:
                alternative_names.append(f"{name}@{lang}")
    
        return primary_name, alternative_names
    

Deliverables:

  • src/glam_extractor/parsers/unesco_parser.py
  • tests/test_unesco_parser.py - 20 golden dataset tests

Success Criteria:

  • Parse 20 golden dataset fixtures with 100% accuracy
  • Extract multilingual names without data loss
  • Generate valid LinkML instances (pass schema validation)

Day 8: GHCID Generator for UNESCO Sites

Tasks:

  1. Extend GHCID Logic for UNESCO

    # src/glam_extractor/identifiers/ghcid_generator.py
    
    def generate_ghcid_for_unesco_site(
        country_code: str,
        region_code: str,
        city_code: str,
        institution_type: InstitutionTypeEnum,
        institution_name: str,
        has_collision: bool = False
    ) -> str:
        """
        Generate GHCID for UNESCO World Heritage Site institution.
    
        Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
        Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix)
    
        Note: Collision suffix uses native language institution name in snake_case,
        NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md
        """
        ...
    
  2. City Code Lookup

    • Use GeoNames API to convert city names to UN/LOCODE
    • Fallback to 3-letter abbreviation if UN/LOCODE not found
    • Cache lookups to minimize API calls
  3. Collision Detection

    • Check existing GHCID dataset for collisions
    • Apply temporal priority rules (first batch vs. historical addition)
    • Append native language name suffix if collision detected

Deliverables:

  • Extended src/glam_extractor/identifiers/ghcid_generator.py
  • tests/test_ghcid_unesco.py - GHCID generation tests

Success Criteria:

  • Generate valid GHCIDs for 20 golden dataset institutions
  • No collisions with existing Dutch ISIL dataset
  • Handle missing Wikidata Q-numbers gracefully

Day 9-10: Batch Processing Pipeline

Tasks:

  1. Parallel Processing

    # scripts/extract_unesco_sites.py
    
    from concurrent.futures import ThreadPoolExecutor
    
    def extract_all_unesco_sites(max_workers: int = 4):
        api_client = UNESCOAPIClient()
    
        # Fetch paginated site list from OpenDataSoft API
        all_sites = []
        offset = 0
        limit = 100
    
        while True:
            response = api_client.fetch_site_list(limit=limit, offset=offset)
            all_sites.extend(response['results'])
    
            if len(all_sites) >= response['total_count']:
                break
            offset += limit
    
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            results = executor.map(process_unesco_site, all_sites)
    
        return list(results)
    
    def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]:
        """
        Process single OpenDataSoft record.
    
        Args:
            site_record: {"record": {"id": str, "fields": {...}}}
        """
        try:
            # Extract unique_number from nested fields
            whc_id = site_record['record']['fields']['unique_number']
    
            # Fetch full details if needed (or use site_record directly)
            details = api_client.fetch_site_details(whc_id)
            institution_type = classify_institution_type(details['record']['fields'])
            custodian = parse_unesco_site(details)
            custodian.ghcid = generate_ghcid(custodian)
            return custodian
        except Exception as e:
            log.error(f"Failed to process site {whc_id}: {e}")
            return None
    
  2. Progress Tracking

    • Use tqdm for progress bars
    • Log successful extractions to logs/unesco_extraction.log
    • Save intermediate results every 100 sites
  3. Error Recovery

    • Resume from last checkpoint if script crashes
    • Separate successful extractions from failed ones
    • Generate error report with failed site IDs

Deliverables:

  • scripts/extract_unesco_sites.py - Batch extraction script
  • data/unesco_extracted/ - Output directory for YAML instances
  • logs/unesco_extraction.log - Extraction log

Success Criteria:

  • Process all 1,000+ UNESCO sites in < 2 hours
  • < 5% failure rate (API errors, missing data)
  • Successful extractions saved as valid LinkML YAML files

Day 11-12: Integration Testing

Tasks:

  1. End-to-End Tests

    # tests/integration/test_unesco_pipeline.py
    
    def test_full_unesco_extraction_pipeline():
        """Test complete pipeline from OpenDataSoft API fetch to LinkML instance."""
        # 1. Fetch API data from OpenDataSoft
        api_client = UNESCOAPIClient()
        response = api_client.fetch_site_details(600)  # Paris, Banks of the Seine
        site_data = response['record']['fields']  # Extract from nested structure
    
        # 2. Classify institution type
        inst_type = classify_institution_type(site_data)
        assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED]
    
        # 3. Parse to LinkML instance
        custodian = parse_unesco_site(response)
        assert custodian.name is not None
    
        # 4. Generate GHCID
        custodian.ghcid = generate_ghcid(custodian)
        assert custodian.ghcid.startswith("FR-")
    
        # 5. Validate against schema
        validator = SchemaValidator(schema="schemas/heritage_custodian.yaml")
        result = validator.validate(custodian)
        assert result.is_valid
    
  2. Property-Based Testing

    from hypothesis import given, strategies as st
    
    @given(st.integers(min_value=1, max_value=1500))
    def test_ghcid_determinism(whc_id: int):
        """GHCID generation is deterministic for same input."""
        site1 = generate_ghcid_for_site(whc_id)
        site2 = generate_ghcid_for_site(whc_id)
        assert site1 == site2
    
  3. Performance Testing

    • Benchmark extraction speed (sites per second)
    • Memory profiling (ensure no memory leaks)
    • Cache hit rate analysis

Deliverables:

  • tests/integration/test_unesco_pipeline.py - End-to-end tests
  • tests/test_unesco_property_based.py - Property-based tests
  • docs/performance-benchmarks.md - Performance results

Success Criteria:

  • 100% passing integration tests
  • Extract 1,000 sites in < 2 hours (with cache)
  • Memory usage < 500MB for full extraction

Day 13: Code Review & Refactoring

Tasks:

  1. Code Quality Review

    • Run ruff linter, fix all warnings
    • Run mypy type checker, resolve type errors
    • Ensure 90%+ test coverage
  2. Documentation Review

    • Add docstrings to all public functions
    • Update README with UNESCO extraction instructions
    • Create developer guide for extending classifiers
  3. Performance Optimization

    • Profile slow functions, optimize bottlenecks
    • Reduce redundant API calls
    • Optimize GHCID generation (cache city code lookups)

Deliverables:

  • Refactored codebase with 90%+ test coverage
  • Updated documentation
  • Performance optimizations applied

Success Criteria:

  • Zero linter warnings
  • Zero type errors
  • Test coverage > 90%

Phase 3: Data Quality & Validation (Days 14-19)

Objectives

  • Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
  • Detect and resolve conflicts
  • Implement confidence scoring system
  • Generate data quality report

Day 14-15: Cross-Referencing with Existing Data

Tasks:

  1. Load Existing Dataset

    # scripts/crosslink_unesco_with_glam.py
    
    def load_existing_glam_dataset():
        """Load Dutch ISIL + conversation extractions."""
        dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv")
        dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
        conversations = load_conversation_extractions("data/instances/")
        return merge_datasets([dutch_isil, dutch_orgs, conversations])
    
  2. Match UNESCO Sites to Existing Records

    • Match by Wikidata Q-number (highest confidence)
    • Match by ISIL code (for Dutch institutions)
    • Match by name + location (fuzzy matching, score > 0.85)
  3. Conflict Detection

    def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]:
        """Detect field-level conflicts between UNESCO and existing data."""
        conflicts = []
    
        if unesco_record.name != existing_record.name:
            conflicts.append(Conflict(
                field="name",
                unesco_value=unesco_record.name,
                existing_value=existing_record.name,
                resolution="MANUAL_REVIEW"
            ))
    
        # Check institution type mismatch
        if unesco_record.institution_type != existing_record.institution_type:
            conflicts.append(Conflict(field="institution_type", ...))
    
        return conflicts
    

Deliverables:

  • scripts/crosslink_unesco_with_glam.py - Cross-linking script
  • data/unesco_conflicts.csv - Detected conflicts report
  • tests/test_crosslinking.py - Unit tests for matching logic

Success Criteria:

  • Identify 50+ matches between UNESCO and existing dataset
  • Detect conflicts (name mismatches, type discrepancies)
  • Generate conflict report for manual review

Day 16: Conflict Resolution

Tasks:

  1. Tier-Based Priority

    • TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
    • When conflict: UNESCO data wins, existing data flagged
  2. Merge Strategy

    def merge_unesco_with_existing(
        unesco_record: HeritageCustodian,
        existing_record: HeritageCustodian
    ) -> HeritageCustodian:
        """Merge UNESCO data with existing record, UNESCO takes priority."""
        merged = existing_record.copy()
    
        # UNESCO name becomes primary
        merged.name = unesco_record.name
    
        # Preserve alternative names from both sources
        merged.alternative_names = list(set(
            unesco_record.alternative_names + existing_record.alternative_names
        ))
    
        # Add UNESCO identifier
        merged.identifiers.append({
            'identifier_scheme': 'UNESCO_WHC',
            'identifier_value': unesco_record.identifiers[0].identifier_value
        })
    
        # Track provenance of merge
        merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX"
    
        return merged
    
  3. Manual Review Queue

    • Flag high-impact conflicts (institution type change, location change)
    • Generate review spreadsheet for human validation
    • Provide evidence for each conflict (source URLs, descriptions)

Deliverables:

  • src/glam_extractor/validators/conflict_resolver.py
  • data/manual_review_queue.csv - Conflicts requiring human review
  • tests/test_conflict_resolution.py

Success Criteria:

  • Resolve 80% of conflicts automatically (tier-based priority)
  • Flag 20% for manual review (complex cases)
  • Zero data loss (preserve all alternative names, identifiers)

Day 17: Confidence Scoring System

Tasks:

  1. Score Calculation

    def calculate_confidence_score(custodian: HeritageCustodian) -> float:
        """Calculate confidence score based on data completeness and source quality."""
        score = 1.0  # Start at maximum (TIER_1 authoritative)
    
        # Deduct for missing fields
        if not custodian.identifiers:
            score -= 0.1
        if not custodian.locations:
            score -= 0.15
        if custodian.institution_type == InstitutionTypeEnum.MIXED:
            score -= 0.2  # Ambiguous classification
    
        # Boost for rich metadata
        if len(custodian.identifiers) > 2:
            score += 0.05
        if custodian.digital_platforms:
            score += 0.05
    
        return max(0.0, min(1.0, score))  # Clamp to [0.0, 1.0]
    
  2. Quality Metrics

    • Completeness: % of optional fields populated
    • Accuracy: Agreement with authoritative sources (Wikidata, official websites)
    • Freshness: Days since extraction
  3. Tier Validation

    • Ensure all UNESCO records have data_tier: TIER_1_AUTHORITATIVE
    • Downgrade tier if conflicts detected (TIER_1 → TIER_2)

Deliverables:

  • src/glam_extractor/validators/confidence_scorer.py
  • tests/test_confidence_scoring.py
  • data/unesco_quality_metrics.json - Aggregate statistics

Success Criteria:

  • 90%+ of UNESCO records score > 0.85 confidence
  • Flag < 5% as low confidence (require review)
  • Document scoring methodology in provenance metadata

Day 18: LinkML Schema Validation

Tasks:

  1. Batch Validation

    # Validate all UNESCO extractions against LinkML schema
    for file in data/unesco_extracted/*.yaml; do
        linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file"
    done
    
  2. Custom Validators

    # src/glam_extractor/validators/unesco_validators.py
    
    def validate_unesco_whc_id(whc_id: str) -> bool:
        """UNESCO WHC IDs are 3-4 digit integers."""
        return bool(re.match(r'^\d{3,4}$', whc_id))
    
     def validate_ghcid_format(ghcid: str) -> bool:
         """Validate GHCID format for UNESCO institutions."""
         pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$'
         return bool(re.match(pattern, ghcid))
    
  3. Error Reporting

    • Generate validation report with line numbers and error messages
    • Categorize errors (required field missing, invalid enum value, format error)
    • Prioritize fixes (blocking errors vs. warnings)

Deliverables:

  • scripts/validate_unesco_dataset.py - Batch validation script
  • src/glam_extractor/validators/unesco_validators.py - Custom validators
  • data/unesco_validation_report.json - Validation results

Success Criteria:

  • 100% of extracted records pass LinkML validation
  • Zero blocking errors
  • Document any warnings in provenance notes

Day 19: Data Quality Report

Tasks:

  1. Generate Statistics

    # scripts/generate_unesco_quality_report.py
    
    def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict:
        return {
            'total_institutions': len(unesco_dataset),
            'by_country': count_by_country(unesco_dataset),
            'by_institution_type': count_by_type(unesco_dataset),
            'avg_confidence_score': calculate_avg_confidence(unesco_dataset),
            'completeness_metrics': {
                'with_wikidata_id': count_with_wikidata(unesco_dataset),
                'with_digital_platform': count_with_platforms(unesco_dataset),
                'with_geocoded_location': count_with_geocoding(unesco_dataset)
            },
            'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')),
            'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv'))
        }
    
  2. Visualization

    • Generate maps showing UNESCO site distribution
    • Bar charts: institutions by country, by type
    • Heatmap: data completeness by field
  3. Documentation

    • Write executive summary of data quality
    • Document known issues and limitations
    • Provide recommendations for improvement

Deliverables:

  • scripts/generate_unesco_quality_report.py
  • data/unesco_quality_report.json - Statistics
  • docs/unesco-data-quality.md - Quality report document
  • data/visualizations/ - Maps and charts

Success Criteria:

  • Quality report shows 90%+ completeness for core fields
  • < 5% of records require manual review
  • Geographic coverage across all inhabited continents

Phase 4: Integration & Enrichment (Days 20-24)

Objectives

  • Enrich UNESCO data with Wikidata identifiers
  • Merge UNESCO dataset with existing GLAM dataset
  • Resolve GHCID collisions
  • Update GHCID history for modified records

Day 20-21: Wikidata Enrichment

Tasks:

  1. SPARQL Query for UNESCO Sites

    # scripts/enrich_unesco_with_wikidata.py
    
    def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]:
        """Find Wikidata Q-number for UNESCO World Heritage Site."""
        query = f"""
        SELECT ?item WHERE {{
          ?item wdt:P757 "{whc_id}" .  # P757 = UNESCO World Heritage Site ID
        }}
        LIMIT 1
        """
    
        results = sparql_query(query)
        if results:
            return extract_qid(results[0]['item']['value'])
        return None
    
  2. Batch Enrichment

    • Query Wikidata for all 1,000+ UNESCO sites
    • Extract Q-numbers, VIAF IDs, ISIL codes (if available)
    • Add to identifiers array in LinkML instances
  3. Fuzzy Matching Fallback

    • If WHC ID not found in Wikidata, try name + location matching
    • Use same fuzzy matching logic from existing enrichment scripts
    • Threshold: 0.85 similarity score

Deliverables:

  • scripts/enrich_unesco_with_wikidata.py - Enrichment script
  • data/unesco_enriched/ - Enriched YAML instances
  • logs/wikidata_enrichment.log - Enrichment log

Success Criteria:

  • Find Wikidata Q-numbers for 80%+ of UNESCO sites
  • Add VIAF/ISIL identifiers where available
  • Document enrichment in provenance metadata

Day 22: Dataset Merge

Tasks:

  1. Merge Strategy

    # scripts/merge_unesco_into_glam_dataset.py
    
    def merge_datasets(
        unesco_data: List[HeritageCustodian],
        existing_data: List[HeritageCustodian]
    ) -> List[HeritageCustodian]:
        """Merge UNESCO data into existing GLAM dataset."""
        merged = existing_data.copy()
    
        for unesco_record in unesco_data:
            # Check if institution already exists
            match = find_match(unesco_record, existing_data)
    
            if match:
                # Merge records
                merged_record = merge_unesco_with_existing(unesco_record, match)
                merged[merged.index(match)] = merged_record
            else:
                # Add new institution
                merged.append(unesco_record)
    
        return merged
    
  2. Deduplication

    • Detect duplicates by GHCID, Wikidata Q-number, ISIL code
    • Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
    • Preserve alternative names and identifiers from both sources
  3. Provenance Tracking

    • Update provenance.notes for merged records
    • Record merge timestamp
    • Link back to original extraction sources

Deliverables:

  • scripts/merge_unesco_into_glam_dataset.py - Merge script
  • data/merged_glam_dataset/ - Merged dataset (YAML files)
  • data/merge_report.json - Merge statistics

Success Criteria:

  • Merge 1,000+ UNESCO records with existing GLAM dataset
  • Deduplicate matches (no duplicate GHCIDs)
  • Preserve data from all sources (no information loss)

Day 23: GHCID Collision Resolution

Tasks:

  1. Detect Collisions

    # scripts/resolve_ghcid_collisions.py
    
    def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]:
        """Find institutions with identical base GHCIDs."""
        ghcid_map = defaultdict(list)
    
        for custodian in dataset:
            base_ghcid = remove_q_number(custodian.ghcid)
            ghcid_map[base_ghcid].append(custodian)
    
        collisions = [
            Collision(base_ghcid=k, institutions=v)
            for k, v in ghcid_map.items()
            if len(v) > 1
        ]
    
        return collisions
    
  2. Apply Temporal Priority Rules

    • Compare provenance.extraction_date for colliding institutions
    • First batch (same date): ALL get Q-numbers
    • Historical addition (later date): ONLY new gets name suffix
  3. Update GHCID History

    def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str):
        """Record GHCID change in history."""
        custodian.ghcid_history.append(GHCIDHistoryEntry(
            ghcid=new_ghcid,
            ghcid_numeric=generate_numeric_id(new_ghcid),
            valid_from=datetime.now(timezone.utc).isoformat(),
            valid_to=None,
            reason=f"Name suffix added to resolve collision with {old_ghcid}"
        ))
    

Deliverables:

  • scripts/resolve_ghcid_collisions.py - Collision resolution script
  • data/ghcid_collision_report.json - Detected collisions
  • Updated YAML instances with ghcid_history entries

Success Criteria:

  • Resolve all GHCID collisions (zero duplicates)
  • Update GHCID history for affected records
  • Preserve PID stability (no changes to published GHCIDs)

Day 24: Final Data Validation

Tasks:

  1. Full Dataset Validation

    • Run LinkML validation on merged dataset
    • Check for orphaned references (invalid foreign keys)
    • Verify all GHCIDs are unique
  2. Integrity Checks

    # tests/integration/test_merged_dataset_integrity.py
    
    def test_no_duplicate_ghcids():
        dataset = load_merged_dataset()
        ghcids = [c.ghcid for c in dataset]
        assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!"
    
    def test_all_unesco_sites_have_whc_id():
        dataset = load_merged_dataset()
        unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"]
    
        for record in unesco_records:
            whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"]
            assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID"
    
  3. Coverage Analysis

    • Verify UNESCO sites across all continents
    • Check institution type distribution (not all MUSEUM)
    • Ensure Dutch institutions properly merged with ISIL registry

Deliverables:

  • tests/integration/test_merged_dataset_integrity.py - Integrity tests
  • data/final_validation_report.json - Validation results
  • docs/dataset-coverage.md - Coverage analysis

Success Criteria:

  • 100% passing integrity tests
  • Zero duplicate GHCIDs
  • UNESCO sites cover 100+ countries

Phase 5: Export & Documentation (Days 25-30)

Objectives

  • Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
  • Generate user documentation and API docs
  • Create example queries and use case tutorials
  • Publish dataset with persistent identifiers

Day 25-26: RDF/JSON-LD Export

Tasks:

  1. RDF Serialization

    # src/glam_extractor/exporters/rdf_exporter.py
    
    def export_to_rdf(dataset: List[HeritageCustodian], output_path: str):
        """Export dataset to RDF/Turtle format."""
        graph = Graph()
    
        # Define namespaces
        GLAM = Namespace("https://w3id.org/heritage/custodian/")
        graph.bind("glam", GLAM)
        graph.bind("schema", SCHEMA)
        graph.bind("cpov", Namespace("http://data.europa.eu/m8g/"))
    
        for custodian in dataset:
            uri = URIRef(custodian.id)
    
            # Type assertions
            graph.add((uri, RDF.type, GLAM.HeritageCustodian))
            graph.add((uri, RDF.type, SCHEMA.Museum))  # If institution_type == MUSEUM
    
            # Literals
            graph.add((uri, SCHEMA.name, Literal(custodian.name)))
            graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type)))
    
            # Identifiers (owl:sameAs)
            for identifier in custodian.identifiers:
                if identifier.identifier_scheme == "Wikidata":
                    graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url)))
    
        graph.serialize(destination=output_path, format="turtle")
    
  2. JSON-LD Context

    // data/context/heritage_custodian_context.jsonld
    {
      "@context": {
        "@vocab": "https://w3id.org/heritage/custodian/",
        "schema": "http://schema.org/",
        "name": "schema:name",
        "location": "schema:location",
        "identifiers": "schema:identifier",
        "institution_type": "institutionType",
        "data_source": "dataSource"
      }
    }
    
  3. Content Negotiation Setup

    • Configure w3id.org redirects (if hosting on GitHub Pages)
    • Test URI resolution for sample institutions
    • Ensure Accept header routing (text/turtle, application/ld+json)

Deliverables:

  • src/glam_extractor/exporters/rdf_exporter.py - RDF exporter
  • data/exports/glam_dataset.ttl - RDF/Turtle export
  • data/exports/glam_dataset.jsonld - JSON-LD export
  • data/context/heritage_custodian_context.jsonld - JSON-LD context

Success Criteria:

  • RDF validates with Turtle parser
  • JSON-LD validates with JSON-LD Playground
  • Sample URIs resolve correctly

Day 27: CSV/Parquet Export

Tasks:

  1. Flatten Schema for CSV

    # src/glam_extractor/exporters/csv_exporter.py
    
    def export_to_csv(dataset: List[HeritageCustodian], output_path: str):
        """Export dataset to CSV with flattened structure."""
        rows = []
    
        for custodian in dataset:
            row = {
                'ghcid': custodian.ghcid,
                'ghcid_uuid': str(custodian.ghcid_uuid),
                'name': custodian.name,
                'institution_type': custodian.institution_type,
                'country': custodian.locations[0].country if custodian.locations else None,
                'city': custodian.locations[0].city if custodian.locations else None,
                'wikidata_id': get_identifier(custodian, 'Wikidata'),
                'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'),
                'data_source': custodian.provenance.data_source,
                'data_tier': custodian.provenance.data_tier,
                'confidence_score': custodian.provenance.confidence_score
            }
            rows.append(row)
    
        df = pd.DataFrame(rows)
        df.to_csv(output_path, index=False, encoding='utf-8-sig')
    
  2. Parquet Export (Columnar)

    def export_to_parquet(dataset: List[HeritageCustodian], output_path: str):
        """Export dataset to Parquet for efficient querying."""
        df = pd.DataFrame([custodian.dict() for custodian in dataset])
        df.to_parquet(output_path, engine='pyarrow', compression='snappy')
    
  3. SQLite Export

    def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str):
        """Export dataset to SQLite database."""
        conn = sqlite3.connect(db_path)
    
        # Create tables
        conn.execute("""
            CREATE TABLE heritage_custodians (
                ghcid TEXT PRIMARY KEY,
                ghcid_uuid TEXT UNIQUE,
                name TEXT NOT NULL,
                institution_type TEXT,
                data_source TEXT,
                ...
            )
        """)
    
        # Insert records
        for custodian in dataset:
            conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...)
    
        conn.commit()
    

Deliverables:

  • src/glam_extractor/exporters/csv_exporter.py - CSV exporter
  • data/exports/glam_dataset.csv - CSV export
  • data/exports/glam_dataset.parquet - Parquet export
  • data/exports/glam_dataset.db - SQLite database

Success Criteria:

  • CSV opens correctly in Excel, Google Sheets
  • Parquet loads in pandas, DuckDB
  • SQLite database queryable with SQL

Day 28: Documentation - User Guide

Tasks:

  1. Getting Started Guide

    # docs/user-guide/getting-started.md
    
    ## Installation
    
    pip install glam-extractor
    
    ## Quick Start
    
    ```python
    from glam_extractor import load_dataset
    
    # Load the GLAM dataset
    dataset = load_dataset("data/exports/glam_dataset.parquet")
    
    # Filter UNESCO museums in France
    museums = dataset.filter(
        institution_type="MUSEUM",
        data_source="UNESCO_WORLD_HERITAGE",
        country="FR"
    )
    
  2. Example Queries

    • SPARQL examples (find institutions by type, country)
    • Pandas examples (data analysis, statistics)
    • SQL examples (SQLite queries)
  3. API Reference

    • Document all public classes and methods
    • Provide code examples for each function
    • Link to LinkML schema documentation

Deliverables:

  • docs/user-guide/getting-started.md - Quick start guide
  • docs/user-guide/example-queries.md - Query examples
  • docs/api-reference.md - API documentation

Success Criteria:

  • Complete documentation for all public APIs
  • 10+ example queries covering common use cases
  • Step-by-step tutorials for data consumers

Day 29: Documentation - Developer Guide

Tasks:

  1. Architecture Overview

    • Diagram of extraction pipeline (API → Parser → Validator → Exporter)
    • Explanation of LinkML Map transformation
    • GHCID generation algorithm
  2. Contributing Guide

    • How to add new institution type classifiers
    • How to extend LinkML schema with new fields
    • How to add new export formats
  3. Testing Guide

    • Running unit tests, integration tests
    • Creating new test fixtures
    • Using property-based testing

Deliverables:

  • docs/developer-guide/architecture.md - Architecture docs
  • docs/developer-guide/contributing.md - Contribution guide
  • docs/developer-guide/testing.md - Testing guide

Success Criteria:

  • Complete architecture documentation with diagrams
  • Clear instructions for extending the system
  • Comprehensive testing guide

Day 30: Release & Publication

Tasks:

  1. Dataset Release

    • Tag repository with version number (e.g., v1.0.0-unesco)
    • Create GitHub Release with exports attached
    • Publish to Zenodo for DOI (persistent citation)
  2. Announcement

    • Write blog post announcing UNESCO data release
    • Share on social media (Twitter, Mastodon, LinkedIn)
    • Notify stakeholders (Europeana, DPLA, heritage researchers)
  3. Data Portal Update

    • Update w3id.org redirects for new institutions
    • Deploy SPARQL endpoint (if applicable)
    • Update REST API to include UNESCO data

Deliverables:

  • GitHub Release with dataset exports
  • Zenodo DOI for citation
  • Blog post and announcement

Success Criteria:

  • Dataset published with persistent DOI
  • Documentation live and accessible
  • Stakeholders notified of release

Risk Mitigation

Technical Risks

Risk Probability Impact Mitigation
UNESCO API changes format Low High Cache all responses, version API client
LinkML Map lacks features Medium High Implement custom extension early (Day 2-3)
GHCID collisions exceed capacity Low Medium Q-number resolution strategy documented
Wikidata enrichment fails Medium Medium Fallback to fuzzy name matching

Resource Risks

Risk Probability Impact Mitigation
Timeline slips past 6 weeks Medium Medium Prioritize core features, defer non-critical exports
Test coverage falls below 90% Low High TDD approach enforced from Day 1
Documentation incomplete Medium High Reserve full week for docs (Phase 5)

Data Quality Risks

Risk Probability Impact Mitigation
Institution type misclassification Medium Medium Manual review queue for low-confidence cases
Missing Wikidata Q-numbers Medium Low Accept base GHCID without Q-number, enrich later
Conflicts with existing data Low Medium Tier-based priority, UNESCO wins

Success Metrics

Quantitative Metrics

  • Coverage: Extract 1,000+ UNESCO site institutions
  • Quality: 90%+ confidence score average
  • Completeness: 80%+ have Wikidata Q-numbers
  • Performance: Process all sites in < 2 hours
  • Test Coverage: 90%+ code coverage

Qualitative Metrics

  • Usability: Positive feedback from 3+ data consumers
  • Documentation: Complete user guide and API docs
  • Maintainability: Code passes linter, type checker
  • Reproducibility: Dataset generation fully automated

Appendix: Day-by-Day Checklist

Phase 1 (Days 1-5)

  • Day 1: UNESCO API documentation reviewed, 50 sites fetched
  • Day 2: LinkML Map schema (part 1) - basic transformations
  • Day 3: LinkML Map schema (part 2) - advanced patterns
  • Day 4: Golden dataset created (20 test fixtures)
  • Day 5: Institution type classifier designed

Phase 2 (Days 6-13)

  • Day 6: UNESCO API client implemented
  • Day 7: LinkML instance generator implemented
  • Day 8: GHCID generator extended for UNESCO
  • Day 9-10: Batch processing pipeline
  • Day 11-12: Integration testing
  • Day 13: Code review and refactoring

Phase 3 (Days 14-19)

  • Day 14-15: Cross-referencing with existing data
  • Day 16: Conflict resolution
  • Day 17: Confidence scoring system
  • Day 18: LinkML schema validation
  • Day 19: Data quality report

Phase 4 (Days 20-24)

  • Day 20-21: Wikidata enrichment
  • Day 22: Dataset merge
  • Day 23: GHCID collision resolution
  • Day 24: Final data validation

Phase 5 (Days 25-30)

  • Day 25-26: RDF/JSON-LD export
  • Day 27: CSV/Parquet/SQLite export
  • Day 28: User guide documentation
  • Day 29: Developer guide documentation
  • Day 30: Release and publication

Document Status: Complete
Next Document: 04-tdd-strategy.md - Test-driven development plan
Version: 1.1


Version History

Version 1.1 (2025-11-10)

Changes: Updated for OpenDataSoft Explore API v2.0 migration

  • Day 1 API Reconnaissance (lines 44-62): Updated API endpoint from legacy whc.unesco.org/en/list/json to OpenDataSoft data.unesco.org/api/explore/v2.0
  • Day 6 UNESCO API Client (lines 265-305):
    • Updated base_url to OpenDataSoft API endpoint
    • Removed api_key parameter (public dataset, no authentication)
    • Added pagination parameters to fetch_site_list(): limit, offset
    • Updated method documentation to reflect OpenDataSoft response structure: {"record": {"fields": {...}}}
  • Day 7 LinkML Parser (lines 309-352):
    • Updated parse_unesco_site() to extract from nested api_response['record']['fields']
    • Added documentation clarifying OpenDataSoft structure
    • Updated extract_multilingual_names() parameter name from unesco_data to site_data
  • Day 10 Batch Processing (lines 406-450):
    • Updated extract_all_unesco_sites() with pagination loop for OpenDataSoft API
    • Updated process_unesco_site() to handle nested record structure
    • Changed field access from site_data['id_number'] to site_record['record']['fields']['unique_number']
  • Day 11-12 Integration Tests (lines 454-483):
    • Updated test_full_unesco_extraction_pipeline() to extract site_data from response['record']['fields']
    • Added explicit documentation of OpenDataSoft API structure

Rationale: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.

Version 1.0 (2025-11-09)

Initial Release

  • Comprehensive 30-day implementation plan for UNESCO data extraction
  • Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
  • TDD approach with golden dataset and integration tests
  • GHCID generation strategy for UNESCO heritage sites
  • Wikidata enrichment and cross-referencing plan