glam/docs/plan/unesco/07-master-checklist.md
2025-11-19 23:25:22 +01:00

48 KiB

UNESCO Data Extraction - Master Implementation Checklist

Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 07 - Master Implementation Checklist
Version: 1.1
Date: 2025-11-10
Status: Planning


Executive Summary

This document provides a comprehensive 30-day implementation checklist for the UNESCO World Heritage Sites extraction project. It consolidates all planning documents (00-06) into a structured timeline with actionable tasks, testing requirements, and quality gates.

Timeline: 30 calendar days (estimated 120 person-hours)
Approach: Test-Driven Development (TDD) with iterative validation
Output: 1,200+ LinkML-compliant heritage custodian records


How to Use This Checklist

Notation

  • [D] = Development task
  • [T] = Testing task
  • [V] = Validation/Quality gate
  • [D+T] = TDD pair (test first, then implementation)
  • = Completed
  • = In progress
  • = Blocked/Issue

Task Priority Levels

  • P0 = Critical path, must complete
  • P1 = High priority, complete if time permits
  • P2 = Nice to have, future iteration

Estimated Time

Each task includes an estimated duration in hours (e.g., [2h]).


Phase 0: Pre-Implementation Setup (Days 1-2, 8 hours)

Day 1: Environment & Dependencies

Morning (4 hours)

  • [D] Verify Python 3.11+ installation [0.5h] (P0)

    • Run: python --version
    • Create virtual environment: python -m venv .venv
    • Activate: source .venv/bin/activate
  • [D] Install core dependencies from requirements.txt [1h] (P0)

    pip install linkml==1.7.10
    pip install linkml-runtime==1.7.5
    pip install pydantic==1.10.13
    pip install requests==2.31.0
    pip install python-dotenv==1.0.0
    pip install pytest==7.4.3
    pip install pytest-cov==4.1.0
    
  • [D] Install optional dependencies [0.5h] (P1)

    pip install rapidfuzz==3.5.2  # Fuzzy matching
    pip install SPARQLWrapper==2.0.0  # Wikidata queries
    pip install click==8.1.7  # CLI interface
    
  • [V] Verify LinkML installation [0.5h] (P0)

    linkml-validate --version
    gen-python --version
    gen-json-schema --version
    
  • [D] Clone/update project repository [0.5h] (P0)

    cd /Users/kempersc/apps/glam
    git status
    git pull origin main
    
  • [D] Create directory structure [1h] (P0)

    mkdir -p schemas/maps
    mkdir -p src/glam_extractor/extractors/unesco
    mkdir -p tests/fixtures/unesco
    mkdir -p tests/unit/unesco
    mkdir -p tests/integration/unesco
    mkdir -p data/raw/unesco
    mkdir -p data/processed/unesco
    mkdir -p data/instances/unesco
    

Afternoon (4 hours)

  • [D] Set up UNESCO API access [1h] (P0)

    • Read API docs: https://data.unesco.org/api/explore/v2.0/console
    • Test API endpoint: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records
    • Note: Legacy endpoint https://whc.unesco.org/en/list/json is BLOCKED (Cloudflare 403)
    • UNESCO uses OpenDataSoft Explore API v2 (not OAI-PMH)
    • No authentication required for public datasets
    • Document ODSQL query language for filtering
    • Create .env file with configuration
  • [D] Download sample UNESCO data [1h] (P0)

    # OpenDataSoft API with pagination (max 100 records per request)
    curl -o data/raw/unesco/whc-sites-2025.json \
      "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=100&offset=0"
    
    # For full dataset (1,200+ sites), implement pagination in Python
    # CSV export fallback (if API unavailable):
    # https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv
    
  • [D] Create project README for UNESCO extraction [1h] (P1)

    • Document scope: 1,200+ WHS extraction
    • Reference planning documents (00-06)
    • List dependencies and prerequisites
  • [V] Review existing schema modules [1h] (P0)

    • Read schemas/core.yaml (HeritageCustodian, Location, Identifier)
    • Read schemas/enums.yaml (InstitutionTypeEnum, DataSource)
    • Read schemas/provenance.yaml (Provenance, ChangeEvent)
    • Read schemas/collections.yaml (Collection metadata)

Phase 1: Schema & Type System (Days 3-5, 16 hours)

Day 3: LinkML Schema Extension

Morning (4 hours)

  • [T] Create test for UNESCO-specific enums [1h] (P0)

    • File: tests/unit/unesco/test_unesco_enums.py
    • Test cases:
      • UNESCO category values (Cultural, Natural, Mixed)
      • UNESCO inscription status (Inscribed, Danger, Delisted)
      • UNESCO serial nomination flag
      • UNESCO transboundary flag
  • [D+T] Add UNESCO enums to schemas/enums.yaml [2h] (P0)

    UNESCOCategoryEnum:
      permissible_values:
        CULTURAL:
          description: Cultural heritage site
        NATURAL:
          description: Natural heritage site
        MIXED:
          description: Mixed cultural and natural site
    
    UNESCOInscriptionStatusEnum:
      permissible_values:
        INSCRIBED:
          description: Currently inscribed on World Heritage List
        IN_DANGER:
          description: On List of World Heritage in Danger
        DELISTED:
          description: Removed from World Heritage List
    
  • [V] Validate enum schema [0.5h] (P0)

    linkml-validate -s schemas/heritage_custodian.yaml \
      -C UNESCOCategoryEnum \
      --target-class UNESCOCategoryEnum
    
  • [D] Generate Python classes from schema [0.5h] (P0)

    gen-python schemas/heritage_custodian.yaml > \
      src/glam_extractor/models/generated.py
    

Afternoon (4 hours)

  • [T] Create test for UNESCO extension class [1.5h] (P0)

    • File: tests/unit/unesco/test_unesco_custodian.py
    • Test cases:
      • UNESCOHeritageCustodian inherits from HeritageCustodian
      • Additional fields: unesco_id, inscription_year, category, criteria
      • Validation rules for UNESCO WHC ID format
      • Multi-language name handling
  • [D+T] Create UNESCO extension schema [2h] (P0)

    • File: schemas/unesco.yaml
    • Extends: HeritageCustodian from schemas/core.yaml
    • New slots:
      • unesco_id: Integer (UNESCO WHC site ID)
      • inscription_year: Integer (year inscribed)
      • unesco_category: UNESCOCategoryEnum
      • unesco_criteria: List of strings (e.g., ["i", "ii", "iv"])
      • in_danger: Boolean
      • serial_nomination: Boolean
      • transboundary: Boolean
  • [D] Update main schema to import UNESCO module [0.5h] (P0)

    • Edit schemas/heritage_custodian.yaml
    • Add import: imports: [core, enums, provenance, collections, unesco]

Day 4: Type Classification System

Morning (4 hours)

  • [T] Create test for institution type classifier [2h] (P0)

    • File: tests/unit/unesco/test_institution_classifier.py
    • Test cases:
      • Museums: "National Museum of...", "Museum Complex"
      • Libraries: "Royal Library", "Library of..."
      • Archives: "National Archives", "Documentary Heritage"
      • Holy Sites: "Cathedral", "Temple", "Monastery"
      • Mixed: Multiple institution types at one site
      • Unknown: Defaults to MIXED when uncertain
  • [D+T] Implement institution type classifier [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/institution_classifier.py
    • Function: classify_unesco_site(site_data: dict) -> InstitutionTypeEnum
    • Logic:
      1. Extract short_description_en, justification_en, name_en
      2. Apply keyword matching patterns (see doc 05, design pattern #2)
      3. Return InstitutionTypeEnum value
      4. Assign confidence score (0.0-1.0)

Afternoon (4 hours)

  • [D] Create classification keyword database [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/classification_rules.yaml
    • Structure:
      MUSEUM:
        keywords:
          - museum
          - gallery
          - kunsthalle
          - musée
        context_boost:
          - collection
          - exhibit
          - artifact
      
      LIBRARY:
        keywords:
          - library
          - biblioteca
          - bibliothèque
        context_boost:
          - manuscript
          - codex
          - archival
      
      ARCHIVE:
        keywords:
          - archive
          - archivo
          - archiv
        context_boost:
          - document
          - record
          - registry
      
      HOLY_SITES:
        keywords:
          - cathedral
          - church
          - temple
          - monastery
          - abbey
          - mosque
          - synagogue
        context_boost:
          - pilgrimage
          - religious
          - sacred
      
  • [T] Add edge case tests for classifier [1h] (P1)

    • Test multilingual descriptions (French, Spanish, Arabic)
    • Test sites with ambiguous descriptions
    • Test serial nominations with multiple institution types
  • [V] Benchmark classifier accuracy [1h] (P1)

    • Manually classify 50 sample UNESCO sites
    • Compare with algorithm output
    • Target accuracy: >85%

Day 5: GHCID Generation for UNESCO

Morning (4 hours)

  • [T] Create test for UNESCO GHCID generation [2h] (P0)

    • File: tests/unit/unesco/test_ghcid_unesco.py
    • Test cases:
      • Single-country site: FR-IDF-PAR-M-WHC600 (Banks of Seine)
      • Transboundary site: Multi-country code handling
      • Serial nomination: Append serial component ID
      • UUID v5 generation from GHCID
      • GHCID history tracking
  • [D+T] Implement UNESCO GHCID generator [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/ghcid_generator.py
    • Function: generate_unesco_ghcid(site_data: dict) -> str
    • Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-WHC{UNESCO_ID}
    • Logic:
      1. Extract country from states_iso_code
      2. Extract city from name_en or geocode from lat/lon
      3. Classify institution type → single-letter code
      4. Append WHC{unesco_id} suffix
      5. Handle transboundary: Use primary country or multi-country code

Afternoon (4 hours)

  • [D] Implement city extraction logic [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/city_extractor.py
    • Function: extract_city_from_name(name: str, country: str) -> str
    • Patterns:
      • "Historic Centre of {City}"
      • "Old Town of {City}"
      • "{Site Name} in {City}"
      • Fallback to geocoding API if pattern match fails
  • [T] Test city extraction edge cases [1h] (P1)

    • Test site names without city references
    • Test sites spanning multiple cities
    • Test non-English site names
  • [V] Validate GHCID uniqueness [1h] (P0)

    • Generate GHCIDs for all 1,200+ UNESCO sites
    • Check for collisions (expect <5%)
    • Document collision resolution strategy

Phase 2: Data Extraction Pipeline (Days 6-12, 32 hours)

Day 6: UNESCO API Client

Morning (4 hours)

  • [T] Create test for UNESCO API client [2h] (P0)

    • File: tests/unit/unesco/test_unesco_api_client.py
    • Test cases:
      • Fetch all sites with pagination (OpenDataSoft limit=100, offset mechanism)
      • Fetch single site by ID using ODSQL where clause
      • Handle HTTP errors (404, 500, timeout)
      • Parse OpenDataSoft response structure (nested record.fields)
      • Cache responses (avoid repeated API calls)
      • Handle pagination links from response
  • [D+T] Implement UNESCO API client [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/api_client.py
    • Class: UNESCOAPIClient
    • Base URL: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records
    • Methods:
      • fetch_all_sites() -> List[dict] - Paginate with limit=100, offset increment
      • fetch_site_by_id(site_id: int) -> dict - Use ODSQL where="unique_number={id}"
      • paginate_records(limit: int, offset: int) -> dict - Handle OpenDataSoft pagination
      • download_and_cache(url: str, cache_path: Path) -> dict
    • Response parsing: Extract records array, navigate record.fields for each item

Afternoon (4 hours)

  • [D] Implement response parser [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/response_parser.py
    • Function: parse_unesco_site(raw_data: dict) -> dict
    • Input: OpenDataSoft record structure: raw_data['record']['fields']
    • Transformations:
      • Extract from nested fields dict: raw_data['record']['fields']
      • Normalize field names (snake_case)
      • Parse date_inscribed to datetime (format: "YYYY-MM-DD")
      • Parse criteria_txt to list: "(i)(ii)(iv)" → ["i", "ii", "iv"]
      • Extract coordinates from coordinates object: {"lat": ..., "lon": ...}
      • Handle missing/null fields (many optional fields in OpenDataSoft)
    • Example input:
      {
        "record": {
          "id": "abc123...",
          "fields": {
            "name_en": "Angkor",
            "unique_number": 668,
            "date_inscribed": "1992-01-01",
            "criteria_txt": "(i)(ii)(iii)(iv)",
            "coordinates": {"lat": 13.4333, "lon": 103.8333},
            "category": "Cultural",
            "states_name_en": "Cambodia"
          }
        }
      }
      
  • [T] Create parser tests with fixtures [1.5h] (P0)

    • Fixtures: tests/fixtures/unesco/sample_sites.json (OpenDataSoft format)
    • Test parsing of 10 diverse UNESCO sites:
      • Cultural site (e.g., Angkor Wat - unique_number: 668)
      • Natural site (e.g., Galápagos Islands - unique_number: 1)
      • Mixed site (e.g., Mount Taishan - unique_number: 437)
      • Transboundary site (e.g., Struve Geodetic Arc - unique_number: 1187)
      • Serial nomination (e.g., Le Corbusier's architectural works - unique_number: 1321)
    • Test nested field extraction: Verify parser correctly navigates record.fields
    • Test coordinate parsing: Verify coordinates object → separate lat/lon floats
    • Test null handling: Verify sites with missing coordinates, criteria_txt, etc.
  • [V] Test API client with live data [0.5h] (P0)

    python -m pytest tests/unit/unesco/test_unesco_api_client.py -v
    

Day 7: Location & Geographic Data

Morning (4 hours)

  • [T] Create test for Location object builder [1.5h] (P0)

    • File: tests/unit/unesco/test_location_builder.py
    • Test cases:
      • Build Location from UNESCO lat/lon + country
      • Reverse geocode coordinates to city/region
      • Handle missing coordinates (geocode from name)
      • Handle transboundary sites (multiple locations)
  • [D+T] Implement Location builder [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/location_builder.py
    • Function: build_location(site_data: dict) -> List[Location]
    • Logic:
      1. Extract latitude, longitude, states_iso_code
      2. Reverse geocode using Nominatim API (cached)
      3. Create Location object with city, region, country
      4. For transboundary sites, split by country and create multiple Locations
  • [D] Implement Nominatim client wrapper [0.5h] (P1)

    • File: src/glam_extractor/geocoding/nominatim_client.py
    • Function: reverse_geocode(lat: float, lon: float) -> dict
    • Caching: Use 15-minute TTL cache (per AGENTS.md)

Afternoon (4 hours)

  • [D] Integrate with GeoNames (from global_glam project) [2h] (P1)

    • Reference: docs/plan/global_glam/08-geonames-integration.md
    • Use existing GeoNames client if available
    • Extract region codes (ADM1, ADM2)
  • [T] Test geographic edge cases [1h] (P1)

    • UNESCO sites in disputed territories
    • Sites with coordinates in ocean (e.g., Galápagos)
    • Sites spanning multiple administrative regions
  • [V] Validate geographic accuracy [1h] (P0)

    • Manually verify 50 geocoded UNESCO sites
    • Check city names match expected values
    • Target accuracy: >90%

Day 8: Identifier Extraction

Morning (4 hours)

  • [T] Create test for Identifier extraction [2h] (P0)

    • File: tests/unit/unesco/test_identifier_extractor.py
    • Test cases:
      • UNESCO WHC ID → Identifier(scheme="UNESCO_WHC", value="600")
      • UNESCO URL → Identifier(scheme="Website", value="https://...")
      • Wikidata enrichment (if available)
      • ISIL codes (if site has institutional archive/library)
  • [D+T] Implement Identifier extractor [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/identifier_extractor.py
    • Function: extract_identifiers(site_data: dict) -> List[Identifier]
    • Always create:
      • UNESCO WHC ID: unesco_id or unique_number
      • Website: http_url
    • Conditionally add:
      • Wikidata Q-number (via SPARQL query)
      • VIAF ID (if available in future enrichment)

Afternoon (4 hours)

  • [D] Implement Wikidata enrichment [3h] (P1)

    • File: src/glam_extractor/extractors/unesco/wikidata_enricher.py
    • Function: find_wikidata_for_unesco_site(site_data: dict) -> Optional[str]
    • SPARQL Query:
      SELECT ?item WHERE {
        ?item wdt:P757 "600" .  # UNESCO WHC ID property
      }
      
    • Fuzzy fallback: If no P757 match, search by name + country
  • [T] Test Wikidata enrichment [1h] (P1)

    • Test with known UNESCO sites that have Wikidata entries
    • Test handling of sites without Wikidata entries
    • Test rate limiting (1 request/second per Wikidata policy)

Day 9-10: Provenance & Metadata

Day 9 Morning (4 hours)

  • [T] Create test for Provenance builder [1.5h] (P0)

    • File: tests/unit/unesco/test_provenance_builder.py
    • Test cases:
      • Provenance with data_source=UNESCO_DATAHUB
      • Provenance with data_tier=TIER_2_VERIFIED (per doc 04)
      • Extraction timestamp in UTC
      • Extraction method documentation
      • Confidence scores for institution type classification
  • [D+T] Implement Provenance builder [2h] (P0)

    • File: src/glam_extractor/extractors/unesco/provenance_builder.py
    • Function: build_provenance(site_data: dict, confidence: float) -> Provenance
    • Fields:
      Provenance(
          data_source=DataSource.UNESCO_DATAHUB,
          data_tier=DataTier.TIER_2_VERIFIED,
          extraction_date=datetime.now(timezone.utc),
          extraction_method="UNESCO DataHub API extraction",
          confidence_score=confidence,
          source_url=site_data.get('http_url'),
          verified_date=datetime.fromisoformat(site_data['date_inscribed']),
          verified_by="UNESCO World Heritage Committee"
      )
      
  • [D] Add DataSource enum value [0.5h] (P0)

    • Edit schemas/enums.yaml
    • Add: UNESCO_DATAHUB: {description: "UNESCO World Heritage DataHub API"}

Day 9 Afternoon (4 hours)

  • [T] Create test for ChangeEvent extraction [2h] (P1)

    • File: tests/unit/unesco/test_change_event_extractor.py
    • Test cases:
      • Founding event from date_inscribed
      • Danger listing event (when danger: "1")
      • Delisting event (when date_end is set)
      • Extension event (when extension: "1")
  • [D+T] Implement ChangeEvent extractor [2h] (P1)

    • File: src/glam_extractor/extractors/unesco/change_event_extractor.py
    • Function: extract_change_events(site_data: dict) -> List[ChangeEvent]
    • Events to create:
      • FOUNDING: date_inscribed → UNESCO inscription (not actual founding!)
      • STATUS_CHANGE: danger: "1" → added to Danger List
      • CLOSURE: date_end → delisted from World Heritage List

Day 10 Morning (4 hours)

  • [D] Create Collection metadata extractor [2h] (P1)

    • File: src/glam_extractor/extractors/unesco/collection_builder.py
    • Function: build_collection(site_data: dict) -> Optional[Collection]
    • Logic:
      • Extract short_description_en → collection description
      • Parse criteria_txt → subject areas
      • Extract area_hectares → extent
      • If site is MUSEUM/ARCHIVE/LIBRARY, create Collection object
  • [T] Test Collection extraction [1.5h] (P1)

    • Test museum sites with clear collections
    • Test non-institutional sites (landscapes, monuments) → return None
    • Test serial nominations → multiple collections
  • [V] Review extraction pipeline completeness [0.5h] (P0)

    • Checklist:
      • ✓ Institution name extracted
      • ✓ Institution type classified
      • ✓ Location(s) extracted
      • ✓ Identifiers extracted
      • ✓ Provenance metadata complete
      • ✓ Change events extracted (optional)
      • ✓ Collection metadata extracted (optional)

Day 10 Afternoon (4 hours)

  • [D+T] Implement main orchestrator [3h] (P0)

    • File: src/glam_extractor/extractors/unesco/extractor.py
    • Class: UNESCOExtractor
    • Method: extract_site(site_data: dict) -> HeritageCustodian
    • Orchestration logic:
      1. Parse raw UNESCO JSON
      2. Classify institution type → confidence score
      3. Extract name (prefer English, fallback to French)
      4. Build Location(s)
      5. Extract Identifiers
      6. Build Provenance
      7. Generate GHCID
      8. Generate UUIDs (v5, v8, numeric)
      9. Extract ChangeEvents
      10. Build Collection (if applicable)
      11. Return HeritageCustodian instance
  • [T] Integration test for full extraction [1h] (P0)

    • File: tests/integration/unesco/test_end_to_end.py
    • Test: Extract 10 diverse UNESCO sites end-to-end
    • Assertions:
      • All required fields populated
      • GHCID format valid
      • UUIDs generated correctly
      • Provenance complete

Day 11-12: Batch Processing

Day 11 Morning (4 hours)

  • [D] Implement batch processor [2.5h] (P0)

    • File: src/glam_extractor/extractors/unesco/batch_processor.py
    • Class: UNESCOBatchProcessor
    • Method: process_all_sites() -> List[HeritageCustodian]
    • Features:
      • Progress bar (tqdm)
      • Error handling (skip invalid sites, log errors)
      • Parallel processing (multiprocessing, 4 workers)
      • Incremental checkpointing (save every 100 sites)
  • [T] Test batch processor [1.5h] (P0)

    • Test processing 100 sites
    • Test error recovery (invalid site data)
    • Test checkpoint/resume functionality

Day 11 Afternoon (4 hours)

  • [D] Add CLI interface [2h] (P1)

    • File: src/glam_extractor/cli/unesco.py
    • Commands:
      glam-extract unesco fetch-all  # Download UNESCO data
      glam-extract unesco process    # Process all sites
      glam-extract unesco validate   # Validate output
      glam-extract unesco export     # Export to formats
      
  • [D] Implement logging system [1h] (P1)

    • File: src/glam_extractor/extractors/unesco/logger.py
    • Log levels:
      • INFO: Progress milestones
      • WARNING: Missing data, fallback logic triggered
      • ERROR: Unhandled exceptions, invalid data
    • Output: logs/unesco_extraction_{timestamp}.log
  • [V] Dry run on full dataset [1h] (P0)

    • Process all 1,200+ UNESCO sites
    • Log errors and warnings
    • Review classification accuracy
    • Target success rate: >95%

Day 12 Full Day (8 hours)

  • [D] Implement error recovery patterns [3h] (P0)

    • File: src/glam_extractor/extractors/unesco/error_handling.py
    • Patterns:
      • Retry on transient errors (HTTP 429, 503)
      • Skip on permanent errors (invalid data structure)
      • Log detailed error context
      • Continue processing remaining sites
  • [T] Stress test batch processor [2h] (P0)

    • Test with 1,200+ sites
    • Measure processing time (target: <10 minutes)
    • Monitor memory usage
    • Test interrupt/resume (Ctrl+C handling)
  • [D] Optimize performance [2h] (P1)

    • Profile slow operations
    • Cache geocoding results
    • Batch Wikidata queries
    • Optimize JSON serialization
  • [V] Quality review of 50 random sites [1h] (P0)

    • Manual inspection of extracted records
    • Verify institution type classification
    • Verify GHCID generation
    • Verify Location accuracy
    • Document edge cases for refinement

Phase 3: Validation & Testing (Days 13-16, 16 hours)

Day 13: LinkML Schema Validation

Morning (4 hours)

  • [D] Create comprehensive validation script [2h] (P0)

    • File: scripts/validate_unesco_output.py
    • Validations:
      • LinkML schema compliance
      • Required field presence
      • Enum value validity
      • GHCID format regex
      • UUID format validation
      • URL reachability (sample check)
  • [T] Create validation test suite [2h] (P0)

    • File: tests/validation/test_unesco_validation.py
    • Test cases:
      • Valid records pass validation
      • Invalid records fail with clear error messages
      • Edge cases handled gracefully

Afternoon (4 hours)

  • [V] Run LinkML validator on all records [2h] (P0)

    for file in data/instances/unesco/*.yaml; do
      linkml-validate -s schemas/heritage_custodian.yaml "$file"
    done
    
  • [V] Review validation errors [2h] (P0)

    • Categorize errors (schema, data quality, edge cases)
    • Fix schema issues
    • Fix extraction logic bugs
    • Document known limitations

Day 14: Data Quality Checks

Morning (4 hours)

  • [D] Implement data quality checks [2.5h] (P0)

    • File: scripts/quality_checks_unesco.py
    • Checks:
      • Completeness: % of fields populated per record
      • Consistency: Same UNESCO ID → same GHCID
      • Accuracy: Geocoding spot checks
      • Uniqueness: No duplicate GHCIDs
      • Referential integrity: All identifiers valid
  • [V] Run quality checks on full dataset [1.5h] (P0)

    • Generate quality report: reports/unesco_quality_report.md
    • Target metrics:
      • Completeness: >90% for required fields
      • Accuracy: >85% for institution type classification
      • Uniqueness: 100% (no duplicate GHCIDs)

Afternoon (4 hours)

  • [D] Implement deduplication logic [2h] (P1)

    • File: src/glam_extractor/extractors/unesco/deduplicator.py
    • Logic:
      • Compare by UNESCO WHC ID (primary key)
      • Fuzzy match by name + country (detect data errors)
      • Merge records if duplicates found
  • [T] Test deduplication [1h] (P1)

    • Create synthetic duplicates
    • Test merging logic
    • Verify data integrity after merge
  • [V] Final quality gate [1h] (P0)

    • Criteria:
      • All records pass LinkML validation
      • Quality metrics meet targets
      • No critical errors in logs
      • Sample manual review passes

Day 15: Integration Testing

Full Day (8 hours)

  • [T] Create integration test suite [4h] (P0)

    • File: tests/integration/unesco/test_full_pipeline.py
    • Tests:
      • Fetch UNESCO data from live API
      • Process 100 random sites
      • Validate output against schema
      • Export to JSON-LD, RDF, CSV
      • Verify exported files are valid
  • [T] Create cross-reference tests [2h] (P1)

    • File: tests/integration/unesco/test_cross_reference.py
    • Tests:
      • Cross-check UNESCO WHC IDs with official list
      • Verify Wikidata Q-numbers resolve
      • Verify website URLs are reachable (sample)
  • [V] Performance benchmarks [1h] (P0)

    • Measure extraction time per site (target: <0.5s)
    • Measure total processing time (target: <10 minutes)
    • Measure memory usage (target: <2GB)
  • [V] User acceptance testing [1h] (P0)

    • Load sample data into visualization tool
    • Verify data is human-readable
    • Verify data structure supports use cases (doc 02)

Day 16: Test Coverage & Documentation

Morning (4 hours)

  • [V] Measure test coverage [1h] (P0)

    pytest --cov=src/glam_extractor/extractors/unesco \
           --cov-report=html \
           tests/unit/unesco/
    
    • Target coverage: >80%
  • [T] Add missing tests to reach coverage target [2h] (P0)

    • Focus on edge cases
    • Focus on error handling paths
  • [D] Document testing strategy [1h] (P1)

    • File: docs/unesco/testing.md
    • Document test fixtures
    • Document test data sources
    • Document known test gaps

Afternoon (4 hours)

  • [D] Create developer guide [2h] (P1)

    • File: docs/unesco/developer_guide.md
    • How to run extraction locally
    • How to add new classifiers
    • How to extend schema
    • How to debug extraction errors
  • [D] Create operational runbook [2h] (P1)

    • File: docs/unesco/operations.md
    • How to schedule periodic extractions
    • How to monitor data quality
    • How to handle API changes
    • How to troubleshoot common issues

Phase 4: Export & Serialization (Days 17-20, 16 hours)

Day 17: JSON-LD Exporter

Morning (4 hours)

  • [T] Create test for JSON-LD exporter [1.5h] (P0)

    • File: tests/unit/unesco/test_jsonld_exporter.py
    • Test cases:
      • Export single HeritageCustodian to JSON-LD
      • Verify @context is valid
      • Verify @type is correct
      • Verify linked data properties
  • [D+T] Implement JSON-LD exporter [2.5h] (P0)

    • File: src/glam_extractor/exporters/jsonld_unesco.py
    • Function: export_to_jsonld(custodian: HeritageCustodian) -> dict
    • JSON-LD context:
      {
        "@context": {
          "@vocab": "https://w3id.org/heritage/custodian/",
          "name": "http://schema.org/name",
          "location": "http://schema.org/location",
          "unesco": "https://whc.unesco.org/en/list/"
        }
      }
      

Afternoon (4 hours)

  • [D] Create JSON-LD context file [1h] (P0)

    • File: contexts/unesco_heritage_custodian.jsonld
    • Reference existing context from global_glam project
    • Add UNESCO-specific terms
  • [T] Validate JSON-LD output [1h] (P0)

  • [D] Batch export to JSON-LD [2h] (P0)

    • Export all 1,200+ records to data/instances/unesco/jsonld/
    • One file per record: whc-{unesco_id}.jsonld

Day 18: RDF/Turtle Exporter

Morning (4 hours)

  • [T] Create test for RDF exporter [1.5h] (P0)

    • File: tests/unit/unesco/test_rdf_exporter.py
    • Test cases:
      • Export to Turtle (.ttl)
      • Export to N-Triples (.nt)
      • Verify RDF triples are valid
      • Test with rdflib
  • [D+T] Implement RDF exporter [2.5h] (P0)

    • File: src/glam_extractor/exporters/rdf_unesco.py
    • Function: export_to_rdf(custodian: HeritageCustodian, format: str) -> str
    • Dependencies: pip install rdflib==7.0.0

Afternoon (4 hours)

Day 19: CSV & Parquet Exporters

Morning (4 hours)

  • [T] Create test for CSV exporter [1h] (P0)

    • File: tests/unit/unesco/test_csv_exporter.py
    • Test cases:
      • Export flat CSV with essential fields
      • Handle multi-valued fields (locations, identifiers)
      • UTF-8 encoding
  • [D+T] Implement CSV exporter [2h] (P0)

    • File: src/glam_extractor/exporters/csv_unesco.py
    • Function: export_to_csv(custodians: List[HeritageCustodian], path: Path)
    • Columns:
      • ghcid, name, institution_type, city, country, latitude, longitude
      • unesco_id, inscription_year, category, criteria
      • wikidata_id, website_url
      • extraction_date, confidence_score
  • [D] Batch export to CSV [1h] (P0)

    • Export to data/instances/unesco/unesco_heritage_custodians.csv

Afternoon (4 hours)

  • [T] Create test for Parquet exporter [1h] (P0)

    • File: tests/unit/unesco/test_parquet_exporter.py
    • Test cases:
      • Export to Parquet format
      • Verify schema preservation
      • Test with DuckDB or Pandas
  • [D+T] Implement Parquet exporter [2h] (P0)

    • File: src/glam_extractor/exporters/parquet_unesco.py
    • Function: export_to_parquet(custodians: List[HeritageCustodian], path: Path)
    • Dependencies: pip install pyarrow==14.0.0
  • [D] Batch export to Parquet [1h] (P0)

    • Export to data/instances/unesco/unesco_heritage_custodians.parquet

Day 20: SQLite Database

Morning (4 hours)

  • [D] Design SQLite schema [1.5h] (P0)

    • File: scripts/create_unesco_db_schema.sql
    • Tables:
      • heritage_custodians (main table)
      • locations (1:N relationship)
      • identifiers (1:N relationship)
      • change_events (1:N relationship)
      • collections (1:N relationship)
  • [D] Implement SQLite exporter [2.5h] (P0)

    • File: src/glam_extractor/exporters/sqlite_unesco.py
    • Function: export_to_sqlite(custodians: List[HeritageCustodian], db_path: Path)
    • Dependencies: Built-in sqlite3

Afternoon (4 hours)

  • [D] Create database indexes [1h] (P0)

    • Index on ghcid (primary key)
    • Index on unesco_id (unique)
    • Index on country
    • Index on institution_type
    • Full-text search on name and description
  • [T] Test SQLite export and queries [2h] (P0)

    • Test data integrity
    • Test foreign key constraints
    • Test query performance
  • [D] Create sample SQL queries [1h] (P1)

    • File: docs/unesco/sample_queries.sql
    • Example queries for common use cases

Phase 5: Integration & Enrichment (Days 21-24, 16 hours)

Day 21: Wikidata Enrichment Pipeline

Morning (4 hours)

  • [D] Implement batch Wikidata enrichment [3h] (P1)

    • File: scripts/enrich_unesco_with_wikidata.py
    • Process:
      1. Load all UNESCO records without Wikidata IDs
      2. Query Wikidata SPARQL endpoint (batches of 100)
      3. Match by UNESCO WHC ID (property P757)
      4. Update records with Q-numbers
      5. Extract additional metadata from Wikidata (VIAF, founding date)
  • [T] Test Wikidata enrichment [1h] (P1)

    • Test with 50 sample sites
    • Verify match accuracy
    • Handle rate limiting

Afternoon (4 hours)

  • [D] Implement ISIL code lookup [2h] (P1)

    • File: scripts/enrich_unesco_with_isil.py
    • Data source: Cross-reference with existing ISIL registry
    • Logic:
      • Match by institution name + country
      • Fuzzy matching threshold: 0.85
      • Add ISIL identifier to records
  • [V] Review enrichment results [2h] (P1)

    • Count enriched records
    • Review match quality
    • Document enrichment coverage

Day 22: Cross-dataset Linking

Morning (4 hours)

  • [D] Link UNESCO records to existing GLAM dataset [3h] (P1)

    • File: scripts/link_unesco_to_global_glam.py
    • Matching strategies:
      1. GHCID match (direct)
      2. Wikidata ID match
      3. Name + country fuzzy match
    • Output: Crosswalk table linking UNESCO WHC ID to existing GHCIDs
  • [V] Validate cross-dataset links [1h] (P1)

    • Manual review of 50 matched pairs
    • Check for false positives
    • Document ambiguous cases

Afternoon (4 hours)

  • [D] Create dataset merge logic [2h] (P1)

    • File: scripts/merge_unesco_into_global_glam.py
    • Merge rules:
      • If GHCID exists: Update with UNESCO metadata
      • If no match: Add as new record
      • Preserve provenance trail
  • [T] Test merge logic [1h] (P1)

    • Test with sample dataset
    • Verify no data loss
    • Verify GHCID uniqueness maintained
  • [V] Final merge quality check [1h] (P0)

    • Review merged dataset
    • Verify provenance for all records
    • Document merge statistics

Day 23: Data Quality Refinement

Full Day (8 hours)

  • [D] Implement data quality improvements [4h] (P0)

    • Fix classification errors from manual review
    • Refine keyword patterns for edge cases
    • Improve geocoding accuracy
    • Handle special characters in names
  • [T] Regression testing [2h] (P0)

    • Re-run full test suite
    • Verify improvements didn't break existing functionality
  • [V] Final data quality audit [2h] (P0)

    • Re-run quality checks
    • Compare metrics to Day 14 baseline
    • Target improvement: +5% accuracy

Day 24: Documentation & Handoff

Morning (4 hours)

  • [D] Create extraction report [2h] (P0)

    • File: reports/unesco_extraction_report.md
    • Contents:
      • Total records extracted: 1,200+
      • Classification accuracy: XX%
      • Coverage by institution type
      • Geographic distribution
      • Data quality metrics
      • Known limitations
  • [D] Document known issues [1h] (P0)

    • File: docs/unesco/known_issues.md
    • List edge cases not handled
    • List sites with ambiguous classification
    • List sites missing key metadata
  • [D] Create future enhancements roadmap [1h] (P1)

    • File: docs/unesco/roadmap.md
    • Planned improvements
    • Additional data sources to integrate
    • New features to implement

Afternoon (4 hours)

  • [D] Create API documentation [2h] (P1)

    • File: docs/unesco/api.md
    • Document all public functions
    • Document CLI commands
    • Provide usage examples
  • [D] Create maintenance guide [2h] (P1)

    • File: docs/unesco/maintenance.md
    • How to update UNESCO data (periodic refresh)
    • How to handle API changes
    • How to re-run enrichment pipelines

Phase 6: Deployment & Monitoring (Days 25-28, 16 hours)

Day 25: Automation Scripts

Morning (4 hours)

  • [D] Create automated extraction pipeline [2.5h] (P0)

    • File: scripts/run_unesco_extraction.sh
    • Steps:
      1. Fetch latest UNESCO data
      2. Run extraction
      3. Validate output
      4. Export to all formats
      5. Run quality checks
      6. Generate report
  • [T] Test automation script [1.5h] (P0)

    • Test end-to-end execution
    • Test error handling
    • Test with scheduled execution (cron)

Afternoon (4 hours)

  • [D] Create incremental update script [2.5h] (P1)

    • File: scripts/update_unesco_incremental.py
    • Logic:
      • Compare current UNESCO data with previous extraction
      • Detect new sites, modified sites, delisted sites
      • Process only changes
      • Update database incrementally
  • [T] Test incremental updates [1.5h] (P1)

    • Simulate API changes
    • Verify only deltas processed
    • Verify data consistency

Day 26: Monitoring & Alerts

Morning (4 hours)

  • [D] Implement extraction monitoring [2h] (P1)

    • File: src/glam_extractor/monitoring/unesco_monitor.py
    • Metrics to track:
      • Extraction success rate
      • Average processing time per site
      • Classification confidence distribution
      • Data quality scores
  • [D] Create alerting logic [2h] (P1)

    • File: src/glam_extractor/monitoring/alerts.py
    • Alert conditions:
      • Extraction failure rate >5%
      • Classification confidence <0.7 for >10% of sites
      • API downtime
      • Schema validation failure rate >1%

Afternoon (4 hours)

  • [D] Create monitoring dashboard (optional) [3h] (P2)

    • Tool: Jupyter notebook or Streamlit
    • Visualizations:
      • Extraction pipeline status
      • Data quality trends over time
      • Geographic coverage map
      • Institution type distribution
  • [V] Test monitoring system [1h] (P1)

    • Simulate failures
    • Verify alerts trigger
    • Test dashboard functionality

Day 27: Production Deployment

Morning (4 hours)

  • [D] Package extraction module [2h] (P0)

    • Create setup.py or pyproject.toml
    • Document installation instructions
    • Test installation in clean environment
  • [D] Create Docker container (optional) [2h] (P2)

    • File: docker/Dockerfile.unesco
    • Containerize extraction pipeline
    • Document Docker deployment

Afternoon (4 hours)

  • [D] Deploy to production environment [2h] (P0)

    • Upload code to production server
    • Install dependencies
    • Configure environment variables
    • Test production execution
  • [V] Production smoke test [1h] (P0)

    • Run extraction on production
    • Verify output files created
    • Verify database updated
    • Verify logs generated
  • [D] Configure scheduled execution [1h] (P0)

    • Set up cron job for weekly updates
    • Configure log rotation
    • Set up error notifications

Day 28: Handoff & Training

Full Day (8 hours)

  • [D] Create training materials [3h] (P1)

    • Presentation: Project overview and architecture
    • Demo: Live extraction walkthrough
    • Exercises: Hands-on practice for maintainers
  • [D] Conduct knowledge transfer session [3h] (P1)

    • Present to stakeholders
    • Walk through codebase
    • Answer questions
    • Document Q&A
  • [D] Create support plan [2h] (P1)

    • File: docs/unesco/support.md
    • Contact information for maintainers
    • Escalation procedures
    • Known issues and workarounds

Phase 7: Finalization (Days 29-30, 8 hours)

Day 29: Final Review

Morning (4 hours)

  • [V] Code review [2h] (P0)

    • Review all implementation files
    • Check code quality (linting, type hints)
    • Verify documentation completeness
  • [V] Test coverage review [1h] (P0)

    • Verify coverage >80%
    • Review untested code paths
    • Document rationale for any skipped tests
  • [V] Security review [1h] (P0)

    • Check for hardcoded credentials
    • Verify API keys in environment variables
    • Check for SQL injection vulnerabilities
    • Verify input validation

Afternoon (4 hours)

  • [D] Final documentation pass [2h] (P0)

    • Proofread all documentation
    • Check for broken links
    • Verify code examples are correct
    • Update README with final statistics
  • [V] Stakeholder demo [2h] (P0)

    • Present final results
    • Demo visualization of extracted data
    • Discuss next steps
    • Gather feedback

Day 30: Release & Archive

Morning (4 hours)

  • [D] Create release package [2h] (P0)

    • Tag release version: v1.0.0-unesco
    • Create GitHub release (if applicable)
    • Archive source code
    • Archive extracted datasets
  • [D] Publish datasets [1h] (P0)

    • Upload to project repository
    • Generate dataset DOI (Zenodo or similar)
    • Update dataset catalog
  • [D] Create project retrospective [1h] (P1)

    • File: reports/unesco_retrospective.md
    • What went well
    • What could be improved
    • Lessons learned
    • Recommendations for future extractions

Afternoon (4 hours)

  • [D] Final cleanup [2h] (P0)

    • Remove temporary files
    • Archive logs
    • Clean up test data
    • Optimize database indexes
  • [V] Final quality gate [1h] (P0)

    • Checklist:
      • ✓ All tests passing
      • ✓ All data validated
      • ✓ All documentation complete
      • ✓ Production deployment successful
      • ✓ Monitoring in place
      • ✓ Stakeholders trained
  • [D] Project close-out [1h] (P0)

    • Send completion notification
    • Archive project artifacts
    • Update project status: COMPLETE

Appendix A: Task Dependencies

Critical Path (Must Complete in Order)

  1. Day 1-2: Environment setup → Blocks all development
  2. Day 3-5: Schema & types → Blocks extraction logic
  3. Day 6-8: Data extraction → Blocks validation
  4. Day 9-12: Batch processing → Blocks export
  5. Day 13-16: Validation → Blocks production deployment
  6. Day 17-20: Export → Blocks integration
  7. Day 25-28: Deployment → Blocks project completion

Parallel Work Opportunities

  • Days 7-10: Location extraction + Identifier extraction can be done in parallel
  • Days 17-20: All exporters (JSON-LD, RDF, CSV, Parquet, SQLite) can be developed in parallel
  • Days 21-22: Wikidata enrichment + ISIL lookup can be done in parallel

Appendix B: Quality Gates

Gate 1: Schema Complete (End of Day 5)

  • All LinkML schemas validated
  • Python classes generated successfully
  • Enum values documented
  • GHCID format specified

Gate 2: Extraction Pipeline Complete (End of Day 12)

  • API client tested with live data
  • All extractors implemented and tested
  • Batch processor handles 1,200+ sites
  • Error handling robust

Gate 3: Validation Complete (End of Day 16)

  • All records pass LinkML validation
  • Quality metrics meet targets
  • Test coverage >80%
  • Sample manual review passes

Gate 4: Export Complete (End of Day 20)

  • All export formats tested
  • Output files validated
  • Database indexes optimized
  • Sample queries work

Gate 5: Production Ready (End of Day 28)

  • Production deployment successful
  • Monitoring in place
  • Documentation complete
  • Team trained

Gate 6: Project Complete (End of Day 30)

  • All tests passing
  • All datasets published
  • Stakeholders notified
  • Project archived

Appendix C: Risk Management

High-Priority Risks

Risk 1: UNESCO API Changes

  • Impact: High (extraction fails)
  • Probability: Low
  • Mitigation:
    • Monitor API version
    • Cache raw API responses
    • Implement schema validation on API responses
    • Create fallback to manual data files

Risk 2: Institution Type Classification Accuracy <85%

  • Impact: Medium (data quality issues)
  • Probability: Medium
  • Mitigation:
    • Iterative refinement of keywords
    • Manual review of 100+ samples
    • Add confidence threshold (skip uncertain classifications)
    • Document known edge cases

Risk 3: GHCID Collisions

  • Impact: Medium (duplicate identifiers)
  • Probability: Low (<5% expected)
  • Mitigation:
    • Implement collision detection
    • Use Wikidata Q-numbers for disambiguation
    • Document collision resolution rules
    • Manual review of collisions

Risk 4: Performance Issues (>10 min processing time)

  • Impact: Low (convenience only)
  • Probability: Medium
  • Mitigation:
    • Implement parallel processing
    • Cache geocoding results
    • Batch Wikidata queries
    • Profile and optimize slow operations

Risk 5: Wikidata Enrichment Rate Limiting

  • Impact: Low (enrichment incomplete)
  • Probability: High
  • Mitigation:
    • Respect rate limits (1 req/sec)
    • Implement exponential backoff
    • Make enrichment optional
    • Run enrichment as separate batch job

Appendix D: Resource Requirements

Software Dependencies

Required:

  • Python 3.11+
  • linkml 1.7.10+
  • pydantic 1.10.13
  • requests 2.31.0

Optional:

  • rapidfuzz 3.5.2 (fuzzy matching)
  • SPARQLWrapper 2.0.0 (Wikidata)
  • rdflib 7.0.0 (RDF export)
  • pyarrow 14.0.0 (Parquet export)

Infrastructure

Development:

  • Laptop/workstation with 8GB+ RAM
  • Internet connection (API access)
  • ~2GB disk space for data

Production:

  • Server with 4GB+ RAM
  • Cron scheduling capability
  • 5GB disk space (with logs)

Team Resources

Estimated effort: 120 person-hours over 30 days

Roles:

  • Software Engineer (primary developer)
  • Data Scientist (classification logic)
  • QA Engineer (testing)
  • Documentation Writer (optional)

Appendix E: Success Metrics

Quantitative Metrics

  • Extraction completeness: >95% of UNESCO sites processed successfully
  • Classification accuracy: >85% correct institution type assignments
  • Validation pass rate: 100% of records pass LinkML schema validation
  • Test coverage: >80% code coverage
  • Processing time: <10 minutes for full dataset
  • Data quality score: >90% completeness on required fields

Qualitative Metrics

  • Code maintainability: Clear documentation, modular design
  • Data usability: Stakeholders can query and visualize data
  • Reproducibility: Pipeline can be re-run by new team members
  • Extensibility: Easy to add new data sources or fields

Version History

Version Date Author Changes
1.1 2025-11-10 AI Agent (OpenCode) Updated Day 6 API client and parser sections to reflect OpenDataSoft Explore API v2 response structure (nested record.fields, pagination links, coordinate objects)
1.0 2025-11-09 GLAM Extraction Team Initial version

Document Status: COMPLETE
Next Steps: Begin Phase 0 - Pre-Implementation Setup

Questions? Refer to planning documents 00-06 for detailed specifications.