# UNESCO Data Extraction - Master Implementation Checklist **Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction **Document**: 07 - Master Implementation Checklist **Version**: 1.1 **Date**: 2025-11-10 **Status**: Planning --- ## Executive Summary This document provides a **comprehensive 30-day implementation checklist** for the UNESCO World Heritage Sites extraction project. It consolidates all planning documents (00-06) into a structured timeline with actionable tasks, testing requirements, and quality gates. **Timeline**: 30 calendar days (estimated 120 person-hours) **Approach**: Test-Driven Development (TDD) with iterative validation **Output**: 1,200+ LinkML-compliant heritage custodian records --- ## How to Use This Checklist ### Notation - **[D]** = Development task - **[T]** = Testing task - **[V]** = Validation/Quality gate - **[D+T]** = TDD pair (test first, then implementation) - **✓** = Completed - **⚠** = In progress - **⊗** = Blocked/Issue ### Task Priority Levels - **P0** = Critical path, must complete - **P1** = High priority, complete if time permits - **P2** = Nice to have, future iteration ### Estimated Time Each task includes an estimated duration in hours (e.g., `[2h]`). --- ## Phase 0: Pre-Implementation Setup (Days 1-2, 8 hours) ### Day 1: Environment & Dependencies #### Morning (4 hours) - [ ] **[D]** Verify Python 3.11+ installation `[0.5h]` (P0) - Run: `python --version` - Create virtual environment: `python -m venv .venv` - Activate: `source .venv/bin/activate` - [ ] **[D]** Install core dependencies from `requirements.txt` `[1h]` (P0) ```bash pip install linkml==1.7.10 pip install linkml-runtime==1.7.5 pip install pydantic==1.10.13 pip install requests==2.31.0 pip install python-dotenv==1.0.0 pip install pytest==7.4.3 pip install pytest-cov==4.1.0 ``` - [ ] **[D]** Install optional dependencies `[0.5h]` (P1) ```bash pip install rapidfuzz==3.5.2 # Fuzzy matching pip install SPARQLWrapper==2.0.0 # Wikidata queries pip install click==8.1.7 # CLI interface ``` - [ ] **[V]** Verify LinkML installation `[0.5h]` (P0) ```bash linkml-validate --version gen-python --version gen-json-schema --version ``` - [ ] **[D]** Clone/update project repository `[0.5h]` (P0) ```bash cd /Users/kempersc/apps/glam git status git pull origin main ``` - [ ] **[D]** Create directory structure `[1h]` (P0) ```bash mkdir -p schemas/maps mkdir -p src/glam_extractor/extractors/unesco mkdir -p tests/fixtures/unesco mkdir -p tests/unit/unesco mkdir -p tests/integration/unesco mkdir -p data/raw/unesco mkdir -p data/processed/unesco mkdir -p data/instances/unesco ``` #### Afternoon (4 hours) - [ ] **[D]** Set up UNESCO API access `[1h]` (P0) - Read API docs: https://data.unesco.org/api/explore/v2.0/console - Test API endpoint: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records` - Note: Legacy endpoint `https://whc.unesco.org/en/list/json` is **BLOCKED** (Cloudflare 403) - UNESCO uses **OpenDataSoft Explore API v2** (not OAI-PMH) - No authentication required for public datasets - Document ODSQL query language for filtering - Create `.env` file with configuration - [ ] **[D]** Download sample UNESCO data `[1h]` (P0) ```bash # OpenDataSoft API with pagination (max 100 records per request) curl -o data/raw/unesco/whc-sites-2025.json \ "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=100&offset=0" # For full dataset (1,200+ sites), implement pagination in Python # CSV export fallback (if API unavailable): # https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv ``` - [ ] **[D]** Create project README for UNESCO extraction `[1h]` (P1) - Document scope: 1,200+ WHS extraction - Reference planning documents (00-06) - List dependencies and prerequisites - [ ] **[V]** Review existing schema modules `[1h]` (P0) - Read `schemas/core.yaml` (HeritageCustodian, Location, Identifier) - Read `schemas/enums.yaml` (InstitutionTypeEnum, DataSource) - Read `schemas/provenance.yaml` (Provenance, ChangeEvent) - Read `schemas/collections.yaml` (Collection metadata) --- ## Phase 1: Schema & Type System (Days 3-5, 16 hours) ### Day 3: LinkML Schema Extension #### Morning (4 hours) - [ ] **[T]** Create test for UNESCO-specific enums `[1h]` (P0) - **File**: `tests/unit/unesco/test_unesco_enums.py` - **Test cases**: - UNESCO category values (Cultural, Natural, Mixed) - UNESCO inscription status (Inscribed, Danger, Delisted) - UNESCO serial nomination flag - UNESCO transboundary flag - [ ] **[D+T]** Add UNESCO enums to `schemas/enums.yaml` `[2h]` (P0) ```yaml UNESCOCategoryEnum: permissible_values: CULTURAL: description: Cultural heritage site NATURAL: description: Natural heritage site MIXED: description: Mixed cultural and natural site UNESCOInscriptionStatusEnum: permissible_values: INSCRIBED: description: Currently inscribed on World Heritage List IN_DANGER: description: On List of World Heritage in Danger DELISTED: description: Removed from World Heritage List ``` - [ ] **[V]** Validate enum schema `[0.5h]` (P0) ```bash linkml-validate -s schemas/heritage_custodian.yaml \ -C UNESCOCategoryEnum \ --target-class UNESCOCategoryEnum ``` - [ ] **[D]** Generate Python classes from schema `[0.5h]` (P0) ```bash gen-python schemas/heritage_custodian.yaml > \ src/glam_extractor/models/generated.py ``` #### Afternoon (4 hours) - [ ] **[T]** Create test for UNESCO extension class `[1.5h]` (P0) - **File**: `tests/unit/unesco/test_unesco_custodian.py` - **Test cases**: - UNESCOHeritageCustodian inherits from HeritageCustodian - Additional fields: unesco_id, inscription_year, category, criteria - Validation rules for UNESCO WHC ID format - Multi-language name handling - [ ] **[D+T]** Create UNESCO extension schema `[2h]` (P0) - **File**: `schemas/unesco.yaml` - **Extends**: HeritageCustodian from `schemas/core.yaml` - **New slots**: - `unesco_id`: Integer (UNESCO WHC site ID) - `inscription_year`: Integer (year inscribed) - `unesco_category`: UNESCOCategoryEnum - `unesco_criteria`: List of strings (e.g., ["i", "ii", "iv"]) - `in_danger`: Boolean - `serial_nomination`: Boolean - `transboundary`: Boolean - [ ] **[D]** Update main schema to import UNESCO module `[0.5h]` (P0) - Edit `schemas/heritage_custodian.yaml` - Add import: `imports: [core, enums, provenance, collections, unesco]` ### Day 4: Type Classification System #### Morning (4 hours) - [ ] **[T]** Create test for institution type classifier `[2h]` (P0) - **File**: `tests/unit/unesco/test_institution_classifier.py` - **Test cases**: - Museums: "National Museum of...", "Museum Complex" - Libraries: "Royal Library", "Library of..." - Archives: "National Archives", "Documentary Heritage" - Holy Sites: "Cathedral", "Temple", "Monastery" - Mixed: Multiple institution types at one site - Unknown: Defaults to MIXED when uncertain - [ ] **[D+T]** Implement institution type classifier `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/institution_classifier.py` - **Function**: `classify_unesco_site(site_data: dict) -> InstitutionTypeEnum` - **Logic**: 1. Extract `short_description_en`, `justification_en`, `name_en` 2. Apply keyword matching patterns (see doc 05, design pattern #2) 3. Return InstitutionTypeEnum value 4. Assign confidence score (0.0-1.0) #### Afternoon (4 hours) - [ ] **[D]** Create classification keyword database `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/classification_rules.yaml` - **Structure**: ```yaml MUSEUM: keywords: - museum - gallery - kunsthalle - musée context_boost: - collection - exhibit - artifact LIBRARY: keywords: - library - biblioteca - bibliothèque context_boost: - manuscript - codex - archival ARCHIVE: keywords: - archive - archivo - archiv context_boost: - document - record - registry HOLY_SITES: keywords: - cathedral - church - temple - monastery - abbey - mosque - synagogue context_boost: - pilgrimage - religious - sacred ``` - [ ] **[T]** Add edge case tests for classifier `[1h]` (P1) - Test multilingual descriptions (French, Spanish, Arabic) - Test sites with ambiguous descriptions - Test serial nominations with multiple institution types - [ ] **[V]** Benchmark classifier accuracy `[1h]` (P1) - Manually classify 50 sample UNESCO sites - Compare with algorithm output - Target accuracy: >85% ### Day 5: GHCID Generation for UNESCO #### Morning (4 hours) - [ ] **[T]** Create test for UNESCO GHCID generation `[2h]` (P0) - **File**: `tests/unit/unesco/test_ghcid_unesco.py` - **Test cases**: - Single-country site: `FR-IDF-PAR-M-WHC600` (Banks of Seine) - Transboundary site: Multi-country code handling - Serial nomination: Append serial component ID - UUID v5 generation from GHCID - GHCID history tracking - [ ] **[D+T]** Implement UNESCO GHCID generator `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/ghcid_generator.py` - **Function**: `generate_unesco_ghcid(site_data: dict) -> str` - **Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-WHC{UNESCO_ID}` - **Logic**: 1. Extract country from `states_iso_code` 2. Extract city from `name_en` or geocode from lat/lon 3. Classify institution type → single-letter code 4. Append `WHC{unesco_id}` suffix 5. Handle transboundary: Use primary country or multi-country code #### Afternoon (4 hours) - [ ] **[D]** Implement city extraction logic `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/city_extractor.py` - **Function**: `extract_city_from_name(name: str, country: str) -> str` - **Patterns**: - "Historic Centre of {City}" - "Old Town of {City}" - "{Site Name} in {City}" - Fallback to geocoding API if pattern match fails - [ ] **[T]** Test city extraction edge cases `[1h]` (P1) - Test site names without city references - Test sites spanning multiple cities - Test non-English site names - [ ] **[V]** Validate GHCID uniqueness `[1h]` (P0) - Generate GHCIDs for all 1,200+ UNESCO sites - Check for collisions (expect <5%) - Document collision resolution strategy --- ## Phase 2: Data Extraction Pipeline (Days 6-12, 32 hours) ### Day 6: UNESCO API Client #### Morning (4 hours) - [ ] **[T]** Create test for UNESCO API client `[2h]` (P0) - **File**: `tests/unit/unesco/test_unesco_api_client.py` - **Test cases**: - Fetch all sites with pagination (OpenDataSoft limit=100, offset mechanism) - Fetch single site by ID using ODSQL `where` clause - Handle HTTP errors (404, 500, timeout) - Parse OpenDataSoft response structure (nested `record.fields`) - Cache responses (avoid repeated API calls) - Handle pagination links from response - [ ] **[D+T]** Implement UNESCO API client `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/api_client.py` - **Class**: `UNESCOAPIClient` - **Base URL**: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records` - **Methods**: - `fetch_all_sites() -> List[dict]` - Paginate with limit=100, offset increment - `fetch_site_by_id(site_id: int) -> dict` - Use ODSQL `where="unique_number={id}"` - `paginate_records(limit: int, offset: int) -> dict` - Handle OpenDataSoft pagination - `download_and_cache(url: str, cache_path: Path) -> dict` - **Response parsing**: Extract `records` array, navigate `record.fields` for each item #### Afternoon (4 hours) - [ ] **[D]** Implement response parser `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/response_parser.py` - **Function**: `parse_unesco_site(raw_data: dict) -> dict` - **Input**: OpenDataSoft record structure: `raw_data['record']['fields']` - **Transformations**: - Extract from nested `fields` dict: `raw_data['record']['fields']` - Normalize field names (snake_case) - Parse `date_inscribed` to datetime (format: "YYYY-MM-DD") - Parse `criteria_txt` to list: "(i)(ii)(iv)" → ["i", "ii", "iv"] - Extract coordinates from `coordinates` object: `{"lat": ..., "lon": ...}` - Handle missing/null fields (many optional fields in OpenDataSoft) - **Example input**: ```json { "record": { "id": "abc123...", "fields": { "name_en": "Angkor", "unique_number": 668, "date_inscribed": "1992-01-01", "criteria_txt": "(i)(ii)(iii)(iv)", "coordinates": {"lat": 13.4333, "lon": 103.8333}, "category": "Cultural", "states_name_en": "Cambodia" } } } ``` - [ ] **[T]** Create parser tests with fixtures `[1.5h]` (P0) - **Fixtures**: `tests/fixtures/unesco/sample_sites.json` (OpenDataSoft format) - Test parsing of 10 diverse UNESCO sites: - Cultural site (e.g., Angkor Wat - unique_number: 668) - Natural site (e.g., Galápagos Islands - unique_number: 1) - Mixed site (e.g., Mount Taishan - unique_number: 437) - Transboundary site (e.g., Struve Geodetic Arc - unique_number: 1187) - Serial nomination (e.g., Le Corbusier's architectural works - unique_number: 1321) - **Test nested field extraction**: Verify parser correctly navigates `record.fields` - **Test coordinate parsing**: Verify `coordinates` object → separate lat/lon floats - **Test null handling**: Verify sites with missing `coordinates`, `criteria_txt`, etc. - [ ] **[V]** Test API client with live data `[0.5h]` (P0) ```bash python -m pytest tests/unit/unesco/test_unesco_api_client.py -v ``` ### Day 7: Location & Geographic Data #### Morning (4 hours) - [ ] **[T]** Create test for Location object builder `[1.5h]` (P0) - **File**: `tests/unit/unesco/test_location_builder.py` - **Test cases**: - Build Location from UNESCO lat/lon + country - Reverse geocode coordinates to city/region - Handle missing coordinates (geocode from name) - Handle transboundary sites (multiple locations) - [ ] **[D+T]** Implement Location builder `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/location_builder.py` - **Function**: `build_location(site_data: dict) -> List[Location]` - **Logic**: 1. Extract `latitude`, `longitude`, `states_iso_code` 2. Reverse geocode using Nominatim API (cached) 3. Create Location object with city, region, country 4. For transboundary sites, split by country and create multiple Locations - [ ] **[D]** Implement Nominatim client wrapper `[0.5h]` (P1) - **File**: `src/glam_extractor/geocoding/nominatim_client.py` - **Function**: `reverse_geocode(lat: float, lon: float) -> dict` - **Caching**: Use 15-minute TTL cache (per AGENTS.md) #### Afternoon (4 hours) - [ ] **[D]** Integrate with GeoNames (from global_glam project) `[2h]` (P1) - Reference: `docs/plan/global_glam/08-geonames-integration.md` - Use existing GeoNames client if available - Extract region codes (ADM1, ADM2) - [ ] **[T]** Test geographic edge cases `[1h]` (P1) - UNESCO sites in disputed territories - Sites with coordinates in ocean (e.g., Galápagos) - Sites spanning multiple administrative regions - [ ] **[V]** Validate geographic accuracy `[1h]` (P0) - Manually verify 50 geocoded UNESCO sites - Check city names match expected values - Target accuracy: >90% ### Day 8: Identifier Extraction #### Morning (4 hours) - [ ] **[T]** Create test for Identifier extraction `[2h]` (P0) - **File**: `tests/unit/unesco/test_identifier_extractor.py` - **Test cases**: - UNESCO WHC ID → Identifier(scheme="UNESCO_WHC", value="600") - UNESCO URL → Identifier(scheme="Website", value="https://...") - Wikidata enrichment (if available) - ISIL codes (if site has institutional archive/library) - [ ] **[D+T]** Implement Identifier extractor `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/identifier_extractor.py` - **Function**: `extract_identifiers(site_data: dict) -> List[Identifier]` - **Always create**: - UNESCO WHC ID: `unesco_id` or `unique_number` - Website: `http_url` - **Conditionally add**: - Wikidata Q-number (via SPARQL query) - VIAF ID (if available in future enrichment) #### Afternoon (4 hours) - [ ] **[D]** Implement Wikidata enrichment `[3h]` (P1) - **File**: `src/glam_extractor/extractors/unesco/wikidata_enricher.py` - **Function**: `find_wikidata_for_unesco_site(site_data: dict) -> Optional[str]` - **SPARQL Query**: ```sparql SELECT ?item WHERE { ?item wdt:P757 "600" . # UNESCO WHC ID property } ``` - **Fuzzy fallback**: If no P757 match, search by name + country - [ ] **[T]** Test Wikidata enrichment `[1h]` (P1) - Test with known UNESCO sites that have Wikidata entries - Test handling of sites without Wikidata entries - Test rate limiting (1 request/second per Wikidata policy) ### Day 9-10: Provenance & Metadata #### Day 9 Morning (4 hours) - [ ] **[T]** Create test for Provenance builder `[1.5h]` (P0) - **File**: `tests/unit/unesco/test_provenance_builder.py` - **Test cases**: - Provenance with data_source=UNESCO_DATAHUB - Provenance with data_tier=TIER_2_VERIFIED (per doc 04) - Extraction timestamp in UTC - Extraction method documentation - Confidence scores for institution type classification - [ ] **[D+T]** Implement Provenance builder `[2h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/provenance_builder.py` - **Function**: `build_provenance(site_data: dict, confidence: float) -> Provenance` - **Fields**: ```python Provenance( data_source=DataSource.UNESCO_DATAHUB, data_tier=DataTier.TIER_2_VERIFIED, extraction_date=datetime.now(timezone.utc), extraction_method="UNESCO DataHub API extraction", confidence_score=confidence, source_url=site_data.get('http_url'), verified_date=datetime.fromisoformat(site_data['date_inscribed']), verified_by="UNESCO World Heritage Committee" ) ``` - [ ] **[D]** Add DataSource enum value `[0.5h]` (P0) - Edit `schemas/enums.yaml` - Add: `UNESCO_DATAHUB: {description: "UNESCO World Heritage DataHub API"}` #### Day 9 Afternoon (4 hours) - [ ] **[T]** Create test for ChangeEvent extraction `[2h]` (P1) - **File**: `tests/unit/unesco/test_change_event_extractor.py` - **Test cases**: - Founding event from `date_inscribed` - Danger listing event (when `danger: "1"`) - Delisting event (when `date_end` is set) - Extension event (when `extension: "1"`) - [ ] **[D+T]** Implement ChangeEvent extractor `[2h]` (P1) - **File**: `src/glam_extractor/extractors/unesco/change_event_extractor.py` - **Function**: `extract_change_events(site_data: dict) -> List[ChangeEvent]` - **Events to create**: - **FOUNDING**: `date_inscribed` → UNESCO inscription (not actual founding!) - **STATUS_CHANGE**: `danger: "1"` → added to Danger List - **CLOSURE**: `date_end` → delisted from World Heritage List #### Day 10 Morning (4 hours) - [ ] **[D]** Create Collection metadata extractor `[2h]` (P1) - **File**: `src/glam_extractor/extractors/unesco/collection_builder.py` - **Function**: `build_collection(site_data: dict) -> Optional[Collection]` - **Logic**: - Extract `short_description_en` → collection description - Parse `criteria_txt` → subject areas - Extract `area_hectares` → extent - If site is MUSEUM/ARCHIVE/LIBRARY, create Collection object - [ ] **[T]** Test Collection extraction `[1.5h]` (P1) - Test museum sites with clear collections - Test non-institutional sites (landscapes, monuments) → return None - Test serial nominations → multiple collections - [ ] **[V]** Review extraction pipeline completeness `[0.5h]` (P0) - Checklist: - ✓ Institution name extracted - ✓ Institution type classified - ✓ Location(s) extracted - ✓ Identifiers extracted - ✓ Provenance metadata complete - ✓ Change events extracted (optional) - ✓ Collection metadata extracted (optional) #### Day 10 Afternoon (4 hours) - [ ] **[D+T]** Implement main orchestrator `[3h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/extractor.py` - **Class**: `UNESCOExtractor` - **Method**: `extract_site(site_data: dict) -> HeritageCustodian` - **Orchestration logic**: 1. Parse raw UNESCO JSON 2. Classify institution type → confidence score 3. Extract name (prefer English, fallback to French) 4. Build Location(s) 5. Extract Identifiers 6. Build Provenance 7. Generate GHCID 8. Generate UUIDs (v5, v8, numeric) 9. Extract ChangeEvents 10. Build Collection (if applicable) 11. Return HeritageCustodian instance - [ ] **[T]** Integration test for full extraction `[1h]` (P0) - **File**: `tests/integration/unesco/test_end_to_end.py` - **Test**: Extract 10 diverse UNESCO sites end-to-end - **Assertions**: - All required fields populated - GHCID format valid - UUIDs generated correctly - Provenance complete ### Day 11-12: Batch Processing #### Day 11 Morning (4 hours) - [ ] **[D]** Implement batch processor `[2.5h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/batch_processor.py` - **Class**: `UNESCOBatchProcessor` - **Method**: `process_all_sites() -> List[HeritageCustodian]` - **Features**: - Progress bar (tqdm) - Error handling (skip invalid sites, log errors) - Parallel processing (multiprocessing, 4 workers) - Incremental checkpointing (save every 100 sites) - [ ] **[T]** Test batch processor `[1.5h]` (P0) - Test processing 100 sites - Test error recovery (invalid site data) - Test checkpoint/resume functionality #### Day 11 Afternoon (4 hours) - [ ] **[D]** Add CLI interface `[2h]` (P1) - **File**: `src/glam_extractor/cli/unesco.py` - **Commands**: ```bash glam-extract unesco fetch-all # Download UNESCO data glam-extract unesco process # Process all sites glam-extract unesco validate # Validate output glam-extract unesco export # Export to formats ``` - [ ] **[D]** Implement logging system `[1h]` (P1) - **File**: `src/glam_extractor/extractors/unesco/logger.py` - **Log levels**: - INFO: Progress milestones - WARNING: Missing data, fallback logic triggered - ERROR: Unhandled exceptions, invalid data - **Output**: `logs/unesco_extraction_{timestamp}.log` - [ ] **[V]** Dry run on full dataset `[1h]` (P0) - Process all 1,200+ UNESCO sites - Log errors and warnings - Review classification accuracy - Target success rate: >95% #### Day 12 Full Day (8 hours) - [ ] **[D]** Implement error recovery patterns `[3h]` (P0) - **File**: `src/glam_extractor/extractors/unesco/error_handling.py` - **Patterns**: - Retry on transient errors (HTTP 429, 503) - Skip on permanent errors (invalid data structure) - Log detailed error context - Continue processing remaining sites - [ ] **[T]** Stress test batch processor `[2h]` (P0) - Test with 1,200+ sites - Measure processing time (target: <10 minutes) - Monitor memory usage - Test interrupt/resume (Ctrl+C handling) - [ ] **[D]** Optimize performance `[2h]` (P1) - Profile slow operations - Cache geocoding results - Batch Wikidata queries - Optimize JSON serialization - [ ] **[V]** Quality review of 50 random sites `[1h]` (P0) - Manual inspection of extracted records - Verify institution type classification - Verify GHCID generation - Verify Location accuracy - Document edge cases for refinement --- ## Phase 3: Validation & Testing (Days 13-16, 16 hours) ### Day 13: LinkML Schema Validation #### Morning (4 hours) - [ ] **[D]** Create comprehensive validation script `[2h]` (P0) - **File**: `scripts/validate_unesco_output.py` - **Validations**: - LinkML schema compliance - Required field presence - Enum value validity - GHCID format regex - UUID format validation - URL reachability (sample check) - [ ] **[T]** Create validation test suite `[2h]` (P0) - **File**: `tests/validation/test_unesco_validation.py` - **Test cases**: - Valid records pass validation - Invalid records fail with clear error messages - Edge cases handled gracefully #### Afternoon (4 hours) - [ ] **[V]** Run LinkML validator on all records `[2h]` (P0) ```bash for file in data/instances/unesco/*.yaml; do linkml-validate -s schemas/heritage_custodian.yaml "$file" done ``` - [ ] **[V]** Review validation errors `[2h]` (P0) - Categorize errors (schema, data quality, edge cases) - Fix schema issues - Fix extraction logic bugs - Document known limitations ### Day 14: Data Quality Checks #### Morning (4 hours) - [ ] **[D]** Implement data quality checks `[2.5h]` (P0) - **File**: `scripts/quality_checks_unesco.py` - **Checks**: - Completeness: % of fields populated per record - Consistency: Same UNESCO ID → same GHCID - Accuracy: Geocoding spot checks - Uniqueness: No duplicate GHCIDs - Referential integrity: All identifiers valid - [ ] **[V]** Run quality checks on full dataset `[1.5h]` (P0) - Generate quality report: `reports/unesco_quality_report.md` - Target metrics: - Completeness: >90% for required fields - Accuracy: >85% for institution type classification - Uniqueness: 100% (no duplicate GHCIDs) #### Afternoon (4 hours) - [ ] **[D]** Implement deduplication logic `[2h]` (P1) - **File**: `src/glam_extractor/extractors/unesco/deduplicator.py` - **Logic**: - Compare by UNESCO WHC ID (primary key) - Fuzzy match by name + country (detect data errors) - Merge records if duplicates found - [ ] **[T]** Test deduplication `[1h]` (P1) - Create synthetic duplicates - Test merging logic - Verify data integrity after merge - [ ] **[V]** Final quality gate `[1h]` (P0) - **Criteria**: - All records pass LinkML validation - Quality metrics meet targets - No critical errors in logs - Sample manual review passes ### Day 15: Integration Testing #### Full Day (8 hours) - [ ] **[T]** Create integration test suite `[4h]` (P0) - **File**: `tests/integration/unesco/test_full_pipeline.py` - **Tests**: - Fetch UNESCO data from live API - Process 100 random sites - Validate output against schema - Export to JSON-LD, RDF, CSV - Verify exported files are valid - [ ] **[T]** Create cross-reference tests `[2h]` (P1) - **File**: `tests/integration/unesco/test_cross_reference.py` - **Tests**: - Cross-check UNESCO WHC IDs with official list - Verify Wikidata Q-numbers resolve - Verify website URLs are reachable (sample) - [ ] **[V]** Performance benchmarks `[1h]` (P0) - Measure extraction time per site (target: <0.5s) - Measure total processing time (target: <10 minutes) - Measure memory usage (target: <2GB) - [ ] **[V]** User acceptance testing `[1h]` (P0) - Load sample data into visualization tool - Verify data is human-readable - Verify data structure supports use cases (doc 02) ### Day 16: Test Coverage & Documentation #### Morning (4 hours) - [ ] **[V]** Measure test coverage `[1h]` (P0) ```bash pytest --cov=src/glam_extractor/extractors/unesco \ --cov-report=html \ tests/unit/unesco/ ``` - Target coverage: >80% - [ ] **[T]** Add missing tests to reach coverage target `[2h]` (P0) - Focus on edge cases - Focus on error handling paths - [ ] **[D]** Document testing strategy `[1h]` (P1) - **File**: `docs/unesco/testing.md` - Document test fixtures - Document test data sources - Document known test gaps #### Afternoon (4 hours) - [ ] **[D]** Create developer guide `[2h]` (P1) - **File**: `docs/unesco/developer_guide.md` - How to run extraction locally - How to add new classifiers - How to extend schema - How to debug extraction errors - [ ] **[D]** Create operational runbook `[2h]` (P1) - **File**: `docs/unesco/operations.md` - How to schedule periodic extractions - How to monitor data quality - How to handle API changes - How to troubleshoot common issues --- ## Phase 4: Export & Serialization (Days 17-20, 16 hours) ### Day 17: JSON-LD Exporter #### Morning (4 hours) - [ ] **[T]** Create test for JSON-LD exporter `[1.5h]` (P0) - **File**: `tests/unit/unesco/test_jsonld_exporter.py` - **Test cases**: - Export single HeritageCustodian to JSON-LD - Verify @context is valid - Verify @type is correct - Verify linked data properties - [ ] **[D+T]** Implement JSON-LD exporter `[2.5h]` (P0) - **File**: `src/glam_extractor/exporters/jsonld_unesco.py` - **Function**: `export_to_jsonld(custodian: HeritageCustodian) -> dict` - **JSON-LD context**: ```json { "@context": { "@vocab": "https://w3id.org/heritage/custodian/", "name": "http://schema.org/name", "location": "http://schema.org/location", "unesco": "https://whc.unesco.org/en/list/" } } ``` #### Afternoon (4 hours) - [ ] **[D]** Create JSON-LD context file `[1h]` (P0) - **File**: `contexts/unesco_heritage_custodian.jsonld` - Reference existing context from global_glam project - Add UNESCO-specific terms - [ ] **[T]** Validate JSON-LD output `[1h]` (P0) - Use JSON-LD Playground: https://json-ld.org/playground/ - Verify RDF triples are correct - Test with sample records - [ ] **[D]** Batch export to JSON-LD `[2h]` (P0) - Export all 1,200+ records to `data/instances/unesco/jsonld/` - One file per record: `whc-{unesco_id}.jsonld` ### Day 18: RDF/Turtle Exporter #### Morning (4 hours) - [ ] **[T]** Create test for RDF exporter `[1.5h]` (P0) - **File**: `tests/unit/unesco/test_rdf_exporter.py` - **Test cases**: - Export to Turtle (.ttl) - Export to N-Triples (.nt) - Verify RDF triples are valid - Test with rdflib - [ ] **[D+T]** Implement RDF exporter `[2.5h]` (P0) - **File**: `src/glam_extractor/exporters/rdf_unesco.py` - **Function**: `export_to_rdf(custodian: HeritageCustodian, format: str) -> str` - **Dependencies**: `pip install rdflib==7.0.0` #### Afternoon (4 hours) - [ ] **[D]** Create RDF namespace definitions `[1h]` (P0) - **File**: `src/glam_extractor/exporters/namespaces.py` - **Namespaces**: - heritage: https://w3id.org/heritage/custodian/ - unesco: https://whc.unesco.org/en/list/ - schema: http://schema.org/ - geo: http://www.w3.org/2003/01/geo/wgs84_pos# - prov: http://www.w3.org/ns/prov# - [ ] **[D]** Batch export to Turtle `[2h]` (P0) - Export all records to `data/instances/unesco/rdf/unesco_sites.ttl` - Create one large Turtle file with all 1,200+ sites - [ ] **[V]** Validate RDF output `[1h]` (P0) - Use rapper validator: `rapper -i turtle unesco_sites.ttl` - Test SPARQL queries on output ### Day 19: CSV & Parquet Exporters #### Morning (4 hours) - [ ] **[T]** Create test for CSV exporter `[1h]` (P0) - **File**: `tests/unit/unesco/test_csv_exporter.py` - **Test cases**: - Export flat CSV with essential fields - Handle multi-valued fields (locations, identifiers) - UTF-8 encoding - [ ] **[D+T]** Implement CSV exporter `[2h]` (P0) - **File**: `src/glam_extractor/exporters/csv_unesco.py` - **Function**: `export_to_csv(custodians: List[HeritageCustodian], path: Path)` - **Columns**: - ghcid, name, institution_type, city, country, latitude, longitude - unesco_id, inscription_year, category, criteria - wikidata_id, website_url - extraction_date, confidence_score - [ ] **[D]** Batch export to CSV `[1h]` (P0) - Export to `data/instances/unesco/unesco_heritage_custodians.csv` #### Afternoon (4 hours) - [ ] **[T]** Create test for Parquet exporter `[1h]` (P0) - **File**: `tests/unit/unesco/test_parquet_exporter.py` - **Test cases**: - Export to Parquet format - Verify schema preservation - Test with DuckDB or Pandas - [ ] **[D+T]** Implement Parquet exporter `[2h]` (P0) - **File**: `src/glam_extractor/exporters/parquet_unesco.py` - **Function**: `export_to_parquet(custodians: List[HeritageCustodian], path: Path)` - **Dependencies**: `pip install pyarrow==14.0.0` - [ ] **[D]** Batch export to Parquet `[1h]` (P0) - Export to `data/instances/unesco/unesco_heritage_custodians.parquet` ### Day 20: SQLite Database #### Morning (4 hours) - [ ] **[D]** Design SQLite schema `[1.5h]` (P0) - **File**: `scripts/create_unesco_db_schema.sql` - **Tables**: - heritage_custodians (main table) - locations (1:N relationship) - identifiers (1:N relationship) - change_events (1:N relationship) - collections (1:N relationship) - [ ] **[D]** Implement SQLite exporter `[2.5h]` (P0) - **File**: `src/glam_extractor/exporters/sqlite_unesco.py` - **Function**: `export_to_sqlite(custodians: List[HeritageCustodian], db_path: Path)` - **Dependencies**: Built-in `sqlite3` #### Afternoon (4 hours) - [ ] **[D]** Create database indexes `[1h]` (P0) - Index on ghcid (primary key) - Index on unesco_id (unique) - Index on country - Index on institution_type - Full-text search on name and description - [ ] **[T]** Test SQLite export and queries `[2h]` (P0) - Test data integrity - Test foreign key constraints - Test query performance - [ ] **[D]** Create sample SQL queries `[1h]` (P1) - **File**: `docs/unesco/sample_queries.sql` - Example queries for common use cases --- ## Phase 5: Integration & Enrichment (Days 21-24, 16 hours) ### Day 21: Wikidata Enrichment Pipeline #### Morning (4 hours) - [ ] **[D]** Implement batch Wikidata enrichment `[3h]` (P1) - **File**: `scripts/enrich_unesco_with_wikidata.py` - **Process**: 1. Load all UNESCO records without Wikidata IDs 2. Query Wikidata SPARQL endpoint (batches of 100) 3. Match by UNESCO WHC ID (property P757) 4. Update records with Q-numbers 5. Extract additional metadata from Wikidata (VIAF, founding date) - [ ] **[T]** Test Wikidata enrichment `[1h]` (P1) - Test with 50 sample sites - Verify match accuracy - Handle rate limiting #### Afternoon (4 hours) - [ ] **[D]** Implement ISIL code lookup `[2h]` (P1) - **File**: `scripts/enrich_unesco_with_isil.py` - **Data source**: Cross-reference with existing ISIL registry - **Logic**: - Match by institution name + country - Fuzzy matching threshold: 0.85 - Add ISIL identifier to records - [ ] **[V]** Review enrichment results `[2h]` (P1) - Count enriched records - Review match quality - Document enrichment coverage ### Day 22: Cross-dataset Linking #### Morning (4 hours) - [ ] **[D]** Link UNESCO records to existing GLAM dataset `[3h]` (P1) - **File**: `scripts/link_unesco_to_global_glam.py` - **Matching strategies**: 1. GHCID match (direct) 2. Wikidata ID match 3. Name + country fuzzy match - **Output**: Crosswalk table linking UNESCO WHC ID to existing GHCIDs - [ ] **[V]** Validate cross-dataset links `[1h]` (P1) - Manual review of 50 matched pairs - Check for false positives - Document ambiguous cases #### Afternoon (4 hours) - [ ] **[D]** Create dataset merge logic `[2h]` (P1) - **File**: `scripts/merge_unesco_into_global_glam.py` - **Merge rules**: - If GHCID exists: Update with UNESCO metadata - If no match: Add as new record - Preserve provenance trail - [ ] **[T]** Test merge logic `[1h]` (P1) - Test with sample dataset - Verify no data loss - Verify GHCID uniqueness maintained - [ ] **[V]** Final merge quality check `[1h]` (P0) - Review merged dataset - Verify provenance for all records - Document merge statistics ### Day 23: Data Quality Refinement #### Full Day (8 hours) - [ ] **[D]** Implement data quality improvements `[4h]` (P0) - Fix classification errors from manual review - Refine keyword patterns for edge cases - Improve geocoding accuracy - Handle special characters in names - [ ] **[T]** Regression testing `[2h]` (P0) - Re-run full test suite - Verify improvements didn't break existing functionality - [ ] **[V]** Final data quality audit `[2h]` (P0) - Re-run quality checks - Compare metrics to Day 14 baseline - Target improvement: +5% accuracy ### Day 24: Documentation & Handoff #### Morning (4 hours) - [ ] **[D]** Create extraction report `[2h]` (P0) - **File**: `reports/unesco_extraction_report.md` - **Contents**: - Total records extracted: 1,200+ - Classification accuracy: XX% - Coverage by institution type - Geographic distribution - Data quality metrics - Known limitations - [ ] **[D]** Document known issues `[1h]` (P0) - **File**: `docs/unesco/known_issues.md` - List edge cases not handled - List sites with ambiguous classification - List sites missing key metadata - [ ] **[D]** Create future enhancements roadmap `[1h]` (P1) - **File**: `docs/unesco/roadmap.md` - Planned improvements - Additional data sources to integrate - New features to implement #### Afternoon (4 hours) - [ ] **[D]** Create API documentation `[2h]` (P1) - **File**: `docs/unesco/api.md` - Document all public functions - Document CLI commands - Provide usage examples - [ ] **[D]** Create maintenance guide `[2h]` (P1) - **File**: `docs/unesco/maintenance.md` - How to update UNESCO data (periodic refresh) - How to handle API changes - How to re-run enrichment pipelines --- ## Phase 6: Deployment & Monitoring (Days 25-28, 16 hours) ### Day 25: Automation Scripts #### Morning (4 hours) - [ ] **[D]** Create automated extraction pipeline `[2.5h]` (P0) - **File**: `scripts/run_unesco_extraction.sh` - **Steps**: 1. Fetch latest UNESCO data 2. Run extraction 3. Validate output 4. Export to all formats 5. Run quality checks 6. Generate report - [ ] **[T]** Test automation script `[1.5h]` (P0) - Test end-to-end execution - Test error handling - Test with scheduled execution (cron) #### Afternoon (4 hours) - [ ] **[D]** Create incremental update script `[2.5h]` (P1) - **File**: `scripts/update_unesco_incremental.py` - **Logic**: - Compare current UNESCO data with previous extraction - Detect new sites, modified sites, delisted sites - Process only changes - Update database incrementally - [ ] **[T]** Test incremental updates `[1.5h]` (P1) - Simulate API changes - Verify only deltas processed - Verify data consistency ### Day 26: Monitoring & Alerts #### Morning (4 hours) - [ ] **[D]** Implement extraction monitoring `[2h]` (P1) - **File**: `src/glam_extractor/monitoring/unesco_monitor.py` - **Metrics to track**: - Extraction success rate - Average processing time per site - Classification confidence distribution - Data quality scores - [ ] **[D]** Create alerting logic `[2h]` (P1) - **File**: `src/glam_extractor/monitoring/alerts.py` - **Alert conditions**: - Extraction failure rate >5% - Classification confidence <0.7 for >10% of sites - API downtime - Schema validation failure rate >1% #### Afternoon (4 hours) - [ ] **[D]** Create monitoring dashboard (optional) `[3h]` (P2) - **Tool**: Jupyter notebook or Streamlit - **Visualizations**: - Extraction pipeline status - Data quality trends over time - Geographic coverage map - Institution type distribution - [ ] **[V]** Test monitoring system `[1h]` (P1) - Simulate failures - Verify alerts trigger - Test dashboard functionality ### Day 27: Production Deployment #### Morning (4 hours) - [ ] **[D]** Package extraction module `[2h]` (P0) - Create `setup.py` or `pyproject.toml` - Document installation instructions - Test installation in clean environment - [ ] **[D]** Create Docker container (optional) `[2h]` (P2) - **File**: `docker/Dockerfile.unesco` - Containerize extraction pipeline - Document Docker deployment #### Afternoon (4 hours) - [ ] **[D]** Deploy to production environment `[2h]` (P0) - Upload code to production server - Install dependencies - Configure environment variables - Test production execution - [ ] **[V]** Production smoke test `[1h]` (P0) - Run extraction on production - Verify output files created - Verify database updated - Verify logs generated - [ ] **[D]** Configure scheduled execution `[1h]` (P0) - Set up cron job for weekly updates - Configure log rotation - Set up error notifications ### Day 28: Handoff & Training #### Full Day (8 hours) - [ ] **[D]** Create training materials `[3h]` (P1) - **Presentation**: Project overview and architecture - **Demo**: Live extraction walkthrough - **Exercises**: Hands-on practice for maintainers - [ ] **[D]** Conduct knowledge transfer session `[3h]` (P1) - Present to stakeholders - Walk through codebase - Answer questions - Document Q&A - [ ] **[D]** Create support plan `[2h]` (P1) - **File**: `docs/unesco/support.md` - Contact information for maintainers - Escalation procedures - Known issues and workarounds --- ## Phase 7: Finalization (Days 29-30, 8 hours) ### Day 29: Final Review #### Morning (4 hours) - [ ] **[V]** Code review `[2h]` (P0) - Review all implementation files - Check code quality (linting, type hints) - Verify documentation completeness - [ ] **[V]** Test coverage review `[1h]` (P0) - Verify coverage >80% - Review untested code paths - Document rationale for any skipped tests - [ ] **[V]** Security review `[1h]` (P0) - Check for hardcoded credentials - Verify API keys in environment variables - Check for SQL injection vulnerabilities - Verify input validation #### Afternoon (4 hours) - [ ] **[D]** Final documentation pass `[2h]` (P0) - Proofread all documentation - Check for broken links - Verify code examples are correct - Update README with final statistics - [ ] **[V]** Stakeholder demo `[2h]` (P0) - Present final results - Demo visualization of extracted data - Discuss next steps - Gather feedback ### Day 30: Release & Archive #### Morning (4 hours) - [ ] **[D]** Create release package `[2h]` (P0) - Tag release version: `v1.0.0-unesco` - Create GitHub release (if applicable) - Archive source code - Archive extracted datasets - [ ] **[D]** Publish datasets `[1h]` (P0) - Upload to project repository - Generate dataset DOI (Zenodo or similar) - Update dataset catalog - [ ] **[D]** Create project retrospective `[1h]` (P1) - **File**: `reports/unesco_retrospective.md` - What went well - What could be improved - Lessons learned - Recommendations for future extractions #### Afternoon (4 hours) - [ ] **[D]** Final cleanup `[2h]` (P0) - Remove temporary files - Archive logs - Clean up test data - Optimize database indexes - [ ] **[V]** Final quality gate `[1h]` (P0) - **Checklist**: - ✓ All tests passing - ✓ All data validated - ✓ All documentation complete - ✓ Production deployment successful - ✓ Monitoring in place - ✓ Stakeholders trained - [ ] **[D]** Project close-out `[1h]` (P0) - Send completion notification - Archive project artifacts - Update project status: COMPLETE --- ## Appendix A: Task Dependencies ### Critical Path (Must Complete in Order) 1. **Day 1-2**: Environment setup → **Blocks all development** 2. **Day 3-5**: Schema & types → **Blocks extraction logic** 3. **Day 6-8**: Data extraction → **Blocks validation** 4. **Day 9-12**: Batch processing → **Blocks export** 5. **Day 13-16**: Validation → **Blocks production deployment** 6. **Day 17-20**: Export → **Blocks integration** 7. **Day 25-28**: Deployment → **Blocks project completion** ### Parallel Work Opportunities - **Days 7-10**: Location extraction + Identifier extraction can be done in parallel - **Days 17-20**: All exporters (JSON-LD, RDF, CSV, Parquet, SQLite) can be developed in parallel - **Days 21-22**: Wikidata enrichment + ISIL lookup can be done in parallel --- ## Appendix B: Quality Gates ### Gate 1: Schema Complete (End of Day 5) - [ ] All LinkML schemas validated - [ ] Python classes generated successfully - [ ] Enum values documented - [ ] GHCID format specified ### Gate 2: Extraction Pipeline Complete (End of Day 12) - [ ] API client tested with live data - [ ] All extractors implemented and tested - [ ] Batch processor handles 1,200+ sites - [ ] Error handling robust ### Gate 3: Validation Complete (End of Day 16) - [ ] All records pass LinkML validation - [ ] Quality metrics meet targets - [ ] Test coverage >80% - [ ] Sample manual review passes ### Gate 4: Export Complete (End of Day 20) - [ ] All export formats tested - [ ] Output files validated - [ ] Database indexes optimized - [ ] Sample queries work ### Gate 5: Production Ready (End of Day 28) - [ ] Production deployment successful - [ ] Monitoring in place - [ ] Documentation complete - [ ] Team trained ### Gate 6: Project Complete (End of Day 30) - [ ] All tests passing - [ ] All datasets published - [ ] Stakeholders notified - [ ] Project archived --- ## Appendix C: Risk Management ### High-Priority Risks #### Risk 1: UNESCO API Changes - **Impact**: High (extraction fails) - **Probability**: Low - **Mitigation**: - Monitor API version - Cache raw API responses - Implement schema validation on API responses - Create fallback to manual data files #### Risk 2: Institution Type Classification Accuracy <85% - **Impact**: Medium (data quality issues) - **Probability**: Medium - **Mitigation**: - Iterative refinement of keywords - Manual review of 100+ samples - Add confidence threshold (skip uncertain classifications) - Document known edge cases #### Risk 3: GHCID Collisions - **Impact**: Medium (duplicate identifiers) - **Probability**: Low (<5% expected) - **Mitigation**: - Implement collision detection - Use Wikidata Q-numbers for disambiguation - Document collision resolution rules - Manual review of collisions #### Risk 4: Performance Issues (>10 min processing time) - **Impact**: Low (convenience only) - **Probability**: Medium - **Mitigation**: - Implement parallel processing - Cache geocoding results - Batch Wikidata queries - Profile and optimize slow operations #### Risk 5: Wikidata Enrichment Rate Limiting - **Impact**: Low (enrichment incomplete) - **Probability**: High - **Mitigation**: - Respect rate limits (1 req/sec) - Implement exponential backoff - Make enrichment optional - Run enrichment as separate batch job --- ## Appendix D: Resource Requirements ### Software Dependencies **Required**: - Python 3.11+ - linkml 1.7.10+ - pydantic 1.10.13 - requests 2.31.0 **Optional**: - rapidfuzz 3.5.2 (fuzzy matching) - SPARQLWrapper 2.0.0 (Wikidata) - rdflib 7.0.0 (RDF export) - pyarrow 14.0.0 (Parquet export) ### Infrastructure **Development**: - Laptop/workstation with 8GB+ RAM - Internet connection (API access) - ~2GB disk space for data **Production**: - Server with 4GB+ RAM - Cron scheduling capability - 5GB disk space (with logs) ### Team Resources **Estimated effort**: 120 person-hours over 30 days **Roles**: - Software Engineer (primary developer) - Data Scientist (classification logic) - QA Engineer (testing) - Documentation Writer (optional) --- ## Appendix E: Success Metrics ### Quantitative Metrics - **Extraction completeness**: >95% of UNESCO sites processed successfully - **Classification accuracy**: >85% correct institution type assignments - **Validation pass rate**: 100% of records pass LinkML schema validation - **Test coverage**: >80% code coverage - **Processing time**: <10 minutes for full dataset - **Data quality score**: >90% completeness on required fields ### Qualitative Metrics - **Code maintainability**: Clear documentation, modular design - **Data usability**: Stakeholders can query and visualize data - **Reproducibility**: Pipeline can be re-run by new team members - **Extensibility**: Easy to add new data sources or fields --- ## Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.1 | 2025-11-10 | AI Agent (OpenCode) | Updated Day 6 API client and parser sections to reflect OpenDataSoft Explore API v2 response structure (nested `record.fields`, pagination links, coordinate objects) | | 1.0 | 2025-11-09 | GLAM Extraction Team | Initial version | --- **Document Status**: ✅ COMPLETE **Next Steps**: Begin Phase 0 - Pre-Implementation Setup **Questions?** Refer to planning documents 00-06 for detailed specifications.