glam/docs/plan/unesco/03-implementation-phases.md

# UNESCO Data Extraction - Implementation Phases

**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
**Document**: 03 - Implementation Phases
**Version**: 1.1
**Date**: 2025-11-10
**Status**: Draft

---

## Executive Summary

This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.

**Total Effort**: 30 working days (6 weeks)
**Team Size**: 1-2 developers + AI agents
**Target Output**: 1,000+ heritage custodian records from UNESCO sites

---

## Phase Overview

| Phase | Duration | Focus | Key Deliverables |
|-------|----------|-------|------------------|
| **Phase 1** | 5 days | API Exploration & Schema Design | UNESCO API parser, LinkML Map schema |
| **Phase 2** | 8 days | Extractor Implementation | Institution type classifier, GHCID generator |
| **Phase 3** | 6 days | Data Quality & Validation | LinkML validator, conflict resolver |
| **Phase 4** | 5 days | Integration & Enrichment | Wikidata enrichment, dataset merge |
| **Phase 5** | 6 days | Export & Documentation | RDF/JSON-LD exporters, user docs |

---

## Phase 1: API Exploration & Schema Design (Days 1-5)

### Objectives

- Understand UNESCO DataHub API structure and data quality
- Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
- Create test fixtures from real UNESCO API responses
- Establish baseline for institution type classification

### Day 1: UNESCO API Reconnaissance

**Tasks**:
1. **API Documentation Review**
   - Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
   - Identify available endpoints (dataset `whc001` - World Heritage List)
   - Document authentication requirements (none - public dataset)
   - Document pagination limits (max 100 records per request, use `offset` parameter)
   - Test API responses for sample sites

2. **Data Structure Analysis**
   ```bash
   # Fetch sample UNESCO site data (OpenDataSoft Explore API v2)
   curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \
     > samples/unesco_site_list.json

   # Fetch specific site by unique_number using ODSQL
   curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \
     > samples/unesco_angkor_detail.json

   # Response structure: nested record.fields with coordinates object
   # {
   #   "total_count": 1248,
   #   "records": [{
   #     "record": {
   #       "fields": {
   #         "name_en": "Angkor",
   #         "unique_number": 668,
   #         "coordinates": {"lat": 13.4333, "lon": 103.8333}
   #       }
   #     }
   #   }]
   # }
   ```

3. **Schema Mapping**
   - Map OpenDataSoft `record.fields` to LinkML `HeritageCustodian` slots
   - Identify missing fields (require inference or external enrichment)
   - Document ambiguities (e.g., when is a site also a museum?)
   - Handle nested response structure (`response['records'][i]['record']['fields']`)

**Deliverables**:
- `docs/unesco-api-analysis.md` - API structure documentation
- `tests/fixtures/unesco_api_responses/` - 10+ sample JSON files
- `docs/unesco-to-linkml-mapping.md` - Field mapping table

**Success Criteria**:
- [ ] Successfully fetch data for 50 UNESCO sites via API
- [ ] Document all relevant JSON fields for extraction
- [ ] Identify 3+ institution type classification patterns

---

### Day 2: LinkML Map Schema Design (Part 1)

**Tasks**:
1. **Install LinkML Map Extension**
   ```bash
   pip install linkml-map
   # OR implement custom extension: src/glam_extractor/mappers/extended_map.py
   ```

2. **Design Transformation Rules**
   - Create `schemas/maps/unesco_to_heritage_custodian.yaml`
   - Define JSONPath expressions for field extraction
   - Handle multi-language names (UNESCO provides English, French, often local language)
   - Map UNESCO categories to InstitutionTypeEnum

3. **Conditional Extraction Logic**
   ```yaml
   # Example LinkML Map rule
   mappings:
     - source_path: $.category
       target_path: institution_type
       transform:
         type: conditional
         rules:
           - condition: "contains(description, 'museum')"
             value: MUSEUM
           - condition: "contains(description, 'library')"
             value: LIBRARY
           - condition: "contains(description, 'archive')"
             value: ARCHIVE
           - default: MIXED
   ```

**Deliverables**:
- `schemas/maps/unesco_to_heritage_custodian.yaml` (initial version)
- `tests/test_unesco_linkml_map.py` - Unit tests for mapping rules

**Success Criteria**:
- [ ] LinkML Map schema validates against sample UNESCO JSON
- [ ] Successfully extract name, location, UNESCO WHC ID from 10 fixtures
- [ ] Handle multilingual names without data loss

---

### Day 3: LinkML Map Schema Design (Part 2)

**Tasks**:
1. **Advanced Transformation Rules**
   - Regex extraction for identifiers (UNESCO WHC ID format: `^\d{3,4}$`)
   - GeoNames ID lookup from UNESCO location strings
   - Wikidata Q-number extraction from UNESCO external links

2. **Multi-value Array Handling**
   ```yaml
   # Extract all languages from UNESCO site names
   mappings:
     - source_path: $.names[*]
       target_path: alternative_names
       transform:
         type: array
         element_transform:
           type: template
           template: "{name}@{lang}"
   ```

3. **Error Handling Patterns**
   - Missing required fields → skip record with warning
   - Invalid coordinates → flag for manual geocoding
   - Unknown institution type → default to MIXED, log for review

**Deliverables**:
- `schemas/maps/unesco_to_heritage_custodian.yaml` (complete)
- `docs/linkml-map-extension-spec.md` - Custom extension specification

**Success Criteria**:
- [ ] Extract ALL relevant fields from 10 diverse UNESCO sites
- [ ] Handle edge cases (missing data, malformed coordinates)
- [ ] Generate valid LinkML instances from real API responses

---

### Day 4: Test Fixture Creation

**Tasks**:
1. **Curate Representative Samples**
   - Select 20 UNESCO sites covering:
     - All continents (Europe, Asia, Africa, Americas, Oceania)
     - Multiple institution types (museums, libraries, archives, botanical gardens)
     - Edge cases (serial nominations, transboundary sites)

2. **Create Expected Outputs**
   ```yaml
   # tests/fixtures/expected_outputs/unesco_louvre.yaml
   - id: https://w3id.org/heritage/custodian/fr/louvre
     name: Musée du Louvre
     institution_type: MUSEUM
     locations:
       - city: Paris
         country: FR
         coordinates: [48.8606, 2.3376]
     identifiers:
       - identifier_scheme: UNESCO_WHC
         identifier_value: "600"
         identifier_url: https://whc.unesco.org/en/list/600
     provenance:
       data_source: UNESCO_WORLD_HERITAGE
       data_tier: TIER_1_AUTHORITATIVE
   ```

3. **Golden Dataset Construction**
   - Manually verify 20 expected outputs against authoritative sources
   - Document any assumptions or inferences made

**Deliverables**:
- `tests/fixtures/unesco_api_responses/` - 20 JSON files
- `tests/fixtures/expected_outputs/` - 20 YAML files
- `tests/test_unesco_golden_dataset.py` - Integration tests

**Success Criteria**:
- [ ] 20 golden dataset pairs (input JSON + expected YAML)
- [ ] 100% passing tests for golden dataset
- [ ] Documented edge cases and classification rules

---

### Day 5: Institution Type Classifier Design

**Tasks**:
1. **Pattern Analysis**
   - Analyze UNESCO descriptions for GLAM-related keywords
   - Create decision tree for institution type classification
   - Handle ambiguous cases (e.g., "archaeological park with museum")

2. **Keyword Extraction**
   ```python
   # src/glam_extractor/classifiers/unesco_institution_type.py

    MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition']
    LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek']
    ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage']
    BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum']
    HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue']
    FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate']
    ```

3. **Confidence Scoring**
   - High confidence (0.9+): Explicit mentions of "museum" or "library"
   - Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
   - Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review

**Deliverables**:
- `src/glam_extractor/classifiers/unesco_institution_type.py`
- `tests/test_unesco_classifier.py` - 50+ test cases
- `docs/unesco-classification-rules.md` - Decision tree documentation

**Success Criteria**:
- [ ] Classifier achieves 90%+ accuracy on golden dataset
- [ ] Low-confidence classifications flagged for manual review
- [ ] Handle multilingual descriptions (English, French, Spanish, etc.)

---

## Phase 2: Extractor Implementation (Days 6-13)

### Objectives

- Implement UNESCO API client with caching and rate limiting
- Build LinkML instance generator using Map schema
- Create GHCID generator for UNESCO institutions
- Achieve 100% test coverage for core extraction logic

### Day 6: UNESCO API Client

**Tasks**:
1. **HTTP Client Implementation**
   ```python
   # src/glam_extractor/parsers/unesco_api_client.py

   class UNESCOAPIClient:
       def __init__(self, cache_ttl: int = 86400):
           self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001"
           self.cache = Cache(ttl=cache_ttl)
           self.rate_limiter = RateLimiter(requests_per_second=2)

       def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict:
           """
           Fetch paginated list of UNESCO World Heritage Sites.

           Returns OpenDataSoft API response with structure:
           {
               "total_count": int,
               "results": [{"record": {"id": str, "fields": {...}}, ...}]
           }
           """
           ...

       def fetch_site_details(self, whc_id: int) -> Dict:
           """
           Fetch detailed information for a specific site using ODSQL query.

           Returns single record with structure:
           {"record": {"id": str, "fields": {field_name: value, ...}}}
           """
           ...
   ```

2. **Caching Strategy**
   - Cache API responses for 24 hours (UNESCO updates infrequently)
   - Store in SQLite database: `cache/unesco_api_cache.db`
   - Invalidate cache on demand for data refreshes

3. **Error Handling**
   - Network errors → retry with exponential backoff
   - 404 Not Found → skip site, log warning
   - Rate limit exceeded → pause and retry

**Deliverables**:
- `src/glam_extractor/parsers/unesco_api_client.py`
- `tests/test_unesco_api_client.py` - Mock API tests
- `cache/unesco_api_cache.db` - SQLite cache database

**Success Criteria**:
- [ ] Successfully fetch all UNESCO sites (1,000+ sites)
- [ ] Handle API errors gracefully (no crashes)
- [ ] Cache reduces API calls by 95% on repeat runs

---

### Day 7: LinkML Instance Generator

**Tasks**:
1. **Apply LinkML Map Transformations**
   ```python
   # src/glam_extractor/parsers/unesco_parser.py

   from linkml_map import Mapper

   def parse_unesco_site(api_response: Dict) -> HeritageCustodian:
       """
       Parse OpenDataSoft API response to HeritageCustodian instance.

       Args:
           api_response: OpenDataSoft record with nested structure:
               {"record": {"id": str, "fields": {field_name: value, ...}}}

       Returns:
           HeritageCustodian: Validated LinkML instance
       """
       # Extract fields from nested structure
       site_data = api_response['record']['fields']

       mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
       instance = mapper.transform(site_data)
       return HeritageCustodian(**instance)
   ```

2. **Validation Pipeline**
   - Apply LinkML schema validation after transformation
   - Catch validation errors, log details
   - Skip invalid records, continue processing

3. **Multi-language Name Handling**
   ```python
   def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]:
       """
       Extract primary name and alternative names in multiple languages.

       Args:
           site_data: Extracted fields from OpenDataSoft record['fields']
       """
       primary_name = site_data.get('site', '')
       alternative_names = []

       # OpenDataSoft may provide language variants in separate fields
       # or as structured data - adjust based on actual API response
       for lang_data in site_data.get('names', []):
           name = lang_data.get('name', '')
           lang = lang_data.get('lang', 'en')
           if name and name != primary_name:
               alternative_names.append(f"{name}@{lang}")

       return primary_name, alternative_names
   ```

**Deliverables**:
- `src/glam_extractor/parsers/unesco_parser.py`
- `tests/test_unesco_parser.py` - 20 golden dataset tests

**Success Criteria**:
- [ ] Parse 20 golden dataset fixtures with 100% accuracy
- [ ] Extract multilingual names without data loss
- [ ] Generate valid LinkML instances (pass schema validation)

---

### Day 8: GHCID Generator for UNESCO Sites

**Tasks**:
1. **Extend GHCID Logic for UNESCO**
   ```python
   # src/glam_extractor/identifiers/ghcid_generator.py

   def generate_ghcid_for_unesco_site(
       country_code: str,
       region_code: str,
       city_code: str,
       institution_type: InstitutionTypeEnum,
       institution_name: str,
       has_collision: bool = False
   ) -> str:
       """
       Generate GHCID for UNESCO World Heritage Site institution.

       Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
       Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix)

       Note: Collision suffix uses native language institution name in snake_case,
       NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md
       """
       ...
   ```

2. **City Code Lookup**
   - Use GeoNames API to convert city names to UN/LOCODE
   - Fallback to 3-letter abbreviation if UN/LOCODE not found
   - Cache lookups to minimize API calls

3. **Collision Detection**
   - Check existing GHCID dataset for collisions
   - Apply temporal priority rules (first batch vs. historical addition)
   - Append native language name suffix if collision detected

**Deliverables**:
- Extended `src/glam_extractor/identifiers/ghcid_generator.py`
- `tests/test_ghcid_unesco.py` - GHCID generation tests

**Success Criteria**:
- [ ] Generate valid GHCIDs for 20 golden dataset institutions
- [ ] No collisions with existing Dutch ISIL dataset
- [ ] Handle missing Wikidata Q-numbers gracefully

---

### Day 9-10: Batch Processing Pipeline

**Tasks**:
1. **Parallel Processing**
   ```python
   # scripts/extract_unesco_sites.py

   from concurrent.futures import ThreadPoolExecutor

   def extract_all_unesco_sites(max_workers: int = 4):
       api_client = UNESCOAPIClient()

       # Fetch paginated site list from OpenDataSoft API
       all_sites = []
       offset = 0
       limit = 100

       while True:
           response = api_client.fetch_site_list(limit=limit, offset=offset)
           all_sites.extend(response['results'])

           if len(all_sites) >= response['total_count']:
               break
           offset += limit

       with ThreadPoolExecutor(max_workers=max_workers) as executor:
           results = executor.map(process_unesco_site, all_sites)

       return list(results)

   def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]:
       """
       Process single OpenDataSoft record.

       Args:
           site_record: {"record": {"id": str, "fields": {...}}}
       """
       try:
           # Extract unique_number from nested fields
           whc_id = site_record['record']['fields']['unique_number']

           # Fetch full details if needed (or use site_record directly)
           details = api_client.fetch_site_details(whc_id)
           institution_type = classify_institution_type(details['record']['fields'])
           custodian = parse_unesco_site(details)
           custodian.ghcid = generate_ghcid(custodian)
           return custodian
       except Exception as e:
           log.error(f"Failed to process site {whc_id}: {e}")
           return None
   ```

2. **Progress Tracking**
   - Use `tqdm` for progress bars
   - Log successful extractions to `logs/unesco_extraction.log`
   - Save intermediate results every 100 sites

3. **Error Recovery**
   - Resume from last checkpoint if script crashes
   - Separate successful extractions from failed ones
   - Generate error report with failed site IDs

**Deliverables**:
- `scripts/extract_unesco_sites.py` - Batch extraction script
- `data/unesco_extracted/` - Output directory for YAML instances
- `logs/unesco_extraction.log` - Extraction log

**Success Criteria**:
- [ ] Process all 1,000+ UNESCO sites in < 2 hours
- [ ] < 5% failure rate (API errors, missing data)
- [ ] Successful extractions saved as valid LinkML YAML files

---

### Day 11-12: Integration Testing

**Tasks**:
1. **End-to-End Tests**
   ```python
   # tests/integration/test_unesco_pipeline.py

   def test_full_unesco_extraction_pipeline():
       """Test complete pipeline from OpenDataSoft API fetch to LinkML instance."""
       # 1. Fetch API data from OpenDataSoft
       api_client = UNESCOAPIClient()
       response = api_client.fetch_site_details(600)  # Paris, Banks of the Seine
       site_data = response['record']['fields']  # Extract from nested structure

       # 2. Classify institution type
       inst_type = classify_institution_type(site_data)
       assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED]

       # 3. Parse to LinkML instance
       custodian = parse_unesco_site(response)
       assert custodian.name is not None

       # 4. Generate GHCID
       custodian.ghcid = generate_ghcid(custodian)
       assert custodian.ghcid.startswith("FR-")

       # 5. Validate against schema
       validator = SchemaValidator(schema="schemas/heritage_custodian.yaml")
       result = validator.validate(custodian)
       assert result.is_valid
   ```

2. **Property-Based Testing**
   ```python
   from hypothesis import given, strategies as st

   @given(st.integers(min_value=1, max_value=1500))
   def test_ghcid_determinism(whc_id: int):
       """GHCID generation is deterministic for same input."""
       site1 = generate_ghcid_for_site(whc_id)
       site2 = generate_ghcid_for_site(whc_id)
       assert site1 == site2
   ```

3. **Performance Testing**
   - Benchmark extraction speed (sites per second)
   - Memory profiling (ensure no memory leaks)
   - Cache hit rate analysis

**Deliverables**:
- `tests/integration/test_unesco_pipeline.py` - End-to-end tests
- `tests/test_unesco_property_based.py` - Property-based tests
- `docs/performance-benchmarks.md` - Performance results

**Success Criteria**:
- [ ] 100% passing integration tests
- [ ] Extract 1,000 sites in < 2 hours (with cache)
- [ ] Memory usage < 500MB for full extraction

---

### Day 13: Code Review & Refactoring

**Tasks**:
1. **Code Quality Review**
   - Run `ruff` linter, fix all warnings
   - Run `mypy` type checker, resolve type errors
   - Ensure 90%+ test coverage

2. **Documentation Review**
   - Add docstrings to all public functions
   - Update README with UNESCO extraction instructions
   - Create developer guide for extending classifiers

3. **Performance Optimization**
   - Profile slow functions, optimize bottlenecks
   - Reduce redundant API calls
   - Optimize GHCID generation (cache city code lookups)

**Deliverables**:
- Refactored codebase with 90%+ test coverage
- Updated documentation
- Performance optimizations applied

**Success Criteria**:
- [ ] Zero linter warnings
- [ ] Zero type errors
- [ ] Test coverage > 90%

---

## Phase 3: Data Quality & Validation (Days 14-19)

### Objectives

- Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
- Detect and resolve conflicts
- Implement confidence scoring system
- Generate data quality report

### Day 14-15: Cross-Referencing with Existing Data

**Tasks**:
1. **Load Existing Dataset**
   ```python
   # scripts/crosslink_unesco_with_glam.py

   def load_existing_glam_dataset():
       """Load Dutch ISIL + conversation extractions."""
       dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv")
       dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
       conversations = load_conversation_extractions("data/instances/")
       return merge_datasets([dutch_isil, dutch_orgs, conversations])
   ```

2. **Match UNESCO Sites to Existing Records**
   - Match by Wikidata Q-number (highest confidence)
   - Match by ISIL code (for Dutch institutions)
   - Match by name + location (fuzzy matching, score > 0.85)

3. **Conflict Detection**
   ```python
   def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]:
       """Detect field-level conflicts between UNESCO and existing data."""
       conflicts = []

       if unesco_record.name != existing_record.name:
           conflicts.append(Conflict(
               field="name",
               unesco_value=unesco_record.name,
               existing_value=existing_record.name,
               resolution="MANUAL_REVIEW"
           ))

       # Check institution type mismatch
       if unesco_record.institution_type != existing_record.institution_type:
           conflicts.append(Conflict(field="institution_type", ...))

       return conflicts
   ```

**Deliverables**:
- `scripts/crosslink_unesco_with_glam.py` - Cross-linking script
- `data/unesco_conflicts.csv` - Detected conflicts report
- `tests/test_crosslinking.py` - Unit tests for matching logic

**Success Criteria**:
- [ ] Identify 50+ matches between UNESCO and existing dataset
- [ ] Detect conflicts (name mismatches, type discrepancies)
- [ ] Generate conflict report for manual review

---

### Day 16: Conflict Resolution

**Tasks**:
1. **Tier-Based Priority**
   - TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
   - When conflict: UNESCO data wins, existing data flagged

2. **Merge Strategy**
   ```python
   def merge_unesco_with_existing(
       unesco_record: HeritageCustodian,
       existing_record: HeritageCustodian
   ) -> HeritageCustodian:
       """Merge UNESCO data with existing record, UNESCO takes priority."""
       merged = existing_record.copy()

       # UNESCO name becomes primary
       merged.name = unesco_record.name

       # Preserve alternative names from both sources
       merged.alternative_names = list(set(
           unesco_record.alternative_names + existing_record.alternative_names
       ))

       # Add UNESCO identifier
       merged.identifiers.append({
           'identifier_scheme': 'UNESCO_WHC',
           'identifier_value': unesco_record.identifiers[0].identifier_value
       })

       # Track provenance of merge
       merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX"

       return merged
   ```

3. **Manual Review Queue**
   - Flag high-impact conflicts (institution type change, location change)
   - Generate review spreadsheet for human validation
   - Provide evidence for each conflict (source URLs, descriptions)

**Deliverables**:
- `src/glam_extractor/validators/conflict_resolver.py`
- `data/manual_review_queue.csv` - Conflicts requiring human review
- `tests/test_conflict_resolution.py`

**Success Criteria**:
- [ ] Resolve 80% of conflicts automatically (tier-based priority)
- [ ] Flag 20% for manual review (complex cases)
- [ ] Zero data loss (preserve all alternative names, identifiers)

---

### Day 17: Confidence Scoring System

**Tasks**:
1. **Score Calculation**
   ```python
   def calculate_confidence_score(custodian: HeritageCustodian) -> float:
       """Calculate confidence score based on data completeness and source quality."""
       score = 1.0  # Start at maximum (TIER_1 authoritative)

       # Deduct for missing fields
       if not custodian.identifiers:
           score -= 0.1
       if not custodian.locations:
           score -= 0.15
       if custodian.institution_type == InstitutionTypeEnum.MIXED:
           score -= 0.2  # Ambiguous classification

       # Boost for rich metadata
       if len(custodian.identifiers) > 2:
           score += 0.05
       if custodian.digital_platforms:
           score += 0.05

       return max(0.0, min(1.0, score))  # Clamp to [0.0, 1.0]
   ```

2. **Quality Metrics**
   - Completeness: % of optional fields populated
   - Accuracy: Agreement with authoritative sources (Wikidata, official websites)
   - Freshness: Days since extraction

3. **Tier Validation**
   - Ensure all UNESCO records have `data_tier: TIER_1_AUTHORITATIVE`
   - Downgrade tier if conflicts detected (TIER_1 → TIER_2)

**Deliverables**:
- `src/glam_extractor/validators/confidence_scorer.py`
- `tests/test_confidence_scoring.py`
- `data/unesco_quality_metrics.json` - Aggregate statistics

**Success Criteria**:
- [ ] 90%+ of UNESCO records score > 0.85 confidence
- [ ] Flag < 5% as low confidence (require review)
- [ ] Document scoring methodology in provenance metadata

---

### Day 18: LinkML Schema Validation

**Tasks**:
1. **Batch Validation**
   ```bash
   # Validate all UNESCO extractions against LinkML schema
   for file in data/unesco_extracted/*.yaml; do
       linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file"
   done
   ```

2. **Custom Validators**
   ```python
   # src/glam_extractor/validators/unesco_validators.py

   def validate_unesco_whc_id(whc_id: str) -> bool:
       """UNESCO WHC IDs are 3-4 digit integers."""
       return bool(re.match(r'^\d{3,4}$', whc_id))

    def validate_ghcid_format(ghcid: str) -> bool:
        """Validate GHCID format for UNESCO institutions."""
        pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$'
        return bool(re.match(pattern, ghcid))
   ```

3. **Error Reporting**
   - Generate validation report with line numbers and error messages
   - Categorize errors (required field missing, invalid enum value, format error)
   - Prioritize fixes (blocking errors vs. warnings)

**Deliverables**:
- `scripts/validate_unesco_dataset.py` - Batch validation script
- `src/glam_extractor/validators/unesco_validators.py` - Custom validators
- `data/unesco_validation_report.json` - Validation results

**Success Criteria**:
- [ ] 100% of extracted records pass LinkML validation
- [ ] Zero blocking errors
- [ ] Document any warnings in provenance notes

---

### Day 19: Data Quality Report

**Tasks**:
1. **Generate Statistics**
   ```python
   # scripts/generate_unesco_quality_report.py

   def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict:
       return {
           'total_institutions': len(unesco_dataset),
           'by_country': count_by_country(unesco_dataset),
           'by_institution_type': count_by_type(unesco_dataset),
           'avg_confidence_score': calculate_avg_confidence(unesco_dataset),
           'completeness_metrics': {
               'with_wikidata_id': count_with_wikidata(unesco_dataset),
               'with_digital_platform': count_with_platforms(unesco_dataset),
               'with_geocoded_location': count_with_geocoding(unesco_dataset)
           },
           'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')),
           'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv'))
       }
   ```

2. **Visualization**
   - Generate maps showing UNESCO site distribution
   - Bar charts: institutions by country, by type
   - Heatmap: data completeness by field

3. **Documentation**
   - Write executive summary of data quality
   - Document known issues and limitations
   - Provide recommendations for improvement

**Deliverables**:
- `scripts/generate_unesco_quality_report.py`
- `data/unesco_quality_report.json` - Statistics
- `docs/unesco-data-quality.md` - Quality report document
- `data/visualizations/` - Maps and charts

**Success Criteria**:
- [ ] Quality report shows 90%+ completeness for core fields
- [ ] < 5% of records require manual review
- [ ] Geographic coverage across all inhabited continents

---

## Phase 4: Integration & Enrichment (Days 20-24)

### Objectives

- Enrich UNESCO data with Wikidata identifiers
- Merge UNESCO dataset with existing GLAM dataset
- Resolve GHCID collisions
- Update GHCID history for modified records

### Day 20-21: Wikidata Enrichment

**Tasks**:
1. **SPARQL Query for UNESCO Sites**
   ```python
   # scripts/enrich_unesco_with_wikidata.py

   def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]:
       """Find Wikidata Q-number for UNESCO World Heritage Site."""
       query = f"""
       SELECT ?item WHERE {{
         ?item wdt:P757 "{whc_id}" .  # P757 = UNESCO World Heritage Site ID
       }}
       LIMIT 1
       """

       results = sparql_query(query)
       if results:
           return extract_qid(results[0]['item']['value'])
       return None
   ```

2. **Batch Enrichment**
   - Query Wikidata for all 1,000+ UNESCO sites
   - Extract Q-numbers, VIAF IDs, ISIL codes (if available)
   - Add to `identifiers` array in LinkML instances

3. **Fuzzy Matching Fallback**
   - If WHC ID not found in Wikidata, try name + location matching
   - Use same fuzzy matching logic from existing enrichment scripts
   - Threshold: 0.85 similarity score

**Deliverables**:
- `scripts/enrich_unesco_with_wikidata.py` - Enrichment script
- `data/unesco_enriched/` - Enriched YAML instances
- `logs/wikidata_enrichment.log` - Enrichment log

**Success Criteria**:
- [ ] Find Wikidata Q-numbers for 80%+ of UNESCO sites
- [ ] Add VIAF/ISIL identifiers where available
- [ ] Document enrichment in provenance metadata

---

### Day 22: Dataset Merge

**Tasks**:
1. **Merge Strategy**
   ```python
   # scripts/merge_unesco_into_glam_dataset.py

   def merge_datasets(
       unesco_data: List[HeritageCustodian],
       existing_data: List[HeritageCustodian]
   ) -> List[HeritageCustodian]:
       """Merge UNESCO data into existing GLAM dataset."""
       merged = existing_data.copy()

       for unesco_record in unesco_data:
           # Check if institution already exists
           match = find_match(unesco_record, existing_data)

           if match:
               # Merge records
               merged_record = merge_unesco_with_existing(unesco_record, match)
               merged[merged.index(match)] = merged_record
           else:
               # Add new institution
               merged.append(unesco_record)

       return merged
   ```

2. **Deduplication**
   - Detect duplicates by GHCID, Wikidata Q-number, ISIL code
   - Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
   - Preserve alternative names and identifiers from both sources

3. **Provenance Tracking**
   - Update `provenance.notes` for merged records
   - Record merge timestamp
   - Link back to original extraction sources

**Deliverables**:
- `scripts/merge_unesco_into_glam_dataset.py` - Merge script
- `data/merged_glam_dataset/` - Merged dataset (YAML files)
- `data/merge_report.json` - Merge statistics

**Success Criteria**:
- [ ] Merge 1,000+ UNESCO records with existing GLAM dataset
- [ ] Deduplicate matches (no duplicate GHCIDs)
- [ ] Preserve data from all sources (no information loss)

---

### Day 23: GHCID Collision Resolution

**Tasks**:
1. **Detect Collisions**
   ```python
   # scripts/resolve_ghcid_collisions.py

   def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]:
       """Find institutions with identical base GHCIDs."""
       ghcid_map = defaultdict(list)

       for custodian in dataset:
           base_ghcid = remove_q_number(custodian.ghcid)
           ghcid_map[base_ghcid].append(custodian)

       collisions = [
           Collision(base_ghcid=k, institutions=v)
           for k, v in ghcid_map.items()
           if len(v) > 1
       ]

       return collisions
   ```

2. **Apply Temporal Priority Rules**
   - Compare `provenance.extraction_date` for colliding institutions
   - First batch (same date): ALL get Q-numbers
   - Historical addition (later date): ONLY new gets name suffix

3. **Update GHCID History**
   ```python
   def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str):
       """Record GHCID change in history."""
       custodian.ghcid_history.append(GHCIDHistoryEntry(
           ghcid=new_ghcid,
           ghcid_numeric=generate_numeric_id(new_ghcid),
           valid_from=datetime.now(timezone.utc).isoformat(),
           valid_to=None,
           reason=f"Name suffix added to resolve collision with {old_ghcid}"
       ))
   ```

**Deliverables**:
- `scripts/resolve_ghcid_collisions.py` - Collision resolution script
- `data/ghcid_collision_report.json` - Detected collisions
- Updated YAML instances with `ghcid_history` entries

**Success Criteria**:
- [ ] Resolve all GHCID collisions (zero duplicates)
- [ ] Update GHCID history for affected records
- [ ] Preserve PID stability (no changes to published GHCIDs)

---

### Day 24: Final Data Validation

**Tasks**:
1. **Full Dataset Validation**
   - Run LinkML validation on merged dataset
   - Check for orphaned references (invalid foreign keys)
   - Verify all GHCIDs are unique

2. **Integrity Checks**
   ```python
   # tests/integration/test_merged_dataset_integrity.py

   def test_no_duplicate_ghcids():
       dataset = load_merged_dataset()
       ghcids = [c.ghcid for c in dataset]
       assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!"

   def test_all_unesco_sites_have_whc_id():
       dataset = load_merged_dataset()
       unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"]

       for record in unesco_records:
           whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"]
           assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID"
   ```

3. **Coverage Analysis**
   - Verify UNESCO sites across all continents
   - Check institution type distribution (not all MUSEUM)
   - Ensure Dutch institutions properly merged with ISIL registry

**Deliverables**:
- `tests/integration/test_merged_dataset_integrity.py` - Integrity tests
- `data/final_validation_report.json` - Validation results
- `docs/dataset-coverage.md` - Coverage analysis

**Success Criteria**:
- [ ] 100% passing integrity tests
- [ ] Zero duplicate GHCIDs
- [ ] UNESCO sites cover 100+ countries

---

## Phase 5: Export & Documentation (Days 25-30)

### Objectives

- Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
- Generate user documentation and API docs
- Create example queries and use case tutorials
- Publish dataset with persistent identifiers

### Day 25-26: RDF/JSON-LD Export

**Tasks**:
1. **RDF Serialization**
   ```python
   # src/glam_extractor/exporters/rdf_exporter.py

   def export_to_rdf(dataset: List[HeritageCustodian], output_path: str):
       """Export dataset to RDF/Turtle format."""
       graph = Graph()

       # Define namespaces
       GLAM = Namespace("https://w3id.org/heritage/custodian/")
       graph.bind("glam", GLAM)
       graph.bind("schema", SCHEMA)
       graph.bind("cpov", Namespace("http://data.europa.eu/m8g/"))

       for custodian in dataset:
           uri = URIRef(custodian.id)

           # Type assertions
           graph.add((uri, RDF.type, GLAM.HeritageCustodian))
           graph.add((uri, RDF.type, SCHEMA.Museum))  # If institution_type == MUSEUM

           # Literals
           graph.add((uri, SCHEMA.name, Literal(custodian.name)))
           graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type)))

           # Identifiers (owl:sameAs)
           for identifier in custodian.identifiers:
               if identifier.identifier_scheme == "Wikidata":
                   graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url)))

       graph.serialize(destination=output_path, format="turtle")
   ```

2. **JSON-LD Context**
   ```json
   // data/context/heritage_custodian_context.jsonld
   {
     "@context": {
       "@vocab": "https://w3id.org/heritage/custodian/",
       "schema": "http://schema.org/",
       "name": "schema:name",
       "location": "schema:location",
       "identifiers": "schema:identifier",
       "institution_type": "institutionType",
       "data_source": "dataSource"
     }
   }
   ```

3. **Content Negotiation Setup**
   - Configure w3id.org redirects (if hosting on GitHub Pages)
   - Test URI resolution for sample institutions
   - Ensure Accept header routing (text/turtle, application/ld+json)

**Deliverables**:
- `src/glam_extractor/exporters/rdf_exporter.py` - RDF exporter
- `data/exports/glam_dataset.ttl` - RDF/Turtle export
- `data/exports/glam_dataset.jsonld` - JSON-LD export
- `data/context/heritage_custodian_context.jsonld` - JSON-LD context

**Success Criteria**:
- [ ] RDF validates with Turtle parser
- [ ] JSON-LD validates with JSON-LD Playground
- [ ] Sample URIs resolve correctly

---

### Day 27: CSV/Parquet Export

**Tasks**:
1. **Flatten Schema for CSV**
   ```python
   # src/glam_extractor/exporters/csv_exporter.py

   def export_to_csv(dataset: List[HeritageCustodian], output_path: str):
       """Export dataset to CSV with flattened structure."""
       rows = []

       for custodian in dataset:
           row = {
               'ghcid': custodian.ghcid,
               'ghcid_uuid': str(custodian.ghcid_uuid),
               'name': custodian.name,
               'institution_type': custodian.institution_type,
               'country': custodian.locations[0].country if custodian.locations else None,
               'city': custodian.locations[0].city if custodian.locations else None,
               'wikidata_id': get_identifier(custodian, 'Wikidata'),
               'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'),
               'data_source': custodian.provenance.data_source,
               'data_tier': custodian.provenance.data_tier,
               'confidence_score': custodian.provenance.confidence_score
           }
           rows.append(row)

       df = pd.DataFrame(rows)
       df.to_csv(output_path, index=False, encoding='utf-8-sig')
   ```

2. **Parquet Export (Columnar)**
   ```python
   def export_to_parquet(dataset: List[HeritageCustodian], output_path: str):
       """Export dataset to Parquet for efficient querying."""
       df = pd.DataFrame([custodian.dict() for custodian in dataset])
       df.to_parquet(output_path, engine='pyarrow', compression='snappy')
   ```

3. **SQLite Export**
   ```python
   def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str):
       """Export dataset to SQLite database."""
       conn = sqlite3.connect(db_path)

       # Create tables
       conn.execute("""
           CREATE TABLE heritage_custodians (
               ghcid TEXT PRIMARY KEY,
               ghcid_uuid TEXT UNIQUE,
               name TEXT NOT NULL,
               institution_type TEXT,
               data_source TEXT,
               ...
           )
       """)

       # Insert records
       for custodian in dataset:
           conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...)

       conn.commit()
   ```

**Deliverables**:
- `src/glam_extractor/exporters/csv_exporter.py` - CSV exporter
- `data/exports/glam_dataset.csv` - CSV export
- `data/exports/glam_dataset.parquet` - Parquet export
- `data/exports/glam_dataset.db` - SQLite database

**Success Criteria**:
- [ ] CSV opens correctly in Excel, Google Sheets
- [ ] Parquet loads in pandas, DuckDB
- [ ] SQLite database queryable with SQL

---

### Day 28: Documentation - User Guide

**Tasks**:
1. **Getting Started Guide**
   ```markdown
   # docs/user-guide/getting-started.md

   ## Installation

   pip install glam-extractor

   ## Quick Start

   ```python
   from glam_extractor import load_dataset

   # Load the GLAM dataset
   dataset = load_dataset("data/exports/glam_dataset.parquet")

   # Filter UNESCO museums in France
   museums = dataset.filter(
       institution_type="MUSEUM",
       data_source="UNESCO_WORLD_HERITAGE",
       country="FR"
   )
   ```

2. **Example Queries**
   - SPARQL examples (find institutions by type, country)
   - Pandas examples (data analysis, statistics)
   - SQL examples (SQLite queries)

3. **API Reference**
   - Document all public classes and methods
   - Provide code examples for each function
   - Link to LinkML schema documentation

**Deliverables**:
- `docs/user-guide/getting-started.md` - Quick start guide
- `docs/user-guide/example-queries.md` - Query examples
- `docs/api-reference.md` - API documentation

**Success Criteria**:
- [ ] Complete documentation for all public APIs
- [ ] 10+ example queries covering common use cases
- [ ] Step-by-step tutorials for data consumers

---

### Day 29: Documentation - Developer Guide

**Tasks**:
1. **Architecture Overview**
   - Diagram of extraction pipeline (API → Parser → Validator → Exporter)
   - Explanation of LinkML Map transformation
   - GHCID generation algorithm

2. **Contributing Guide**
   - How to add new institution type classifiers
   - How to extend LinkML schema with new fields
   - How to add new export formats

3. **Testing Guide**
   - Running unit tests, integration tests
   - Creating new test fixtures
   - Using property-based testing

**Deliverables**:
- `docs/developer-guide/architecture.md` - Architecture docs
- `docs/developer-guide/contributing.md` - Contribution guide
- `docs/developer-guide/testing.md` - Testing guide

**Success Criteria**:
- [ ] Complete architecture documentation with diagrams
- [ ] Clear instructions for extending the system
- [ ] Comprehensive testing guide

---

### Day 30: Release & Publication

**Tasks**:
1. **Dataset Release**
   - Tag repository with version number (e.g., v1.0.0-unesco)
   - Create GitHub Release with exports attached
   - Publish to Zenodo for DOI (persistent citation)

2. **Announcement**
   - Write blog post announcing UNESCO data release
   - Share on social media (Twitter, Mastodon, LinkedIn)
   - Notify stakeholders (Europeana, DPLA, heritage researchers)

3. **Data Portal Update**
   - Update w3id.org redirects for new institutions
   - Deploy SPARQL endpoint (if applicable)
   - Update REST API to include UNESCO data

**Deliverables**:
- GitHub Release with dataset exports
- Zenodo DOI for citation
- Blog post and announcement

**Success Criteria**:
- [ ] Dataset published with persistent DOI
- [ ] Documentation live and accessible
- [ ] Stakeholders notified of release

---

## Risk Mitigation

### Technical Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| UNESCO API changes format | Low | High | Cache all responses, version API client |
| LinkML Map lacks features | Medium | High | Implement custom extension early (Day 2-3) |
| GHCID collisions exceed capacity | Low | Medium | Q-number resolution strategy documented |
| Wikidata enrichment fails | Medium | Medium | Fallback to fuzzy name matching |

### Resource Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Timeline slips past 6 weeks | Medium | Medium | Prioritize core features, defer non-critical exports |
| Test coverage falls below 90% | Low | High | TDD approach enforced from Day 1 |
| Documentation incomplete | Medium | High | Reserve full week for docs (Phase 5) |

### Data Quality Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Institution type misclassification | Medium | Medium | Manual review queue for low-confidence cases |
| Missing Wikidata Q-numbers | Medium | Low | Accept base GHCID without Q-number, enrich later |
| Conflicts with existing data | Low | Medium | Tier-based priority, UNESCO wins |

---

## Success Metrics

### Quantitative Metrics

- **Coverage**: Extract 1,000+ UNESCO site institutions
- **Quality**: 90%+ confidence score average
- **Completeness**: 80%+ have Wikidata Q-numbers
- **Performance**: Process all sites in < 2 hours
- **Test Coverage**: 90%+ code coverage

### Qualitative Metrics

- **Usability**: Positive feedback from 3+ data consumers
- **Documentation**: Complete user guide and API docs
- **Maintainability**: Code passes linter, type checker
- **Reproducibility**: Dataset generation fully automated

---

## Appendix: Day-by-Day Checklist

### Phase 1 (Days 1-5)
- [ ] Day 1: UNESCO API documentation reviewed, 50 sites fetched
- [ ] Day 2: LinkML Map schema (part 1) - basic transformations
- [ ] Day 3: LinkML Map schema (part 2) - advanced patterns
- [ ] Day 4: Golden dataset created (20 test fixtures)
- [ ] Day 5: Institution type classifier designed

### Phase 2 (Days 6-13)
- [ ] Day 6: UNESCO API client implemented
- [ ] Day 7: LinkML instance generator implemented
- [ ] Day 8: GHCID generator extended for UNESCO
- [ ] Day 9-10: Batch processing pipeline
- [ ] Day 11-12: Integration testing
- [ ] Day 13: Code review and refactoring

### Phase 3 (Days 14-19)
- [ ] Day 14-15: Cross-referencing with existing data
- [ ] Day 16: Conflict resolution
- [ ] Day 17: Confidence scoring system
- [ ] Day 18: LinkML schema validation
- [ ] Day 19: Data quality report

### Phase 4 (Days 20-24)
- [ ] Day 20-21: Wikidata enrichment
- [ ] Day 22: Dataset merge
- [ ] Day 23: GHCID collision resolution
- [ ] Day 24: Final data validation

### Phase 5 (Days 25-30)
- [ ] Day 25-26: RDF/JSON-LD export
- [ ] Day 27: CSV/Parquet/SQLite export
- [ ] Day 28: User guide documentation
- [ ] Day 29: Developer guide documentation
- [ ] Day 30: Release and publication

---

**Document Status**: Complete
**Next Document**: `04-tdd-strategy.md` - Test-driven development plan
**Version**: 1.1

---

## Version History

### Version 1.1 (2025-11-10)
**Changes**: Updated for OpenDataSoft Explore API v2.0 migration

- **Day 1 API Reconnaissance** (lines 44-62): Updated API endpoint from legacy `whc.unesco.org/en/list/json` to OpenDataSoft `data.unesco.org/api/explore/v2.0`
- **Day 6 UNESCO API Client** (lines 265-305):
  - Updated `base_url` to OpenDataSoft API endpoint
  - Removed `api_key` parameter (public dataset, no authentication)
  - Added pagination parameters to `fetch_site_list()`: `limit`, `offset`
  - Updated method documentation to reflect OpenDataSoft response structure: `{"record": {"fields": {...}}}`
- **Day 7 LinkML Parser** (lines 309-352):
  - Updated `parse_unesco_site()` to extract from nested `api_response['record']['fields']`
  - Added documentation clarifying OpenDataSoft structure
  - Updated `extract_multilingual_names()` parameter name from `unesco_data` to `site_data`
- **Day 10 Batch Processing** (lines 406-450):
  - Updated `extract_all_unesco_sites()` with pagination loop for OpenDataSoft API
  - Updated `process_unesco_site()` to handle nested record structure
  - Changed field access from `site_data['id_number']` to `site_record['record']['fields']['unique_number']`
- **Day 11-12 Integration Tests** (lines 454-483):
  - Updated `test_full_unesco_extraction_pipeline()` to extract `site_data` from `response['record']['fields']`
  - Added explicit documentation of OpenDataSoft API structure

**Rationale**: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.

### Version 1.0 (2025-11-09)
**Initial Release**

- Comprehensive 30-day implementation plan for UNESCO data extraction
- Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
- TDD approach with golden dataset and integration tests
- GHCID generation strategy for UNESCO heritage sites
- Wikidata enrichment and cross-referencing plan