2025-11-30 23:30:29 +01:00

48 KiB

Raw Permalink Blame History

UNESCO Data Extraction - Implementation Phases

Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 03 - Implementation Phases
Version: 1.1
Date: 2025-11-10
Status: Draft

Executive Summary

This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.

Total Effort: 30 working days (6 weeks)
Team Size: 1-2 developers + AI agents
Target Output: 1,000+ heritage custodian records from UNESCO sites

Phase Overview

Phase	Duration	Focus	Key Deliverables
Phase 1	5 days	API Exploration & Schema Design	UNESCO API parser, LinkML Map schema
Phase 2	8 days	Extractor Implementation	Institution type classifier, GHCID generator
Phase 3	6 days	Data Quality & Validation	LinkML validator, conflict resolver
Phase 4	5 days	Integration & Enrichment	Wikidata enrichment, dataset merge
Phase 5	6 days	Export & Documentation	RDF/JSON-LD exporters, user docs

Phase 1: API Exploration & Schema Design (Days 1-5)

Objectives

Understand UNESCO DataHub API structure and data quality
Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
Create test fixtures from real UNESCO API responses
Establish baseline for institution type classification

Day 1: UNESCO API Reconnaissance

Tasks:

API Documentation Review
- Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
- Identify available endpoints (dataset whc001 - World Heritage List)
- Document authentication requirements (none - public dataset)
- Document pagination limits (max 100 records per request, use offset parameter)
- Test API responses for sample sites

Data Structure Analysis

# Fetch sample UNESCO site data (OpenDataSoft Explore API v2)
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \
  > samples/unesco_site_list.json

# Fetch specific site by unique_number using ODSQL
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \
  > samples/unesco_angkor_detail.json

# Response structure: nested record.fields with coordinates object
# {
#   "total_count": 1248,
#   "records": [{
#     "record": {
#       "fields": {
#         "name_en": "Angkor",
#         "unique_number": 668,
#         "coordinates": {"lat": 13.4333, "lon": 103.8333}
#       }
#     }
#   }]
# }

Schema Mapping
- Map OpenDataSoft record.fields to LinkML HeritageCustodian slots
- Identify missing fields (require inference or external enrichment)
- Document ambiguities (e.g., when is a site also a museum?)
- Handle nested response structure (response['records'][i]['record']['fields'])

Deliverables:

docs/unesco-api-analysis.md - API structure documentation
tests/fixtures/unesco_api_responses/ - 10+ sample JSON files
docs/unesco-to-linkml-mapping.md - Field mapping table

Success Criteria:

Successfully fetch data for 50 UNESCO sites via API
Document all relevant JSON fields for extraction
Identify 3+ institution type classification patterns

Day 2: LinkML Map Schema Design (Part 1)

Tasks:

Install LinkML Map Extension

pip install linkml-map
# OR implement custom extension: src/glam_extractor/mappers/extended_map.py

Design Transformation Rules
- Create schemas/maps/unesco_to_heritage_custodian.yaml
- Define JSONPath expressions for field extraction
- Handle multi-language names (UNESCO provides English, French, often local language)
- Map UNESCO categories to InstitutionTypeEnum

Conditional Extraction Logic

# Example LinkML Map rule
mappings:
  - source_path: $.category
    target_path: institution_type
    transform:
      type: conditional
      rules:
        - condition: "contains(description, 'museum')"
          value: MUSEUM
        - condition: "contains(description, 'library')"
          value: LIBRARY
        - condition: "contains(description, 'archive')"
          value: ARCHIVE
        - default: MIXED

Deliverables:

schemas/maps/unesco_to_heritage_custodian.yaml (initial version)
tests/test_unesco_linkml_map.py - Unit tests for mapping rules

Success Criteria:

LinkML Map schema validates against sample UNESCO JSON
Successfully extract name, location, UNESCO WHC ID from 10 fixtures
Handle multilingual names without data loss

Day 3: LinkML Map Schema Design (Part 2)

Tasks:

Advanced Transformation Rules
- Regex extraction for identifiers (UNESCO WHC ID format: ^\d{3,4}$)
- GeoNames ID lookup from UNESCO location strings
- Wikidata Q-number extraction from UNESCO external links

Multi-value Array Handling

# Extract all languages from UNESCO site names
mappings:
  - source_path: $.names[*]
    target_path: alternative_names
    transform:
      type: array
      element_transform:
        type: template
        template: "{name}@{lang}"

Error Handling Patterns
- Missing required fields → skip record with warning
- Invalid coordinates → flag for manual geocoding
- Unknown institution type → default to MIXED, log for review

Deliverables:

schemas/maps/unesco_to_heritage_custodian.yaml (complete)
docs/linkml-map-extension-spec.md - Custom extension specification

Success Criteria:

Extract ALL relevant fields from 10 diverse UNESCO sites
Handle edge cases (missing data, malformed coordinates)
Generate valid LinkML instances from real API responses

Day 4: Test Fixture Creation

Tasks:

Curate Representative Samples
- Select 20 UNESCO sites covering:
  - All continents (Europe, Asia, Africa, Americas, Oceania)
  - Multiple institution types (museums, libraries, archives, botanical gardens)
  - Edge cases (serial nominations, transboundary sites)

Create Expected Outputs

# tests/fixtures/expected_outputs/unesco_louvre.yaml
- id: https://w3id.org/heritage/custodian/fr/louvre
  name: Musée du Louvre
  institution_type: MUSEUM
  locations:
    - city: Paris
      country: FR
      coordinates: [48.8606, 2.3376]
  identifiers:
    - identifier_scheme: UNESCO_WHC
      identifier_value: "600"
      identifier_url: https://whc.unesco.org/en/list/600
  provenance:
    data_source: UNESCO_WORLD_HERITAGE
    data_tier: TIER_1_AUTHORITATIVE

Golden Dataset Construction
- Manually verify 20 expected outputs against authoritative sources
- Document any assumptions or inferences made

Deliverables:

tests/fixtures/unesco_api_responses/ - 20 JSON files
tests/fixtures/expected_outputs/ - 20 YAML files
tests/test_unesco_golden_dataset.py - Integration tests

Success Criteria:

20 golden dataset pairs (input JSON + expected YAML)
100% passing tests for golden dataset
Documented edge cases and classification rules

Day 5: Institution Type Classifier Design

Tasks:

Pattern Analysis
- Analyze UNESCO descriptions for GLAM-related keywords
- Create decision tree for institution type classification
- Handle ambiguous cases (e.g., "archaeological park with museum")

Keyword Extraction

# src/glam_extractor/classifiers/unesco_institution_type.py

 MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition']
 LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek']
 ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage']
 BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum']
 HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue']
 FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate']

Confidence Scoring
- High confidence (0.9+): Explicit mentions of "museum" or "library"
- Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
- Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review

Deliverables:

src/glam_extractor/classifiers/unesco_institution_type.py
tests/test_unesco_classifier.py - 50+ test cases
docs/unesco-classification-rules.md - Decision tree documentation

Success Criteria:

Classifier achieves 90%+ accuracy on golden dataset
Low-confidence classifications flagged for manual review
Handle multilingual descriptions (English, French, Spanish, etc.)

Phase 2: Extractor Implementation (Days 6-13)

Objectives

Implement UNESCO API client with caching and rate limiting
Build LinkML instance generator using Map schema
Create GHCID generator for UNESCO institutions
Achieve 100% test coverage for core extraction logic

Day 6: UNESCO API Client

Tasks:

HTTP Client Implementation

# src/glam_extractor/parsers/unesco_api_client.py

class UNESCOAPIClient:
    def __init__(self, cache_ttl: int = 86400):
        self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001"
        self.cache = Cache(ttl=cache_ttl)
        self.rate_limiter = RateLimiter(requests_per_second=2)

    def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict:
        """
        Fetch paginated list of UNESCO World Heritage Sites.

        Returns OpenDataSoft API response with structure:
        {
            "total_count": int,
            "results": [{"record": {"id": str, "fields": {...}}, ...}]
        }
        """
        ...

    def fetch_site_details(self, whc_id: int) -> Dict:
        """
        Fetch detailed information for a specific site using ODSQL query.

        Returns single record with structure:
        {"record": {"id": str, "fields": {field_name: value, ...}}}
        """
        ...

Caching Strategy
- Cache API responses for 24 hours (UNESCO updates infrequently)
- Store in SQLite database: cache/unesco_api_cache.db
- Invalidate cache on demand for data refreshes
Error Handling
- Network errors → retry with exponential backoff
- 404 Not Found → skip site, log warning
- Rate limit exceeded → pause and retry

Deliverables:

src/glam_extractor/parsers/unesco_api_client.py
tests/test_unesco_api_client.py - Mock API tests
cache/unesco_api_cache.db - SQLite cache database

Success Criteria:

Successfully fetch all UNESCO sites (1,000+ sites)
Handle API errors gracefully (no crashes)
Cache reduces API calls by 95% on repeat runs

Day 7: LinkML Instance Generator

Tasks:

Apply LinkML Map Transformations

# src/glam_extractor/parsers/unesco_parser.py

from linkml_map import Mapper

def parse_unesco_site(api_response: Dict) -> HeritageCustodian:
    """
    Parse OpenDataSoft API response to HeritageCustodian instance.

    Args:
        api_response: OpenDataSoft record with nested structure:
            {"record": {"id": str, "fields": {field_name: value, ...}}}

    Returns:
        HeritageCustodian: Validated LinkML instance
    """
    # Extract fields from nested structure
    site_data = api_response['record']['fields']

    mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
    instance = mapper.transform(site_data)
    return HeritageCustodian(**instance)

Validation Pipeline
- Apply LinkML schema validation after transformation
- Catch validation errors, log details
- Skip invalid records, continue processing

Multi-language Name Handling

def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]:
    """
    Extract primary name and alternative names in multiple languages.

    Args:
        site_data: Extracted fields from OpenDataSoft record['fields']
    """
    primary_name = site_data.get('site', '')
    alternative_names = []

    # OpenDataSoft may provide language variants in separate fields
    # or as structured data - adjust based on actual API response
    for lang_data in site_data.get('names', []):
        name = lang_data.get('name', '')
        lang = lang_data.get('lang', 'en')
        if name and name != primary_name:
            alternative_names.append(f"{name}@{lang}")

    return primary_name, alternative_names

Deliverables:

src/glam_extractor/parsers/unesco_parser.py
tests/test_unesco_parser.py - 20 golden dataset tests

Success Criteria:

Parse 20 golden dataset fixtures with 100% accuracy
Extract multilingual names without data loss
Generate valid LinkML instances (pass schema validation)

Day 8: GHCID Generator for UNESCO Sites

Tasks:

Extend GHCID Logic for UNESCO

# src/glam_extractor/identifiers/ghcid_generator.py

def generate_ghcid_for_unesco_site(
    country_code: str,
    region_code: str,
    city_code: str,
    institution_type: InstitutionTypeEnum,
    institution_name: str,
    has_collision: bool = False
) -> str:
    """
    Generate GHCID for UNESCO World Heritage Site institution.

    Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
    Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix)

    Note: Collision suffix uses native language institution name in snake_case,
    NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md
    """
    ...

City Code Lookup
- Use GeoNames API to convert city names to UN/LOCODE
- Fallback to 3-letter abbreviation if UN/LOCODE not found
- Cache lookups to minimize API calls
Collision Detection
- Check existing GHCID dataset for collisions
- Apply temporal priority rules (first batch vs. historical addition)
- Append native language name suffix if collision detected

Deliverables:

Extended src/glam_extractor/identifiers/ghcid_generator.py
tests/test_ghcid_unesco.py - GHCID generation tests

Success Criteria:

Generate valid GHCIDs for 20 golden dataset institutions
No collisions with existing Dutch ISIL dataset
Handle missing Wikidata Q-numbers gracefully

Day 9-10: Batch Processing Pipeline

Tasks:

Parallel Processing

# scripts/extract_unesco_sites.py

from concurrent.futures import ThreadPoolExecutor

def extract_all_unesco_sites(max_workers: int = 4):
    api_client = UNESCOAPIClient()

    # Fetch paginated site list from OpenDataSoft API
    all_sites = []
    offset = 0
    limit = 100

    while True:
        response = api_client.fetch_site_list(limit=limit, offset=offset)
        all_sites.extend(response['results'])

        if len(all_sites) >= response['total_count']:
            break
        offset += limit

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = executor.map(process_unesco_site, all_sites)

    return list(results)

def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]:
    """
    Process single OpenDataSoft record.

    Args:
        site_record: {"record": {"id": str, "fields": {...}}}
    """
    try:
        # Extract unique_number from nested fields
        whc_id = site_record['record']['fields']['unique_number']

        # Fetch full details if needed (or use site_record directly)
        details = api_client.fetch_site_details(whc_id)
        institution_type = classify_institution_type(details['record']['fields'])
        custodian = parse_unesco_site(details)
        custodian.ghcid = generate_ghcid(custodian)
        return custodian
    except Exception as e:
        log.error(f"Failed to process site {whc_id}: {e}")
        return None

Progress Tracking
- Use tqdm for progress bars
- Log successful extractions to logs/unesco_extraction.log
- Save intermediate results every 100 sites
Error Recovery
- Resume from last checkpoint if script crashes
- Separate successful extractions from failed ones
- Generate error report with failed site IDs

Deliverables:

scripts/extract_unesco_sites.py - Batch extraction script
data/unesco_extracted/ - Output directory for YAML instances
logs/unesco_extraction.log - Extraction log

Success Criteria:

Process all 1,000+ UNESCO sites in < 2 hours
< 5% failure rate (API errors, missing data)
Successful extractions saved as valid LinkML YAML files

Day 11-12: Integration Testing

Tasks:

End-to-End Tests

# tests/integration/test_unesco_pipeline.py

def test_full_unesco_extraction_pipeline():
    """Test complete pipeline from OpenDataSoft API fetch to LinkML instance."""
    # 1. Fetch API data from OpenDataSoft
    api_client = UNESCOAPIClient()
    response = api_client.fetch_site_details(600)  # Paris, Banks of the Seine
    site_data = response['record']['fields']  # Extract from nested structure

    # 2. Classify institution type
    inst_type = classify_institution_type(site_data)
    assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED]

    # 3. Parse to LinkML instance
    custodian = parse_unesco_site(response)
    assert custodian.name is not None

    # 4. Generate GHCID
    custodian.ghcid = generate_ghcid(custodian)
    assert custodian.ghcid.startswith("FR-")

    # 5. Validate against schema
    validator = SchemaValidator(schema="schemas/heritage_custodian.yaml")
    result = validator.validate(custodian)
    assert result.is_valid

Property-Based Testing

from hypothesis import given, strategies as st

@given(st.integers(min_value=1, max_value=1500))
def test_ghcid_determinism(whc_id: int):
    """GHCID generation is deterministic for same input."""
    site1 = generate_ghcid_for_site(whc_id)
    site2 = generate_ghcid_for_site(whc_id)
    assert site1 == site2

Performance Testing
- Benchmark extraction speed (sites per second)
- Memory profiling (ensure no memory leaks)
- Cache hit rate analysis

Deliverables:

tests/integration/test_unesco_pipeline.py - End-to-end tests
tests/test_unesco_property_based.py - Property-based tests
docs/performance-benchmarks.md - Performance results

Success Criteria:

100% passing integration tests
Extract 1,000 sites in < 2 hours (with cache)
Memory usage < 500MB for full extraction

Day 13: Code Review & Refactoring

Tasks:

Code Quality Review
- Run ruff linter, fix all warnings
- Run mypy type checker, resolve type errors
- Ensure 90%+ test coverage
Documentation Review
- Add docstrings to all public functions
- Update README with UNESCO extraction instructions
- Create developer guide for extending classifiers
Performance Optimization
- Profile slow functions, optimize bottlenecks
- Reduce redundant API calls
- Optimize GHCID generation (cache city code lookups)

Deliverables:

Refactored codebase with 90%+ test coverage
Updated documentation
Performance optimizations applied

Success Criteria:

Zero linter warnings
Zero type errors
Test coverage > 90%

Phase 3: Data Quality & Validation (Days 14-19)

Objectives

Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
Detect and resolve conflicts
Implement confidence scoring system
Generate data quality report

Day 14-15: Cross-Referencing with Existing Data

Tasks:

Load Existing Dataset

# scripts/crosslink_unesco_with_glam.py

def load_existing_glam_dataset():
    """Load Dutch ISIL + conversation extractions."""
    dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv")
    dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
    conversations = load_conversation_extractions("data/instances/")
    return merge_datasets([dutch_isil, dutch_orgs, conversations])

Match UNESCO Sites to Existing Records
- Match by Wikidata Q-number (highest confidence)
- Match by ISIL code (for Dutch institutions)
- Match by name + location (fuzzy matching, score > 0.85)

Conflict Detection

def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]:
    """Detect field-level conflicts between UNESCO and existing data."""
    conflicts = []

    if unesco_record.name != existing_record.name:
        conflicts.append(Conflict(
            field="name",
            unesco_value=unesco_record.name,
            existing_value=existing_record.name,
            resolution="MANUAL_REVIEW"
        ))

    # Check institution type mismatch
    if unesco_record.institution_type != existing_record.institution_type:
        conflicts.append(Conflict(field="institution_type", ...))

    return conflicts

Deliverables:

scripts/crosslink_unesco_with_glam.py - Cross-linking script
data/unesco_conflicts.csv - Detected conflicts report
tests/test_crosslinking.py - Unit tests for matching logic

Success Criteria:

Identify 50+ matches between UNESCO and existing dataset
Detect conflicts (name mismatches, type discrepancies)
Generate conflict report for manual review

Day 16: Conflict Resolution

Tasks:

Tier-Based Priority
- TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
- When conflict: UNESCO data wins, existing data flagged

Merge Strategy

def merge_unesco_with_existing(
    unesco_record: HeritageCustodian,
    existing_record: HeritageCustodian
) -> HeritageCustodian:
    """Merge UNESCO data with existing record, UNESCO takes priority."""
    merged = existing_record.copy()

    # UNESCO name becomes primary
    merged.name = unesco_record.name

    # Preserve alternative names from both sources
    merged.alternative_names = list(set(
        unesco_record.alternative_names + existing_record.alternative_names
    ))

    # Add UNESCO identifier
    merged.identifiers.append({
        'identifier_scheme': 'UNESCO_WHC',
        'identifier_value': unesco_record.identifiers[0].identifier_value
    })

    # Track provenance of merge
    merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX"

    return merged

Manual Review Queue
- Flag high-impact conflicts (institution type change, location change)
- Generate review spreadsheet for human validation
- Provide evidence for each conflict (source URLs, descriptions)

Deliverables:

src/glam_extractor/validators/conflict_resolver.py
data/manual_review_queue.csv - Conflicts requiring human review
tests/test_conflict_resolution.py

Success Criteria:

Resolve 80% of conflicts automatically (tier-based priority)
Flag 20% for manual review (complex cases)
Zero data loss (preserve all alternative names, identifiers)

Day 17: Confidence Scoring System

Tasks:

Score Calculation

def calculate_confidence_score(custodian: HeritageCustodian) -> float:
    """Calculate confidence score based on data completeness and source quality."""
    score = 1.0  # Start at maximum (TIER_1 authoritative)

    # Deduct for missing fields
    if not custodian.identifiers:
        score -= 0.1
    if not custodian.locations:
        score -= 0.15
    if custodian.institution_type == InstitutionTypeEnum.MIXED:
        score -= 0.2  # Ambiguous classification

    # Boost for rich metadata
    if len(custodian.identifiers) > 2:
        score += 0.05
    if custodian.digital_platforms:
        score += 0.05

    return max(0.0, min(1.0, score))  # Clamp to [0.0, 1.0]

Quality Metrics
- Completeness: % of optional fields populated
- Accuracy: Agreement with authoritative sources (Wikidata, official websites)
- Freshness: Days since extraction
Tier Validation
- Ensure all UNESCO records have data_tier: TIER_1_AUTHORITATIVE
- Downgrade tier if conflicts detected (TIER_1 → TIER_2)

Deliverables:

src/glam_extractor/validators/confidence_scorer.py
tests/test_confidence_scoring.py
data/unesco_quality_metrics.json - Aggregate statistics

Success Criteria:

90%+ of UNESCO records score > 0.85 confidence
Flag < 5% as low confidence (require review)
Document scoring methodology in provenance metadata

Day 18: LinkML Schema Validation

Tasks:

Batch Validation

# Validate all UNESCO extractions against LinkML schema
for file in data/unesco_extracted/*.yaml; do
    linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file"
done

Custom Validators

# src/glam_extractor/validators/unesco_validators.py

def validate_unesco_whc_id(whc_id: str) -> bool:
    """UNESCO WHC IDs are 3-4 digit integers."""
    return bool(re.match(r'^\d{3,4}$', whc_id))

 def validate_ghcid_format(ghcid: str) -> bool:
     """Validate GHCID format for UNESCO institutions."""
     pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$'
     return bool(re.match(pattern, ghcid))

Error Reporting
- Generate validation report with line numbers and error messages
- Categorize errors (required field missing, invalid enum value, format error)
- Prioritize fixes (blocking errors vs. warnings)

Deliverables:

scripts/validate_unesco_dataset.py - Batch validation script
src/glam_extractor/validators/unesco_validators.py - Custom validators
data/unesco_validation_report.json - Validation results

Success Criteria:

100% of extracted records pass LinkML validation
Zero blocking errors
Document any warnings in provenance notes

Day 19: Data Quality Report

Tasks:

Generate Statistics

# scripts/generate_unesco_quality_report.py

def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict:
    return {
        'total_institutions': len(unesco_dataset),
        'by_country': count_by_country(unesco_dataset),
        'by_institution_type': count_by_type(unesco_dataset),
        'avg_confidence_score': calculate_avg_confidence(unesco_dataset),
        'completeness_metrics': {
            'with_wikidata_id': count_with_wikidata(unesco_dataset),
            'with_digital_platform': count_with_platforms(unesco_dataset),
            'with_geocoded_location': count_with_geocoding(unesco_dataset)
        },
        'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')),
        'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv'))
    }

Visualization
- Generate maps showing UNESCO site distribution
- Bar charts: institutions by country, by type
- Heatmap: data completeness by field
Documentation
- Write executive summary of data quality
- Document known issues and limitations
- Provide recommendations for improvement

Deliverables:

scripts/generate_unesco_quality_report.py
data/unesco_quality_report.json - Statistics
docs/unesco-data-quality.md - Quality report document
data/visualizations/ - Maps and charts

Success Criteria:

Quality report shows 90%+ completeness for core fields
< 5% of records require manual review
Geographic coverage across all inhabited continents

Phase 4: Integration & Enrichment (Days 20-24)

Objectives

Enrich UNESCO data with Wikidata identifiers
Merge UNESCO dataset with existing GLAM dataset
Resolve GHCID collisions
Update GHCID history for modified records

Day 20-21: Wikidata Enrichment

Tasks:

SPARQL Query for UNESCO Sites

# scripts/enrich_unesco_with_wikidata.py

def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]:
    """Find Wikidata Q-number for UNESCO World Heritage Site."""
    query = f"""
    SELECT ?item WHERE {{
      ?item wdt:P757 "{whc_id}" .  # P757 = UNESCO World Heritage Site ID
    }}
    LIMIT 1
    """

    results = sparql_query(query)
    if results:
        return extract_qid(results[0]['item']['value'])
    return None

Batch Enrichment
- Query Wikidata for all 1,000+ UNESCO sites
- Extract Q-numbers, VIAF IDs, ISIL codes (if available)
- Add to identifiers array in LinkML instances
Fuzzy Matching Fallback
- If WHC ID not found in Wikidata, try name + location matching
- Use same fuzzy matching logic from existing enrichment scripts
- Threshold: 0.85 similarity score

Deliverables:

scripts/enrich_unesco_with_wikidata.py - Enrichment script
data/unesco_enriched/ - Enriched YAML instances
logs/wikidata_enrichment.log - Enrichment log

Success Criteria:

Find Wikidata Q-numbers for 80%+ of UNESCO sites
Add VIAF/ISIL identifiers where available
Document enrichment in provenance metadata

Day 22: Dataset Merge

Tasks:

Merge Strategy

# scripts/merge_unesco_into_glam_dataset.py

def merge_datasets(
    unesco_data: List[HeritageCustodian],
    existing_data: List[HeritageCustodian]
) -> List[HeritageCustodian]:
    """Merge UNESCO data into existing GLAM dataset."""
    merged = existing_data.copy()

    for unesco_record in unesco_data:
        # Check if institution already exists
        match = find_match(unesco_record, existing_data)

        if match:
            # Merge records
            merged_record = merge_unesco_with_existing(unesco_record, match)
            merged[merged.index(match)] = merged_record
        else:
            # Add new institution
            merged.append(unesco_record)

    return merged

Deduplication
- Detect duplicates by GHCID, Wikidata Q-number, ISIL code
- Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
- Preserve alternative names and identifiers from both sources
Provenance Tracking
- Update provenance.notes for merged records
- Record merge timestamp
- Link back to original extraction sources

Deliverables:

scripts/merge_unesco_into_glam_dataset.py - Merge script
data/merged_glam_dataset/ - Merged dataset (YAML files)
data/merge_report.json - Merge statistics

Success Criteria:

Merge 1,000+ UNESCO records with existing GLAM dataset
Deduplicate matches (no duplicate GHCIDs)
Preserve data from all sources (no information loss)

Day 23: GHCID Collision Resolution

Tasks:

Detect Collisions

# scripts/resolve_ghcid_collisions.py

def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]:
    """Find institutions with identical base GHCIDs."""
    ghcid_map = defaultdict(list)

    for custodian in dataset:
        base_ghcid = remove_q_number(custodian.ghcid)
        ghcid_map[base_ghcid].append(custodian)

    collisions = [
        Collision(base_ghcid=k, institutions=v)
        for k, v in ghcid_map.items()
        if len(v) > 1
    ]

    return collisions

Apply Temporal Priority Rules
- Compare provenance.extraction_date for colliding institutions
- First batch (same date): ALL get Q-numbers
- Historical addition (later date): ONLY new gets name suffix

Update GHCID History

def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str):
    """Record GHCID change in history."""
    custodian.ghcid_history.append(GHCIDHistoryEntry(
        ghcid=new_ghcid,
        ghcid_numeric=generate_numeric_id(new_ghcid),
        valid_from=datetime.now(timezone.utc).isoformat(),
        valid_to=None,
        reason=f"Name suffix added to resolve collision with {old_ghcid}"
    ))

Deliverables:

scripts/resolve_ghcid_collisions.py - Collision resolution script
data/ghcid_collision_report.json - Detected collisions
Updated YAML instances with ghcid_history entries

Success Criteria:

Resolve all GHCID collisions (zero duplicates)
Update GHCID history for affected records
Preserve PID stability (no changes to published GHCIDs)

Day 24: Final Data Validation

Tasks:

Full Dataset Validation
- Run LinkML validation on merged dataset
- Check for orphaned references (invalid foreign keys)
- Verify all GHCIDs are unique

Integrity Checks

# tests/integration/test_merged_dataset_integrity.py

def test_no_duplicate_ghcids():
    dataset = load_merged_dataset()
    ghcids = [c.ghcid for c in dataset]
    assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!"

def test_all_unesco_sites_have_whc_id():
    dataset = load_merged_dataset()
    unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"]

    for record in unesco_records:
        whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"]
        assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID"

Coverage Analysis
- Verify UNESCO sites across all continents
- Check institution type distribution (not all MUSEUM)
- Ensure Dutch institutions properly merged with ISIL registry

Deliverables:

tests/integration/test_merged_dataset_integrity.py - Integrity tests
data/final_validation_report.json - Validation results
docs/dataset-coverage.md - Coverage analysis

Success Criteria:

100% passing integrity tests
Zero duplicate GHCIDs
UNESCO sites cover 100+ countries

Phase 5: Export & Documentation (Days 25-30)

Objectives

Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
Generate user documentation and API docs
Create example queries and use case tutorials
Publish dataset with persistent identifiers

Day 25-26: RDF/JSON-LD Export

Tasks:

RDF Serialization

# src/glam_extractor/exporters/rdf_exporter.py

def export_to_rdf(dataset: List[HeritageCustodian], output_path: str):
    """Export dataset to RDF/Turtle format."""
    graph = Graph()

    # Define namespaces
    GLAM = Namespace("https://w3id.org/heritage/custodian/")
    graph.bind("glam", GLAM)
    graph.bind("schema", SCHEMA)
    graph.bind("cpov", Namespace("http://data.europa.eu/m8g/"))

    for custodian in dataset:
        uri = URIRef(custodian.id)

        # Type assertions
        graph.add((uri, RDF.type, GLAM.HeritageCustodian))
        graph.add((uri, RDF.type, SCHEMA.Museum))  # If institution_type == MUSEUM

        # Literals
        graph.add((uri, SCHEMA.name, Literal(custodian.name)))
        graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type)))

        # Identifiers (owl:sameAs)
        for identifier in custodian.identifiers:
            if identifier.identifier_scheme == "Wikidata":
                graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url)))

    graph.serialize(destination=output_path, format="turtle")

JSON-LD Context

// data/context/heritage_custodian_context.jsonld
{
  "@context": {
    "@vocab": "https://w3id.org/heritage/custodian/",
    "schema": "http://schema.org/",
    "name": "schema:name",
    "location": "schema:location",
    "identifiers": "schema:identifier",
    "institution_type": "institutionType",
    "data_source": "dataSource"
  }
}

Content Negotiation Setup
- Configure w3id.org redirects (if hosting on GitHub Pages)
- Test URI resolution for sample institutions
- Ensure Accept header routing (text/turtle, application/ld+json)

Deliverables:

src/glam_extractor/exporters/rdf_exporter.py - RDF exporter
data/exports/glam_dataset.ttl - RDF/Turtle export
data/exports/glam_dataset.jsonld - JSON-LD export
data/context/heritage_custodian_context.jsonld - JSON-LD context

Success Criteria:

RDF validates with Turtle parser
JSON-LD validates with JSON-LD Playground
Sample URIs resolve correctly

Day 27: CSV/Parquet Export

Tasks:

Flatten Schema for CSV

# src/glam_extractor/exporters/csv_exporter.py

def export_to_csv(dataset: List[HeritageCustodian], output_path: str):
    """Export dataset to CSV with flattened structure."""
    rows = []

    for custodian in dataset:
        row = {
            'ghcid': custodian.ghcid,
            'ghcid_uuid': str(custodian.ghcid_uuid),
            'name': custodian.name,
            'institution_type': custodian.institution_type,
            'country': custodian.locations[0].country if custodian.locations else None,
            'city': custodian.locations[0].city if custodian.locations else None,
            'wikidata_id': get_identifier(custodian, 'Wikidata'),
            'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'),
            'data_source': custodian.provenance.data_source,
            'data_tier': custodian.provenance.data_tier,
            'confidence_score': custodian.provenance.confidence_score
        }
        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(output_path, index=False, encoding='utf-8-sig')

Parquet Export (Columnar)

def export_to_parquet(dataset: List[HeritageCustodian], output_path: str):
    """Export dataset to Parquet for efficient querying."""
    df = pd.DataFrame([custodian.dict() for custodian in dataset])
    df.to_parquet(output_path, engine='pyarrow', compression='snappy')

SQLite Export

def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str):
    """Export dataset to SQLite database."""
    conn = sqlite3.connect(db_path)

    # Create tables
    conn.execute("""
        CREATE TABLE heritage_custodians (
            ghcid TEXT PRIMARY KEY,
            ghcid_uuid TEXT UNIQUE,
            name TEXT NOT NULL,
            institution_type TEXT,
            data_source TEXT,
            ...
        )
    """)

    # Insert records
    for custodian in dataset:
        conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...)

    conn.commit()

Deliverables:

src/glam_extractor/exporters/csv_exporter.py - CSV exporter
data/exports/glam_dataset.csv - CSV export
data/exports/glam_dataset.parquet - Parquet export
data/exports/glam_dataset.db - SQLite database

Success Criteria:

CSV opens correctly in Excel, Google Sheets
Parquet loads in pandas, DuckDB
SQLite database queryable with SQL

Day 28: Documentation - User Guide

Tasks:

Getting Started Guide

# docs/user-guide/getting-started.md

## Installation

pip install glam-extractor

## Quick Start

```python
from glam_extractor import load_dataset

# Load the GLAM dataset
dataset = load_dataset("data/exports/glam_dataset.parquet")

# Filter UNESCO museums in France
museums = dataset.filter(
    institution_type="MUSEUM",
    data_source="UNESCO_WORLD_HERITAGE",
    country="FR"
)

Example Queries
- SPARQL examples (find institutions by type, country)
- Pandas examples (data analysis, statistics)
- SQL examples (SQLite queries)
API Reference
- Document all public classes and methods
- Provide code examples for each function
- Link to LinkML schema documentation

Deliverables:

docs/user-guide/getting-started.md - Quick start guide
docs/user-guide/example-queries.md - Query examples
docs/api-reference.md - API documentation

Success Criteria:

Complete documentation for all public APIs
10+ example queries covering common use cases
Step-by-step tutorials for data consumers

Day 29: Documentation - Developer Guide

Tasks:

Architecture Overview
- Diagram of extraction pipeline (API → Parser → Validator → Exporter)
- Explanation of LinkML Map transformation
- GHCID generation algorithm
Contributing Guide
- How to add new institution type classifiers
- How to extend LinkML schema with new fields
- How to add new export formats
Testing Guide
- Running unit tests, integration tests
- Creating new test fixtures
- Using property-based testing

Deliverables:

docs/developer-guide/architecture.md - Architecture docs
docs/developer-guide/contributing.md - Contribution guide
docs/developer-guide/testing.md - Testing guide

Success Criteria:

Complete architecture documentation with diagrams
Clear instructions for extending the system
Comprehensive testing guide

Day 30: Release & Publication

Tasks:

Dataset Release
- Tag repository with version number (e.g., v1.0.0-unesco)
- Create GitHub Release with exports attached
- Publish to Zenodo for DOI (persistent citation)
Announcement
- Write blog post announcing UNESCO data release
- Share on social media (Twitter, Mastodon, LinkedIn)
- Notify stakeholders (Europeana, DPLA, heritage researchers)
Data Portal Update
- Update w3id.org redirects for new institutions
- Deploy SPARQL endpoint (if applicable)
- Update REST API to include UNESCO data

Deliverables:

GitHub Release with dataset exports
Zenodo DOI for citation
Blog post and announcement

Success Criteria:

Dataset published with persistent DOI
Documentation live and accessible
Stakeholders notified of release

Risk Mitigation

Technical Risks

Risk	Probability	Impact	Mitigation
UNESCO API changes format	Low	High	Cache all responses, version API client
LinkML Map lacks features	Medium	High	Implement custom extension early (Day 2-3)
GHCID collisions exceed capacity	Low	Medium	Q-number resolution strategy documented
Wikidata enrichment fails	Medium	Medium	Fallback to fuzzy name matching

Resource Risks

Risk	Probability	Impact	Mitigation
Timeline slips past 6 weeks	Medium	Medium	Prioritize core features, defer non-critical exports
Test coverage falls below 90%	Low	High	TDD approach enforced from Day 1
Documentation incomplete	Medium	High	Reserve full week for docs (Phase 5)

Data Quality Risks

Risk	Probability	Impact	Mitigation
Institution type misclassification	Medium	Medium	Manual review queue for low-confidence cases
Missing Wikidata Q-numbers	Medium	Low	Accept base GHCID without Q-number, enrich later
Conflicts with existing data	Low	Medium	Tier-based priority, UNESCO wins

Success Metrics

Quantitative Metrics

Coverage: Extract 1,000+ UNESCO site institutions
Quality: 90%+ confidence score average
Completeness: 80%+ have Wikidata Q-numbers
Performance: Process all sites in < 2 hours
Test Coverage: 90%+ code coverage

Qualitative Metrics

Usability: Positive feedback from 3+ data consumers
Documentation: Complete user guide and API docs
Maintainability: Code passes linter, type checker
Reproducibility: Dataset generation fully automated

Appendix: Day-by-Day Checklist

Phase 1 (Days 1-5)

Day 1: UNESCO API documentation reviewed, 50 sites fetched
Day 2: LinkML Map schema (part 1) - basic transformations
Day 3: LinkML Map schema (part 2) - advanced patterns
Day 4: Golden dataset created (20 test fixtures)
Day 5: Institution type classifier designed

Phase 2 (Days 6-13)

Day 6: UNESCO API client implemented
Day 7: LinkML instance generator implemented
Day 8: GHCID generator extended for UNESCO
Day 9-10: Batch processing pipeline
Day 11-12: Integration testing
Day 13: Code review and refactoring

Phase 3 (Days 14-19)

Day 14-15: Cross-referencing with existing data
Day 16: Conflict resolution
Day 17: Confidence scoring system
Day 18: LinkML schema validation
Day 19: Data quality report

Phase 4 (Days 20-24)

Day 20-21: Wikidata enrichment
Day 22: Dataset merge
Day 23: GHCID collision resolution
Day 24: Final data validation

Phase 5 (Days 25-30)

Day 25-26: RDF/JSON-LD export
Day 27: CSV/Parquet/SQLite export
Day 28: User guide documentation
Day 29: Developer guide documentation
Day 30: Release and publication

Document Status: Complete
Next Document: 04-tdd-strategy.md - Test-driven development plan
Version: 1.1

Version History

Version 1.1 (2025-11-10)

Changes: Updated for OpenDataSoft Explore API v2.0 migration

Day 1 API Reconnaissance (lines 44-62): Updated API endpoint from legacy whc.unesco.org/en/list/json to OpenDataSoft data.unesco.org/api/explore/v2.0
Day 6 UNESCO API Client (lines 265-305):
- Updated base_url to OpenDataSoft API endpoint
- Removed api_key parameter (public dataset, no authentication)
- Added pagination parameters to fetch_site_list(): limit, offset
- Updated method documentation to reflect OpenDataSoft response structure: {"record": {"fields": {...}}}
Day 7 LinkML Parser (lines 309-352):
- Updated parse_unesco_site() to extract from nested api_response['record']['fields']
- Added documentation clarifying OpenDataSoft structure
- Updated extract_multilingual_names() parameter name from unesco_data to site_data
Day 10 Batch Processing (lines 406-450):
- Updated extract_all_unesco_sites() with pagination loop for OpenDataSoft API
- Updated process_unesco_site() to handle nested record structure
- Changed field access from site_data['id_number'] to site_record['record']['fields']['unique_number']
Day 11-12 Integration Tests (lines 454-483):
- Updated test_full_unesco_extraction_pipeline() to extract site_data from response['record']['fields']
- Added explicit documentation of OpenDataSoft API structure

Rationale: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.

Version 1.0 (2025-11-09)

Initial Release

Comprehensive 30-day implementation plan for UNESCO data extraction
Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
TDD approach with golden dataset and integration tests
GHCID generation strategy for UNESCO heritage sites
Wikidata enrichment and cross-referencing plan

48 KiB Raw Permalink Blame History

UNESCO Data Extraction - Implementation Phases

Executive Summary

Phase Overview

Phase 1: API Exploration & Schema Design (Days 1-5)

Objectives

Day 1: UNESCO API Reconnaissance

Day 2: LinkML Map Schema Design (Part 1)

Day 3: LinkML Map Schema Design (Part 2)

Day 4: Test Fixture Creation

Day 5: Institution Type Classifier Design

Phase 2: Extractor Implementation (Days 6-13)

Objectives

Day 6: UNESCO API Client

Day 7: LinkML Instance Generator

Day 8: GHCID Generator for UNESCO Sites

Day 9-10: Batch Processing Pipeline

Day 11-12: Integration Testing

Day 13: Code Review & Refactoring

Phase 3: Data Quality & Validation (Days 14-19)

Objectives

Day 14-15: Cross-Referencing with Existing Data

Day 16: Conflict Resolution

Day 17: Confidence Scoring System

Day 18: LinkML Schema Validation

Day 19: Data Quality Report

Phase 4: Integration & Enrichment (Days 20-24)

Objectives

Day 20-21: Wikidata Enrichment

Day 22: Dataset Merge

Day 23: GHCID Collision Resolution

Day 24: Final Data Validation

Phase 5: Export & Documentation (Days 25-30)

Objectives

Day 25-26: RDF/JSON-LD Export

Day 27: CSV/Parquet Export

Day 28: Documentation - User Guide

Day 29: Documentation - Developer Guide

Day 30: Release & Publication

Risk Mitigation

Technical Risks

Resource Risks

Data Quality Risks

Success Metrics

Quantitative Metrics

Qualitative Metrics

Appendix: Day-by-Day Checklist

Phase 1 (Days 1-5)

Phase 2 (Days 6-13)

Phase 3 (Days 14-19)

Phase 4 (Days 20-24)

Phase 5 (Days 25-30)

Version History

Version 1.1 (2025-11-10)

Version 1.0 (2025-11-09)

48 KiB

Raw Permalink Blame History