glam/docs/plan/unesco/00-overview.md
2025-11-19 23:25:22 +01:00

9.7 KiB

UNESCO World Heritage Data Extraction Plan

Executive Summary

This plan outlines the comprehensive extraction, mapping, and integration of UNESCO World Heritage Site data into the Global GLAM Dataset. UNESCO provides structured linked data access through their DataHub, making it an authoritative TIER_1 source for heritage custodian information worldwide.

Project Goal: Extract and map all UNESCO World Heritage Sites managing GLAMORCUBEPSXHF-type collections to the LinkML Global Heritage Custodian schema, with full provenance tracking and ontology alignment.

Key Innovation: Use LinkML Map with bespoke conditional XPath/JSONPath/regex extension to handle complex UNESCO data structures and map them to the modular GLAM ontology.

Data Source: UNESCO DataHub

URL: https://whc.unesco.org/en/list/
Linked Data Portal: UNESCO DataHub (data.unesco.org)

Available Formats

UNESCO provides World Heritage data in multiple linked data formats:

  1. DCAT-AP - Data Catalogue Application Profile (EU standard)
  2. RDF/XML - Resource Description Framework
  3. Notation3 (N3) - Compact RDF serialization
  4. Turtle (.ttl) - Terse RDF Triple Language
  5. JSON-LD - JSON with Linked Data context
  6. API Console - Programmatic query interface

API Access

UNESCO World Heritage data is accessible through the OpenDataSoft Explore API v2.0:

  • Base URL: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001
  • Authentication: Public dataset (no API key required)
  • Pagination: Supported via limit and offset parameters
  • Filtering: ODSQL query language (e.g., where=unique_number=668)
  • Response Format: JSON with nested structure: {"record": {"fields": {...}}}
  • Rate Limiting: Reasonable limits for public access
  • SPARQL Endpoint: Not available (OpenDataSoft REST API only)
  • Bulk Downloads: Dataset snapshots available in multiple formats

Heritage Sites as GLAMORCUBEPSXHF Custodians

UNESCO World Heritage Sites often function as heritage custodians managing collections:

Site Type GLAM Type Examples
Cultural sites with museums MUSEUM Museums at archaeological sites
Historic libraries LIBRARY Historic monastic libraries, national libraries
Archive repositories ARCHIVE Historic document centers
Religious sites HOLY_SITES Churches, temples, monasteries with collections
Botanical/Zoological parks BOTANICAL_ZOO Historic gardens, zoos with heritage value
Universities UNIVERSITY Historic universities with collections
Research centers RESEARCH_CENTER Archaeological research sites
Monuments and landmarks FEATURES Statues, memorials, sculptures with heritage significance

Extraction Criteria: Not all UNESCO sites are GLAM custodians. We extract only sites that:

  • Actively manage collections (artifacts, documents, specimens)
  • Provide public or scholarly access
  • Maintain preservation/conservation programs
  • Have museum, archive, library, or research facilities

Scope

Geographic Coverage

  • Global: 1,157+ World Heritage Sites across 167 countries (as of 2024)
  • Priority Regions:
    • Europe (high GLAM density)
    • Asia-Pacific (temple libraries, museum complexes)
    • Latin America (archaeological site museums)
    • Africa (heritage research centers)

Data Elements to Extract

From UNESCO data, extract:

  1. Core Institution Data

    • Site name (official, alternative names)
    • Institution type (map to GLAMORCUBEPSXHF taxonomy)
    • Description and significance
    • Parent organization (national heritage agency)
  2. Location & Geography

    • Country (ISO 3166-1 alpha-2)
    • Region/Province
    • Coordinates (WGS84)
    • GeoNames ID
    • UN/LOCODE (for GHCID generation)
  3. Identifiers

    • UNESCO WHC ID (primary)
    • Wikidata Q-number
    • National heritage IDs
    • Website URLs
  4. Temporal Data

    • Inscription date (founding as UNESCO site)
    • Establishment date (original founding)
    • Change history (extensions, name changes)
  5. Collections & Significance

    • Cultural criteria (i-vi)
    • Natural criteria (vii-x)
    • Collection types (archaeological, architectural, natural specimens)
    • Significance statements
  6. Digital Platforms

    • Official websites
    • Virtual tours
    • Online collections
    • API endpoints
  7. Organizational Relationships

    • Managing authority (government agency, foundation)
    • International networks (UNESCO, ICOMOS, IUCN)
    • Partnerships

Integration with Global GLAM Schema

Schema Alignment

Map UNESCO data to modular LinkML schema (v0.2.1):

UNESCO Field LinkML Class Schema Module Ontology Mapping
Site name HeritageCustodian.name core.yaml schema:name, cpov:prefLabel
Alternative names alternative_names core.yaml schema:alternateName
Country Location.country core.yaml schema:addressCountry
Coordinates Location.latitude/longitude core.yaml schema:geo
WHC ID Identifier (scheme: UNESCO_WHC) core.yaml dcterms:identifier
Inscription date founded_date core.yaml schema:foundingDate
Criteria Collection.subject_areas collections.yaml dcterms:subject
Description description core.yaml dcterms:description
Change events ChangeEvent provenance.yaml prov:Activity

Ontology Mapping

UNESCO data aligns with:

  1. CPOV (Core Public Organisation Vocabulary) - Primary

    • Most sites managed by public sector organizations
    • Use cpov:PublicOrganisation as base class
  2. Schema.org - Web semantics

    • schema:Museum, schema:TouristAttraction
    • schema:Place for geographic data
  3. CIDOC-CRM - Cultural heritage domain

    • crm:E53_Place for heritage sites
    • crm:E74_Group for managing organizations
  4. UNESCO Thesaurus - Domain vocabulary

    • Map UNESCO criteria to subject headings
    • Link to UNESCO concept scheme

Success Metrics

Quantitative

  • Extract ≥1,000 UNESCO sites with GLAM custodian functions
  • ≥95% data completeness for core fields (name, location, identifier)
  • ≥80% enrichment with Wikidata Q-numbers
  • 100% GHCID generation for all sites
  • Zero validation errors against LinkML schema

Qualitative

  • Accurate mapping of UNESCO criteria to GLAM collection types
  • Proper institution type classification (MUSEUM, ARCHIVE, etc.)
  • Complete provenance tracking (UNESCO as TIER_1 source)
  • Ontology-aligned RDF export compatible with Europeana, DPLA

Timeline

  • Phase 1 (Week 1): Dependencies & API exploration - 5 days
  • Phase 2 (Week 2): LinkML Map schema design - 5 days
  • Phase 3 (Week 3-4): Extractor implementation & TDD - 10 days
  • Phase 4 (Week 5): Data quality validation & enrichment - 5 days
  • Phase 5 (Week 6): Export & integration with GLAM dataset - 5 days

Total: 6 weeks (30 working days)

Risk Mitigation

Technical Risks

  • API rate limits: Implement exponential backoff, response caching
  • Data structure changes: Version control mappings, monitor OpenDataSoft schema updates
  • Complex nested JSON: Parse record.fields structure carefully, validate field paths

Data Quality Risks

  • Incomplete site data: Mark fields as null, document in provenance
  • Institution type ambiguity: Use confidence scores, flag for manual review
  • Identifier conflicts: Use UNESCO WHC ID as authoritative, cross-reference with Wikidata

Integration Risks

  • Schema evolution: Design LinkML Map rules to be version-agnostic
  • GHCID collisions: Append UNESCO WHC ID to GHCID if location-based collision occurs
  • Dependencies: docs/plan/unesco/01-dependencies.md
  • Consumers & Use Cases: docs/plan/unesco/02-consumers.md
  • Implementation Phases: docs/plan/unesco/03-implementation-phases.md
  • TDD Strategy: docs/plan/unesco/04-tdd-strategy.md
  • Design Patterns: docs/plan/unesco/05-design-patterns.md
  • LinkML Map Schema: docs/plan/unesco/06-linkml-map-schema.md
  • Master Checklist: docs/plan/unesco/07-master-checklist.md

Version: 1.1
Date: 2025-11-10
Status: Planning Phase
Owner: GLAM Data Extraction Project


Version History

Version 1.1 (2025-11-10)

Changes: Updated for OpenDataSoft Explore API v2.0 migration

  • API Access Section (lines 27-37):

    • Clarified that UNESCO data is accessed via OpenDataSoft Explore API v2.0
    • Updated base URL to https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001
    • Noted no authentication required (public dataset)
    • Added pagination details (limit, offset parameters)
    • Added ODSQL filtering examples
    • Documented nested JSON structure: {"record": {"fields": {...}}}
    • Confirmed SPARQL endpoint not available (OpenDataSoft REST API only)
  • Technical Risks (lines 173-176):

    • Updated risk description to mention OpenDataSoft schema monitoring
    • Removed SPARQL fallback (not applicable to OpenDataSoft)
    • Added risk for complex nested JSON parsing

Rationale: Legacy UNESCO JSON API deprecated in favor of OpenDataSoft's standardized REST API with better pagination and query capabilities.

Version 1.0 (2025-11-09)

Initial Release

  • Comprehensive overview of UNESCO World Heritage data extraction plan
  • Defined scope: 1,157+ sites across 167 countries
  • Schema alignment with LinkML v0.2.1 modular architecture
  • Ontology mapping to CPOV, Schema.org, CIDOC-CRM, UNESCO Thesaurus
  • 6-week implementation timeline with 5 phases
  • Risk mitigation strategies for technical, data quality, and integration challenges