9.7 KiB
UNESCO World Heritage Data Extraction Plan
Executive Summary
This plan outlines the comprehensive extraction, mapping, and integration of UNESCO World Heritage Site data into the Global GLAM Dataset. UNESCO provides structured linked data access through their DataHub, making it an authoritative TIER_1 source for heritage custodian information worldwide.
Project Goal: Extract and map all UNESCO World Heritage Sites managing GLAMORCUBEPSXHF-type collections to the LinkML Global Heritage Custodian schema, with full provenance tracking and ontology alignment.
Key Innovation: Use LinkML Map with bespoke conditional XPath/JSONPath/regex extension to handle complex UNESCO data structures and map them to the modular GLAM ontology.
Data Source: UNESCO DataHub
URL: https://whc.unesco.org/en/list/
Linked Data Portal: UNESCO DataHub (data.unesco.org)
Available Formats
UNESCO provides World Heritage data in multiple linked data formats:
- DCAT-AP - Data Catalogue Application Profile (EU standard)
- RDF/XML - Resource Description Framework
- Notation3 (N3) - Compact RDF serialization
- Turtle (.ttl) - Terse RDF Triple Language
- JSON-LD - JSON with Linked Data context
- API Console - Programmatic query interface
API Access
UNESCO World Heritage data is accessible through the OpenDataSoft Explore API v2.0:
- Base URL:
https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001 - Authentication: Public dataset (no API key required)
- Pagination: Supported via
limitandoffsetparameters - Filtering: ODSQL query language (e.g.,
where=unique_number=668) - Response Format: JSON with nested structure:
{"record": {"fields": {...}}} - Rate Limiting: Reasonable limits for public access
- SPARQL Endpoint: Not available (OpenDataSoft REST API only)
- Bulk Downloads: Dataset snapshots available in multiple formats
Heritage Sites as GLAMORCUBEPSXHF Custodians
UNESCO World Heritage Sites often function as heritage custodians managing collections:
| Site Type | GLAM Type | Examples |
|---|---|---|
| Cultural sites with museums | MUSEUM | Museums at archaeological sites |
| Historic libraries | LIBRARY | Historic monastic libraries, national libraries |
| Archive repositories | ARCHIVE | Historic document centers |
| Religious sites | HOLY_SITES | Churches, temples, monasteries with collections |
| Botanical/Zoological parks | BOTANICAL_ZOO | Historic gardens, zoos with heritage value |
| Universities | UNIVERSITY | Historic universities with collections |
| Research centers | RESEARCH_CENTER | Archaeological research sites |
| Monuments and landmarks | FEATURES | Statues, memorials, sculptures with heritage significance |
Extraction Criteria: Not all UNESCO sites are GLAM custodians. We extract only sites that:
- Actively manage collections (artifacts, documents, specimens)
- Provide public or scholarly access
- Maintain preservation/conservation programs
- Have museum, archive, library, or research facilities
Scope
Geographic Coverage
- Global: 1,157+ World Heritage Sites across 167 countries (as of 2024)
- Priority Regions:
- Europe (high GLAM density)
- Asia-Pacific (temple libraries, museum complexes)
- Latin America (archaeological site museums)
- Africa (heritage research centers)
Data Elements to Extract
From UNESCO data, extract:
-
Core Institution Data
- Site name (official, alternative names)
- Institution type (map to GLAMORCUBEPSXHF taxonomy)
- Description and significance
- Parent organization (national heritage agency)
-
Location & Geography
- Country (ISO 3166-1 alpha-2)
- Region/Province
- Coordinates (WGS84)
- GeoNames ID
- UN/LOCODE (for GHCID generation)
-
Identifiers
- UNESCO WHC ID (primary)
- Wikidata Q-number
- National heritage IDs
- Website URLs
-
Temporal Data
- Inscription date (founding as UNESCO site)
- Establishment date (original founding)
- Change history (extensions, name changes)
-
Collections & Significance
- Cultural criteria (i-vi)
- Natural criteria (vii-x)
- Collection types (archaeological, architectural, natural specimens)
- Significance statements
-
Digital Platforms
- Official websites
- Virtual tours
- Online collections
- API endpoints
-
Organizational Relationships
- Managing authority (government agency, foundation)
- International networks (UNESCO, ICOMOS, IUCN)
- Partnerships
Integration with Global GLAM Schema
Schema Alignment
Map UNESCO data to modular LinkML schema (v0.2.1):
| UNESCO Field | LinkML Class | Schema Module | Ontology Mapping |
|---|---|---|---|
| Site name | HeritageCustodian.name |
core.yaml |
schema:name, cpov:prefLabel |
| Alternative names | alternative_names |
core.yaml |
schema:alternateName |
| Country | Location.country |
core.yaml |
schema:addressCountry |
| Coordinates | Location.latitude/longitude |
core.yaml |
schema:geo |
| WHC ID | Identifier (scheme: UNESCO_WHC) |
core.yaml |
dcterms:identifier |
| Inscription date | founded_date |
core.yaml |
schema:foundingDate |
| Criteria | Collection.subject_areas |
collections.yaml |
dcterms:subject |
| Description | description |
core.yaml |
dcterms:description |
| Change events | ChangeEvent |
provenance.yaml |
prov:Activity |
Ontology Mapping
UNESCO data aligns with:
-
CPOV (Core Public Organisation Vocabulary) - Primary
- Most sites managed by public sector organizations
- Use
cpov:PublicOrganisationas base class
-
Schema.org - Web semantics
schema:Museum,schema:TouristAttractionschema:Placefor geographic data
-
CIDOC-CRM - Cultural heritage domain
crm:E53_Placefor heritage sitescrm:E74_Groupfor managing organizations
-
UNESCO Thesaurus - Domain vocabulary
- Map UNESCO criteria to subject headings
- Link to UNESCO concept scheme
Success Metrics
Quantitative
- Extract ≥1,000 UNESCO sites with GLAM custodian functions
- ≥95% data completeness for core fields (name, location, identifier)
- ≥80% enrichment with Wikidata Q-numbers
- 100% GHCID generation for all sites
- Zero validation errors against LinkML schema
Qualitative
- Accurate mapping of UNESCO criteria to GLAM collection types
- Proper institution type classification (MUSEUM, ARCHIVE, etc.)
- Complete provenance tracking (UNESCO as TIER_1 source)
- Ontology-aligned RDF export compatible with Europeana, DPLA
Timeline
- Phase 1 (Week 1): Dependencies & API exploration - 5 days
- Phase 2 (Week 2): LinkML Map schema design - 5 days
- Phase 3 (Week 3-4): Extractor implementation & TDD - 10 days
- Phase 4 (Week 5): Data quality validation & enrichment - 5 days
- Phase 5 (Week 6): Export & integration with GLAM dataset - 5 days
Total: 6 weeks (30 working days)
Risk Mitigation
Technical Risks
- API rate limits: Implement exponential backoff, response caching
- Data structure changes: Version control mappings, monitor OpenDataSoft schema updates
- Complex nested JSON: Parse
record.fieldsstructure carefully, validate field paths
Data Quality Risks
- Incomplete site data: Mark fields as
null, document in provenance - Institution type ambiguity: Use confidence scores, flag for manual review
- Identifier conflicts: Use UNESCO WHC ID as authoritative, cross-reference with Wikidata
Integration Risks
- Schema evolution: Design LinkML Map rules to be version-agnostic
- GHCID collisions: Append UNESCO WHC ID to GHCID if location-based collision occurs
Related Documentation
- Dependencies:
docs/plan/unesco/01-dependencies.md - Consumers & Use Cases:
docs/plan/unesco/02-consumers.md - Implementation Phases:
docs/plan/unesco/03-implementation-phases.md - TDD Strategy:
docs/plan/unesco/04-tdd-strategy.md - Design Patterns:
docs/plan/unesco/05-design-patterns.md - LinkML Map Schema:
docs/plan/unesco/06-linkml-map-schema.md - Master Checklist:
docs/plan/unesco/07-master-checklist.md
Version: 1.1
Date: 2025-11-10
Status: Planning Phase
Owner: GLAM Data Extraction Project
Version History
Version 1.1 (2025-11-10)
Changes: Updated for OpenDataSoft Explore API v2.0 migration
-
API Access Section (lines 27-37):
- Clarified that UNESCO data is accessed via OpenDataSoft Explore API v2.0
- Updated base URL to
https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001 - Noted no authentication required (public dataset)
- Added pagination details (
limit,offsetparameters) - Added ODSQL filtering examples
- Documented nested JSON structure:
{"record": {"fields": {...}}} - Confirmed SPARQL endpoint not available (OpenDataSoft REST API only)
-
Technical Risks (lines 173-176):
- Updated risk description to mention OpenDataSoft schema monitoring
- Removed SPARQL fallback (not applicable to OpenDataSoft)
- Added risk for complex nested JSON parsing
Rationale: Legacy UNESCO JSON API deprecated in favor of OpenDataSoft's standardized REST API with better pagination and query capabilities.
Version 1.0 (2025-11-09)
Initial Release
- Comprehensive overview of UNESCO World Heritage data extraction plan
- Defined scope: 1,157+ sites across 167 countries
- Schema alignment with LinkML v0.2.1 modular architecture
- Ontology mapping to CPOV, Schema.org, CIDOC-CRM, UNESCO Thesaurus
- 6-week implementation timeline with 5 phases
- Risk mitigation strategies for technical, data quality, and integration challenges