19 KiB
UNESCO Data Extraction: Dependencies
Overview
This document catalogs all dependencies for extracting UNESCO World Heritage Site data and integrating it with the Global GLAM Dataset.
External Data Dependencies
1. UNESCO DataHub
Primary Source:
- URL: https://data.unesco.org/
- Platform: OpenDataSoft (commercial open data platform)
- World Heritage Dataset: https://data.unesco.org/explore/dataset/whc001/
- API Documentation: https://help.opendatasoft.com/apis/ods-explore-v2/
⚠️ CRITICAL FINDING: UNESCO DataHub uses OpenDataSoft's proprietary Explore API v2 (NOT OAI-PMH or standard REST)
API Architecture:
- API Version: Explore API v2.0 and v2.1
- Base Endpoint:
https://data.unesco.org/api/explore/v2.0/andhttps://data.unesco.org/api/explore/v2.1/ - Query Language: ODSQL (OpenDataSoft SQL-like query language, NOT standard SQL)
- Response Format: JSON
- Authentication: None required for public datasets (optional for private data)
Available Formats:
- JSON API via OpenDataSoft Explore API (primary extraction method)
- CSV export via portal
- DCAT-AP metadata
- RDF/XML (via OpenDataSoft serialization)
- GeoJSON (for geographic datasets)
Rate Limits:
- Public API: ~100 requests/minute (OpenDataSoft default, to be verified)
- Bulk download: Available via CSV/JSON export from portal
- SPARQL: NOT available (OpenDataSoft uses ODSQL instead)
Authentication: None required for public data
Legacy JSON Endpoint (BLOCKED):
- URL: https://whc.unesco.org/en/list/json/
- Status: Returns 403 Forbidden (Cloudflare protection)
- Mitigation: Use OpenDataSoft API instead
2. Wikidata Integration
Purpose: Enrich UNESCO sites with Wikidata Q-numbers
Endpoint: https://query.wikidata.org/sparql
Query Pattern:
SELECT ?item ?itemLabel ?whcID WHERE {
?item wdt:P757 ?whcID . # UNESCO World Heritage Site ID (P757)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
Usage:
- Cross-reference UNESCO WHC IDs with Wikidata
- Enrich with additional identifiers (VIAF, GND, etc.)
- Extract alternative names in multiple languages
3. GeoNames
Purpose: Geocode locations, generate UN/LOCODE for GHCID
Endpoint: http://api.geonames.org/
API Key: Required (free tier: 20k requests/day)
Usage:
- Validate coordinates
- Look up UN/LOCODE for cities
- Resolve place names to GeoNames IDs
4. ISO Standards Data
ISO 3166-1 alpha-2 (Country Codes):
- Source:
pycountrylibrary - Purpose: Standardize country identifiers for GHCID
ISO 3166-2 (Region Codes):
- Source:
pycountrylibrary - Purpose: Region codes for GHCID generation
Python Library Dependencies
Core Libraries
[tool.poetry.dependencies]
python = "^3.11"
# HTTP & API clients
httpx = "^0.25.0" # Async HTTP for UNESCO API
aiohttp = "^3.9.0" # Alternative async HTTP
requests-cache = "^1.1.0" # Cache API responses
# NOTE: OpenDataSoft API uses standard REST, no special client library needed
# ODSQL queries passed as URL parameters to Explore API v2
# Data processing
pandas = "^2.1.0" # Tabular data manipulation
pydantic = "^2.4.0" # Data validation
jsonschema = "^4.19.0" # JSON schema validation
# LinkML ecosystem
linkml = "^1.6.0" # Schema tools
linkml-runtime = "^1.6.0" # Runtime validation
linkml-map = "^0.1.0" # Mapping transformations ⚠️ CRITICAL
# RDF & Semantic Web
rdflib = "^7.0.0" # RDF graph manipulation
sparqlwrapper = "^2.0.0" # SPARQL queries
jsonld = "^1.8.0" # JSON-LD processing
# Geographic data
geopy = "^2.4.0" # Geocoding
pycountry = "^22.3.5" # Country/region codes
shapely = "^2.0.0" # Geometric operations (optional)
# Text processing
unidecode = "^1.3.7" # Unicode normalization
ftfy = "^6.1.1" # Fix text encoding
rapidfuzz = "^3.5.0" # Fuzzy matching
# Utilities
tqdm = "^4.66.0" # Progress bars
rich = "^13.7.0" # Beautiful terminal output
loguru = "^0.7.0" # Structured logging
Development Dependencies
[tool.poetry.group.dev.dependencies]
# Testing
pytest = "^7.4.0"
pytest-asyncio = "^0.21.0" # Async test support
pytest-cov = "^4.1.0" # Coverage
pytest-mock = "^3.12.0" # Mocking
hypothesis = "^6.92.0" # Property-based testing
responses = "^0.24.0" # Mock HTTP responses
# Code quality
ruff = "^0.1.0" # Fast linter
mypy = "^1.7.0" # Type checking
pre-commit = "^3.5.0" # Git hooks
# Documentation
mkdocs = "^1.5.0"
mkdocs-material = "^9.5.0"
OpenDataSoft API Integration
⚠️ CRITICAL: UNESCO uses OpenDataSoft's proprietary Explore API, NOT a standard protocol like OAI-PMH.
ODSQL Query Language:
- Type: SQL-like query language (NOT standard SQL)
- Purpose: Filter, aggregate, and query OpenDataSoft datasets
- Syntax: Similar to SQL SELECT/WHERE/GROUP BY but with custom functions
- Documentation: https://help.opendatasoft.com/apis/ods-explore-v2/#odsql
Example ODSQL Query:
# Fetch UNESCO World Heritage Sites inscribed after 2000 in Europe
import requests
url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
params = {
"select": "site_name, date_inscribed, latitude, longitude",
"where": "date_inscribed > 2000 AND region = 'Europe'",
"limit": 100,
"offset": 0,
"order_by": "date_inscribed DESC"
}
response = requests.get(url, params=params)
data = response.json()
Key Differences from Standard REST APIs:
- Queries passed as URL parameters (not POST body)
- ODSQL syntax in
where,select,group_byparameters - Pagination via
limitandoffset(not standard cursors) - Response structure:
{"total_count": N, "results": [...]}
Implementation Strategy:
- Use
requestsorhttpxfor HTTP calls (no special client library) - Build ODSQL queries programmatically as strings
- Handle pagination with
limit/offsetloops - Cache responses with
requests-cache
Critical: LinkML Map Extension
⚠️ IMPORTANT: This project requires LinkML Map with conditional XPath/JSONPath/regex support.
Status: To be implemented as bespoke extension
Required Features:
- Conditional extraction - Extract based on field value conditions
- XPath support - Navigate XML/HTML structures from UNESCO
- JSONPath support - Query complex JSON API responses
- Regex patterns - Extract and validate identifier formats
- Multi-value handling - Map arrays to multivalued LinkML slots
Implementation Path:
- Extend
linkml-mappackage OR - Create custom
linkml_map_extendedmodule in project
Example Usage:
# LinkML Map with conditional XPath
mappings:
- source: unesco_site
target: HeritageCustodian
conditions:
- path: "$.category"
pattern: "^(Cultural|Mixed)$" # Only cultural/mixed sites
transforms:
- source_path: "$.site_name"
target_slot: name
- source_path: "$.date_inscribed"
target_slot: founded_date
transform: "parse_year"
- source_xpath: "//description[@lang='en']/text()"
target_slot: description
Schema Dependencies
LinkML Global GLAM Schema (v0.2.1)
Modules Required:
schemas/core.yaml- HeritageCustodian, Location, Identifierschemas/enums.yaml- InstitutionTypeEnum, DataSourceschemas/provenance.yaml- Provenance, ChangeEventschemas/collections.yaml- Collection metadata
New Enums to Add:
# schemas/enums.yaml - Add to DataSource
DataSource:
permissible_values:
UNESCO_WORLD_HERITAGE:
description: UNESCO World Heritage Centre API
meaning: heritage:UNESCODataHub
Ontology Dependencies
Required Ontologies:
-
CPOV (Core Public Organisation Vocabulary)
- Path:
/data/ontology/core-public-organisation-ap.ttl - Usage: Map UNESCO sites to public organizations
- Path:
-
Schema.org
- Path:
/data/ontology/schemaorg.owl - Usage:
schema:Museum,schema:TouristAttraction,schema:Place
- Path:
-
CIDOC-CRM (Cultural Heritage)
- Path:
/data/ontology/CIDOC_CRM_v7.1.3.rdf - Usage:
crm:E53_Place,crm:E74_Groupfor heritage sites
- Path:
-
UNESCO Thesaurus
- URL: http://vocabularies.unesco.org/thesaurus/
- Usage: Map UNESCO criteria to controlled vocabulary
- Format: SKOS (RDF)
To Download:
# UNESCO Thesaurus
curl -o data/ontology/unesco_thesaurus.ttl \
http://vocabularies.unesco.org/thesaurus/concept.ttl
# Verify CIDOC-CRM is present
ls data/ontology/CIDOC_CRM_v7.1.3.rdf
External Service Dependencies
1. UNESCO OpenDataSoft API Service
Platform: OpenDataSoft Explore API v2.0/v2.1
Primary Endpoint: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records
Legacy Endpoint (BLOCKED):
https://whc.unesco.org/en/list/json/- Status: Returns 403 Forbidden (Cloudflare protection)
- Action: Do NOT use - use OpenDataSoft API instead
Availability: 99.5% uptime (estimated, OpenDataSoft SLA)
API Response Structure:
{
"total_count": 1199,
"results": [
{
"site_name": "Historic Centre of Rome",
"date_inscribed": 1980,
"category": "Cultural",
"latitude": 41.9,
"longitude": 12.5,
"criteria_txt": "(i)(ii)(iii)(iv)(vi)",
"states_name_en": "Italy",
// ... additional fields
}
]
}
Pagination:
- Use
limit(max 100 per request) andoffsetparameters - Total records available in
total_countfield - No cursor-based pagination
Fallback Strategy:
- Cache API responses locally (requests-cache with 24-hour TTL)
- Implement exponential backoff on 5xx errors
- Fallback to CSV export download if API unavailable for >1 hour
- CSV export URL:
https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv
Monitoring:
- Log all API response times
- Alert if >5% requests fail or response time >2s
- Track
total_countfor dataset size changes - Monitor for schema changes via response validation
2. Wikidata SPARQL Endpoint
Endpoint: https://query.wikidata.org/sparql
Rate Limits:
- 60 requests/minute (public)
- User-Agent header required
Fallback Strategy:
- Cache SPARQL results (24-hour TTL)
- Batch queries to minimize requests
- Skip enrichment if endpoint unavailable
3. GeoNames API
Endpoint: http://api.geonames.org/
Rate Limits:
- Free tier: 20,000 requests/day
- Premium: 200,000 requests/day
API Key Storage: Environment variable GEONAMES_API_KEY
Fallback Strategy:
- Use UNESCO-provided coordinates as primary
- GeoNames lookup only for UN/LOCODE generation
- Cache lookups indefinitely (place names don't change often)
File System Dependencies
Required Directories
glam-extractor/
├── data/
│ ├── unesco/
│ │ ├── raw/ # Raw API responses
│ │ ├── cache/ # HTTP cache
│ │ ├── instances/ # LinkML instances
│ │ └── mappings/ # LinkML Map schemas
│ ├── ontology/
│ │ ├── unesco_thesaurus.ttl # UNESCO vocabulary
│ │ └── [existing ontologies]
│ └── reference/
│ └── unesco_criteria.yaml # Criteria definitions
├── src/glam_extractor/
│ ├── extractors/
│ │ └── unesco.py # UNESCO extractor
│ ├── mappers/
│ │ └── unesco_mapper.py # LinkML Map integration
│ └── validators/
│ └── unesco_validator.py # UNESCO-specific validation
└── tests/
└── unesco/
├── fixtures/ # Test data
└── test_unesco_*.py # Test modules
Environment Variables
# .env file
GEONAMES_API_KEY=your_key_here
UNESCO_API_CACHE_TTL=86400 # 24 hours
WIKIDATA_RATE_LIMIT=60 # requests/minute
LOG_LEVEL=INFO
Database Dependencies
DuckDB (Intermediate Storage)
Purpose: Store extracted UNESCO data during processing
Schema:
CREATE TABLE unesco_sites (
whc_id INTEGER PRIMARY KEY,
site_name TEXT NOT NULL,
country TEXT,
category TEXT, -- Cultural, Natural, Mixed
date_inscribed INTEGER,
latitude DOUBLE,
longitude DOUBLE,
raw_json TEXT, -- Full API response
extracted_at TIMESTAMP,
processed BOOLEAN DEFAULT FALSE
);
CREATE TABLE unesco_criteria (
whc_id INTEGER,
criterion_code TEXT, -- (i), (ii), ..., (x)
criterion_text TEXT,
FOREIGN KEY (whc_id) REFERENCES unesco_sites(whc_id)
);
SQLite (Cache)
Purpose: HTTP response caching (via requests-cache)
Auto-managed: Library handles schema
Network Dependencies
Required Network Access
data.unesco.org(port 443) - UNESCO OpenDataSoft API (primary)whc.unesco.org(port 443) - UNESCO WHC website (legacy, blocked by Cloudflare)query.wikidata.org(port 443) - Wikidata SPARQLapi.geonames.org(port 80/443) - GeoNames APIvocabularies.unesco.org(port 443) - UNESCO Thesaurus
Note: The legacy whc.unesco.org/en/list/json/ endpoint is protected by Cloudflare and returns 403. Use data.unesco.org OpenDataSoft API instead.
Offline Mode
Supported: Yes, with cached data
Requirements:
- Pre-downloaded UNESCO dataset snapshot
- Cached Wikidata SPARQL results
- GeoNames offline database (optional)
Activate:
export GLAM_OFFLINE_MODE=true
Version Constraints
Python Version
- Minimum: Python 3.11
- Recommended: Python 3.11 or 3.12
- Reason: Modern type hints, performance improvements
LinkML Version
- Minimum: linkml 1.6.0
- Reason: Required for latest schema features
API Versions
- UNESCO OpenDataSoft API: Explore API v2.0 and v2.1 (both supported)
- Wikidata SPARQL: SPARQL 1.1 standard
- GeoNames API: v3.0
Breaking Change Detection:
- Monitor UNESCO API responses for schema changes
- Test suite includes response validation against expected structure
- Alert if field names or types change unexpectedly
Dependency Installation
Full Installation
# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset
# Install with Poetry (includes UNESCO dependencies)
poetry install
# Download UNESCO Thesaurus
mkdir -p data/ontology
curl -o data/ontology/unesco_thesaurus.ttl \
http://vocabularies.unesco.org/thesaurus/concept.ttl
# Set up environment
cp .env.example .env
# Edit .env with your GeoNames API key
Minimal Installation (Testing Only)
poetry install --without dev
# Use cached fixtures, no API access
Dependency Risk Assessment
High Risk
-
LinkML Map Extension ⚠️
- Risk: Custom extension not yet implemented
- Mitigation: Prioritize in Phase 1, create fallback pure-Python mapper
- Timeline: Week 1-2
-
UNESCO OpenDataSoft API Stability
- Risk: Schema changes in OpenDataSoft platform or dataset fields
- Mitigation:
- Version API responses in cache filenames
- Automated schema validation tests
- Monitor
total_countfor unexpected changes
- Detection: Weekly automated tests, compare field names/types
- Fallback: CSV export download if API breaking changes detected
-
ODSQL Query Language Changes
- Risk: OpenDataSoft updates ODSQL syntax in breaking ways
- Mitigation:
- Pin to Explore API v2.0 (stable version)
- Test queries against live API weekly
- Document ODSQL syntax version in code comments
- Impact: Query failures, need to rewrite ODSQL expressions
Medium Risk
-
GeoNames Rate Limits
- Risk: Exceed free tier (20k/day)
- Mitigation: Aggressive caching, batch processing
- Fallback: Manual UN/LOCODE lookup table
-
Wikidata Endpoint Load
- Risk: SPARQL endpoint slow or unavailable
- Mitigation: Timeout handling, skip enrichment on failure
- Impact: Reduced data quality, no blocking errors
Low Risk
- Python Library Updates
- Risk: Breaking changes in dependencies
- Mitigation: Pin versions in poetry.lock
- Testing: CI/CD catches incompatibilities
Monitoring Dependencies
Health Checks
# tests/test_dependencies.py
import httpx
def test_unesco_opendatasoft_api_available():
"""Verify UNESCO OpenDataSoft API is accessible"""
url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
params = {"limit": 1}
response = httpx.get(url, params=params, timeout=10)
assert response.status_code == 200
data = response.json()
assert "total_count" in data
assert "results" in data
assert data["total_count"] > 1000 # Expect 1000+ UNESCO sites
def test_unesco_legacy_api_blocked():
"""Verify legacy JSON endpoint is indeed blocked"""
url = "https://whc.unesco.org/en/list/json/"
response = httpx.get(url, timeout=10, follow_redirects=True)
# Expect 403 Forbidden due to Cloudflare protection
assert response.status_code == 403
def test_wikidata_sparql_available():
"""Verify Wikidata SPARQL endpoint"""
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("SELECT * WHERE { ?s ?p ?o } LIMIT 1")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
assert len(results["results"]["bindings"]) > 0
def test_geonames_api_key():
"""Verify GeoNames API key is valid"""
import os
api_key = os.getenv("GEONAMES_API_KEY")
assert api_key, "GEONAMES_API_KEY not set in environment"
url = f"http://api.geonames.org/getJSON"
params = {"geonameId": 6295630, "username": api_key} # Test with Earth
response = httpx.get(url, params=params, timeout=10)
assert response.status_code == 200
data = response.json()
assert "geonameId" in data
Dependency Update Schedule
- Weekly: Security patches (automated via Dependabot)
- Monthly: Minor version updates (manual review)
- Quarterly: Major version updates (full regression testing)
Related Documentation
- Implementation Phases:
docs/plan/unesco/03-implementation-phases.md - LinkML Map Schema:
docs/plan/unesco/06-linkml-map-schema.md - TDD Strategy:
docs/plan/unesco/04-tdd-strategy.md
Version: 1.1
Date: 2025-11-09
Status: Updated - OpenDataSoft API architecture documented
Changelog:
- v1.1 (2025-11-09): Updated to reflect OpenDataSoft Explore API v2 (not OAI-PMH)
- Added ODSQL query language documentation
- Updated API endpoints and response structures
- Added health checks for OpenDataSoft API
- Documented legacy endpoint blocking (Cloudflare 403)
- v1.0 (2025-11-09): Initial version