glam/docs/plan/unesco/01-dependencies.md
2025-11-19 23:25:22 +01:00

19 KiB

UNESCO Data Extraction: Dependencies

Overview

This document catalogs all dependencies for extracting UNESCO World Heritage Site data and integrating it with the Global GLAM Dataset.

External Data Dependencies

1. UNESCO DataHub

Primary Source:

⚠️ CRITICAL FINDING: UNESCO DataHub uses OpenDataSoft's proprietary Explore API v2 (NOT OAI-PMH or standard REST)

API Architecture:

  • API Version: Explore API v2.0 and v2.1
  • Base Endpoint: https://data.unesco.org/api/explore/v2.0/ and https://data.unesco.org/api/explore/v2.1/
  • Query Language: ODSQL (OpenDataSoft SQL-like query language, NOT standard SQL)
  • Response Format: JSON
  • Authentication: None required for public datasets (optional for private data)

Available Formats:

  • JSON API via OpenDataSoft Explore API (primary extraction method)
  • CSV export via portal
  • DCAT-AP metadata
  • RDF/XML (via OpenDataSoft serialization)
  • GeoJSON (for geographic datasets)

Rate Limits:

  • Public API: ~100 requests/minute (OpenDataSoft default, to be verified)
  • Bulk download: Available via CSV/JSON export from portal
  • SPARQL: NOT available (OpenDataSoft uses ODSQL instead)

Authentication: None required for public data

Legacy JSON Endpoint (BLOCKED):

2. Wikidata Integration

Purpose: Enrich UNESCO sites with Wikidata Q-numbers

Endpoint: https://query.wikidata.org/sparql

Query Pattern:

SELECT ?item ?itemLabel ?whcID WHERE {
  ?item wdt:P757 ?whcID .  # UNESCO World Heritage Site ID (P757)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Usage:

  • Cross-reference UNESCO WHC IDs with Wikidata
  • Enrich with additional identifiers (VIAF, GND, etc.)
  • Extract alternative names in multiple languages

3. GeoNames

Purpose: Geocode locations, generate UN/LOCODE for GHCID

Endpoint: http://api.geonames.org/

API Key: Required (free tier: 20k requests/day)

Usage:

  • Validate coordinates
  • Look up UN/LOCODE for cities
  • Resolve place names to GeoNames IDs

4. ISO Standards Data

ISO 3166-1 alpha-2 (Country Codes):

  • Source: pycountry library
  • Purpose: Standardize country identifiers for GHCID

ISO 3166-2 (Region Codes):

  • Source: pycountry library
  • Purpose: Region codes for GHCID generation

Python Library Dependencies

Core Libraries

[tool.poetry.dependencies]
python = "^3.11"

# HTTP & API clients
httpx = "^0.25.0"              # Async HTTP for UNESCO API
aiohttp = "^3.9.0"             # Alternative async HTTP
requests-cache = "^1.1.0"      # Cache API responses
# NOTE: OpenDataSoft API uses standard REST, no special client library needed
#       ODSQL queries passed as URL parameters to Explore API v2

# Data processing
pandas = "^2.1.0"              # Tabular data manipulation
pydantic = "^2.4.0"            # Data validation
jsonschema = "^4.19.0"         # JSON schema validation

# LinkML ecosystem
linkml = "^1.6.0"              # Schema tools
linkml-runtime = "^1.6.0"      # Runtime validation
linkml-map = "^0.1.0"          # Mapping transformations ⚠️ CRITICAL

# RDF & Semantic Web
rdflib = "^7.0.0"              # RDF graph manipulation
sparqlwrapper = "^2.0.0"       # SPARQL queries
jsonld = "^1.8.0"              # JSON-LD processing

# Geographic data
geopy = "^2.4.0"               # Geocoding
pycountry = "^22.3.5"          # Country/region codes
shapely = "^2.0.0"             # Geometric operations (optional)

# Text processing
unidecode = "^1.3.7"           # Unicode normalization
ftfy = "^6.1.1"                # Fix text encoding
rapidfuzz = "^3.5.0"           # Fuzzy matching

# Utilities
tqdm = "^4.66.0"               # Progress bars
rich = "^13.7.0"               # Beautiful terminal output
loguru = "^0.7.0"              # Structured logging

Development Dependencies

[tool.poetry.group.dev.dependencies]
# Testing
pytest = "^7.4.0"
pytest-asyncio = "^0.21.0"     # Async test support
pytest-cov = "^4.1.0"          # Coverage
pytest-mock = "^3.12.0"        # Mocking
hypothesis = "^6.92.0"         # Property-based testing
responses = "^0.24.0"          # Mock HTTP responses

# Code quality
ruff = "^0.1.0"                # Fast linter
mypy = "^1.7.0"                # Type checking
pre-commit = "^3.5.0"          # Git hooks

# Documentation
mkdocs = "^1.5.0"
mkdocs-material = "^9.5.0"

OpenDataSoft API Integration

⚠️ CRITICAL: UNESCO uses OpenDataSoft's proprietary Explore API, NOT a standard protocol like OAI-PMH.

ODSQL Query Language:

Example ODSQL Query:

# Fetch UNESCO World Heritage Sites inscribed after 2000 in Europe
import requests

url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
params = {
    "select": "site_name, date_inscribed, latitude, longitude",
    "where": "date_inscribed > 2000 AND region = 'Europe'",
    "limit": 100,
    "offset": 0,
    "order_by": "date_inscribed DESC"
}
response = requests.get(url, params=params)
data = response.json()

Key Differences from Standard REST APIs:

  • Queries passed as URL parameters (not POST body)
  • ODSQL syntax in where, select, group_by parameters
  • Pagination via limit and offset (not standard cursors)
  • Response structure: {"total_count": N, "results": [...]}

Implementation Strategy:

  • Use requests or httpx for HTTP calls (no special client library)
  • Build ODSQL queries programmatically as strings
  • Handle pagination with limit/offset loops
  • Cache responses with requests-cache

Critical: LinkML Map Extension

⚠️ IMPORTANT: This project requires LinkML Map with conditional XPath/JSONPath/regex support.

Status: To be implemented as bespoke extension

Required Features:

  1. Conditional extraction - Extract based on field value conditions
  2. XPath support - Navigate XML/HTML structures from UNESCO
  3. JSONPath support - Query complex JSON API responses
  4. Regex patterns - Extract and validate identifier formats
  5. Multi-value handling - Map arrays to multivalued LinkML slots

Implementation Path:

  • Extend linkml-map package OR
  • Create custom linkml_map_extended module in project

Example Usage:

# LinkML Map with conditional XPath
mappings:
  - source: unesco_site
    target: HeritageCustodian
    conditions:
      - path: "$.category"
        pattern: "^(Cultural|Mixed)$"  # Only cultural/mixed sites
    transforms:
      - source_path: "$.site_name"
        target_slot: name
      - source_path: "$.date_inscribed"
        target_slot: founded_date
        transform: "parse_year"
      - source_xpath: "//description[@lang='en']/text()"
        target_slot: description

Schema Dependencies

LinkML Global GLAM Schema (v0.2.1)

Modules Required:

  • schemas/core.yaml - HeritageCustodian, Location, Identifier
  • schemas/enums.yaml - InstitutionTypeEnum, DataSource
  • schemas/provenance.yaml - Provenance, ChangeEvent
  • schemas/collections.yaml - Collection metadata

New Enums to Add:

# schemas/enums.yaml - Add to DataSource
DataSource:
  permissible_values:
    UNESCO_WORLD_HERITAGE:
      description: UNESCO World Heritage Centre API
      meaning: heritage:UNESCODataHub

Ontology Dependencies

Required Ontologies:

  1. CPOV (Core Public Organisation Vocabulary)

    • Path: /data/ontology/core-public-organisation-ap.ttl
    • Usage: Map UNESCO sites to public organizations
  2. Schema.org

    • Path: /data/ontology/schemaorg.owl
    • Usage: schema:Museum, schema:TouristAttraction, schema:Place
  3. CIDOC-CRM (Cultural Heritage)

    • Path: /data/ontology/CIDOC_CRM_v7.1.3.rdf
    • Usage: crm:E53_Place, crm:E74_Group for heritage sites
  4. UNESCO Thesaurus

To Download:

# UNESCO Thesaurus
curl -o data/ontology/unesco_thesaurus.ttl \
  http://vocabularies.unesco.org/thesaurus/concept.ttl

# Verify CIDOC-CRM is present
ls data/ontology/CIDOC_CRM_v7.1.3.rdf

External Service Dependencies

1. UNESCO OpenDataSoft API Service

Platform: OpenDataSoft Explore API v2.0/v2.1

Primary Endpoint: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records

Legacy Endpoint (BLOCKED):

  • https://whc.unesco.org/en/list/json/
  • Status: Returns 403 Forbidden (Cloudflare protection)
  • Action: Do NOT use - use OpenDataSoft API instead

Availability: 99.5% uptime (estimated, OpenDataSoft SLA)

API Response Structure:

{
  "total_count": 1199,
  "results": [
    {
      "site_name": "Historic Centre of Rome",
      "date_inscribed": 1980,
      "category": "Cultural",
      "latitude": 41.9,
      "longitude": 12.5,
      "criteria_txt": "(i)(ii)(iii)(iv)(vi)",
      "states_name_en": "Italy",
      // ... additional fields
    }
  ]
}

Pagination:

  • Use limit (max 100 per request) and offset parameters
  • Total records available in total_count field
  • No cursor-based pagination

Fallback Strategy:

  • Cache API responses locally (requests-cache with 24-hour TTL)
  • Implement exponential backoff on 5xx errors
  • Fallback to CSV export download if API unavailable for >1 hour
  • CSV export URL: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv

Monitoring:

  • Log all API response times
  • Alert if >5% requests fail or response time >2s
  • Track total_count for dataset size changes
  • Monitor for schema changes via response validation

2. Wikidata SPARQL Endpoint

Endpoint: https://query.wikidata.org/sparql

Rate Limits:

  • 60 requests/minute (public)
  • User-Agent header required

Fallback Strategy:

  • Cache SPARQL results (24-hour TTL)
  • Batch queries to minimize requests
  • Skip enrichment if endpoint unavailable

3. GeoNames API

Endpoint: http://api.geonames.org/

Rate Limits:

  • Free tier: 20,000 requests/day
  • Premium: 200,000 requests/day

API Key Storage: Environment variable GEONAMES_API_KEY

Fallback Strategy:

  • Use UNESCO-provided coordinates as primary
  • GeoNames lookup only for UN/LOCODE generation
  • Cache lookups indefinitely (place names don't change often)

File System Dependencies

Required Directories

glam-extractor/
├── data/
│   ├── unesco/
│   │   ├── raw/                 # Raw API responses
│   │   ├── cache/               # HTTP cache
│   │   ├── instances/           # LinkML instances
│   │   └── mappings/            # LinkML Map schemas
│   ├── ontology/
│   │   ├── unesco_thesaurus.ttl # UNESCO vocabulary
│   │   └── [existing ontologies]
│   └── reference/
│       └── unesco_criteria.yaml # Criteria definitions
├── src/glam_extractor/
│   ├── extractors/
│   │   └── unesco.py            # UNESCO extractor
│   ├── mappers/
│   │   └── unesco_mapper.py     # LinkML Map integration
│   └── validators/
│       └── unesco_validator.py  # UNESCO-specific validation
└── tests/
    └── unesco/
        ├── fixtures/            # Test data
        └── test_unesco_*.py     # Test modules

Environment Variables

# .env file
GEONAMES_API_KEY=your_key_here
UNESCO_API_CACHE_TTL=86400       # 24 hours
WIKIDATA_RATE_LIMIT=60           # requests/minute
LOG_LEVEL=INFO

Database Dependencies

DuckDB (Intermediate Storage)

Purpose: Store extracted UNESCO data during processing

Schema:

CREATE TABLE unesco_sites (
    whc_id INTEGER PRIMARY KEY,
    site_name TEXT NOT NULL,
    country TEXT,
    category TEXT,  -- Cultural, Natural, Mixed
    date_inscribed INTEGER,
    latitude DOUBLE,
    longitude DOUBLE,
    raw_json TEXT,  -- Full API response
    extracted_at TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE
);

CREATE TABLE unesco_criteria (
    whc_id INTEGER,
    criterion_code TEXT,  -- (i), (ii), ..., (x)
    criterion_text TEXT,
    FOREIGN KEY (whc_id) REFERENCES unesco_sites(whc_id)
);

SQLite (Cache)

Purpose: HTTP response caching (via requests-cache)

Auto-managed: Library handles schema

Network Dependencies

Required Network Access

  • data.unesco.org (port 443) - UNESCO OpenDataSoft API (primary)
  • whc.unesco.org (port 443) - UNESCO WHC website (legacy, blocked by Cloudflare)
  • query.wikidata.org (port 443) - Wikidata SPARQL
  • api.geonames.org (port 80/443) - GeoNames API
  • vocabularies.unesco.org (port 443) - UNESCO Thesaurus

Note: The legacy whc.unesco.org/en/list/json/ endpoint is protected by Cloudflare and returns 403. Use data.unesco.org OpenDataSoft API instead.

Offline Mode

Supported: Yes, with cached data

Requirements:

  • Pre-downloaded UNESCO dataset snapshot
  • Cached Wikidata SPARQL results
  • GeoNames offline database (optional)

Activate:

export GLAM_OFFLINE_MODE=true

Version Constraints

Python Version

  • Minimum: Python 3.11
  • Recommended: Python 3.11 or 3.12
  • Reason: Modern type hints, performance improvements

LinkML Version

  • Minimum: linkml 1.6.0
  • Reason: Required for latest schema features

API Versions

  • UNESCO OpenDataSoft API: Explore API v2.0 and v2.1 (both supported)
  • Wikidata SPARQL: SPARQL 1.1 standard
  • GeoNames API: v3.0

Breaking Change Detection:

  • Monitor UNESCO API responses for schema changes
  • Test suite includes response validation against expected structure
  • Alert if field names or types change unexpectedly

Dependency Installation

Full Installation

# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset

# Install with Poetry (includes UNESCO dependencies)
poetry install

# Download UNESCO Thesaurus
mkdir -p data/ontology
curl -o data/ontology/unesco_thesaurus.ttl \
  http://vocabularies.unesco.org/thesaurus/concept.ttl

# Set up environment
cp .env.example .env
# Edit .env with your GeoNames API key

Minimal Installation (Testing Only)

poetry install --without dev
# Use cached fixtures, no API access

Dependency Risk Assessment

High Risk

  1. LinkML Map Extension ⚠️

    • Risk: Custom extension not yet implemented
    • Mitigation: Prioritize in Phase 1, create fallback pure-Python mapper
    • Timeline: Week 1-2
  2. UNESCO OpenDataSoft API Stability

    • Risk: Schema changes in OpenDataSoft platform or dataset fields
    • Mitigation:
      • Version API responses in cache filenames
      • Automated schema validation tests
      • Monitor total_count for unexpected changes
    • Detection: Weekly automated tests, compare field names/types
    • Fallback: CSV export download if API breaking changes detected
  3. ODSQL Query Language Changes

    • Risk: OpenDataSoft updates ODSQL syntax in breaking ways
    • Mitigation:
      • Pin to Explore API v2.0 (stable version)
      • Test queries against live API weekly
      • Document ODSQL syntax version in code comments
    • Impact: Query failures, need to rewrite ODSQL expressions

Medium Risk

  1. GeoNames Rate Limits

    • Risk: Exceed free tier (20k/day)
    • Mitigation: Aggressive caching, batch processing
    • Fallback: Manual UN/LOCODE lookup table
  2. Wikidata Endpoint Load

    • Risk: SPARQL endpoint slow or unavailable
    • Mitigation: Timeout handling, skip enrichment on failure
    • Impact: Reduced data quality, no blocking errors

Low Risk

  1. Python Library Updates
    • Risk: Breaking changes in dependencies
    • Mitigation: Pin versions in poetry.lock
    • Testing: CI/CD catches incompatibilities

Monitoring Dependencies

Health Checks

# tests/test_dependencies.py
import httpx

def test_unesco_opendatasoft_api_available():
    """Verify UNESCO OpenDataSoft API is accessible"""
    url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
    params = {"limit": 1}
    response = httpx.get(url, params=params, timeout=10)
    assert response.status_code == 200
    
    data = response.json()
    assert "total_count" in data
    assert "results" in data
    assert data["total_count"] > 1000  # Expect 1000+ UNESCO sites

def test_unesco_legacy_api_blocked():
    """Verify legacy JSON endpoint is indeed blocked"""
    url = "https://whc.unesco.org/en/list/json/"
    response = httpx.get(url, timeout=10, follow_redirects=True)
    # Expect 403 Forbidden due to Cloudflare protection
    assert response.status_code == 403

def test_wikidata_sparql_available():
    """Verify Wikidata SPARQL endpoint"""
    from SPARQLWrapper import SPARQLWrapper, JSON
    
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    sparql.setQuery("SELECT * WHERE { ?s ?p ?o } LIMIT 1")
    sparql.setReturnFormat(JSON)
    
    results = sparql.query().convert()
    assert len(results["results"]["bindings"]) > 0

def test_geonames_api_key():
    """Verify GeoNames API key is valid"""
    import os
    
    api_key = os.getenv("GEONAMES_API_KEY")
    assert api_key, "GEONAMES_API_KEY not set in environment"
    
    url = f"http://api.geonames.org/getJSON"
    params = {"geonameId": 6295630, "username": api_key}  # Test with Earth
    response = httpx.get(url, params=params, timeout=10)
    
    assert response.status_code == 200
    data = response.json()
    assert "geonameId" in data

Dependency Update Schedule

  • Weekly: Security patches (automated via Dependabot)
  • Monthly: Minor version updates (manual review)
  • Quarterly: Major version updates (full regression testing)
  • Implementation Phases: docs/plan/unesco/03-implementation-phases.md
  • LinkML Map Schema: docs/plan/unesco/06-linkml-map-schema.md
  • TDD Strategy: docs/plan/unesco/04-tdd-strategy.md

Version: 1.1
Date: 2025-11-09
Status: Updated - OpenDataSoft API architecture documented
Changelog:

  • v1.1 (2025-11-09): Updated to reflect OpenDataSoft Explore API v2 (not OAI-PMH)
    • Added ODSQL query language documentation
    • Updated API endpoints and response structures
    • Added health checks for OpenDataSoft API
    • Documented legacy endpoint blocking (Cloudflare 403)
  • v1.0 (2025-11-09): Initial version