# UNESCO Data Extraction: Dependencies ## Overview This document catalogs all dependencies for extracting UNESCO World Heritage Site data and integrating it with the Global GLAM Dataset. ## External Data Dependencies ### 1. UNESCO DataHub **Primary Source**: - **URL**: https://data.unesco.org/ - **Platform**: OpenDataSoft (commercial open data platform) - **World Heritage Dataset**: https://data.unesco.org/explore/dataset/whc001/ - **API Documentation**: https://help.opendatasoft.com/apis/ods-explore-v2/ **⚠️ CRITICAL FINDING**: UNESCO DataHub uses **OpenDataSoft's proprietary Explore API v2** (NOT OAI-PMH or standard REST) **API Architecture**: - **API Version**: Explore API v2.0 and v2.1 - **Base Endpoint**: `https://data.unesco.org/api/explore/v2.0/` and `https://data.unesco.org/api/explore/v2.1/` - **Query Language**: ODSQL (OpenDataSoft SQL-like query language, NOT standard SQL) - **Response Format**: JSON - **Authentication**: None required for public datasets (optional for private data) **Available Formats**: - JSON API via OpenDataSoft Explore API (primary extraction method) - CSV export via portal - DCAT-AP metadata - RDF/XML (via OpenDataSoft serialization) - GeoJSON (for geographic datasets) **Rate Limits**: - Public API: ~100 requests/minute (OpenDataSoft default, to be verified) - Bulk download: Available via CSV/JSON export from portal - SPARQL: NOT available (OpenDataSoft uses ODSQL instead) **Authentication**: None required for public data **Legacy JSON Endpoint** (BLOCKED): - URL: https://whc.unesco.org/en/list/json/ - Status: Returns 403 Forbidden (Cloudflare protection) - Mitigation: Use OpenDataSoft API instead ### 2. Wikidata Integration **Purpose**: Enrich UNESCO sites with Wikidata Q-numbers **Endpoint**: https://query.wikidata.org/sparql **Query Pattern**: ```sparql SELECT ?item ?itemLabel ?whcID WHERE { ?item wdt:P757 ?whcID . # UNESCO World Heritage Site ID (P757) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } } ``` **Usage**: - Cross-reference UNESCO WHC IDs with Wikidata - Enrich with additional identifiers (VIAF, GND, etc.) - Extract alternative names in multiple languages ### 3. GeoNames **Purpose**: Geocode locations, generate UN/LOCODE for GHCID **Endpoint**: http://api.geonames.org/ **API Key**: Required (free tier: 20k requests/day) **Usage**: - Validate coordinates - Look up UN/LOCODE for cities - Resolve place names to GeoNames IDs ### 4. ISO Standards Data **ISO 3166-1 alpha-2** (Country Codes): - Source: `pycountry` library - Purpose: Standardize country identifiers for GHCID **ISO 3166-2** (Region Codes): - Source: `pycountry` library - Purpose: Region codes for GHCID generation ## Python Library Dependencies ### Core Libraries ```toml [tool.poetry.dependencies] python = "^3.11" # HTTP & API clients httpx = "^0.25.0" # Async HTTP for UNESCO API aiohttp = "^3.9.0" # Alternative async HTTP requests-cache = "^1.1.0" # Cache API responses # NOTE: OpenDataSoft API uses standard REST, no special client library needed # ODSQL queries passed as URL parameters to Explore API v2 # Data processing pandas = "^2.1.0" # Tabular data manipulation pydantic = "^2.4.0" # Data validation jsonschema = "^4.19.0" # JSON schema validation # LinkML ecosystem linkml = "^1.6.0" # Schema tools linkml-runtime = "^1.6.0" # Runtime validation linkml-map = "^0.1.0" # Mapping transformations ⚠️ CRITICAL # RDF & Semantic Web rdflib = "^7.0.0" # RDF graph manipulation sparqlwrapper = "^2.0.0" # SPARQL queries jsonld = "^1.8.0" # JSON-LD processing # Geographic data geopy = "^2.4.0" # Geocoding pycountry = "^22.3.5" # Country/region codes shapely = "^2.0.0" # Geometric operations (optional) # Text processing unidecode = "^1.3.7" # Unicode normalization ftfy = "^6.1.1" # Fix text encoding rapidfuzz = "^3.5.0" # Fuzzy matching # Utilities tqdm = "^4.66.0" # Progress bars rich = "^13.7.0" # Beautiful terminal output loguru = "^0.7.0" # Structured logging ``` ### Development Dependencies ```toml [tool.poetry.group.dev.dependencies] # Testing pytest = "^7.4.0" pytest-asyncio = "^0.21.0" # Async test support pytest-cov = "^4.1.0" # Coverage pytest-mock = "^3.12.0" # Mocking hypothesis = "^6.92.0" # Property-based testing responses = "^0.24.0" # Mock HTTP responses # Code quality ruff = "^0.1.0" # Fast linter mypy = "^1.7.0" # Type checking pre-commit = "^3.5.0" # Git hooks # Documentation mkdocs = "^1.5.0" mkdocs-material = "^9.5.0" ``` ### OpenDataSoft API Integration **⚠️ CRITICAL**: UNESCO uses OpenDataSoft's proprietary Explore API, NOT a standard protocol like OAI-PMH. **ODSQL Query Language**: - **Type**: SQL-like query language (NOT standard SQL) - **Purpose**: Filter, aggregate, and query OpenDataSoft datasets - **Syntax**: Similar to SQL SELECT/WHERE/GROUP BY but with custom functions - **Documentation**: https://help.opendatasoft.com/apis/ods-explore-v2/#odsql **Example ODSQL Query**: ```python # Fetch UNESCO World Heritage Sites inscribed after 2000 in Europe import requests url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records" params = { "select": "site_name, date_inscribed, latitude, longitude", "where": "date_inscribed > 2000 AND region = 'Europe'", "limit": 100, "offset": 0, "order_by": "date_inscribed DESC" } response = requests.get(url, params=params) data = response.json() ``` **Key Differences from Standard REST APIs**: - Queries passed as URL parameters (not POST body) - ODSQL syntax in `where`, `select`, `group_by` parameters - Pagination via `limit` and `offset` (not standard cursors) - Response structure: `{"total_count": N, "results": [...]}` **Implementation Strategy**: - Use `requests` or `httpx` for HTTP calls (no special client library) - Build ODSQL queries programmatically as strings - Handle pagination with `limit`/`offset` loops - Cache responses with `requests-cache` ### Critical: LinkML Map Extension **⚠️ IMPORTANT**: This project requires **LinkML Map with conditional XPath/JSONPath/regex support**. **Status**: To be implemented as bespoke extension **Required Features**: 1. **Conditional extraction** - Extract based on field value conditions 2. **XPath support** - Navigate XML/HTML structures from UNESCO 3. **JSONPath support** - Query complex JSON API responses 4. **Regex patterns** - Extract and validate identifier formats 5. **Multi-value handling** - Map arrays to multivalued LinkML slots **Implementation Path**: - Extend `linkml-map` package OR - Create custom `linkml_map_extended` module in project **Example Usage**: ```yaml # LinkML Map with conditional XPath mappings: - source: unesco_site target: HeritageCustodian conditions: - path: "$.category" pattern: "^(Cultural|Mixed)$" # Only cultural/mixed sites transforms: - source_path: "$.site_name" target_slot: name - source_path: "$.date_inscribed" target_slot: founded_date transform: "parse_year" - source_xpath: "//description[@lang='en']/text()" target_slot: description ``` ## Schema Dependencies ### LinkML Global GLAM Schema (v0.2.1) **Modules Required**: - `schemas/core.yaml` - HeritageCustodian, Location, Identifier - `schemas/enums.yaml` - InstitutionTypeEnum, DataSource - `schemas/provenance.yaml` - Provenance, ChangeEvent - `schemas/collections.yaml` - Collection metadata **New Enums to Add**: ```yaml # schemas/enums.yaml - Add to DataSource DataSource: permissible_values: UNESCO_WORLD_HERITAGE: description: UNESCO World Heritage Centre API meaning: heritage:UNESCODataHub ``` ### Ontology Dependencies **Required Ontologies**: 1. **CPOV** (Core Public Organisation Vocabulary) - Path: `/data/ontology/core-public-organisation-ap.ttl` - Usage: Map UNESCO sites to public organizations 2. **Schema.org** - Path: `/data/ontology/schemaorg.owl` - Usage: `schema:Museum`, `schema:TouristAttraction`, `schema:Place` 3. **CIDOC-CRM** (Cultural Heritage) - Path: `/data/ontology/CIDOC_CRM_v7.1.3.rdf` - Usage: `crm:E53_Place`, `crm:E74_Group` for heritage sites 4. **UNESCO Thesaurus** - URL: http://vocabularies.unesco.org/thesaurus/ - Usage: Map UNESCO criteria to controlled vocabulary - Format: SKOS (RDF) **To Download**: ```bash # UNESCO Thesaurus curl -o data/ontology/unesco_thesaurus.ttl \ http://vocabularies.unesco.org/thesaurus/concept.ttl # Verify CIDOC-CRM is present ls data/ontology/CIDOC_CRM_v7.1.3.rdf ``` ## External Service Dependencies ### 1. UNESCO OpenDataSoft API Service **Platform**: OpenDataSoft Explore API v2.0/v2.1 **Primary Endpoint**: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records **Legacy Endpoint** (BLOCKED): - `https://whc.unesco.org/en/list/json/` - Status: Returns 403 Forbidden (Cloudflare protection) - Action: Do NOT use - use OpenDataSoft API instead **Availability**: 99.5% uptime (estimated, OpenDataSoft SLA) **API Response Structure**: ```json { "total_count": 1199, "results": [ { "site_name": "Historic Centre of Rome", "date_inscribed": 1980, "category": "Cultural", "latitude": 41.9, "longitude": 12.5, "criteria_txt": "(i)(ii)(iii)(iv)(vi)", "states_name_en": "Italy", // ... additional fields } ] } ``` **Pagination**: - Use `limit` (max 100 per request) and `offset` parameters - Total records available in `total_count` field - No cursor-based pagination **Fallback Strategy**: - Cache API responses locally (requests-cache with 24-hour TTL) - Implement exponential backoff on 5xx errors - Fallback to CSV export download if API unavailable for >1 hour - CSV export URL: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv` **Monitoring**: - Log all API response times - Alert if >5% requests fail or response time >2s - Track `total_count` for dataset size changes - Monitor for schema changes via response validation ### 2. Wikidata SPARQL Endpoint **Endpoint**: https://query.wikidata.org/sparql **Rate Limits**: - 60 requests/minute (public) - User-Agent header required **Fallback Strategy**: - Cache SPARQL results (24-hour TTL) - Batch queries to minimize requests - Skip enrichment if endpoint unavailable ### 3. GeoNames API **Endpoint**: http://api.geonames.org/ **Rate Limits**: - Free tier: 20,000 requests/day - Premium: 200,000 requests/day **API Key Storage**: Environment variable `GEONAMES_API_KEY` **Fallback Strategy**: - Use UNESCO-provided coordinates as primary - GeoNames lookup only for UN/LOCODE generation - Cache lookups indefinitely (place names don't change often) ## File System Dependencies ### Required Directories ``` glam-extractor/ ├── data/ │ ├── unesco/ │ │ ├── raw/ # Raw API responses │ │ ├── cache/ # HTTP cache │ │ ├── instances/ # LinkML instances │ │ └── mappings/ # LinkML Map schemas │ ├── ontology/ │ │ ├── unesco_thesaurus.ttl # UNESCO vocabulary │ │ └── [existing ontologies] │ └── reference/ │ └── unesco_criteria.yaml # Criteria definitions ├── src/glam_extractor/ │ ├── extractors/ │ │ └── unesco.py # UNESCO extractor │ ├── mappers/ │ │ └── unesco_mapper.py # LinkML Map integration │ └── validators/ │ └── unesco_validator.py # UNESCO-specific validation └── tests/ └── unesco/ ├── fixtures/ # Test data └── test_unesco_*.py # Test modules ``` ## Environment Variables ```bash # .env file GEONAMES_API_KEY=your_key_here UNESCO_API_CACHE_TTL=86400 # 24 hours WIKIDATA_RATE_LIMIT=60 # requests/minute LOG_LEVEL=INFO ``` ## Database Dependencies ### DuckDB (Intermediate Storage) **Purpose**: Store extracted UNESCO data during processing **Schema**: ```sql CREATE TABLE unesco_sites ( whc_id INTEGER PRIMARY KEY, site_name TEXT NOT NULL, country TEXT, category TEXT, -- Cultural, Natural, Mixed date_inscribed INTEGER, latitude DOUBLE, longitude DOUBLE, raw_json TEXT, -- Full API response extracted_at TIMESTAMP, processed BOOLEAN DEFAULT FALSE ); CREATE TABLE unesco_criteria ( whc_id INTEGER, criterion_code TEXT, -- (i), (ii), ..., (x) criterion_text TEXT, FOREIGN KEY (whc_id) REFERENCES unesco_sites(whc_id) ); ``` ### SQLite (Cache) **Purpose**: HTTP response caching (via requests-cache) **Auto-managed**: Library handles schema ## Network Dependencies ### Required Network Access - `data.unesco.org` (port 443) - UNESCO OpenDataSoft API (primary) - `whc.unesco.org` (port 443) - UNESCO WHC website (legacy, blocked by Cloudflare) - `query.wikidata.org` (port 443) - Wikidata SPARQL - `api.geonames.org` (port 80/443) - GeoNames API - `vocabularies.unesco.org` (port 443) - UNESCO Thesaurus **Note**: The legacy `whc.unesco.org/en/list/json/` endpoint is protected by Cloudflare and returns 403. Use `data.unesco.org` OpenDataSoft API instead. ### Offline Mode **Supported**: Yes, with cached data **Requirements**: - Pre-downloaded UNESCO dataset snapshot - Cached Wikidata SPARQL results - GeoNames offline database (optional) **Activate**: ```bash export GLAM_OFFLINE_MODE=true ``` ## Version Constraints ### Python Version - **Minimum**: Python 3.11 - **Recommended**: Python 3.11 or 3.12 - **Reason**: Modern type hints, performance improvements ### LinkML Version - **Minimum**: linkml 1.6.0 - **Reason**: Required for latest schema features ### API Versions - **UNESCO OpenDataSoft API**: Explore API v2.0 and v2.1 (both supported) - **Wikidata SPARQL**: SPARQL 1.1 standard - **GeoNames API**: v3.0 **Breaking Change Detection**: - Monitor UNESCO API responses for schema changes - Test suite includes response validation against expected structure - Alert if field names or types change unexpectedly ## Dependency Installation ### Full Installation ```bash # Clone repository git clone https://github.com/user/glam-dataset.git cd glam-dataset # Install with Poetry (includes UNESCO dependencies) poetry install # Download UNESCO Thesaurus mkdir -p data/ontology curl -o data/ontology/unesco_thesaurus.ttl \ http://vocabularies.unesco.org/thesaurus/concept.ttl # Set up environment cp .env.example .env # Edit .env with your GeoNames API key ``` ### Minimal Installation (Testing Only) ```bash poetry install --without dev # Use cached fixtures, no API access ``` ## Dependency Risk Assessment ### High Risk 1. **LinkML Map Extension** ⚠️ - **Risk**: Custom extension not yet implemented - **Mitigation**: Prioritize in Phase 1, create fallback pure-Python mapper - **Timeline**: Week 1-2 2. **UNESCO OpenDataSoft API Stability** - **Risk**: Schema changes in OpenDataSoft platform or dataset fields - **Mitigation**: - Version API responses in cache filenames - Automated schema validation tests - Monitor `total_count` for unexpected changes - **Detection**: Weekly automated tests, compare field names/types - **Fallback**: CSV export download if API breaking changes detected 3. **ODSQL Query Language Changes** - **Risk**: OpenDataSoft updates ODSQL syntax in breaking ways - **Mitigation**: - Pin to Explore API v2.0 (stable version) - Test queries against live API weekly - Document ODSQL syntax version in code comments - **Impact**: Query failures, need to rewrite ODSQL expressions ### Medium Risk 1. **GeoNames Rate Limits** - **Risk**: Exceed free tier (20k/day) - **Mitigation**: Aggressive caching, batch processing - **Fallback**: Manual UN/LOCODE lookup table 2. **Wikidata Endpoint Load** - **Risk**: SPARQL endpoint slow or unavailable - **Mitigation**: Timeout handling, skip enrichment on failure - **Impact**: Reduced data quality, no blocking errors ### Low Risk 1. **Python Library Updates** - **Risk**: Breaking changes in dependencies - **Mitigation**: Pin versions in poetry.lock - **Testing**: CI/CD catches incompatibilities ## Monitoring Dependencies ### Health Checks ```python # tests/test_dependencies.py import httpx def test_unesco_opendatasoft_api_available(): """Verify UNESCO OpenDataSoft API is accessible""" url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records" params = {"limit": 1} response = httpx.get(url, params=params, timeout=10) assert response.status_code == 200 data = response.json() assert "total_count" in data assert "results" in data assert data["total_count"] > 1000 # Expect 1000+ UNESCO sites def test_unesco_legacy_api_blocked(): """Verify legacy JSON endpoint is indeed blocked""" url = "https://whc.unesco.org/en/list/json/" response = httpx.get(url, timeout=10, follow_redirects=True) # Expect 403 Forbidden due to Cloudflare protection assert response.status_code == 403 def test_wikidata_sparql_available(): """Verify Wikidata SPARQL endpoint""" from SPARQLWrapper import SPARQLWrapper, JSON sparql = SPARQLWrapper("https://query.wikidata.org/sparql") sparql.setQuery("SELECT * WHERE { ?s ?p ?o } LIMIT 1") sparql.setReturnFormat(JSON) results = sparql.query().convert() assert len(results["results"]["bindings"]) > 0 def test_geonames_api_key(): """Verify GeoNames API key is valid""" import os api_key = os.getenv("GEONAMES_API_KEY") assert api_key, "GEONAMES_API_KEY not set in environment" url = f"http://api.geonames.org/getJSON" params = {"geonameId": 6295630, "username": api_key} # Test with Earth response = httpx.get(url, params=params, timeout=10) assert response.status_code == 200 data = response.json() assert "geonameId" in data ``` ### Dependency Update Schedule - **Weekly**: Security patches (automated via Dependabot) - **Monthly**: Minor version updates (manual review) - **Quarterly**: Major version updates (full regression testing) ## Related Documentation - **Implementation Phases**: `docs/plan/unesco/03-implementation-phases.md` - **LinkML Map Schema**: `docs/plan/unesco/06-linkml-map-schema.md` - **TDD Strategy**: `docs/plan/unesco/04-tdd-strategy.md` --- **Version**: 1.1 **Date**: 2025-11-09 **Status**: Updated - OpenDataSoft API architecture documented **Changelog**: - v1.1 (2025-11-09): Updated to reflect OpenDataSoft Explore API v2 (not OAI-PMH) - Added ODSQL query language documentation - Updated API endpoints and response structures - Added health checks for OpenDataSoft API - Documented legacy endpoint blocking (Cloudflare 403) - v1.0 (2025-11-09): Initial version