glam/docs/plan/unesco/01-dependencies.md
2025-11-19 23:25:22 +01:00

633 lines
19 KiB
Markdown

# UNESCO Data Extraction: Dependencies
## Overview
This document catalogs all dependencies for extracting UNESCO World Heritage Site data and integrating it with the Global GLAM Dataset.
## External Data Dependencies
### 1. UNESCO DataHub
**Primary Source**:
- **URL**: https://data.unesco.org/
- **Platform**: OpenDataSoft (commercial open data platform)
- **World Heritage Dataset**: https://data.unesco.org/explore/dataset/whc001/
- **API Documentation**: https://help.opendatasoft.com/apis/ods-explore-v2/
**⚠️ CRITICAL FINDING**: UNESCO DataHub uses **OpenDataSoft's proprietary Explore API v2** (NOT OAI-PMH or standard REST)
**API Architecture**:
- **API Version**: Explore API v2.0 and v2.1
- **Base Endpoint**: `https://data.unesco.org/api/explore/v2.0/` and `https://data.unesco.org/api/explore/v2.1/`
- **Query Language**: ODSQL (OpenDataSoft SQL-like query language, NOT standard SQL)
- **Response Format**: JSON
- **Authentication**: None required for public datasets (optional for private data)
**Available Formats**:
- JSON API via OpenDataSoft Explore API (primary extraction method)
- CSV export via portal
- DCAT-AP metadata
- RDF/XML (via OpenDataSoft serialization)
- GeoJSON (for geographic datasets)
**Rate Limits**:
- Public API: ~100 requests/minute (OpenDataSoft default, to be verified)
- Bulk download: Available via CSV/JSON export from portal
- SPARQL: NOT available (OpenDataSoft uses ODSQL instead)
**Authentication**: None required for public data
**Legacy JSON Endpoint** (BLOCKED):
- URL: https://whc.unesco.org/en/list/json/
- Status: Returns 403 Forbidden (Cloudflare protection)
- Mitigation: Use OpenDataSoft API instead
### 2. Wikidata Integration
**Purpose**: Enrich UNESCO sites with Wikidata Q-numbers
**Endpoint**: https://query.wikidata.org/sparql
**Query Pattern**:
```sparql
SELECT ?item ?itemLabel ?whcID WHERE {
?item wdt:P757 ?whcID . # UNESCO World Heritage Site ID (P757)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
```
**Usage**:
- Cross-reference UNESCO WHC IDs with Wikidata
- Enrich with additional identifiers (VIAF, GND, etc.)
- Extract alternative names in multiple languages
### 3. GeoNames
**Purpose**: Geocode locations, generate UN/LOCODE for GHCID
**Endpoint**: http://api.geonames.org/
**API Key**: Required (free tier: 20k requests/day)
**Usage**:
- Validate coordinates
- Look up UN/LOCODE for cities
- Resolve place names to GeoNames IDs
### 4. ISO Standards Data
**ISO 3166-1 alpha-2** (Country Codes):
- Source: `pycountry` library
- Purpose: Standardize country identifiers for GHCID
**ISO 3166-2** (Region Codes):
- Source: `pycountry` library
- Purpose: Region codes for GHCID generation
## Python Library Dependencies
### Core Libraries
```toml
[tool.poetry.dependencies]
python = "^3.11"
# HTTP & API clients
httpx = "^0.25.0" # Async HTTP for UNESCO API
aiohttp = "^3.9.0" # Alternative async HTTP
requests-cache = "^1.1.0" # Cache API responses
# NOTE: OpenDataSoft API uses standard REST, no special client library needed
# ODSQL queries passed as URL parameters to Explore API v2
# Data processing
pandas = "^2.1.0" # Tabular data manipulation
pydantic = "^2.4.0" # Data validation
jsonschema = "^4.19.0" # JSON schema validation
# LinkML ecosystem
linkml = "^1.6.0" # Schema tools
linkml-runtime = "^1.6.0" # Runtime validation
linkml-map = "^0.1.0" # Mapping transformations ⚠️ CRITICAL
# RDF & Semantic Web
rdflib = "^7.0.0" # RDF graph manipulation
sparqlwrapper = "^2.0.0" # SPARQL queries
jsonld = "^1.8.0" # JSON-LD processing
# Geographic data
geopy = "^2.4.0" # Geocoding
pycountry = "^22.3.5" # Country/region codes
shapely = "^2.0.0" # Geometric operations (optional)
# Text processing
unidecode = "^1.3.7" # Unicode normalization
ftfy = "^6.1.1" # Fix text encoding
rapidfuzz = "^3.5.0" # Fuzzy matching
# Utilities
tqdm = "^4.66.0" # Progress bars
rich = "^13.7.0" # Beautiful terminal output
loguru = "^0.7.0" # Structured logging
```
### Development Dependencies
```toml
[tool.poetry.group.dev.dependencies]
# Testing
pytest = "^7.4.0"
pytest-asyncio = "^0.21.0" # Async test support
pytest-cov = "^4.1.0" # Coverage
pytest-mock = "^3.12.0" # Mocking
hypothesis = "^6.92.0" # Property-based testing
responses = "^0.24.0" # Mock HTTP responses
# Code quality
ruff = "^0.1.0" # Fast linter
mypy = "^1.7.0" # Type checking
pre-commit = "^3.5.0" # Git hooks
# Documentation
mkdocs = "^1.5.0"
mkdocs-material = "^9.5.0"
```
### OpenDataSoft API Integration
**⚠️ CRITICAL**: UNESCO uses OpenDataSoft's proprietary Explore API, NOT a standard protocol like OAI-PMH.
**ODSQL Query Language**:
- **Type**: SQL-like query language (NOT standard SQL)
- **Purpose**: Filter, aggregate, and query OpenDataSoft datasets
- **Syntax**: Similar to SQL SELECT/WHERE/GROUP BY but with custom functions
- **Documentation**: https://help.opendatasoft.com/apis/ods-explore-v2/#odsql
**Example ODSQL Query**:
```python
# Fetch UNESCO World Heritage Sites inscribed after 2000 in Europe
import requests
url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
params = {
"select": "site_name, date_inscribed, latitude, longitude",
"where": "date_inscribed > 2000 AND region = 'Europe'",
"limit": 100,
"offset": 0,
"order_by": "date_inscribed DESC"
}
response = requests.get(url, params=params)
data = response.json()
```
**Key Differences from Standard REST APIs**:
- Queries passed as URL parameters (not POST body)
- ODSQL syntax in `where`, `select`, `group_by` parameters
- Pagination via `limit` and `offset` (not standard cursors)
- Response structure: `{"total_count": N, "results": [...]}`
**Implementation Strategy**:
- Use `requests` or `httpx` for HTTP calls (no special client library)
- Build ODSQL queries programmatically as strings
- Handle pagination with `limit`/`offset` loops
- Cache responses with `requests-cache`
### Critical: LinkML Map Extension
**⚠️ IMPORTANT**: This project requires **LinkML Map with conditional XPath/JSONPath/regex support**.
**Status**: To be implemented as bespoke extension
**Required Features**:
1. **Conditional extraction** - Extract based on field value conditions
2. **XPath support** - Navigate XML/HTML structures from UNESCO
3. **JSONPath support** - Query complex JSON API responses
4. **Regex patterns** - Extract and validate identifier formats
5. **Multi-value handling** - Map arrays to multivalued LinkML slots
**Implementation Path**:
- Extend `linkml-map` package OR
- Create custom `linkml_map_extended` module in project
**Example Usage**:
```yaml
# LinkML Map with conditional XPath
mappings:
- source: unesco_site
target: HeritageCustodian
conditions:
- path: "$.category"
pattern: "^(Cultural|Mixed)$" # Only cultural/mixed sites
transforms:
- source_path: "$.site_name"
target_slot: name
- source_path: "$.date_inscribed"
target_slot: founded_date
transform: "parse_year"
- source_xpath: "//description[@lang='en']/text()"
target_slot: description
```
## Schema Dependencies
### LinkML Global GLAM Schema (v0.2.1)
**Modules Required**:
- `schemas/core.yaml` - HeritageCustodian, Location, Identifier
- `schemas/enums.yaml` - InstitutionTypeEnum, DataSource
- `schemas/provenance.yaml` - Provenance, ChangeEvent
- `schemas/collections.yaml` - Collection metadata
**New Enums to Add**:
```yaml
# schemas/enums.yaml - Add to DataSource
DataSource:
permissible_values:
UNESCO_WORLD_HERITAGE:
description: UNESCO World Heritage Centre API
meaning: heritage:UNESCODataHub
```
### Ontology Dependencies
**Required Ontologies**:
1. **CPOV** (Core Public Organisation Vocabulary)
- Path: `/data/ontology/core-public-organisation-ap.ttl`
- Usage: Map UNESCO sites to public organizations
2. **Schema.org**
- Path: `/data/ontology/schemaorg.owl`
- Usage: `schema:Museum`, `schema:TouristAttraction`, `schema:Place`
3. **CIDOC-CRM** (Cultural Heritage)
- Path: `/data/ontology/CIDOC_CRM_v7.1.3.rdf`
- Usage: `crm:E53_Place`, `crm:E74_Group` for heritage sites
4. **UNESCO Thesaurus**
- URL: http://vocabularies.unesco.org/thesaurus/
- Usage: Map UNESCO criteria to controlled vocabulary
- Format: SKOS (RDF)
**To Download**:
```bash
# UNESCO Thesaurus
curl -o data/ontology/unesco_thesaurus.ttl \
http://vocabularies.unesco.org/thesaurus/concept.ttl
# Verify CIDOC-CRM is present
ls data/ontology/CIDOC_CRM_v7.1.3.rdf
```
## External Service Dependencies
### 1. UNESCO OpenDataSoft API Service
**Platform**: OpenDataSoft Explore API v2.0/v2.1
**Primary Endpoint**: https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records
**Legacy Endpoint** (BLOCKED):
- `https://whc.unesco.org/en/list/json/`
- Status: Returns 403 Forbidden (Cloudflare protection)
- Action: Do NOT use - use OpenDataSoft API instead
**Availability**: 99.5% uptime (estimated, OpenDataSoft SLA)
**API Response Structure**:
```json
{
"total_count": 1199,
"results": [
{
"site_name": "Historic Centre of Rome",
"date_inscribed": 1980,
"category": "Cultural",
"latitude": 41.9,
"longitude": 12.5,
"criteria_txt": "(i)(ii)(iii)(iv)(vi)",
"states_name_en": "Italy",
// ... additional fields
}
]
}
```
**Pagination**:
- Use `limit` (max 100 per request) and `offset` parameters
- Total records available in `total_count` field
- No cursor-based pagination
**Fallback Strategy**:
- Cache API responses locally (requests-cache with 24-hour TTL)
- Implement exponential backoff on 5xx errors
- Fallback to CSV export download if API unavailable for >1 hour
- CSV export URL: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv`
**Monitoring**:
- Log all API response times
- Alert if >5% requests fail or response time >2s
- Track `total_count` for dataset size changes
- Monitor for schema changes via response validation
### 2. Wikidata SPARQL Endpoint
**Endpoint**: https://query.wikidata.org/sparql
**Rate Limits**:
- 60 requests/minute (public)
- User-Agent header required
**Fallback Strategy**:
- Cache SPARQL results (24-hour TTL)
- Batch queries to minimize requests
- Skip enrichment if endpoint unavailable
### 3. GeoNames API
**Endpoint**: http://api.geonames.org/
**Rate Limits**:
- Free tier: 20,000 requests/day
- Premium: 200,000 requests/day
**API Key Storage**: Environment variable `GEONAMES_API_KEY`
**Fallback Strategy**:
- Use UNESCO-provided coordinates as primary
- GeoNames lookup only for UN/LOCODE generation
- Cache lookups indefinitely (place names don't change often)
## File System Dependencies
### Required Directories
```
glam-extractor/
├── data/
│ ├── unesco/
│ │ ├── raw/ # Raw API responses
│ │ ├── cache/ # HTTP cache
│ │ ├── instances/ # LinkML instances
│ │ └── mappings/ # LinkML Map schemas
│ ├── ontology/
│ │ ├── unesco_thesaurus.ttl # UNESCO vocabulary
│ │ └── [existing ontologies]
│ └── reference/
│ └── unesco_criteria.yaml # Criteria definitions
├── src/glam_extractor/
│ ├── extractors/
│ │ └── unesco.py # UNESCO extractor
│ ├── mappers/
│ │ └── unesco_mapper.py # LinkML Map integration
│ └── validators/
│ └── unesco_validator.py # UNESCO-specific validation
└── tests/
└── unesco/
├── fixtures/ # Test data
└── test_unesco_*.py # Test modules
```
## Environment Variables
```bash
# .env file
GEONAMES_API_KEY=your_key_here
UNESCO_API_CACHE_TTL=86400 # 24 hours
WIKIDATA_RATE_LIMIT=60 # requests/minute
LOG_LEVEL=INFO
```
## Database Dependencies
### DuckDB (Intermediate Storage)
**Purpose**: Store extracted UNESCO data during processing
**Schema**:
```sql
CREATE TABLE unesco_sites (
whc_id INTEGER PRIMARY KEY,
site_name TEXT NOT NULL,
country TEXT,
category TEXT, -- Cultural, Natural, Mixed
date_inscribed INTEGER,
latitude DOUBLE,
longitude DOUBLE,
raw_json TEXT, -- Full API response
extracted_at TIMESTAMP,
processed BOOLEAN DEFAULT FALSE
);
CREATE TABLE unesco_criteria (
whc_id INTEGER,
criterion_code TEXT, -- (i), (ii), ..., (x)
criterion_text TEXT,
FOREIGN KEY (whc_id) REFERENCES unesco_sites(whc_id)
);
```
### SQLite (Cache)
**Purpose**: HTTP response caching (via requests-cache)
**Auto-managed**: Library handles schema
## Network Dependencies
### Required Network Access
- `data.unesco.org` (port 443) - UNESCO OpenDataSoft API (primary)
- `whc.unesco.org` (port 443) - UNESCO WHC website (legacy, blocked by Cloudflare)
- `query.wikidata.org` (port 443) - Wikidata SPARQL
- `api.geonames.org` (port 80/443) - GeoNames API
- `vocabularies.unesco.org` (port 443) - UNESCO Thesaurus
**Note**: The legacy `whc.unesco.org/en/list/json/` endpoint is protected by Cloudflare and returns 403. Use `data.unesco.org` OpenDataSoft API instead.
### Offline Mode
**Supported**: Yes, with cached data
**Requirements**:
- Pre-downloaded UNESCO dataset snapshot
- Cached Wikidata SPARQL results
- GeoNames offline database (optional)
**Activate**:
```bash
export GLAM_OFFLINE_MODE=true
```
## Version Constraints
### Python Version
- **Minimum**: Python 3.11
- **Recommended**: Python 3.11 or 3.12
- **Reason**: Modern type hints, performance improvements
### LinkML Version
- **Minimum**: linkml 1.6.0
- **Reason**: Required for latest schema features
### API Versions
- **UNESCO OpenDataSoft API**: Explore API v2.0 and v2.1 (both supported)
- **Wikidata SPARQL**: SPARQL 1.1 standard
- **GeoNames API**: v3.0
**Breaking Change Detection**:
- Monitor UNESCO API responses for schema changes
- Test suite includes response validation against expected structure
- Alert if field names or types change unexpectedly
## Dependency Installation
### Full Installation
```bash
# Clone repository
git clone https://github.com/user/glam-dataset.git
cd glam-dataset
# Install with Poetry (includes UNESCO dependencies)
poetry install
# Download UNESCO Thesaurus
mkdir -p data/ontology
curl -o data/ontology/unesco_thesaurus.ttl \
http://vocabularies.unesco.org/thesaurus/concept.ttl
# Set up environment
cp .env.example .env
# Edit .env with your GeoNames API key
```
### Minimal Installation (Testing Only)
```bash
poetry install --without dev
# Use cached fixtures, no API access
```
## Dependency Risk Assessment
### High Risk
1. **LinkML Map Extension** ⚠️
- **Risk**: Custom extension not yet implemented
- **Mitigation**: Prioritize in Phase 1, create fallback pure-Python mapper
- **Timeline**: Week 1-2
2. **UNESCO OpenDataSoft API Stability**
- **Risk**: Schema changes in OpenDataSoft platform or dataset fields
- **Mitigation**:
- Version API responses in cache filenames
- Automated schema validation tests
- Monitor `total_count` for unexpected changes
- **Detection**: Weekly automated tests, compare field names/types
- **Fallback**: CSV export download if API breaking changes detected
3. **ODSQL Query Language Changes**
- **Risk**: OpenDataSoft updates ODSQL syntax in breaking ways
- **Mitigation**:
- Pin to Explore API v2.0 (stable version)
- Test queries against live API weekly
- Document ODSQL syntax version in code comments
- **Impact**: Query failures, need to rewrite ODSQL expressions
### Medium Risk
1. **GeoNames Rate Limits**
- **Risk**: Exceed free tier (20k/day)
- **Mitigation**: Aggressive caching, batch processing
- **Fallback**: Manual UN/LOCODE lookup table
2. **Wikidata Endpoint Load**
- **Risk**: SPARQL endpoint slow or unavailable
- **Mitigation**: Timeout handling, skip enrichment on failure
- **Impact**: Reduced data quality, no blocking errors
### Low Risk
1. **Python Library Updates**
- **Risk**: Breaking changes in dependencies
- **Mitigation**: Pin versions in poetry.lock
- **Testing**: CI/CD catches incompatibilities
## Monitoring Dependencies
### Health Checks
```python
# tests/test_dependencies.py
import httpx
def test_unesco_opendatasoft_api_available():
"""Verify UNESCO OpenDataSoft API is accessible"""
url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records"
params = {"limit": 1}
response = httpx.get(url, params=params, timeout=10)
assert response.status_code == 200
data = response.json()
assert "total_count" in data
assert "results" in data
assert data["total_count"] > 1000 # Expect 1000+ UNESCO sites
def test_unesco_legacy_api_blocked():
"""Verify legacy JSON endpoint is indeed blocked"""
url = "https://whc.unesco.org/en/list/json/"
response = httpx.get(url, timeout=10, follow_redirects=True)
# Expect 403 Forbidden due to Cloudflare protection
assert response.status_code == 403
def test_wikidata_sparql_available():
"""Verify Wikidata SPARQL endpoint"""
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("SELECT * WHERE { ?s ?p ?o } LIMIT 1")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
assert len(results["results"]["bindings"]) > 0
def test_geonames_api_key():
"""Verify GeoNames API key is valid"""
import os
api_key = os.getenv("GEONAMES_API_KEY")
assert api_key, "GEONAMES_API_KEY not set in environment"
url = f"http://api.geonames.org/getJSON"
params = {"geonameId": 6295630, "username": api_key} # Test with Earth
response = httpx.get(url, params=params, timeout=10)
assert response.status_code == 200
data = response.json()
assert "geonameId" in data
```
### Dependency Update Schedule
- **Weekly**: Security patches (automated via Dependabot)
- **Monthly**: Minor version updates (manual review)
- **Quarterly**: Major version updates (full regression testing)
## Related Documentation
- **Implementation Phases**: `docs/plan/unesco/03-implementation-phases.md`
- **LinkML Map Schema**: `docs/plan/unesco/06-linkml-map-schema.md`
- **TDD Strategy**: `docs/plan/unesco/04-tdd-strategy.md`
---
**Version**: 1.1
**Date**: 2025-11-09
**Status**: Updated - OpenDataSoft API architecture documented
**Changelog**:
- v1.1 (2025-11-09): Updated to reflect OpenDataSoft Explore API v2 (not OAI-PMH)
- Added ODSQL query language documentation
- Updated API endpoints and response structures
- Added health checks for OpenDataSoft API
- Documented legacy endpoint blocking (Cloudflare 403)
- v1.0 (2025-11-09): Initial version