1047 lines
39 KiB
Markdown
1047 lines
39 KiB
Markdown
# UNESCO Data Extraction - LinkML Map Schema
|
|
|
|
**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
|
|
**Document**: 06 - LinkML Map Transformation Rules
|
|
**Version**: 1.0
|
|
**Date**: 2025-11-09
|
|
**Status**: Draft
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This document specifies the **LinkML Map transformation schema** for converting UNESCO World Heritage Site data from the UNESCO DataHub API into LinkML-compliant `HeritageCustodian` instances. The mapping handles complex data structures, multi-language fields, conditional institution type classification, and identifier extraction.
|
|
|
|
**Key Innovation**: Extends LinkML Map with conditional extraction, regex validators, and multi-language field handling to manage UNESCO's rich semantic data.
|
|
|
|
**Schema Location**: `schemas/maps/unesco_to_heritage_custodian.yaml`
|
|
|
|
---
|
|
|
|
## UNESCO API Data Structure
|
|
|
|
### Sample UNESCO Site JSON Structure
|
|
|
|
Based on UNESCO DataHub API response format:
|
|
|
|
```json
|
|
{
|
|
"id": 600,
|
|
"unique_number": 600,
|
|
"id_number": 600,
|
|
"category": "Cultural",
|
|
"name_en": "Banks of the Seine in Paris",
|
|
"name_fr": "Paris, rives de la Seine",
|
|
"short_description_en": "From the Louvre to the Eiffel Tower, from the Place de la Concorde to the Grand and Petit Palais, the evolution of Paris and its history can be seen from the River Seine...",
|
|
"short_description_fr": "Du Louvre à la Tour Eiffel, de la place de la Concorde aux Grand et Petit Palais...",
|
|
"justification_en": "The site demonstrates exceptional urban landscape reflecting key periods in the history of Paris...",
|
|
"states": "France",
|
|
"states_iso_code": "FR",
|
|
"region": "Europe and North America",
|
|
"iso_region": "EUR",
|
|
"transboundary": "0",
|
|
"date_inscribed": "1991",
|
|
"secondary_dates": "",
|
|
"danger": "0",
|
|
"date_end": null,
|
|
"longitude": 2.3376,
|
|
"latitude": 48.8606,
|
|
"area_hectares": 365.0,
|
|
"criteria_txt": "(i)(ii)(iv)",
|
|
"http_url": "https://whc.unesco.org/en/list/600",
|
|
"image_url": "https://whc.unesco.org/uploads/thumbs/site_0600_0001-750-0-20151104113424.jpg",
|
|
"unesco_region": "Europe",
|
|
"serial": "0",
|
|
"extension": "0",
|
|
"revision": "0"
|
|
}
|
|
```
|
|
|
|
### Key Observations
|
|
|
|
1. **Multi-language Fields**: `name_en`, `name_fr`, `short_description_en`, `short_description_fr`
|
|
2. **Institution Type Inference**: Must infer from `category`, `short_description`, and `justification`
|
|
3. **Geographic Data**: Provides `latitude`, `longitude`, `states_iso_code`, `region`
|
|
4. **UNESCO WHC ID**: Unique identifier in `id`, `unique_number`, `id_number` fields
|
|
5. **Temporal Data**: `date_inscribed` (inscription date, not always founding date)
|
|
6. **Criteria**: Cultural criteria `(i)-(vi)` and Natural criteria `(vii)-(x)`
|
|
7. **Serial Nominations**: Multiple sites under one inscription (indicated by `serial: "1"`)
|
|
|
|
---
|
|
|
|
## LinkML Map Schema Design
|
|
|
|
### Schema Structure
|
|
|
|
The LinkML Map schema will use a **hybrid approach**:
|
|
|
|
1. **Direct field mappings** for straightforward transformations (name, coordinates, dates)
|
|
2. **Conditional transformations** for institution type classification
|
|
3. **Custom Python functions** for complex logic (GHCID generation, Wikidata enrichment)
|
|
4. **Multi-step transformations** for nested objects (Location, Identifier, Provenance)
|
|
|
|
### Schema File: `schemas/maps/unesco_to_heritage_custodian.yaml`
|
|
|
|
```yaml
|
|
id: https://w3id.org/heritage/custodian/maps/unesco
|
|
name: unesco-to-heritage-custodian-map
|
|
title: UNESCO World Heritage Site to HeritageCustodian Mapping
|
|
description: >-
|
|
LinkML Map transformation rules for converting UNESCO DataHub API responses
|
|
to LinkML HeritageCustodian instances. Handles multi-language fields,
|
|
institution type classification, and geographic data normalization.
|
|
|
|
version: 1.0.0
|
|
license: https://creativecommons.org/publicdomain/zero/1.0/
|
|
|
|
prefixes:
|
|
linkml: https://w3id.org/linkml/
|
|
heritage: https://w3id.org/heritage/custodian/
|
|
unesco: https://whc.unesco.org/en/list/
|
|
|
|
imports:
|
|
- linkml:types
|
|
- ../../core
|
|
- ../../enums
|
|
- ../../provenance
|
|
|
|
# =============================================================================
|
|
# MAPPING RULES
|
|
# =============================================================================
|
|
|
|
mappings:
|
|
unesco_site_to_heritage_custodian:
|
|
description: >-
|
|
Transform UNESCO World Heritage Site JSON to HeritageCustodian instance.
|
|
Applies institution type classification, multi-language name extraction,
|
|
and geographic data normalization.
|
|
|
|
source_schema: unesco_api_response
|
|
target_schema: heritage_custodian
|
|
|
|
# -------------------------------------------------------------------------
|
|
# CORE IDENTIFICATION FIELDS
|
|
# -------------------------------------------------------------------------
|
|
|
|
id:
|
|
source_path: $.id
|
|
target_path: id
|
|
transform:
|
|
function: generate_heritage_custodian_id
|
|
description: >-
|
|
Generate persistent URI identifier using UNESCO WHC ID and country code.
|
|
Format: https://w3id.org/heritage/custodian/{country_code}/unesco-{whc_id}
|
|
parameters:
|
|
whc_id: $.id
|
|
country_code: $.states_iso_code
|
|
example: "https://w3id.org/heritage/custodian/fr/unesco-600"
|
|
|
|
# GHCID fields (generated via custom function)
|
|
ghcid_current:
|
|
source_path: $.id
|
|
target_path: ghcid_current
|
|
transform:
|
|
function: generate_ghcid_for_unesco_site
|
|
description: >-
|
|
Generate Global Heritage Custodian Identifier (GHCID) using UNESCO site data.
|
|
Format: {ISO-3166-1}-{ISO-3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}[-Q{WikidataID}]
|
|
parameters:
|
|
whc_id: $.id
|
|
country_code: $.states_iso_code
|
|
site_name: $.name_en
|
|
latitude: $.latitude
|
|
longitude: $.longitude
|
|
institution_type: "@computed:institution_type" # Reference computed value
|
|
fallback:
|
|
# If GHCID cannot be generated, use UNESCO WHC ID
|
|
value: "UNESCO-{whc_id}"
|
|
note: "Fallback GHCID - requires manual review"
|
|
|
|
ghcid_numeric:
|
|
source_path: ghcid_current
|
|
target_path: ghcid_numeric
|
|
transform:
|
|
function: ghcid_to_numeric_hash
|
|
description: "Generate 64-bit numeric hash from GHCID string (SHA-256 based)"
|
|
|
|
ghcid_uuid:
|
|
source_path: ghcid_current
|
|
target_path: ghcid_uuid
|
|
transform:
|
|
function: ghcid_to_uuid_v5
|
|
description: "Generate deterministic UUID v5 from GHCID string (SHA-1 based)"
|
|
|
|
ghcid_uuid_sha256:
|
|
source_path: ghcid_current
|
|
target_path: ghcid_uuid_sha256
|
|
transform:
|
|
function: ghcid_to_uuid_v8
|
|
description: "Generate deterministic UUID v8 from GHCID string (SHA-256 based)"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# NAME FIELDS (Multi-language handling)
|
|
# -------------------------------------------------------------------------
|
|
|
|
name:
|
|
source_path: $.name_en
|
|
target_path: name
|
|
transform:
|
|
function: extract_primary_name
|
|
description: >-
|
|
Extract primary name for the institution. Prefers English name,
|
|
falls back to French or first available language.
|
|
parameters:
|
|
name_en: $.name_en
|
|
name_fr: $.name_fr
|
|
fallback:
|
|
source_path: $.name_fr
|
|
required: true
|
|
|
|
alternative_names:
|
|
target_path: alternative_names
|
|
transform:
|
|
function: extract_alternative_names
|
|
description: >-
|
|
Extract all language variants of the site name. Collects names from
|
|
name_en, name_fr, and other language fields if present.
|
|
parameters:
|
|
name_en: $.name_en
|
|
name_fr: $.name_fr
|
|
output: list
|
|
example:
|
|
- "Banks of the Seine in Paris"
|
|
- "Paris, rives de la Seine"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# INSTITUTION TYPE CLASSIFICATION (Conditional)
|
|
# -------------------------------------------------------------------------
|
|
|
|
institution_type:
|
|
target_path: institution_type
|
|
transform:
|
|
function: classify_unesco_institution_type
|
|
description: >-
|
|
Classify institution type based on UNESCO category, description, and keywords.
|
|
Uses multi-strategy approach: keyword matching, rule-based inference, and
|
|
confidence scoring. Defaults to MIXED for ambiguous cases.
|
|
parameters:
|
|
category: $.category
|
|
description_en: $.short_description_en
|
|
description_fr: $.short_description_fr
|
|
justification_en: $.justification_en
|
|
criteria: $.criteria_txt
|
|
site_name: $.name_en
|
|
strategy: composite # Use composite classification strategy
|
|
confidence_threshold: 0.7
|
|
default: MIXED
|
|
required: true
|
|
examples:
|
|
- input: {description: "Museum complex with archaeological collections"}
|
|
output: MUSEUM
|
|
confidence: 0.95
|
|
- input: {description: "Historic library and archive"}
|
|
output: LIBRARY
|
|
confidence: 0.85
|
|
- input: {description: "Historic city center"}
|
|
output: MIXED
|
|
confidence: 0.5
|
|
|
|
# -------------------------------------------------------------------------
|
|
# DESCRIPTION FIELDS
|
|
# -------------------------------------------------------------------------
|
|
|
|
description:
|
|
source_path: $.short_description_en
|
|
target_path: description
|
|
transform:
|
|
function: merge_descriptions
|
|
description: >-
|
|
Merge short description and justification into single comprehensive
|
|
description. Prefers English, includes French if English unavailable.
|
|
parameters:
|
|
short_description_en: $.short_description_en
|
|
short_description_fr: $.short_description_fr
|
|
justification_en: $.justification_en
|
|
justification_fr: $.justification_fr
|
|
template: |
|
|
{short_description_en}
|
|
|
|
UNESCO Justification: {justification_en}
|
|
fallback:
|
|
source_path: $.short_description_fr
|
|
|
|
# -------------------------------------------------------------------------
|
|
# LOCATION (Nested Object)
|
|
# -------------------------------------------------------------------------
|
|
|
|
locations:
|
|
target_path: locations
|
|
transform:
|
|
function: create_location_from_unesco
|
|
description: >-
|
|
Create Location object from UNESCO geographic data. Extracts country,
|
|
region, coordinates, and attempts GeoNames lookup for city name.
|
|
parameters:
|
|
country: $.states_iso_code
|
|
region: $.region
|
|
latitude: $.latitude
|
|
longitude: $.longitude
|
|
unesco_region: $.unesco_region
|
|
output: list # Always return list (even single location)
|
|
|
|
nested_mappings:
|
|
- target_class: Location
|
|
fields:
|
|
country:
|
|
source_path: $.states_iso_code
|
|
required: true
|
|
validation:
|
|
pattern: "^[A-Z]{2}$"
|
|
description: "ISO 3166-1 alpha-2 country code"
|
|
|
|
region:
|
|
source_path: $.region
|
|
transform:
|
|
function: map_unesco_region_to_iso3166_2
|
|
description: "Convert UNESCO region name to ISO 3166-2 code if possible"
|
|
fallback:
|
|
value: null
|
|
note: "UNESCO regions do not map 1:1 to ISO 3166-2"
|
|
|
|
city:
|
|
source_path: null # Not provided by UNESCO
|
|
transform:
|
|
function: reverse_geocode_coordinates
|
|
description: >-
|
|
Reverse geocode coordinates to find city name using GeoNames API.
|
|
Caches results to minimize API calls.
|
|
parameters:
|
|
latitude: $.latitude
|
|
longitude: $.longitude
|
|
cache_ttl: 2592000 # 30 days
|
|
fallback:
|
|
value: null
|
|
note: "Geocoding failed or coordinates invalid"
|
|
|
|
latitude:
|
|
source_path: $.latitude
|
|
required: true
|
|
validation:
|
|
type: float
|
|
range: [-90, 90]
|
|
|
|
longitude:
|
|
source_path: $.longitude
|
|
required: true
|
|
validation:
|
|
type: float
|
|
range: [-180, 180]
|
|
|
|
geonames_id:
|
|
source_path: null
|
|
transform:
|
|
function: lookup_geonames_id
|
|
description: "Query GeoNames API for location ID"
|
|
parameters:
|
|
latitude: $.latitude
|
|
longitude: $.longitude
|
|
cache_ttl: 2592000 # 30 days
|
|
|
|
is_primary:
|
|
source_path: null
|
|
default: true
|
|
description: "UNESCO data provides single location per site"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# IDENTIFIERS (Nested Objects - Array)
|
|
# -------------------------------------------------------------------------
|
|
|
|
identifiers:
|
|
target_path: identifiers
|
|
transform:
|
|
function: create_identifiers_from_unesco
|
|
description: >-
|
|
Create Identifier objects for UNESCO WHC ID, website URL, and
|
|
Wikidata Q-number (if available). Queries Wikidata SPARQL endpoint
|
|
to find Q-number via UNESCO WHC ID property (P757).
|
|
output: list
|
|
|
|
nested_mappings:
|
|
# UNESCO WHC ID
|
|
- target_class: Identifier
|
|
condition:
|
|
field: $.id
|
|
operator: is_not_null
|
|
fields:
|
|
identifier_scheme:
|
|
default: "UNESCO_WHC"
|
|
identifier_value:
|
|
source_path: $.id
|
|
transform:
|
|
function: format_unesco_whc_id
|
|
description: "Format WHC ID as zero-padded 4-digit string"
|
|
example: "0600"
|
|
identifier_url:
|
|
source_path: $.http_url
|
|
fallback:
|
|
template: "https://whc.unesco.org/en/list/{id}"
|
|
assigned_date:
|
|
source_path: $.date_inscribed
|
|
transform:
|
|
function: parse_year_to_date
|
|
description: "Convert year string to ISO 8601 date (YYYY-01-01)"
|
|
|
|
# Website URL
|
|
- target_class: Identifier
|
|
condition:
|
|
field: $.http_url
|
|
operator: is_not_null
|
|
fields:
|
|
identifier_scheme:
|
|
default: "Website"
|
|
identifier_value:
|
|
source_path: $.http_url
|
|
identifier_url:
|
|
source_path: $.http_url
|
|
|
|
# Wikidata Q-number (enrichment)
|
|
- target_class: Identifier
|
|
transform:
|
|
function: enrich_with_wikidata_qnumber
|
|
description: >-
|
|
Query Wikidata SPARQL endpoint for Q-number using UNESCO WHC ID property (P757).
|
|
Caches results to minimize SPARQL queries.
|
|
parameters:
|
|
whc_id: $.id
|
|
cache_ttl: 604800 # 7 days
|
|
fallback:
|
|
skip: true
|
|
note: "Wikidata Q-number not found or enrichment disabled"
|
|
fields:
|
|
identifier_scheme:
|
|
default: "Wikidata"
|
|
identifier_value:
|
|
source_path: "@enrichment:wikidata_qnumber"
|
|
identifier_url:
|
|
template: "https://www.wikidata.org/wiki/{identifier_value}"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# TEMPORAL FIELDS
|
|
# -------------------------------------------------------------------------
|
|
|
|
founded_date:
|
|
source_path: $.date_inscribed
|
|
target_path: founded_date
|
|
transform:
|
|
function: parse_unesco_date
|
|
description: >-
|
|
Parse UNESCO inscription date. Note: This is the UNESCO inscription date,
|
|
NOT the original founding date of the institution. Use with caution.
|
|
validation:
|
|
pattern: "^\\d{4}(-\\d{2}-\\d{2})?$"
|
|
output_format: "ISO 8601 date (YYYY-MM-DD)"
|
|
note: >-
|
|
UNESCO inscription date != founding date. For actual founding dates,
|
|
requires external enrichment from Wikidata or institutional sources.
|
|
|
|
closed_date:
|
|
source_path: $.date_end
|
|
target_path: closed_date
|
|
transform:
|
|
function: parse_unesco_date
|
|
description: "Parse UNESCO delisting date (if site was removed from list)"
|
|
fallback:
|
|
value: null
|
|
note: "Most sites remain on UNESCO list indefinitely"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# COLLECTIONS (Nested Objects)
|
|
# -------------------------------------------------------------------------
|
|
|
|
collections:
|
|
target_path: collections
|
|
transform:
|
|
function: infer_collections_from_unesco
|
|
description: >-
|
|
Infer collection metadata from UNESCO criteria and category.
|
|
Creates Collection objects based on cultural/natural criteria.
|
|
parameters:
|
|
category: $.category
|
|
criteria_txt: $.criteria_txt
|
|
description_en: $.short_description_en
|
|
output: list
|
|
|
|
nested_mappings:
|
|
- target_class: Collection
|
|
fields:
|
|
collection_name:
|
|
transform:
|
|
function: generate_collection_name
|
|
description: "Generate collection name from UNESCO category and criteria"
|
|
template: "UNESCO {category} Heritage Collection - {site_name}"
|
|
|
|
collection_type:
|
|
source_path: $.category
|
|
transform:
|
|
function: map_unesco_category_to_collection_type
|
|
mappings:
|
|
Cultural: "cultural"
|
|
Natural: "natural"
|
|
Mixed: "mixed"
|
|
|
|
subject_areas:
|
|
source_path: $.criteria_txt
|
|
transform:
|
|
function: parse_unesco_criteria
|
|
description: >-
|
|
Parse UNESCO criteria codes (i-x) and map to subject areas.
|
|
Cultural: (i) masterpieces, (ii) exchange, (iii) testimony,
|
|
(iv) architecture, (v) settlement, (vi) associations.
|
|
Natural: (vii) natural phenomena, (viii) earth's history,
|
|
(ix) ecosystems, (x) biodiversity.
|
|
output: list
|
|
example:
|
|
input: "(i)(ii)(iv)"
|
|
output: ["Masterpieces of human creative genius", "Exchange of human values", "Architecture"]
|
|
|
|
temporal_coverage:
|
|
source_path: null
|
|
transform:
|
|
function: infer_temporal_coverage_from_description
|
|
description: >-
|
|
Attempt to extract time periods from UNESCO description text.
|
|
Uses NER and regex patterns to identify dates/periods.
|
|
parameters:
|
|
description: $.short_description_en
|
|
justification: $.justification_en
|
|
fallback:
|
|
value: null
|
|
note: "Temporal coverage could not be inferred"
|
|
|
|
extent:
|
|
source_path: $.area_hectares
|
|
transform:
|
|
function: format_area_extent
|
|
description: "Format area in hectares as collection extent"
|
|
template: "{area_hectares} hectares"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# DIGITAL PLATFORMS (Nested Objects)
|
|
# -------------------------------------------------------------------------
|
|
|
|
digital_platforms:
|
|
target_path: digital_platforms
|
|
transform:
|
|
function: create_digital_platforms_from_unesco
|
|
description: >-
|
|
Create DigitalPlatform objects for UNESCO website and any linked
|
|
institutional websites discovered via URL parsing.
|
|
parameters:
|
|
http_url: $.http_url
|
|
image_url: $.image_url
|
|
output: list
|
|
|
|
nested_mappings:
|
|
# UNESCO Website
|
|
- target_class: DigitalPlatform
|
|
fields:
|
|
platform_name:
|
|
default: "UNESCO World Heritage Centre"
|
|
platform_url:
|
|
source_path: $.http_url
|
|
platform_type:
|
|
default: WEBSITE
|
|
metadata_standards:
|
|
default: ["DUBLIN_CORE", "SCHEMA_ORG"]
|
|
description: "UNESCO uses Dublin Core and Schema.org metadata"
|
|
|
|
# -------------------------------------------------------------------------
|
|
# PROVENANCE (Required)
|
|
# -------------------------------------------------------------------------
|
|
|
|
provenance:
|
|
target_path: provenance
|
|
transform:
|
|
function: create_unesco_provenance
|
|
description: "Create Provenance object for UNESCO-sourced data"
|
|
output: object
|
|
required: true
|
|
|
|
nested_mappings:
|
|
- target_class: Provenance
|
|
fields:
|
|
data_source:
|
|
default: UNESCO_WORLD_HERITAGE
|
|
validation:
|
|
enum: DataSourceEnum
|
|
|
|
data_tier:
|
|
default: TIER_1_AUTHORITATIVE
|
|
description: "UNESCO DataHub is authoritative source"
|
|
validation:
|
|
enum: DataTierEnum
|
|
|
|
extraction_date:
|
|
source_path: null
|
|
transform:
|
|
function: current_timestamp_utc
|
|
description: "Timestamp when data was extracted from UNESCO API"
|
|
output_format: "ISO 8601 datetime with timezone"
|
|
|
|
extraction_method:
|
|
default: "UNESCO DataHub API extraction via LinkML Map transformation"
|
|
|
|
confidence_score:
|
|
source_path: "@computed:institution_type_confidence"
|
|
description: "Confidence score from institution type classification"
|
|
validation:
|
|
type: float
|
|
range: [0.0, 1.0]
|
|
fallback:
|
|
value: 0.85
|
|
note: "Default confidence for UNESCO data with inferred institution type"
|
|
|
|
source_url:
|
|
source_path: $.http_url
|
|
description: "URL to original UNESCO World Heritage Site page"
|
|
|
|
verified_date:
|
|
source_path: $.date_inscribed
|
|
transform:
|
|
function: parse_year_to_date
|
|
description: "UNESCO inscription date serves as verification date"
|
|
|
|
verified_by:
|
|
default: "UNESCO World Heritage Committee"
|
|
description: "UNESCO WHC verifies all inscriptions"
|
|
|
|
# =============================================================================
|
|
# CUSTOM TRANSFORMATION FUNCTIONS (Python Implementation Required)
|
|
# =============================================================================
|
|
|
|
custom_functions:
|
|
generate_heritage_custodian_id:
|
|
description: "Generate persistent URI for heritage custodian"
|
|
implementation: "src/glam_extractor/mappers/unesco_transformers.py:generate_heritage_custodian_id"
|
|
signature:
|
|
parameters:
|
|
- name: whc_id
|
|
type: int
|
|
- name: country_code
|
|
type: str
|
|
returns: str
|
|
example:
|
|
input: {whc_id: 600, country_code: "FR"}
|
|
output: "https://w3id.org/heritage/custodian/fr/unesco-600"
|
|
|
|
generate_ghcid_for_unesco_site:
|
|
description: "Generate GHCID from UNESCO site data"
|
|
implementation: "src/glam_extractor/ghcid/unesco_ghcid_generator.py:generate_ghcid"
|
|
signature:
|
|
parameters:
|
|
- name: whc_id
|
|
type: int
|
|
- name: country_code
|
|
type: str
|
|
- name: site_name
|
|
type: str
|
|
- name: latitude
|
|
type: float
|
|
- name: longitude
|
|
type: float
|
|
- name: institution_type
|
|
type: InstitutionTypeEnum
|
|
returns: str
|
|
notes:
|
|
- "Requires GeoNames lookup for UN/LOCODE (city code)"
|
|
- "Handles collision resolution via Wikidata Q-number suffix"
|
|
- "Caches GeoNames lookups to minimize API calls"
|
|
|
|
classify_unesco_institution_type:
|
|
description: "Classify institution type using composite strategy"
|
|
implementation: "src/glam_extractor/classifiers/unesco_institution_type.py:classify"
|
|
signature:
|
|
parameters:
|
|
- name: category
|
|
type: str
|
|
- name: description_en
|
|
type: str
|
|
- name: description_fr
|
|
type: str
|
|
- name: justification_en
|
|
type: str
|
|
- name: criteria
|
|
type: str
|
|
- name: site_name
|
|
type: str
|
|
returns:
|
|
type: dict
|
|
fields:
|
|
- institution_type: InstitutionTypeEnum
|
|
- confidence_score: float
|
|
- reasoning: str
|
|
notes:
|
|
- "Uses KeywordClassificationStrategy, RuleBasedStrategy, and CompositeStrategy"
|
|
- "Confidence threshold: 0.7 (below this, defaults to MIXED)"
|
|
- "See design-patterns.md Pattern 2 for implementation details"
|
|
|
|
enrich_with_wikidata_qnumber:
|
|
description: "Query Wikidata for Q-number via UNESCO WHC ID property (P757)"
|
|
implementation: "src/glam_extractor/enrichment/wikidata_enricher.py:get_qnumber_by_whc_id"
|
|
signature:
|
|
parameters:
|
|
- name: whc_id
|
|
type: int
|
|
returns:
|
|
type: Optional[str]
|
|
description: "Wikidata Q-number or None if not found"
|
|
sparql_query: |
|
|
SELECT ?item WHERE {
|
|
?item wdt:P757 "{whc_id}" .
|
|
}
|
|
cache_ttl: 604800 # 7 days
|
|
notes:
|
|
- "Rate limit: 60 requests/minute to Wikidata SPARQL endpoint"
|
|
- "Implement exponential backoff on 429 Too Many Requests"
|
|
- "Skip enrichment if offline mode enabled"
|
|
|
|
reverse_geocode_coordinates:
|
|
description: "Reverse geocode coordinates to city name via GeoNames"
|
|
implementation: "src/glam_extractor/geocoding/geonames_client.py:reverse_geocode"
|
|
signature:
|
|
parameters:
|
|
- name: latitude
|
|
type: float
|
|
- name: longitude
|
|
type: float
|
|
returns:
|
|
type: Optional[str]
|
|
description: "City name or None if lookup failed"
|
|
api_endpoint: "http://api.geonames.org/findNearbyPlaceNameJSON"
|
|
cache_ttl: 2592000 # 30 days
|
|
rate_limit: "1 request/second (GeoNames free tier)"
|
|
notes:
|
|
- "Requires GEONAMES_API_KEY environment variable"
|
|
- "Free tier: 20,000 requests/day"
|
|
- "Aggressive caching essential to stay within limits"
|
|
|
|
parse_unesco_criteria:
|
|
description: "Parse UNESCO criteria codes (i-x) to subject area descriptions"
|
|
implementation: "src/glam_extractor/parsers/unesco_criteria_parser.py:parse_criteria"
|
|
signature:
|
|
parameters:
|
|
- name: criteria_txt
|
|
type: str
|
|
example: "(i)(ii)(iv)"
|
|
returns:
|
|
type: List[str]
|
|
description: "List of criteria descriptions"
|
|
criteria_mapping:
|
|
"(i)": "Masterpieces of human creative genius"
|
|
"(ii)": "Exchange of human values"
|
|
"(iii)": "Testimony to cultural tradition"
|
|
"(iv)": "Outstanding example of architecture"
|
|
"(v)": "Traditional human settlement or land-use"
|
|
"(vi)": "Associated with events, traditions, or ideas"
|
|
"(vii)": "Superlative natural phenomena or natural beauty"
|
|
"(viii)": "Outstanding examples of Earth's history"
|
|
"(ix)": "Significant ongoing ecological and biological processes"
|
|
"(x)": "Habitats for biodiversity conservation"
|
|
reference: "https://whc.unesco.org/en/criteria/"
|
|
|
|
# =============================================================================
|
|
# VALIDATION RULES
|
|
# =============================================================================
|
|
|
|
validation:
|
|
required_fields:
|
|
- id
|
|
- name
|
|
- institution_type
|
|
- locations
|
|
- identifiers
|
|
- provenance
|
|
|
|
post_transformation:
|
|
- rule: validate_ghcid_format
|
|
description: "Ensure GHCID matches specification pattern"
|
|
pattern: "^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$"
|
|
applies_to: ghcid_current
|
|
|
|
- rule: validate_country_code
|
|
description: "Ensure country code is valid ISO 3166-1 alpha-2"
|
|
validation_function: "pycountry.countries.get(alpha_2=value)"
|
|
applies_to: locations[*].country
|
|
|
|
- rule: validate_coordinates
|
|
description: "Ensure coordinates are within valid ranges"
|
|
validation_function: "validate_coordinates_in_range"
|
|
applies_to:
|
|
- locations[*].latitude
|
|
- locations[*].longitude
|
|
|
|
- rule: validate_confidence_score
|
|
description: "Ensure confidence score is between 0.0 and 1.0"
|
|
validation_function: "0.0 <= value <= 1.0"
|
|
applies_to: provenance.confidence_score
|
|
|
|
- rule: warn_low_confidence
|
|
description: "Warn if institution type confidence < 0.7"
|
|
severity: warning
|
|
condition: "provenance.confidence_score < 0.7"
|
|
action: "log_warning"
|
|
message: "Low confidence institution type classification: {institution_type} ({confidence_score})"
|
|
|
|
# =============================================================================
|
|
# EDGE CASES AND SPECIAL HANDLING
|
|
# =============================================================================
|
|
|
|
edge_cases:
|
|
serial_nominations:
|
|
description: >-
|
|
UNESCO serial nominations represent multiple sites under one WHC ID.
|
|
Example: "Primeval Beech Forests of the Carpathians" spans 12 countries.
|
|
handling:
|
|
- "Detect via $.serial == '1' flag"
|
|
- "Query UNESCO API for component sites"
|
|
- "Create separate HeritageCustodian record for each component"
|
|
- "Link via parent_organization or partnership relationship"
|
|
implementation: "src/glam_extractor/parsers/serial_nomination_handler.py"
|
|
|
|
transboundary_sites:
|
|
description: >-
|
|
Sites spanning multiple countries (e.g., Mont Blanc, Wadden Sea).
|
|
handling:
|
|
- "Detect via $.transboundary == '1' flag"
|
|
- "Parse $.states field for multiple country codes (comma-separated)"
|
|
- "Create Location objects for each country"
|
|
- "GHCID generation: Use first country alphabetically, note transboundary in description"
|
|
example:
|
|
site: "Wadden Sea"
|
|
countries: ["NL", "DE", "DK"]
|
|
ghcid: "DE-NS-CUX-M-WS" # Uses Germany (alphabetically first among DE/DK/NL)
|
|
|
|
sites_with_multiple_institutions:
|
|
description: >-
|
|
Large UNESCO sites may contain multiple GLAM institutions (e.g., Vatican City).
|
|
handling:
|
|
- "Primary extraction creates one HeritageCustodian for the UNESCO site itself"
|
|
- "Flag for manual enrichment: Add sub_organizations for individual museums/archives within site"
|
|
- "Use sub_organizations slot to link related institutions"
|
|
example:
|
|
site: "Vatican City"
|
|
main_record: "Vatican Museums"
|
|
sub_organizations:
|
|
- "Vatican Apostolic Archive"
|
|
- "Vatican Library"
|
|
- "Sistine Chapel"
|
|
|
|
sites_removed_from_list:
|
|
description: >-
|
|
Sites delisted from UNESCO World Heritage List (rare but possible).
|
|
handling:
|
|
- "Detect via $.date_end field (not null)"
|
|
- "Set organization_status: INACTIVE"
|
|
- "Set closed_date: $.date_end"
|
|
- "Note delisting in change_history as ChangeEvent(CLOSURE)"
|
|
example:
|
|
site: "Dresden Elbe Valley (Germany)"
|
|
date_end: "2009"
|
|
reason: "Construction of Waldschlösschen Bridge"
|
|
|
|
missing_coordinates:
|
|
description: >-
|
|
Some UNESCO sites lack precise coordinates (especially large serial nominations).
|
|
handling:
|
|
- "Attempt geocoding via city/country fallback"
|
|
- "If geocoding fails, leave latitude/longitude as null"
|
|
- "Mark with provenance.notes: 'Coordinates unavailable from UNESCO, requires manual geocoding'"
|
|
- "Flag for manual review"
|
|
|
|
ambiguous_institution_types:
|
|
description: >-
|
|
UNESCO sites that are clearly heritage sites but not obviously GLAM custodians.
|
|
Example: Historic city centers without specific museum/archive mention.
|
|
handling:
|
|
- "Default to MIXED with confidence_score < 0.7"
|
|
- "Add provenance.notes: 'Institution type inferred, requires verification'"
|
|
- "Generate GHCID with 'X' type code (MIXED)"
|
|
- "Flag for manual classification review"
|
|
|
|
# =============================================================================
|
|
# TESTING STRATEGY
|
|
# =============================================================================
|
|
|
|
testing:
|
|
unit_tests:
|
|
- test_direct_field_mapping:
|
|
description: "Test simple field mappings (name, country, WHC ID)"
|
|
fixtures: 20
|
|
coverage: "All direct mappings without custom functions"
|
|
|
|
- test_custom_transformation_functions:
|
|
description: "Test each custom function in isolation"
|
|
fixtures: 50
|
|
functions:
|
|
- generate_heritage_custodian_id
|
|
- generate_ghcid_for_unesco_site
|
|
- classify_unesco_institution_type
|
|
- parse_unesco_criteria
|
|
|
|
- test_nested_object_creation:
|
|
description: "Test Location, Identifier, Collection creation"
|
|
fixtures: 30
|
|
focus: "Ensure nested objects validate against LinkML schema"
|
|
|
|
- test_multi_language_handling:
|
|
description: "Test extraction of English and French names"
|
|
fixtures: 20
|
|
edge_cases:
|
|
- Only English name available
|
|
- Only French name available
|
|
- Both names identical
|
|
- Names significantly different
|
|
|
|
integration_tests:
|
|
- test_full_transformation_pipeline:
|
|
description: "End-to-end test: UNESCO JSON → HeritageCustodian YAML"
|
|
fixtures: 20 (golden dataset)
|
|
validation:
|
|
- LinkML schema validation
|
|
- GHCID format validation
|
|
- Provenance completeness
|
|
- No required fields missing
|
|
|
|
- test_edge_cases:
|
|
description: "Test special cases (serial, transboundary, delisted)"
|
|
fixtures:
|
|
- 3 serial nominations
|
|
- 3 transboundary sites
|
|
- 1 delisted site
|
|
- 5 ambiguous institution types
|
|
|
|
- test_external_api_integration:
|
|
description: "Test GeoNames and Wikidata enrichment"
|
|
approach: "Mock external APIs with cached responses"
|
|
fixtures: 10
|
|
scenarios:
|
|
- Successful enrichment
|
|
- API rate limit exceeded
|
|
- API unavailable (offline mode)
|
|
- No results found
|
|
|
|
property_based_tests:
|
|
- test_ghcid_uniqueness:
|
|
description: "Ensure GHCIDs are unique within dataset"
|
|
strategy: "Generate 1000 UNESCO site transformations, check for collisions"
|
|
property: "All GHCIDs unique OR collisions resolved via Q-number suffix"
|
|
|
|
- test_coordinate_validity:
|
|
description: "Ensure all coordinates within valid ranges"
|
|
strategy: "Test with random valid/invalid coordinates"
|
|
property: "latitude ∈ [-90, 90], longitude ∈ [-180, 180]"
|
|
|
|
- test_confidence_score_range:
|
|
description: "Ensure confidence scores always in [0.0, 1.0]"
|
|
strategy: "Test with diverse UNESCO descriptions"
|
|
property: "0.0 ≤ confidence_score ≤ 1.0"
|
|
|
|
# =============================================================================
|
|
# PERFORMANCE CONSIDERATIONS
|
|
# =============================================================================
|
|
|
|
performance:
|
|
caching_strategy:
|
|
geonames_lookups:
|
|
ttl: 2592000 # 30 days
|
|
rationale: "Place names rarely change"
|
|
cache_backend: "SQLite (cache/geonames_cache.db)"
|
|
|
|
wikidata_enrichment:
|
|
ttl: 604800 # 7 days
|
|
rationale: "Wikidata updates occasionally, refresh weekly"
|
|
cache_backend: "SQLite (cache/wikidata_cache.db)"
|
|
|
|
unesco_api_responses:
|
|
ttl: 86400 # 24 hours
|
|
rationale: "UNESCO data changes infrequently"
|
|
cache_backend: "requests-cache with SQLite"
|
|
|
|
rate_limiting:
|
|
geonames:
|
|
limit: "1 request/second"
|
|
fallback: "Skip geocoding if rate limit exceeded, continue processing"
|
|
|
|
wikidata:
|
|
limit: "60 requests/minute"
|
|
fallback: "Queue requests, process in batches"
|
|
|
|
parallelization:
|
|
approach: "Process UNESCO sites in parallel using multiprocessing"
|
|
workers: 4
|
|
batch_size: 50
|
|
considerations:
|
|
- "Share cache across workers (SQLite supports concurrent reads)"
|
|
- "Coordinate API rate limiting across workers (use Redis or file lock)"
|
|
- "Collect results and merge into single dataset"
|
|
|
|
estimated_processing_time:
|
|
total_sites: 1200
|
|
processing_per_site: "2-5 seconds (with external API calls)"
|
|
total_time_serial: "40-100 minutes"
|
|
total_time_parallel_4_workers: "10-25 minutes"
|
|
|
|
# =============================================================================
|
|
# EXTENSION POINTS
|
|
# =============================================================================
|
|
|
|
extensions:
|
|
future_enhancements:
|
|
- name: "Machine Learning Classification"
|
|
description: "Train ML model on manually classified UNESCO sites"
|
|
implementation: "Add MLClassificationStrategy to composite classifier"
|
|
benefits: "Improve classification accuracy beyond keyword matching"
|
|
|
|
- name: "OpenStreetMap Integration"
|
|
description: "Enrich location data with OSM polygons and detailed addresses"
|
|
implementation: "Query Overpass API for UNESCO site boundaries"
|
|
benefits: "More precise geographic data, street addresses"
|
|
|
|
- name: "Multi-language NLP"
|
|
description: "Use spaCy multi-language models for better description parsing"
|
|
implementation: "Integrate spaCy NER for extracting temporal coverage and subjects"
|
|
benefits: "Better collection metadata inference"
|
|
|
|
- name: "UNESCO Thesaurus Integration"
|
|
description: "Map UNESCO criteria to UNESCO Thesaurus SKOS concepts"
|
|
implementation: "SPARQL queries against vocabularies.unesco.org"
|
|
benefits: "Richer semantic linking, LOD compatibility"
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
- [ ] **Day 2-3**: Create `schemas/maps/unesco_to_heritage_custodian.yaml` (this file)
|
|
- [ ] **Day 4**: Implement custom transformation functions in `src/glam_extractor/mappers/unesco_transformers.py`
|
|
- [ ] **Day 5**: Write unit tests for all custom functions (50+ tests)
|
|
- [ ] **Day 6**: Test full transformation pipeline with golden dataset (20 tests)
|
|
- [ ] **Day 7**: Handle edge cases (serial, transboundary, delisted sites)
|
|
- [ ] **Day 8**: Optimize performance (caching, rate limiting, parallelization)
|
|
- [ ] **Day 9-10**: Integration with main GLAM dataset and validation
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Dependencies**: `01-dependencies.md` - LinkML Map extension requirements
|
|
- **Implementation Phases**: `03-implementation-phases.md` - Day-by-day timeline
|
|
- **TDD Strategy**: `04-tdd-strategy.md` - Testing approach
|
|
- **Design Patterns**: `05-design-patterns.md` - Classification strategies
|
|
- **Master Checklist**: `07-master-checklist.md` - Overall progress tracker
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Date**: 2025-11-09
|
|
**Status**: Draft - Ready for Implementation
|
|
**Next Steps**: Begin Day 2 implementation (create initial LinkML Map schema file)
|