39 KiB
UNESCO Data Extraction - LinkML Map Schema
Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 06 - LinkML Map Transformation Rules
Version: 1.0
Date: 2025-11-09
Status: Draft
Executive Summary
This document specifies the LinkML Map transformation schema for converting UNESCO World Heritage Site data from the UNESCO DataHub API into LinkML-compliant HeritageCustodian instances. The mapping handles complex data structures, multi-language fields, conditional institution type classification, and identifier extraction.
Key Innovation: Extends LinkML Map with conditional extraction, regex validators, and multi-language field handling to manage UNESCO's rich semantic data.
Schema Location: schemas/maps/unesco_to_heritage_custodian.yaml
UNESCO API Data Structure
Sample UNESCO Site JSON Structure
Based on UNESCO DataHub API response format:
{
"id": 600,
"unique_number": 600,
"id_number": 600,
"category": "Cultural",
"name_en": "Banks of the Seine in Paris",
"name_fr": "Paris, rives de la Seine",
"short_description_en": "From the Louvre to the Eiffel Tower, from the Place de la Concorde to the Grand and Petit Palais, the evolution of Paris and its history can be seen from the River Seine...",
"short_description_fr": "Du Louvre à la Tour Eiffel, de la place de la Concorde aux Grand et Petit Palais...",
"justification_en": "The site demonstrates exceptional urban landscape reflecting key periods in the history of Paris...",
"states": "France",
"states_iso_code": "FR",
"region": "Europe and North America",
"iso_region": "EUR",
"transboundary": "0",
"date_inscribed": "1991",
"secondary_dates": "",
"danger": "0",
"date_end": null,
"longitude": 2.3376,
"latitude": 48.8606,
"area_hectares": 365.0,
"criteria_txt": "(i)(ii)(iv)",
"http_url": "https://whc.unesco.org/en/list/600",
"image_url": "https://whc.unesco.org/uploads/thumbs/site_0600_0001-750-0-20151104113424.jpg",
"unesco_region": "Europe",
"serial": "0",
"extension": "0",
"revision": "0"
}
Key Observations
- Multi-language Fields:
name_en,name_fr,short_description_en,short_description_fr - Institution Type Inference: Must infer from
category,short_description, andjustification - Geographic Data: Provides
latitude,longitude,states_iso_code,region - UNESCO WHC ID: Unique identifier in
id,unique_number,id_numberfields - Temporal Data:
date_inscribed(inscription date, not always founding date) - Criteria: Cultural criteria
(i)-(vi)and Natural criteria(vii)-(x) - Serial Nominations: Multiple sites under one inscription (indicated by
serial: "1")
LinkML Map Schema Design
Schema Structure
The LinkML Map schema will use a hybrid approach:
- Direct field mappings for straightforward transformations (name, coordinates, dates)
- Conditional transformations for institution type classification
- Custom Python functions for complex logic (GHCID generation, Wikidata enrichment)
- Multi-step transformations for nested objects (Location, Identifier, Provenance)
Schema File: schemas/maps/unesco_to_heritage_custodian.yaml
id: https://w3id.org/heritage/custodian/maps/unesco
name: unesco-to-heritage-custodian-map
title: UNESCO World Heritage Site to HeritageCustodian Mapping
description: >-
LinkML Map transformation rules for converting UNESCO DataHub API responses
to LinkML HeritageCustodian instances. Handles multi-language fields,
institution type classification, and geographic data normalization.
version: 1.0.0
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
linkml: https://w3id.org/linkml/
heritage: https://w3id.org/heritage/custodian/
unesco: https://whc.unesco.org/en/list/
imports:
- linkml:types
- ../../core
- ../../enums
- ../../provenance
# =============================================================================
# MAPPING RULES
# =============================================================================
mappings:
unesco_site_to_heritage_custodian:
description: >-
Transform UNESCO World Heritage Site JSON to HeritageCustodian instance.
Applies institution type classification, multi-language name extraction,
and geographic data normalization.
source_schema: unesco_api_response
target_schema: heritage_custodian
# -------------------------------------------------------------------------
# CORE IDENTIFICATION FIELDS
# -------------------------------------------------------------------------
id:
source_path: $.id
target_path: id
transform:
function: generate_heritage_custodian_id
description: >-
Generate persistent URI identifier using UNESCO WHC ID and country code.
Format: https://w3id.org/heritage/custodian/{country_code}/unesco-{whc_id}
parameters:
whc_id: $.id
country_code: $.states_iso_code
example: "https://w3id.org/heritage/custodian/fr/unesco-600"
# GHCID fields (generated via custom function)
ghcid_current:
source_path: $.id
target_path: ghcid_current
transform:
function: generate_ghcid_for_unesco_site
description: >-
Generate Global Heritage Custodian Identifier (GHCID) using UNESCO site data.
Format: {ISO-3166-1}-{ISO-3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}[-{native_name_snake_case}]
Note: Collision suffix uses native language institution name in snake_case (NOT Wikidata Q-numbers).
See docs/plan/global_glam/07-ghcid-collision-resolution.md for details.
parameters:
whc_id: $.id
country_code: $.states_iso_code
site_name: $.name_en
latitude: $.latitude
longitude: $.longitude
institution_type: "@computed:institution_type" # Reference computed value
fallback:
# If GHCID cannot be generated, use UNESCO WHC ID
value: "UNESCO-{whc_id}"
note: "Fallback GHCID - requires manual review"
ghcid_numeric:
source_path: ghcid_current
target_path: ghcid_numeric
transform:
function: ghcid_to_numeric_hash
description: "Generate 64-bit numeric hash from GHCID string (SHA-256 based)"
ghcid_uuid:
source_path: ghcid_current
target_path: ghcid_uuid
transform:
function: ghcid_to_uuid_v5
description: "Generate deterministic UUID v5 from GHCID string (SHA-1 based)"
ghcid_uuid_sha256:
source_path: ghcid_current
target_path: ghcid_uuid_sha256
transform:
function: ghcid_to_uuid_v8
description: "Generate deterministic UUID v8 from GHCID string (SHA-256 based)"
# -------------------------------------------------------------------------
# NAME FIELDS (Multi-language handling)
# -------------------------------------------------------------------------
name:
source_path: $.name_en
target_path: name
transform:
function: extract_primary_name
description: >-
Extract primary name for the institution. Prefers English name,
falls back to French or first available language.
parameters:
name_en: $.name_en
name_fr: $.name_fr
fallback:
source_path: $.name_fr
required: true
alternative_names:
target_path: alternative_names
transform:
function: extract_alternative_names
description: >-
Extract all language variants of the site name. Collects names from
name_en, name_fr, and other language fields if present.
parameters:
name_en: $.name_en
name_fr: $.name_fr
output: list
example:
- "Banks of the Seine in Paris"
- "Paris, rives de la Seine"
# -------------------------------------------------------------------------
# INSTITUTION TYPE CLASSIFICATION (Conditional)
# -------------------------------------------------------------------------
institution_type:
target_path: institution_type
transform:
function: classify_unesco_institution_type
description: >-
Classify institution type based on UNESCO category, description, and keywords.
Uses multi-strategy approach: keyword matching, rule-based inference, and
confidence scoring. Defaults to MIXED for ambiguous cases.
parameters:
category: $.category
description_en: $.short_description_en
description_fr: $.short_description_fr
justification_en: $.justification_en
criteria: $.criteria_txt
site_name: $.name_en
strategy: composite # Use composite classification strategy
confidence_threshold: 0.7
default: MIXED
required: true
examples:
- input: {description: "Museum complex with archaeological collections"}
output: MUSEUM
confidence: 0.95
- input: {description: "Historic library and archive"}
output: LIBRARY
confidence: 0.85
- input: {description: "Historic city center"}
output: MIXED
confidence: 0.5
# -------------------------------------------------------------------------
# DESCRIPTION FIELDS
# -------------------------------------------------------------------------
description:
source_path: $.short_description_en
target_path: description
transform:
function: merge_descriptions
description: >-
Merge short description and justification into single comprehensive
description. Prefers English, includes French if English unavailable.
parameters:
short_description_en: $.short_description_en
short_description_fr: $.short_description_fr
justification_en: $.justification_en
justification_fr: $.justification_fr
template: |
{short_description_en}
UNESCO Justification: {justification_en}
fallback:
source_path: $.short_description_fr
# -------------------------------------------------------------------------
# LOCATION (Nested Object)
# -------------------------------------------------------------------------
locations:
target_path: locations
transform:
function: create_location_from_unesco
description: >-
Create Location object from UNESCO geographic data. Extracts country,
region, coordinates, and attempts GeoNames lookup for city name.
parameters:
country: $.states_iso_code
region: $.region
latitude: $.latitude
longitude: $.longitude
unesco_region: $.unesco_region
output: list # Always return list (even single location)
nested_mappings:
- target_class: Location
fields:
country:
source_path: $.states_iso_code
required: true
validation:
pattern: "^[A-Z]{2}$"
description: "ISO 3166-1 alpha-2 country code"
region:
source_path: $.region
transform:
function: map_unesco_region_to_iso3166_2
description: "Convert UNESCO region name to ISO 3166-2 code if possible"
fallback:
value: null
note: "UNESCO regions do not map 1:1 to ISO 3166-2"
city:
source_path: null # Not provided by UNESCO
transform:
function: reverse_geocode_coordinates
description: >-
Reverse geocode coordinates to find city name using GeoNames API.
Caches results to minimize API calls.
parameters:
latitude: $.latitude
longitude: $.longitude
cache_ttl: 2592000 # 30 days
fallback:
value: null
note: "Geocoding failed or coordinates invalid"
latitude:
source_path: $.latitude
required: true
validation:
type: float
range: [-90, 90]
longitude:
source_path: $.longitude
required: true
validation:
type: float
range: [-180, 180]
geonames_id:
source_path: null
transform:
function: lookup_geonames_id
description: "Query GeoNames API for location ID"
parameters:
latitude: $.latitude
longitude: $.longitude
cache_ttl: 2592000 # 30 days
is_primary:
source_path: null
default: true
description: "UNESCO data provides single location per site"
# -------------------------------------------------------------------------
# IDENTIFIERS (Nested Objects - Array)
# -------------------------------------------------------------------------
identifiers:
target_path: identifiers
transform:
function: create_identifiers_from_unesco
description: >-
Create Identifier objects for UNESCO WHC ID, website URL, and
Wikidata Q-number (if available). Queries Wikidata SPARQL endpoint
to find Q-number via UNESCO WHC ID property (P757).
output: list
nested_mappings:
# UNESCO WHC ID
- target_class: Identifier
condition:
field: $.id
operator: is_not_null
fields:
identifier_scheme:
default: "UNESCO_WHC"
identifier_value:
source_path: $.id
transform:
function: format_unesco_whc_id
description: "Format WHC ID as zero-padded 4-digit string"
example: "0600"
identifier_url:
source_path: $.http_url
fallback:
template: "https://whc.unesco.org/en/list/{id}"
assigned_date:
source_path: $.date_inscribed
transform:
function: parse_year_to_date
description: "Convert year string to ISO 8601 date (YYYY-01-01)"
# Website URL
- target_class: Identifier
condition:
field: $.http_url
operator: is_not_null
fields:
identifier_scheme:
default: "Website"
identifier_value:
source_path: $.http_url
identifier_url:
source_path: $.http_url
# Wikidata Q-number (enrichment)
- target_class: Identifier
transform:
function: enrich_with_wikidata_qnumber
description: >-
Query Wikidata SPARQL endpoint for Q-number using UNESCO WHC ID property (P757).
Caches results to minimize SPARQL queries.
parameters:
whc_id: $.id
cache_ttl: 604800 # 7 days
fallback:
skip: true
note: "Wikidata Q-number not found or enrichment disabled"
fields:
identifier_scheme:
default: "Wikidata"
identifier_value:
source_path: "@enrichment:wikidata_qnumber"
identifier_url:
template: "https://www.wikidata.org/wiki/{identifier_value}"
# -------------------------------------------------------------------------
# TEMPORAL FIELDS
# -------------------------------------------------------------------------
founded_date:
source_path: $.date_inscribed
target_path: founded_date
transform:
function: parse_unesco_date
description: >-
Parse UNESCO inscription date. Note: This is the UNESCO inscription date,
NOT the original founding date of the institution. Use with caution.
validation:
pattern: "^\\d{4}(-\\d{2}-\\d{2})?$"
output_format: "ISO 8601 date (YYYY-MM-DD)"
note: >-
UNESCO inscription date != founding date. For actual founding dates,
requires external enrichment from Wikidata or institutional sources.
closed_date:
source_path: $.date_end
target_path: closed_date
transform:
function: parse_unesco_date
description: "Parse UNESCO delisting date (if site was removed from list)"
fallback:
value: null
note: "Most sites remain on UNESCO list indefinitely"
# -------------------------------------------------------------------------
# COLLECTIONS (Nested Objects)
# -------------------------------------------------------------------------
collections:
target_path: collections
transform:
function: infer_collections_from_unesco
description: >-
Infer collection metadata from UNESCO criteria and category.
Creates Collection objects based on cultural/natural criteria.
parameters:
category: $.category
criteria_txt: $.criteria_txt
description_en: $.short_description_en
output: list
nested_mappings:
- target_class: Collection
fields:
collection_name:
transform:
function: generate_collection_name
description: "Generate collection name from UNESCO category and criteria"
template: "UNESCO {category} Heritage Collection - {site_name}"
collection_type:
source_path: $.category
transform:
function: map_unesco_category_to_collection_type
mappings:
Cultural: "cultural"
Natural: "natural"
Mixed: "mixed"
subject_areas:
source_path: $.criteria_txt
transform:
function: parse_unesco_criteria
description: >-
Parse UNESCO criteria codes (i-x) and map to subject areas.
Cultural: (i) masterpieces, (ii) exchange, (iii) testimony,
(iv) architecture, (v) settlement, (vi) associations.
Natural: (vii) natural phenomena, (viii) earth's history,
(ix) ecosystems, (x) biodiversity.
output: list
example:
input: "(i)(ii)(iv)"
output: ["Masterpieces of human creative genius", "Exchange of human values", "Architecture"]
temporal_coverage:
source_path: null
transform:
function: infer_temporal_coverage_from_description
description: >-
Attempt to extract time periods from UNESCO description text.
Uses NER and regex patterns to identify dates/periods.
parameters:
description: $.short_description_en
justification: $.justification_en
fallback:
value: null
note: "Temporal coverage could not be inferred"
extent:
source_path: $.area_hectares
transform:
function: format_area_extent
description: "Format area in hectares as collection extent"
template: "{area_hectares} hectares"
# -------------------------------------------------------------------------
# DIGITAL PLATFORMS (Nested Objects)
# -------------------------------------------------------------------------
digital_platforms:
target_path: digital_platforms
transform:
function: create_digital_platforms_from_unesco
description: >-
Create DigitalPlatform objects for UNESCO website and any linked
institutional websites discovered via URL parsing.
parameters:
http_url: $.http_url
image_url: $.image_url
output: list
nested_mappings:
# UNESCO Website
- target_class: DigitalPlatform
fields:
platform_name:
default: "UNESCO World Heritage Centre"
platform_url:
source_path: $.http_url
platform_type:
default: WEBSITE
metadata_standards:
default: ["DUBLIN_CORE", "SCHEMA_ORG"]
description: "UNESCO uses Dublin Core and Schema.org metadata"
# -------------------------------------------------------------------------
# PROVENANCE (Required)
# -------------------------------------------------------------------------
provenance:
target_path: provenance
transform:
function: create_unesco_provenance
description: "Create Provenance object for UNESCO-sourced data"
output: object
required: true
nested_mappings:
- target_class: Provenance
fields:
data_source:
default: UNESCO_WORLD_HERITAGE
validation:
enum: DataSourceEnum
data_tier:
default: TIER_1_AUTHORITATIVE
description: "UNESCO DataHub is authoritative source"
validation:
enum: DataTierEnum
extraction_date:
source_path: null
transform:
function: current_timestamp_utc
description: "Timestamp when data was extracted from UNESCO API"
output_format: "ISO 8601 datetime with timezone"
extraction_method:
default: "UNESCO DataHub API extraction via LinkML Map transformation"
confidence_score:
source_path: "@computed:institution_type_confidence"
description: "Confidence score from institution type classification"
validation:
type: float
range: [0.0, 1.0]
fallback:
value: 0.85
note: "Default confidence for UNESCO data with inferred institution type"
source_url:
source_path: $.http_url
description: "URL to original UNESCO World Heritage Site page"
verified_date:
source_path: $.date_inscribed
transform:
function: parse_year_to_date
description: "UNESCO inscription date serves as verification date"
verified_by:
default: "UNESCO World Heritage Committee"
description: "UNESCO WHC verifies all inscriptions"
# =============================================================================
# CUSTOM TRANSFORMATION FUNCTIONS (Python Implementation Required)
# =============================================================================
custom_functions:
generate_heritage_custodian_id:
description: "Generate persistent URI for heritage custodian"
implementation: "src/glam_extractor/mappers/unesco_transformers.py:generate_heritage_custodian_id"
signature:
parameters:
- name: whc_id
type: int
- name: country_code
type: str
returns: str
example:
input: {whc_id: 600, country_code: "FR"}
output: "https://w3id.org/heritage/custodian/fr/unesco-600"
generate_ghcid_for_unesco_site:
description: "Generate GHCID from UNESCO site data"
implementation: "src/glam_extractor/ghcid/unesco_ghcid_generator.py:generate_ghcid"
signature:
parameters:
- name: whc_id
type: int
- name: country_code
type: str
- name: site_name
type: str
- name: latitude
type: float
- name: longitude
type: float
- name: institution_type
type: InstitutionTypeEnum
returns: str
notes:
- "Requires GeoNames lookup for UN/LOCODE (city code)"
- "Handles collision resolution via native language name suffix"
- "Caches GeoNames lookups to minimize API calls"
classify_unesco_institution_type:
description: "Classify institution type using composite strategy"
implementation: "src/glam_extractor/classifiers/unesco_institution_type.py:classify"
signature:
parameters:
- name: category
type: str
- name: description_en
type: str
- name: description_fr
type: str
- name: justification_en
type: str
- name: criteria
type: str
- name: site_name
type: str
returns:
type: dict
fields:
- institution_type: InstitutionTypeEnum
- confidence_score: float
- reasoning: str
notes:
- "Uses KeywordClassificationStrategy, RuleBasedStrategy, and CompositeStrategy"
- "Confidence threshold: 0.7 (below this, defaults to MIXED)"
- "See design-patterns.md Pattern 2 for implementation details"
enrich_with_wikidata_qnumber:
description: "Query Wikidata for Q-number via UNESCO WHC ID property (P757)"
implementation: "src/glam_extractor/enrichment/wikidata_enricher.py:get_qnumber_by_whc_id"
signature:
parameters:
- name: whc_id
type: int
returns:
type: Optional[str]
description: "Wikidata Q-number or None if not found"
sparql_query: |
SELECT ?item WHERE {
?item wdt:P757 "{whc_id}" .
}
cache_ttl: 604800 # 7 days
notes:
- "Rate limit: 60 requests/minute to Wikidata SPARQL endpoint"
- "Implement exponential backoff on 429 Too Many Requests"
- "Skip enrichment if offline mode enabled"
reverse_geocode_coordinates:
description: "Reverse geocode coordinates to city name via GeoNames"
implementation: "src/glam_extractor/geocoding/geonames_client.py:reverse_geocode"
signature:
parameters:
- name: latitude
type: float
- name: longitude
type: float
returns:
type: Optional[str]
description: "City name or None if lookup failed"
api_endpoint: "http://api.geonames.org/findNearbyPlaceNameJSON"
cache_ttl: 2592000 # 30 days
rate_limit: "1 request/second (GeoNames free tier)"
notes:
- "Requires GEONAMES_API_KEY environment variable"
- "Free tier: 20,000 requests/day"
- "Aggressive caching essential to stay within limits"
parse_unesco_criteria:
description: "Parse UNESCO criteria codes (i-x) to subject area descriptions"
implementation: "src/glam_extractor/parsers/unesco_criteria_parser.py:parse_criteria"
signature:
parameters:
- name: criteria_txt
type: str
example: "(i)(ii)(iv)"
returns:
type: List[str]
description: "List of criteria descriptions"
criteria_mapping:
"(i)": "Masterpieces of human creative genius"
"(ii)": "Exchange of human values"
"(iii)": "Testimony to cultural tradition"
"(iv)": "Outstanding example of architecture"
"(v)": "Traditional human settlement or land-use"
"(vi)": "Associated with events, traditions, or ideas"
"(vii)": "Superlative natural phenomena or natural beauty"
"(viii)": "Outstanding examples of Earth's history"
"(ix)": "Significant ongoing ecological and biological processes"
"(x)": "Habitats for biodiversity conservation"
reference: "https://whc.unesco.org/en/criteria/"
# =============================================================================
# VALIDATION RULES
# =============================================================================
validation:
required_fields:
- id
- name
- institution_type
- locations
- identifiers
- provenance
post_transformation:
- rule: validate_ghcid_format
description: "Ensure GHCID matches specification pattern"
pattern: "^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$"
applies_to: ghcid_current
- rule: validate_country_code
description: "Ensure country code is valid ISO 3166-1 alpha-2"
validation_function: "pycountry.countries.get(alpha_2=value)"
applies_to: locations[*].country
- rule: validate_coordinates
description: "Ensure coordinates are within valid ranges"
validation_function: "validate_coordinates_in_range"
applies_to:
- locations[*].latitude
- locations[*].longitude
- rule: validate_confidence_score
description: "Ensure confidence score is between 0.0 and 1.0"
validation_function: "0.0 <= value <= 1.0"
applies_to: provenance.confidence_score
- rule: warn_low_confidence
description: "Warn if institution type confidence < 0.7"
severity: warning
condition: "provenance.confidence_score < 0.7"
action: "log_warning"
message: "Low confidence institution type classification: {institution_type} ({confidence_score})"
# =============================================================================
# EDGE CASES AND SPECIAL HANDLING
# =============================================================================
edge_cases:
serial_nominations:
description: >-
UNESCO serial nominations represent multiple sites under one WHC ID.
Example: "Primeval Beech Forests of the Carpathians" spans 12 countries.
handling:
- "Detect via $.serial == '1' flag"
- "Query UNESCO API for component sites"
- "Create separate HeritageCustodian record for each component"
- "Link via parent_organization or partnership relationship"
implementation: "src/glam_extractor/parsers/serial_nomination_handler.py"
transboundary_sites:
description: >-
Sites spanning multiple countries (e.g., Mont Blanc, Wadden Sea).
handling:
- "Detect via $.transboundary == '1' flag"
- "Parse $.states field for multiple country codes (comma-separated)"
- "Create Location objects for each country"
- "GHCID generation: Use first country alphabetically, note transboundary in description"
example:
site: "Wadden Sea"
countries: ["NL", "DE", "DK"]
ghcid: "DE-NS-CUX-M-WS" # Uses Germany (alphabetically first among DE/DK/NL)
sites_with_multiple_institutions:
description: >-
Large UNESCO sites may contain multiple GLAM institutions (e.g., Vatican City).
handling:
- "Primary extraction creates one HeritageCustodian for the UNESCO site itself"
- "Flag for manual enrichment: Add sub_organizations for individual museums/archives within site"
- "Use sub_organizations slot to link related institutions"
example:
site: "Vatican City"
main_record: "Vatican Museums"
sub_organizations:
- "Vatican Apostolic Archive"
- "Vatican Library"
- "Sistine Chapel"
sites_removed_from_list:
description: >-
Sites delisted from UNESCO World Heritage List (rare but possible).
handling:
- "Detect via $.date_end field (not null)"
- "Set organization_status: INACTIVE"
- "Set closed_date: $.date_end"
- "Note delisting in change_history as ChangeEvent(CLOSURE)"
example:
site: "Dresden Elbe Valley (Germany)"
date_end: "2009"
reason: "Construction of Waldschlösschen Bridge"
missing_coordinates:
description: >-
Some UNESCO sites lack precise coordinates (especially large serial nominations).
handling:
- "Attempt geocoding via city/country fallback"
- "If geocoding fails, leave latitude/longitude as null"
- "Mark with provenance.notes: 'Coordinates unavailable from UNESCO, requires manual geocoding'"
- "Flag for manual review"
ambiguous_institution_types:
description: >-
UNESCO sites that are clearly heritage sites but not obviously GLAM custodians.
Example: Historic city centers without specific museum/archive mention.
handling:
- "Default to MIXED with confidence_score < 0.7"
- "Add provenance.notes: 'Institution type inferred, requires verification'"
- "Generate GHCID with 'X' type code (MIXED)"
- "Flag for manual classification review"
# =============================================================================
# TESTING STRATEGY
# =============================================================================
testing:
unit_tests:
- test_direct_field_mapping:
description: "Test simple field mappings (name, country, WHC ID)"
fixtures: 20
coverage: "All direct mappings without custom functions"
- test_custom_transformation_functions:
description: "Test each custom function in isolation"
fixtures: 50
functions:
- generate_heritage_custodian_id
- generate_ghcid_for_unesco_site
- classify_unesco_institution_type
- parse_unesco_criteria
- test_nested_object_creation:
description: "Test Location, Identifier, Collection creation"
fixtures: 30
focus: "Ensure nested objects validate against LinkML schema"
- test_multi_language_handling:
description: "Test extraction of English and French names"
fixtures: 20
edge_cases:
- Only English name available
- Only French name available
- Both names identical
- Names significantly different
integration_tests:
- test_full_transformation_pipeline:
description: "End-to-end test: UNESCO JSON → HeritageCustodian YAML"
fixtures: 20 (golden dataset)
validation:
- LinkML schema validation
- GHCID format validation
- Provenance completeness
- No required fields missing
- test_edge_cases:
description: "Test special cases (serial, transboundary, delisted)"
fixtures:
- 3 serial nominations
- 3 transboundary sites
- 1 delisted site
- 5 ambiguous institution types
- test_external_api_integration:
description: "Test GeoNames and Wikidata enrichment"
approach: "Mock external APIs with cached responses"
fixtures: 10
scenarios:
- Successful enrichment
- API rate limit exceeded
- API unavailable (offline mode)
- No results found
property_based_tests:
- test_ghcid_uniqueness:
description: "Ensure GHCIDs are unique within dataset"
strategy: "Generate 1000 UNESCO site transformations, check for collisions"
property: "All GHCIDs unique OR collisions resolved via name suffix"
- test_coordinate_validity:
description: "Ensure all coordinates within valid ranges"
strategy: "Test with random valid/invalid coordinates"
property: "latitude ∈ [-90, 90], longitude ∈ [-180, 180]"
- test_confidence_score_range:
description: "Ensure confidence scores always in [0.0, 1.0]"
strategy: "Test with diverse UNESCO descriptions"
property: "0.0 ≤ confidence_score ≤ 1.0"
# =============================================================================
# PERFORMANCE CONSIDERATIONS
# =============================================================================
performance:
caching_strategy:
geonames_lookups:
ttl: 2592000 # 30 days
rationale: "Place names rarely change"
cache_backend: "SQLite (cache/geonames_cache.db)"
wikidata_enrichment:
ttl: 604800 # 7 days
rationale: "Wikidata updates occasionally, refresh weekly"
cache_backend: "SQLite (cache/wikidata_cache.db)"
unesco_api_responses:
ttl: 86400 # 24 hours
rationale: "UNESCO data changes infrequently"
cache_backend: "requests-cache with SQLite"
rate_limiting:
geonames:
limit: "1 request/second"
fallback: "Skip geocoding if rate limit exceeded, continue processing"
wikidata:
limit: "60 requests/minute"
fallback: "Queue requests, process in batches"
parallelization:
approach: "Process UNESCO sites in parallel using multiprocessing"
workers: 4
batch_size: 50
considerations:
- "Share cache across workers (SQLite supports concurrent reads)"
- "Coordinate API rate limiting across workers (use Redis or file lock)"
- "Collect results and merge into single dataset"
estimated_processing_time:
total_sites: 1200
processing_per_site: "2-5 seconds (with external API calls)"
total_time_serial: "40-100 minutes"
total_time_parallel_4_workers: "10-25 minutes"
# =============================================================================
# EXTENSION POINTS
# =============================================================================
extensions:
future_enhancements:
- name: "Machine Learning Classification"
description: "Train ML model on manually classified UNESCO sites"
implementation: "Add MLClassificationStrategy to composite classifier"
benefits: "Improve classification accuracy beyond keyword matching"
- name: "OpenStreetMap Integration"
description: "Enrich location data with OSM polygons and detailed addresses"
implementation: "Query Overpass API for UNESCO site boundaries"
benefits: "More precise geographic data, street addresses"
- name: "Multi-language NLP"
description: "Use spaCy multi-language models for better description parsing"
implementation: "Integrate spaCy NER for extracting temporal coverage and subjects"
benefits: "Better collection metadata inference"
- name: "UNESCO Thesaurus Integration"
description: "Map UNESCO criteria to UNESCO Thesaurus SKOS concepts"
implementation: "SPARQL queries against vocabularies.unesco.org"
benefits: "Richer semantic linking, LOD compatibility"
---
## Implementation Checklist
- [ ] **Day 2-3**: Create `schemas/maps/unesco_to_heritage_custodian.yaml` (this file)
- [ ] **Day 4**: Implement custom transformation functions in `src/glam_extractor/mappers/unesco_transformers.py`
- [ ] **Day 5**: Write unit tests for all custom functions (50+ tests)
- [ ] **Day 6**: Test full transformation pipeline with golden dataset (20 tests)
- [ ] **Day 7**: Handle edge cases (serial, transboundary, delisted sites)
- [ ] **Day 8**: Optimize performance (caching, rate limiting, parallelization)
- [ ] **Day 9-10**: Integration with main GLAM dataset and validation
---
## Related Documentation
- **Dependencies**: `01-dependencies.md` - LinkML Map extension requirements
- **Implementation Phases**: `03-implementation-phases.md` - Day-by-day timeline
- **TDD Strategy**: `04-tdd-strategy.md` - Testing approach
- **Design Patterns**: `05-design-patterns.md` - Classification strategies
- **Master Checklist**: `07-master-checklist.md` - Overall progress tracker
---
**Version**: 1.0
**Date**: 2025-11-09
**Status**: Draft - Ready for Implementation
**Next Steps**: Begin Day 2 implementation (create initial LinkML Map schema file)