glam/docs/plan/unesco/06-linkml-map-schema.md
2025-11-30 23:30:29 +01:00

39 KiB

UNESCO Data Extraction - LinkML Map Schema

Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 06 - LinkML Map Transformation Rules
Version: 1.0
Date: 2025-11-09
Status: Draft


Executive Summary

This document specifies the LinkML Map transformation schema for converting UNESCO World Heritage Site data from the UNESCO DataHub API into LinkML-compliant HeritageCustodian instances. The mapping handles complex data structures, multi-language fields, conditional institution type classification, and identifier extraction.

Key Innovation: Extends LinkML Map with conditional extraction, regex validators, and multi-language field handling to manage UNESCO's rich semantic data.

Schema Location: schemas/maps/unesco_to_heritage_custodian.yaml


UNESCO API Data Structure

Sample UNESCO Site JSON Structure

Based on UNESCO DataHub API response format:

{
  "id": 600,
  "unique_number": 600,
  "id_number": 600,
  "category": "Cultural",
  "name_en": "Banks of the Seine in Paris",
  "name_fr": "Paris, rives de la Seine",
  "short_description_en": "From the Louvre to the Eiffel Tower, from the Place de la Concorde to the Grand and Petit Palais, the evolution of Paris and its history can be seen from the River Seine...",
  "short_description_fr": "Du Louvre à la Tour Eiffel, de la place de la Concorde aux Grand et Petit Palais...",
  "justification_en": "The site demonstrates exceptional urban landscape reflecting key periods in the history of Paris...",
  "states": "France",
  "states_iso_code": "FR",
  "region": "Europe and North America",
  "iso_region": "EUR",
  "transboundary": "0",
  "date_inscribed": "1991",
  "secondary_dates": "",
  "danger": "0",
  "date_end": null,
  "longitude": 2.3376,
  "latitude": 48.8606,
  "area_hectares": 365.0,
  "criteria_txt": "(i)(ii)(iv)",
  "http_url": "https://whc.unesco.org/en/list/600",
  "image_url": "https://whc.unesco.org/uploads/thumbs/site_0600_0001-750-0-20151104113424.jpg",
  "unesco_region": "Europe",
  "serial": "0",
  "extension": "0",
  "revision": "0"
}

Key Observations

  1. Multi-language Fields: name_en, name_fr, short_description_en, short_description_fr
  2. Institution Type Inference: Must infer from category, short_description, and justification
  3. Geographic Data: Provides latitude, longitude, states_iso_code, region
  4. UNESCO WHC ID: Unique identifier in id, unique_number, id_number fields
  5. Temporal Data: date_inscribed (inscription date, not always founding date)
  6. Criteria: Cultural criteria (i)-(vi) and Natural criteria (vii)-(x)
  7. Serial Nominations: Multiple sites under one inscription (indicated by serial: "1")

LinkML Map Schema Design

Schema Structure

The LinkML Map schema will use a hybrid approach:

  1. Direct field mappings for straightforward transformations (name, coordinates, dates)
  2. Conditional transformations for institution type classification
  3. Custom Python functions for complex logic (GHCID generation, Wikidata enrichment)
  4. Multi-step transformations for nested objects (Location, Identifier, Provenance)

Schema File: schemas/maps/unesco_to_heritage_custodian.yaml

id: https://w3id.org/heritage/custodian/maps/unesco
name: unesco-to-heritage-custodian-map
title: UNESCO World Heritage Site to HeritageCustodian Mapping
description: >-
  LinkML Map transformation rules for converting UNESCO DataHub API responses
  to LinkML HeritageCustodian instances. Handles multi-language fields,
  institution type classification, and geographic data normalization.  

version: 1.0.0
license: https://creativecommons.org/publicdomain/zero/1.0/

prefixes:
  linkml: https://w3id.org/linkml/
  heritage: https://w3id.org/heritage/custodian/
  unesco: https://whc.unesco.org/en/list/

imports:
  - linkml:types
  - ../../core
  - ../../enums
  - ../../provenance

# =============================================================================
# MAPPING RULES
# =============================================================================

mappings:
  unesco_site_to_heritage_custodian:
    description: >-
      Transform UNESCO World Heritage Site JSON to HeritageCustodian instance.
      Applies institution type classification, multi-language name extraction,
      and geographic data normalization.      
    
    source_schema: unesco_api_response
    target_schema: heritage_custodian
    
    # -------------------------------------------------------------------------
    # CORE IDENTIFICATION FIELDS
    # -------------------------------------------------------------------------
    
    id:
      source_path: $.id
      target_path: id
      transform:
        function: generate_heritage_custodian_id
        description: >-
          Generate persistent URI identifier using UNESCO WHC ID and country code.
          Format: https://w3id.org/heritage/custodian/{country_code}/unesco-{whc_id}          
        parameters:
          whc_id: $.id
          country_code: $.states_iso_code
        example: "https://w3id.org/heritage/custodian/fr/unesco-600"
    
    # GHCID fields (generated via custom function)
    ghcid_current:
      source_path: $.id
      target_path: ghcid_current
      transform:
        function: generate_ghcid_for_unesco_site
        description: >-
          Generate Global Heritage Custodian Identifier (GHCID) using UNESCO site data.
          Format: {ISO-3166-1}-{ISO-3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}[-{native_name_snake_case}]
          Note: Collision suffix uses native language institution name in snake_case (NOT Wikidata Q-numbers).
          See docs/plan/global_glam/07-ghcid-collision-resolution.md for details.          
        parameters:
          whc_id: $.id
          country_code: $.states_iso_code
          site_name: $.name_en
          latitude: $.latitude
          longitude: $.longitude
          institution_type: "@computed:institution_type"  # Reference computed value
        fallback:
          # If GHCID cannot be generated, use UNESCO WHC ID
          value: "UNESCO-{whc_id}"
          note: "Fallback GHCID - requires manual review"
    
    ghcid_numeric:
      source_path: ghcid_current
      target_path: ghcid_numeric
      transform:
        function: ghcid_to_numeric_hash
        description: "Generate 64-bit numeric hash from GHCID string (SHA-256 based)"
    
    ghcid_uuid:
      source_path: ghcid_current
      target_path: ghcid_uuid
      transform:
        function: ghcid_to_uuid_v5
        description: "Generate deterministic UUID v5 from GHCID string (SHA-1 based)"
    
    ghcid_uuid_sha256:
      source_path: ghcid_current
      target_path: ghcid_uuid_sha256
      transform:
        function: ghcid_to_uuid_v8
        description: "Generate deterministic UUID v8 from GHCID string (SHA-256 based)"
    
    # -------------------------------------------------------------------------
    # NAME FIELDS (Multi-language handling)
    # -------------------------------------------------------------------------
    
    name:
      source_path: $.name_en
      target_path: name
      transform:
        function: extract_primary_name
        description: >-
          Extract primary name for the institution. Prefers English name,
          falls back to French or first available language.          
        parameters:
          name_en: $.name_en
          name_fr: $.name_fr
        fallback:
          source_path: $.name_fr
      required: true
    
    alternative_names:
      target_path: alternative_names
      transform:
        function: extract_alternative_names
        description: >-
          Extract all language variants of the site name. Collects names from
          name_en, name_fr, and other language fields if present.          
        parameters:
          name_en: $.name_en
          name_fr: $.name_fr
        output: list
      example:
        - "Banks of the Seine in Paris"
        - "Paris, rives de la Seine"
    
    # -------------------------------------------------------------------------
    # INSTITUTION TYPE CLASSIFICATION (Conditional)
    # -------------------------------------------------------------------------
    
    institution_type:
      target_path: institution_type
      transform:
        function: classify_unesco_institution_type
        description: >-
          Classify institution type based on UNESCO category, description, and keywords.
          Uses multi-strategy approach: keyword matching, rule-based inference, and
          confidence scoring. Defaults to MIXED for ambiguous cases.          
        parameters:
          category: $.category
          description_en: $.short_description_en
          description_fr: $.short_description_fr
          justification_en: $.justification_en
          criteria: $.criteria_txt
          site_name: $.name_en
        strategy: composite  # Use composite classification strategy
        confidence_threshold: 0.7
        default: MIXED
      required: true
      examples:
        - input: {description: "Museum complex with archaeological collections"}
          output: MUSEUM
          confidence: 0.95
        - input: {description: "Historic library and archive"}
          output: LIBRARY
          confidence: 0.85
        - input: {description: "Historic city center"}
          output: MIXED
          confidence: 0.5
    
    # -------------------------------------------------------------------------
    # DESCRIPTION FIELDS
    # -------------------------------------------------------------------------
    
    description:
      source_path: $.short_description_en
      target_path: description
      transform:
        function: merge_descriptions
        description: >-
          Merge short description and justification into single comprehensive
          description. Prefers English, includes French if English unavailable.          
        parameters:
          short_description_en: $.short_description_en
          short_description_fr: $.short_description_fr
          justification_en: $.justification_en
          justification_fr: $.justification_fr
        template: |
          {short_description_en}
          
          UNESCO Justification: {justification_en}          
      fallback:
        source_path: $.short_description_fr
    
    # -------------------------------------------------------------------------
    # LOCATION (Nested Object)
    # -------------------------------------------------------------------------
    
    locations:
      target_path: locations
      transform:
        function: create_location_from_unesco
        description: >-
          Create Location object from UNESCO geographic data. Extracts country,
          region, coordinates, and attempts GeoNames lookup for city name.          
        parameters:
          country: $.states_iso_code
          region: $.region
          latitude: $.latitude
          longitude: $.longitude
          unesco_region: $.unesco_region
        output: list  # Always return list (even single location)
      
      nested_mappings:
        - target_class: Location
          fields:
            country:
              source_path: $.states_iso_code
              required: true
              validation:
                pattern: "^[A-Z]{2}$"
                description: "ISO 3166-1 alpha-2 country code"
            
            region:
              source_path: $.region
              transform:
                function: map_unesco_region_to_iso3166_2
                description: "Convert UNESCO region name to ISO 3166-2 code if possible"
                fallback:
                  value: null
                  note: "UNESCO regions do not map 1:1 to ISO 3166-2"
            
            city:
              source_path: null  # Not provided by UNESCO
              transform:
                function: reverse_geocode_coordinates
                description: >-
                  Reverse geocode coordinates to find city name using GeoNames API.
                  Caches results to minimize API calls.                  
                parameters:
                  latitude: $.latitude
                  longitude: $.longitude
                cache_ttl: 2592000  # 30 days
                fallback:
                  value: null
                  note: "Geocoding failed or coordinates invalid"
            
            latitude:
              source_path: $.latitude
              required: true
              validation:
                type: float
                range: [-90, 90]
            
            longitude:
              source_path: $.longitude
              required: true
              validation:
                type: float
                range: [-180, 180]
            
            geonames_id:
              source_path: null
              transform:
                function: lookup_geonames_id
                description: "Query GeoNames API for location ID"
                parameters:
                  latitude: $.latitude
                  longitude: $.longitude
                cache_ttl: 2592000  # 30 days
            
            is_primary:
              source_path: null
              default: true
              description: "UNESCO data provides single location per site"
    
    # -------------------------------------------------------------------------
    # IDENTIFIERS (Nested Objects - Array)
    # -------------------------------------------------------------------------
    
    identifiers:
      target_path: identifiers
      transform:
        function: create_identifiers_from_unesco
        description: >-
          Create Identifier objects for UNESCO WHC ID, website URL, and
          Wikidata Q-number (if available). Queries Wikidata SPARQL endpoint
          to find Q-number via UNESCO WHC ID property (P757).          
        output: list
      
      nested_mappings:
        # UNESCO WHC ID
        - target_class: Identifier
          condition:
            field: $.id
            operator: is_not_null
          fields:
            identifier_scheme:
              default: "UNESCO_WHC"
            identifier_value:
              source_path: $.id
              transform:
                function: format_unesco_whc_id
                description: "Format WHC ID as zero-padded 4-digit string"
                example: "0600"
            identifier_url:
              source_path: $.http_url
              fallback:
                template: "https://whc.unesco.org/en/list/{id}"
            assigned_date:
              source_path: $.date_inscribed
              transform:
                function: parse_year_to_date
                description: "Convert year string to ISO 8601 date (YYYY-01-01)"
        
        # Website URL
        - target_class: Identifier
          condition:
            field: $.http_url
            operator: is_not_null
          fields:
            identifier_scheme:
              default: "Website"
            identifier_value:
              source_path: $.http_url
            identifier_url:
              source_path: $.http_url
        
        # Wikidata Q-number (enrichment)
        - target_class: Identifier
          transform:
            function: enrich_with_wikidata_qnumber
            description: >-
              Query Wikidata SPARQL endpoint for Q-number using UNESCO WHC ID property (P757).
              Caches results to minimize SPARQL queries.              
            parameters:
              whc_id: $.id
            cache_ttl: 604800  # 7 days
            fallback:
              skip: true
              note: "Wikidata Q-number not found or enrichment disabled"
          fields:
            identifier_scheme:
              default: "Wikidata"
            identifier_value:
              source_path: "@enrichment:wikidata_qnumber"
            identifier_url:
              template: "https://www.wikidata.org/wiki/{identifier_value}"
    
    # -------------------------------------------------------------------------
    # TEMPORAL FIELDS
    # -------------------------------------------------------------------------
    
    founded_date:
      source_path: $.date_inscribed
      target_path: founded_date
      transform:
        function: parse_unesco_date
        description: >-
          Parse UNESCO inscription date. Note: This is the UNESCO inscription date,
          NOT the original founding date of the institution. Use with caution.          
        validation:
          pattern: "^\\d{4}(-\\d{2}-\\d{2})?$"
        output_format: "ISO 8601 date (YYYY-MM-DD)"
      note: >-
        UNESCO inscription date != founding date. For actual founding dates,
        requires external enrichment from Wikidata or institutional sources.        
    
    closed_date:
      source_path: $.date_end
      target_path: closed_date
      transform:
        function: parse_unesco_date
        description: "Parse UNESCO delisting date (if site was removed from list)"
      fallback:
        value: null
        note: "Most sites remain on UNESCO list indefinitely"
    
    # -------------------------------------------------------------------------
    # COLLECTIONS (Nested Objects)
    # -------------------------------------------------------------------------
    
    collections:
      target_path: collections
      transform:
        function: infer_collections_from_unesco
        description: >-
          Infer collection metadata from UNESCO criteria and category.
          Creates Collection objects based on cultural/natural criteria.          
        parameters:
          category: $.category
          criteria_txt: $.criteria_txt
          description_en: $.short_description_en
        output: list
      
      nested_mappings:
        - target_class: Collection
          fields:
            collection_name:
              transform:
                function: generate_collection_name
                description: "Generate collection name from UNESCO category and criteria"
                template: "UNESCO {category} Heritage Collection - {site_name}"
            
            collection_type:
              source_path: $.category
              transform:
                function: map_unesco_category_to_collection_type
                mappings:
                  Cultural: "cultural"
                  Natural: "natural"
                  Mixed: "mixed"
            
            subject_areas:
              source_path: $.criteria_txt
              transform:
                function: parse_unesco_criteria
                description: >-
                  Parse UNESCO criteria codes (i-x) and map to subject areas.
                  Cultural: (i) masterpieces, (ii) exchange, (iii) testimony,
                  (iv) architecture, (v) settlement, (vi) associations.
                  Natural: (vii) natural phenomena, (viii) earth's history,
                  (ix) ecosystems, (x) biodiversity.                  
                output: list
              example:
                input: "(i)(ii)(iv)"
                output: ["Masterpieces of human creative genius", "Exchange of human values", "Architecture"]
            
            temporal_coverage:
              source_path: null
              transform:
                function: infer_temporal_coverage_from_description
                description: >-
                  Attempt to extract time periods from UNESCO description text.
                  Uses NER and regex patterns to identify dates/periods.                  
                parameters:
                  description: $.short_description_en
                  justification: $.justification_en
                fallback:
                  value: null
                  note: "Temporal coverage could not be inferred"
            
            extent:
              source_path: $.area_hectares
              transform:
                function: format_area_extent
                description: "Format area in hectares as collection extent"
                template: "{area_hectares} hectares"
    
    # -------------------------------------------------------------------------
    # DIGITAL PLATFORMS (Nested Objects)
    # -------------------------------------------------------------------------
    
    digital_platforms:
      target_path: digital_platforms
      transform:
        function: create_digital_platforms_from_unesco
        description: >-
          Create DigitalPlatform objects for UNESCO website and any linked
          institutional websites discovered via URL parsing.          
        parameters:
          http_url: $.http_url
          image_url: $.image_url
        output: list
      
      nested_mappings:
        # UNESCO Website
        - target_class: DigitalPlatform
          fields:
            platform_name:
              default: "UNESCO World Heritage Centre"
            platform_url:
              source_path: $.http_url
            platform_type:
              default: WEBSITE
            metadata_standards:
              default: ["DUBLIN_CORE", "SCHEMA_ORG"]
              description: "UNESCO uses Dublin Core and Schema.org metadata"
    
    # -------------------------------------------------------------------------
    # PROVENANCE (Required)
    # -------------------------------------------------------------------------
    
    provenance:
      target_path: provenance
      transform:
        function: create_unesco_provenance
        description: "Create Provenance object for UNESCO-sourced data"
        output: object
      required: true
      
      nested_mappings:
        - target_class: Provenance
          fields:
            data_source:
              default: UNESCO_WORLD_HERITAGE
              validation:
                enum: DataSourceEnum
            
            data_tier:
              default: TIER_1_AUTHORITATIVE
              description: "UNESCO DataHub is authoritative source"
              validation:
                enum: DataTierEnum
            
            extraction_date:
              source_path: null
              transform:
                function: current_timestamp_utc
                description: "Timestamp when data was extracted from UNESCO API"
                output_format: "ISO 8601 datetime with timezone"
            
            extraction_method:
              default: "UNESCO DataHub API extraction via LinkML Map transformation"
            
            confidence_score:
              source_path: "@computed:institution_type_confidence"
              description: "Confidence score from institution type classification"
              validation:
                type: float
                range: [0.0, 1.0]
              fallback:
                value: 0.85
                note: "Default confidence for UNESCO data with inferred institution type"
            
            source_url:
              source_path: $.http_url
              description: "URL to original UNESCO World Heritage Site page"
            
            verified_date:
              source_path: $.date_inscribed
              transform:
                function: parse_year_to_date
                description: "UNESCO inscription date serves as verification date"
            
            verified_by:
              default: "UNESCO World Heritage Committee"
              description: "UNESCO WHC verifies all inscriptions"

# =============================================================================
# CUSTOM TRANSFORMATION FUNCTIONS (Python Implementation Required)
# =============================================================================

custom_functions:
  generate_heritage_custodian_id:
    description: "Generate persistent URI for heritage custodian"
    implementation: "src/glam_extractor/mappers/unesco_transformers.py:generate_heritage_custodian_id"
    signature:
      parameters:
        - name: whc_id
          type: int
        - name: country_code
          type: str
      returns: str
    example:
      input: {whc_id: 600, country_code: "FR"}
      output: "https://w3id.org/heritage/custodian/fr/unesco-600"
  
  generate_ghcid_for_unesco_site:
    description: "Generate GHCID from UNESCO site data"
    implementation: "src/glam_extractor/ghcid/unesco_ghcid_generator.py:generate_ghcid"
    signature:
      parameters:
        - name: whc_id
          type: int
        - name: country_code
          type: str
        - name: site_name
          type: str
        - name: latitude
          type: float
        - name: longitude
          type: float
        - name: institution_type
          type: InstitutionTypeEnum
      returns: str
    notes:
      - "Requires GeoNames lookup for UN/LOCODE (city code)"
      - "Handles collision resolution via native language name suffix"
      - "Caches GeoNames lookups to minimize API calls"
  
  classify_unesco_institution_type:
    description: "Classify institution type using composite strategy"
    implementation: "src/glam_extractor/classifiers/unesco_institution_type.py:classify"
    signature:
      parameters:
        - name: category
          type: str
        - name: description_en
          type: str
        - name: description_fr
          type: str
        - name: justification_en
          type: str
        - name: criteria
          type: str
        - name: site_name
          type: str
      returns:
        type: dict
        fields:
          - institution_type: InstitutionTypeEnum
          - confidence_score: float
          - reasoning: str
    notes:
      - "Uses KeywordClassificationStrategy, RuleBasedStrategy, and CompositeStrategy"
      - "Confidence threshold: 0.7 (below this, defaults to MIXED)"
      - "See design-patterns.md Pattern 2 for implementation details"
  
  enrich_with_wikidata_qnumber:
    description: "Query Wikidata for Q-number via UNESCO WHC ID property (P757)"
    implementation: "src/glam_extractor/enrichment/wikidata_enricher.py:get_qnumber_by_whc_id"
    signature:
      parameters:
        - name: whc_id
          type: int
      returns:
        type: Optional[str]
        description: "Wikidata Q-number or None if not found"
    sparql_query: |
      SELECT ?item WHERE {
        ?item wdt:P757 "{whc_id}" .
      }      
    cache_ttl: 604800  # 7 days
    notes:
      - "Rate limit: 60 requests/minute to Wikidata SPARQL endpoint"
      - "Implement exponential backoff on 429 Too Many Requests"
      - "Skip enrichment if offline mode enabled"
  
  reverse_geocode_coordinates:
    description: "Reverse geocode coordinates to city name via GeoNames"
    implementation: "src/glam_extractor/geocoding/geonames_client.py:reverse_geocode"
    signature:
      parameters:
        - name: latitude
          type: float
        - name: longitude
          type: float
      returns:
        type: Optional[str]
        description: "City name or None if lookup failed"
    api_endpoint: "http://api.geonames.org/findNearbyPlaceNameJSON"
    cache_ttl: 2592000  # 30 days
    rate_limit: "1 request/second (GeoNames free tier)"
    notes:
      - "Requires GEONAMES_API_KEY environment variable"
      - "Free tier: 20,000 requests/day"
      - "Aggressive caching essential to stay within limits"
  
  parse_unesco_criteria:
    description: "Parse UNESCO criteria codes (i-x) to subject area descriptions"
    implementation: "src/glam_extractor/parsers/unesco_criteria_parser.py:parse_criteria"
    signature:
      parameters:
        - name: criteria_txt
          type: str
          example: "(i)(ii)(iv)"
      returns:
        type: List[str]
        description: "List of criteria descriptions"
    criteria_mapping:
      "(i)": "Masterpieces of human creative genius"
      "(ii)": "Exchange of human values"
      "(iii)": "Testimony to cultural tradition"
      "(iv)": "Outstanding example of architecture"
      "(v)": "Traditional human settlement or land-use"
      "(vi)": "Associated with events, traditions, or ideas"
      "(vii)": "Superlative natural phenomena or natural beauty"
      "(viii)": "Outstanding examples of Earth's history"
      "(ix)": "Significant ongoing ecological and biological processes"
      "(x)": "Habitats for biodiversity conservation"
    reference: "https://whc.unesco.org/en/criteria/"

# =============================================================================
# VALIDATION RULES
# =============================================================================

validation:
  required_fields:
    - id
    - name
    - institution_type
    - locations
    - identifiers
    - provenance
  
  post_transformation:
    - rule: validate_ghcid_format
      description: "Ensure GHCID matches specification pattern"
      pattern: "^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$"
      applies_to: ghcid_current
    
    - rule: validate_country_code
      description: "Ensure country code is valid ISO 3166-1 alpha-2"
      validation_function: "pycountry.countries.get(alpha_2=value)"
      applies_to: locations[*].country
    
    - rule: validate_coordinates
      description: "Ensure coordinates are within valid ranges"
      validation_function: "validate_coordinates_in_range"
      applies_to:
        - locations[*].latitude
        - locations[*].longitude
    
    - rule: validate_confidence_score
      description: "Ensure confidence score is between 0.0 and 1.0"
      validation_function: "0.0 <= value <= 1.0"
      applies_to: provenance.confidence_score
    
    - rule: warn_low_confidence
      description: "Warn if institution type confidence < 0.7"
      severity: warning
      condition: "provenance.confidence_score < 0.7"
      action: "log_warning"
      message: "Low confidence institution type classification: {institution_type} ({confidence_score})"

# =============================================================================
# EDGE CASES AND SPECIAL HANDLING
# =============================================================================

edge_cases:
  serial_nominations:
    description: >-
      UNESCO serial nominations represent multiple sites under one WHC ID.
      Example: "Primeval Beech Forests of the Carpathians" spans 12 countries.      
    handling:
      - "Detect via $.serial == '1' flag"
      - "Query UNESCO API for component sites"
      - "Create separate HeritageCustodian record for each component"
      - "Link via parent_organization or partnership relationship"
    implementation: "src/glam_extractor/parsers/serial_nomination_handler.py"
  
  transboundary_sites:
    description: >-
      Sites spanning multiple countries (e.g., Mont Blanc, Wadden Sea).      
    handling:
      - "Detect via $.transboundary == '1' flag"
      - "Parse $.states field for multiple country codes (comma-separated)"
      - "Create Location objects for each country"
      - "GHCID generation: Use first country alphabetically, note transboundary in description"
    example:
      site: "Wadden Sea"
      countries: ["NL", "DE", "DK"]
      ghcid: "DE-NS-CUX-M-WS"  # Uses Germany (alphabetically first among DE/DK/NL)
  
  sites_with_multiple_institutions:
    description: >-
      Large UNESCO sites may contain multiple GLAM institutions (e.g., Vatican City).      
    handling:
      - "Primary extraction creates one HeritageCustodian for the UNESCO site itself"
      - "Flag for manual enrichment: Add sub_organizations for individual museums/archives within site"
      - "Use sub_organizations slot to link related institutions"
    example:
      site: "Vatican City"
      main_record: "Vatican Museums"
      sub_organizations:
        - "Vatican Apostolic Archive"
        - "Vatican Library"
        - "Sistine Chapel"
  
  sites_removed_from_list:
    description: >-
      Sites delisted from UNESCO World Heritage List (rare but possible).      
    handling:
      - "Detect via $.date_end field (not null)"
      - "Set organization_status: INACTIVE"
      - "Set closed_date: $.date_end"
      - "Note delisting in change_history as ChangeEvent(CLOSURE)"
    example:
      site: "Dresden Elbe Valley (Germany)"
      date_end: "2009"
      reason: "Construction of Waldschlösschen Bridge"
  
  missing_coordinates:
    description: >-
      Some UNESCO sites lack precise coordinates (especially large serial nominations).      
    handling:
      - "Attempt geocoding via city/country fallback"
      - "If geocoding fails, leave latitude/longitude as null"
      - "Mark with provenance.notes: 'Coordinates unavailable from UNESCO, requires manual geocoding'"
      - "Flag for manual review"
  
  ambiguous_institution_types:
    description: >-
      UNESCO sites that are clearly heritage sites but not obviously GLAM custodians.
      Example: Historic city centers without specific museum/archive mention.      
    handling:
      - "Default to MIXED with confidence_score < 0.7"
      - "Add provenance.notes: 'Institution type inferred, requires verification'"
      - "Generate GHCID with 'X' type code (MIXED)"
      - "Flag for manual classification review"

# =============================================================================
# TESTING STRATEGY
# =============================================================================

testing:
  unit_tests:
    - test_direct_field_mapping:
        description: "Test simple field mappings (name, country, WHC ID)"
        fixtures: 20
        coverage: "All direct mappings without custom functions"
    
    - test_custom_transformation_functions:
        description: "Test each custom function in isolation"
        fixtures: 50
        functions:
          - generate_heritage_custodian_id
          - generate_ghcid_for_unesco_site
          - classify_unesco_institution_type
          - parse_unesco_criteria
    
    - test_nested_object_creation:
        description: "Test Location, Identifier, Collection creation"
        fixtures: 30
        focus: "Ensure nested objects validate against LinkML schema"
    
    - test_multi_language_handling:
        description: "Test extraction of English and French names"
        fixtures: 20
        edge_cases:
          - Only English name available
          - Only French name available
          - Both names identical
          - Names significantly different
  
  integration_tests:
    - test_full_transformation_pipeline:
        description: "End-to-end test: UNESCO JSON → HeritageCustodian YAML"
        fixtures: 20 (golden dataset)
        validation:
          - LinkML schema validation
          - GHCID format validation
          - Provenance completeness
          - No required fields missing
    
    - test_edge_cases:
        description: "Test special cases (serial, transboundary, delisted)"
        fixtures:
          - 3 serial nominations
          - 3 transboundary sites
          - 1 delisted site
          - 5 ambiguous institution types
    
    - test_external_api_integration:
        description: "Test GeoNames and Wikidata enrichment"
        approach: "Mock external APIs with cached responses"
        fixtures: 10
        scenarios:
          - Successful enrichment
          - API rate limit exceeded
          - API unavailable (offline mode)
          - No results found
  
  property_based_tests:
    - test_ghcid_uniqueness:
        description: "Ensure GHCIDs are unique within dataset"
        strategy: "Generate 1000 UNESCO site transformations, check for collisions"
        property: "All GHCIDs unique OR collisions resolved via name suffix"
    
    - test_coordinate_validity:
        description: "Ensure all coordinates within valid ranges"
        strategy: "Test with random valid/invalid coordinates"
        property: "latitude ∈ [-90, 90], longitude ∈ [-180, 180]"
    
    - test_confidence_score_range:
        description: "Ensure confidence scores always in [0.0, 1.0]"
        strategy: "Test with diverse UNESCO descriptions"
        property: "0.0 ≤ confidence_score ≤ 1.0"

# =============================================================================
# PERFORMANCE CONSIDERATIONS
# =============================================================================

performance:
  caching_strategy:
    geonames_lookups:
      ttl: 2592000  # 30 days
      rationale: "Place names rarely change"
      cache_backend: "SQLite (cache/geonames_cache.db)"
    
    wikidata_enrichment:
      ttl: 604800  # 7 days
      rationale: "Wikidata updates occasionally, refresh weekly"
      cache_backend: "SQLite (cache/wikidata_cache.db)"
    
    unesco_api_responses:
      ttl: 86400  # 24 hours
      rationale: "UNESCO data changes infrequently"
      cache_backend: "requests-cache with SQLite"
  
  rate_limiting:
    geonames:
      limit: "1 request/second"
      fallback: "Skip geocoding if rate limit exceeded, continue processing"
    
    wikidata:
      limit: "60 requests/minute"
      fallback: "Queue requests, process in batches"
  
  parallelization:
    approach: "Process UNESCO sites in parallel using multiprocessing"
    workers: 4
    batch_size: 50
    considerations:
      - "Share cache across workers (SQLite supports concurrent reads)"
      - "Coordinate API rate limiting across workers (use Redis or file lock)"
      - "Collect results and merge into single dataset"
  
  estimated_processing_time:
    total_sites: 1200
    processing_per_site: "2-5 seconds (with external API calls)"
    total_time_serial: "40-100 minutes"
    total_time_parallel_4_workers: "10-25 minutes"

# =============================================================================
# EXTENSION POINTS
# =============================================================================

extensions:
  future_enhancements:
    - name: "Machine Learning Classification"
      description: "Train ML model on manually classified UNESCO sites"
      implementation: "Add MLClassificationStrategy to composite classifier"
      benefits: "Improve classification accuracy beyond keyword matching"
    
    - name: "OpenStreetMap Integration"
      description: "Enrich location data with OSM polygons and detailed addresses"
      implementation: "Query Overpass API for UNESCO site boundaries"
      benefits: "More precise geographic data, street addresses"
    
    - name: "Multi-language NLP"
      description: "Use spaCy multi-language models for better description parsing"
      implementation: "Integrate spaCy NER for extracting temporal coverage and subjects"
      benefits: "Better collection metadata inference"
    
    - name: "UNESCO Thesaurus Integration"
      description: "Map UNESCO criteria to UNESCO Thesaurus SKOS concepts"
      implementation: "SPARQL queries against vocabularies.unesco.org"
      benefits: "Richer semantic linking, LOD compatibility"

---

## Implementation Checklist

- [ ] **Day 2-3**: Create `schemas/maps/unesco_to_heritage_custodian.yaml` (this file)
- [ ] **Day 4**: Implement custom transformation functions in `src/glam_extractor/mappers/unesco_transformers.py`
- [ ] **Day 5**: Write unit tests for all custom functions (50+ tests)
- [ ] **Day 6**: Test full transformation pipeline with golden dataset (20 tests)
- [ ] **Day 7**: Handle edge cases (serial, transboundary, delisted sites)
- [ ] **Day 8**: Optimize performance (caching, rate limiting, parallelization)
- [ ] **Day 9-10**: Integration with main GLAM dataset and validation

---

## Related Documentation

- **Dependencies**: `01-dependencies.md` - LinkML Map extension requirements
- **Implementation Phases**: `03-implementation-phases.md` - Day-by-day timeline
- **TDD Strategy**: `04-tdd-strategy.md` - Testing approach
- **Design Patterns**: `05-design-patterns.md` - Classification strategies
- **Master Checklist**: `07-master-checklist.md` - Overall progress tracker

---

**Version**: 1.0  
**Date**: 2025-11-09  
**Status**: Draft - Ready for Implementation  
**Next Steps**: Begin Day 2 implementation (create initial LinkML Map schema file)