# UNESCO Data Extraction - LinkML Map Schema **Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction **Document**: 06 - LinkML Map Transformation Rules **Version**: 1.0 **Date**: 2025-11-09 **Status**: Draft --- ## Executive Summary This document specifies the **LinkML Map transformation schema** for converting UNESCO World Heritage Site data from the UNESCO DataHub API into LinkML-compliant `HeritageCustodian` instances. The mapping handles complex data structures, multi-language fields, conditional institution type classification, and identifier extraction. **Key Innovation**: Extends LinkML Map with conditional extraction, regex validators, and multi-language field handling to manage UNESCO's rich semantic data. **Schema Location**: `schemas/maps/unesco_to_heritage_custodian.yaml` --- ## UNESCO API Data Structure ### Sample UNESCO Site JSON Structure Based on UNESCO DataHub API response format: ```json { "id": 600, "unique_number": 600, "id_number": 600, "category": "Cultural", "name_en": "Banks of the Seine in Paris", "name_fr": "Paris, rives de la Seine", "short_description_en": "From the Louvre to the Eiffel Tower, from the Place de la Concorde to the Grand and Petit Palais, the evolution of Paris and its history can be seen from the River Seine...", "short_description_fr": "Du Louvre à la Tour Eiffel, de la place de la Concorde aux Grand et Petit Palais...", "justification_en": "The site demonstrates exceptional urban landscape reflecting key periods in the history of Paris...", "states": "France", "states_iso_code": "FR", "region": "Europe and North America", "iso_region": "EUR", "transboundary": "0", "date_inscribed": "1991", "secondary_dates": "", "danger": "0", "date_end": null, "longitude": 2.3376, "latitude": 48.8606, "area_hectares": 365.0, "criteria_txt": "(i)(ii)(iv)", "http_url": "https://whc.unesco.org/en/list/600", "image_url": "https://whc.unesco.org/uploads/thumbs/site_0600_0001-750-0-20151104113424.jpg", "unesco_region": "Europe", "serial": "0", "extension": "0", "revision": "0" } ``` ### Key Observations 1. **Multi-language Fields**: `name_en`, `name_fr`, `short_description_en`, `short_description_fr` 2. **Institution Type Inference**: Must infer from `category`, `short_description`, and `justification` 3. **Geographic Data**: Provides `latitude`, `longitude`, `states_iso_code`, `region` 4. **UNESCO WHC ID**: Unique identifier in `id`, `unique_number`, `id_number` fields 5. **Temporal Data**: `date_inscribed` (inscription date, not always founding date) 6. **Criteria**: Cultural criteria `(i)-(vi)` and Natural criteria `(vii)-(x)` 7. **Serial Nominations**: Multiple sites under one inscription (indicated by `serial: "1"`) --- ## LinkML Map Schema Design ### Schema Structure The LinkML Map schema will use a **hybrid approach**: 1. **Direct field mappings** for straightforward transformations (name, coordinates, dates) 2. **Conditional transformations** for institution type classification 3. **Custom Python functions** for complex logic (GHCID generation, Wikidata enrichment) 4. **Multi-step transformations** for nested objects (Location, Identifier, Provenance) ### Schema File: `schemas/maps/unesco_to_heritage_custodian.yaml` ```yaml id: https://w3id.org/heritage/custodian/maps/unesco name: unesco-to-heritage-custodian-map title: UNESCO World Heritage Site to HeritageCustodian Mapping description: >- LinkML Map transformation rules for converting UNESCO DataHub API responses to LinkML HeritageCustodian instances. Handles multi-language fields, institution type classification, and geographic data normalization. version: 1.0.0 license: https://creativecommons.org/publicdomain/zero/1.0/ prefixes: linkml: https://w3id.org/linkml/ heritage: https://w3id.org/heritage/custodian/ unesco: https://whc.unesco.org/en/list/ imports: - linkml:types - ../../core - ../../enums - ../../provenance # ============================================================================= # MAPPING RULES # ============================================================================= mappings: unesco_site_to_heritage_custodian: description: >- Transform UNESCO World Heritage Site JSON to HeritageCustodian instance. Applies institution type classification, multi-language name extraction, and geographic data normalization. source_schema: unesco_api_response target_schema: heritage_custodian # ------------------------------------------------------------------------- # CORE IDENTIFICATION FIELDS # ------------------------------------------------------------------------- id: source_path: $.id target_path: id transform: function: generate_heritage_custodian_id description: >- Generate persistent URI identifier using UNESCO WHC ID and country code. Format: https://w3id.org/heritage/custodian/{country_code}/unesco-{whc_id} parameters: whc_id: $.id country_code: $.states_iso_code example: "https://w3id.org/heritage/custodian/fr/unesco-600" # GHCID fields (generated via custom function) ghcid_current: source_path: $.id target_path: ghcid_current transform: function: generate_ghcid_for_unesco_site description: >- Generate Global Heritage Custodian Identifier (GHCID) using UNESCO site data. Format: {ISO-3166-1}-{ISO-3166-2}-{UN/LOCODE}-{Type}-{Abbreviation}[-Q{WikidataID}] parameters: whc_id: $.id country_code: $.states_iso_code site_name: $.name_en latitude: $.latitude longitude: $.longitude institution_type: "@computed:institution_type" # Reference computed value fallback: # If GHCID cannot be generated, use UNESCO WHC ID value: "UNESCO-{whc_id}" note: "Fallback GHCID - requires manual review" ghcid_numeric: source_path: ghcid_current target_path: ghcid_numeric transform: function: ghcid_to_numeric_hash description: "Generate 64-bit numeric hash from GHCID string (SHA-256 based)" ghcid_uuid: source_path: ghcid_current target_path: ghcid_uuid transform: function: ghcid_to_uuid_v5 description: "Generate deterministic UUID v5 from GHCID string (SHA-1 based)" ghcid_uuid_sha256: source_path: ghcid_current target_path: ghcid_uuid_sha256 transform: function: ghcid_to_uuid_v8 description: "Generate deterministic UUID v8 from GHCID string (SHA-256 based)" # ------------------------------------------------------------------------- # NAME FIELDS (Multi-language handling) # ------------------------------------------------------------------------- name: source_path: $.name_en target_path: name transform: function: extract_primary_name description: >- Extract primary name for the institution. Prefers English name, falls back to French or first available language. parameters: name_en: $.name_en name_fr: $.name_fr fallback: source_path: $.name_fr required: true alternative_names: target_path: alternative_names transform: function: extract_alternative_names description: >- Extract all language variants of the site name. Collects names from name_en, name_fr, and other language fields if present. parameters: name_en: $.name_en name_fr: $.name_fr output: list example: - "Banks of the Seine in Paris" - "Paris, rives de la Seine" # ------------------------------------------------------------------------- # INSTITUTION TYPE CLASSIFICATION (Conditional) # ------------------------------------------------------------------------- institution_type: target_path: institution_type transform: function: classify_unesco_institution_type description: >- Classify institution type based on UNESCO category, description, and keywords. Uses multi-strategy approach: keyword matching, rule-based inference, and confidence scoring. Defaults to MIXED for ambiguous cases. parameters: category: $.category description_en: $.short_description_en description_fr: $.short_description_fr justification_en: $.justification_en criteria: $.criteria_txt site_name: $.name_en strategy: composite # Use composite classification strategy confidence_threshold: 0.7 default: MIXED required: true examples: - input: {description: "Museum complex with archaeological collections"} output: MUSEUM confidence: 0.95 - input: {description: "Historic library and archive"} output: LIBRARY confidence: 0.85 - input: {description: "Historic city center"} output: MIXED confidence: 0.5 # ------------------------------------------------------------------------- # DESCRIPTION FIELDS # ------------------------------------------------------------------------- description: source_path: $.short_description_en target_path: description transform: function: merge_descriptions description: >- Merge short description and justification into single comprehensive description. Prefers English, includes French if English unavailable. parameters: short_description_en: $.short_description_en short_description_fr: $.short_description_fr justification_en: $.justification_en justification_fr: $.justification_fr template: | {short_description_en} UNESCO Justification: {justification_en} fallback: source_path: $.short_description_fr # ------------------------------------------------------------------------- # LOCATION (Nested Object) # ------------------------------------------------------------------------- locations: target_path: locations transform: function: create_location_from_unesco description: >- Create Location object from UNESCO geographic data. Extracts country, region, coordinates, and attempts GeoNames lookup for city name. parameters: country: $.states_iso_code region: $.region latitude: $.latitude longitude: $.longitude unesco_region: $.unesco_region output: list # Always return list (even single location) nested_mappings: - target_class: Location fields: country: source_path: $.states_iso_code required: true validation: pattern: "^[A-Z]{2}$" description: "ISO 3166-1 alpha-2 country code" region: source_path: $.region transform: function: map_unesco_region_to_iso3166_2 description: "Convert UNESCO region name to ISO 3166-2 code if possible" fallback: value: null note: "UNESCO regions do not map 1:1 to ISO 3166-2" city: source_path: null # Not provided by UNESCO transform: function: reverse_geocode_coordinates description: >- Reverse geocode coordinates to find city name using GeoNames API. Caches results to minimize API calls. parameters: latitude: $.latitude longitude: $.longitude cache_ttl: 2592000 # 30 days fallback: value: null note: "Geocoding failed or coordinates invalid" latitude: source_path: $.latitude required: true validation: type: float range: [-90, 90] longitude: source_path: $.longitude required: true validation: type: float range: [-180, 180] geonames_id: source_path: null transform: function: lookup_geonames_id description: "Query GeoNames API for location ID" parameters: latitude: $.latitude longitude: $.longitude cache_ttl: 2592000 # 30 days is_primary: source_path: null default: true description: "UNESCO data provides single location per site" # ------------------------------------------------------------------------- # IDENTIFIERS (Nested Objects - Array) # ------------------------------------------------------------------------- identifiers: target_path: identifiers transform: function: create_identifiers_from_unesco description: >- Create Identifier objects for UNESCO WHC ID, website URL, and Wikidata Q-number (if available). Queries Wikidata SPARQL endpoint to find Q-number via UNESCO WHC ID property (P757). output: list nested_mappings: # UNESCO WHC ID - target_class: Identifier condition: field: $.id operator: is_not_null fields: identifier_scheme: default: "UNESCO_WHC" identifier_value: source_path: $.id transform: function: format_unesco_whc_id description: "Format WHC ID as zero-padded 4-digit string" example: "0600" identifier_url: source_path: $.http_url fallback: template: "https://whc.unesco.org/en/list/{id}" assigned_date: source_path: $.date_inscribed transform: function: parse_year_to_date description: "Convert year string to ISO 8601 date (YYYY-01-01)" # Website URL - target_class: Identifier condition: field: $.http_url operator: is_not_null fields: identifier_scheme: default: "Website" identifier_value: source_path: $.http_url identifier_url: source_path: $.http_url # Wikidata Q-number (enrichment) - target_class: Identifier transform: function: enrich_with_wikidata_qnumber description: >- Query Wikidata SPARQL endpoint for Q-number using UNESCO WHC ID property (P757). Caches results to minimize SPARQL queries. parameters: whc_id: $.id cache_ttl: 604800 # 7 days fallback: skip: true note: "Wikidata Q-number not found or enrichment disabled" fields: identifier_scheme: default: "Wikidata" identifier_value: source_path: "@enrichment:wikidata_qnumber" identifier_url: template: "https://www.wikidata.org/wiki/{identifier_value}" # ------------------------------------------------------------------------- # TEMPORAL FIELDS # ------------------------------------------------------------------------- founded_date: source_path: $.date_inscribed target_path: founded_date transform: function: parse_unesco_date description: >- Parse UNESCO inscription date. Note: This is the UNESCO inscription date, NOT the original founding date of the institution. Use with caution. validation: pattern: "^\\d{4}(-\\d{2}-\\d{2})?$" output_format: "ISO 8601 date (YYYY-MM-DD)" note: >- UNESCO inscription date != founding date. For actual founding dates, requires external enrichment from Wikidata or institutional sources. closed_date: source_path: $.date_end target_path: closed_date transform: function: parse_unesco_date description: "Parse UNESCO delisting date (if site was removed from list)" fallback: value: null note: "Most sites remain on UNESCO list indefinitely" # ------------------------------------------------------------------------- # COLLECTIONS (Nested Objects) # ------------------------------------------------------------------------- collections: target_path: collections transform: function: infer_collections_from_unesco description: >- Infer collection metadata from UNESCO criteria and category. Creates Collection objects based on cultural/natural criteria. parameters: category: $.category criteria_txt: $.criteria_txt description_en: $.short_description_en output: list nested_mappings: - target_class: Collection fields: collection_name: transform: function: generate_collection_name description: "Generate collection name from UNESCO category and criteria" template: "UNESCO {category} Heritage Collection - {site_name}" collection_type: source_path: $.category transform: function: map_unesco_category_to_collection_type mappings: Cultural: "cultural" Natural: "natural" Mixed: "mixed" subject_areas: source_path: $.criteria_txt transform: function: parse_unesco_criteria description: >- Parse UNESCO criteria codes (i-x) and map to subject areas. Cultural: (i) masterpieces, (ii) exchange, (iii) testimony, (iv) architecture, (v) settlement, (vi) associations. Natural: (vii) natural phenomena, (viii) earth's history, (ix) ecosystems, (x) biodiversity. output: list example: input: "(i)(ii)(iv)" output: ["Masterpieces of human creative genius", "Exchange of human values", "Architecture"] temporal_coverage: source_path: null transform: function: infer_temporal_coverage_from_description description: >- Attempt to extract time periods from UNESCO description text. Uses NER and regex patterns to identify dates/periods. parameters: description: $.short_description_en justification: $.justification_en fallback: value: null note: "Temporal coverage could not be inferred" extent: source_path: $.area_hectares transform: function: format_area_extent description: "Format area in hectares as collection extent" template: "{area_hectares} hectares" # ------------------------------------------------------------------------- # DIGITAL PLATFORMS (Nested Objects) # ------------------------------------------------------------------------- digital_platforms: target_path: digital_platforms transform: function: create_digital_platforms_from_unesco description: >- Create DigitalPlatform objects for UNESCO website and any linked institutional websites discovered via URL parsing. parameters: http_url: $.http_url image_url: $.image_url output: list nested_mappings: # UNESCO Website - target_class: DigitalPlatform fields: platform_name: default: "UNESCO World Heritage Centre" platform_url: source_path: $.http_url platform_type: default: WEBSITE metadata_standards: default: ["DUBLIN_CORE", "SCHEMA_ORG"] description: "UNESCO uses Dublin Core and Schema.org metadata" # ------------------------------------------------------------------------- # PROVENANCE (Required) # ------------------------------------------------------------------------- provenance: target_path: provenance transform: function: create_unesco_provenance description: "Create Provenance object for UNESCO-sourced data" output: object required: true nested_mappings: - target_class: Provenance fields: data_source: default: UNESCO_WORLD_HERITAGE validation: enum: DataSourceEnum data_tier: default: TIER_1_AUTHORITATIVE description: "UNESCO DataHub is authoritative source" validation: enum: DataTierEnum extraction_date: source_path: null transform: function: current_timestamp_utc description: "Timestamp when data was extracted from UNESCO API" output_format: "ISO 8601 datetime with timezone" extraction_method: default: "UNESCO DataHub API extraction via LinkML Map transformation" confidence_score: source_path: "@computed:institution_type_confidence" description: "Confidence score from institution type classification" validation: type: float range: [0.0, 1.0] fallback: value: 0.85 note: "Default confidence for UNESCO data with inferred institution type" source_url: source_path: $.http_url description: "URL to original UNESCO World Heritage Site page" verified_date: source_path: $.date_inscribed transform: function: parse_year_to_date description: "UNESCO inscription date serves as verification date" verified_by: default: "UNESCO World Heritage Committee" description: "UNESCO WHC verifies all inscriptions" # ============================================================================= # CUSTOM TRANSFORMATION FUNCTIONS (Python Implementation Required) # ============================================================================= custom_functions: generate_heritage_custodian_id: description: "Generate persistent URI for heritage custodian" implementation: "src/glam_extractor/mappers/unesco_transformers.py:generate_heritage_custodian_id" signature: parameters: - name: whc_id type: int - name: country_code type: str returns: str example: input: {whc_id: 600, country_code: "FR"} output: "https://w3id.org/heritage/custodian/fr/unesco-600" generate_ghcid_for_unesco_site: description: "Generate GHCID from UNESCO site data" implementation: "src/glam_extractor/ghcid/unesco_ghcid_generator.py:generate_ghcid" signature: parameters: - name: whc_id type: int - name: country_code type: str - name: site_name type: str - name: latitude type: float - name: longitude type: float - name: institution_type type: InstitutionTypeEnum returns: str notes: - "Requires GeoNames lookup for UN/LOCODE (city code)" - "Handles collision resolution via Wikidata Q-number suffix" - "Caches GeoNames lookups to minimize API calls" classify_unesco_institution_type: description: "Classify institution type using composite strategy" implementation: "src/glam_extractor/classifiers/unesco_institution_type.py:classify" signature: parameters: - name: category type: str - name: description_en type: str - name: description_fr type: str - name: justification_en type: str - name: criteria type: str - name: site_name type: str returns: type: dict fields: - institution_type: InstitutionTypeEnum - confidence_score: float - reasoning: str notes: - "Uses KeywordClassificationStrategy, RuleBasedStrategy, and CompositeStrategy" - "Confidence threshold: 0.7 (below this, defaults to MIXED)" - "See design-patterns.md Pattern 2 for implementation details" enrich_with_wikidata_qnumber: description: "Query Wikidata for Q-number via UNESCO WHC ID property (P757)" implementation: "src/glam_extractor/enrichment/wikidata_enricher.py:get_qnumber_by_whc_id" signature: parameters: - name: whc_id type: int returns: type: Optional[str] description: "Wikidata Q-number or None if not found" sparql_query: | SELECT ?item WHERE { ?item wdt:P757 "{whc_id}" . } cache_ttl: 604800 # 7 days notes: - "Rate limit: 60 requests/minute to Wikidata SPARQL endpoint" - "Implement exponential backoff on 429 Too Many Requests" - "Skip enrichment if offline mode enabled" reverse_geocode_coordinates: description: "Reverse geocode coordinates to city name via GeoNames" implementation: "src/glam_extractor/geocoding/geonames_client.py:reverse_geocode" signature: parameters: - name: latitude type: float - name: longitude type: float returns: type: Optional[str] description: "City name or None if lookup failed" api_endpoint: "http://api.geonames.org/findNearbyPlaceNameJSON" cache_ttl: 2592000 # 30 days rate_limit: "1 request/second (GeoNames free tier)" notes: - "Requires GEONAMES_API_KEY environment variable" - "Free tier: 20,000 requests/day" - "Aggressive caching essential to stay within limits" parse_unesco_criteria: description: "Parse UNESCO criteria codes (i-x) to subject area descriptions" implementation: "src/glam_extractor/parsers/unesco_criteria_parser.py:parse_criteria" signature: parameters: - name: criteria_txt type: str example: "(i)(ii)(iv)" returns: type: List[str] description: "List of criteria descriptions" criteria_mapping: "(i)": "Masterpieces of human creative genius" "(ii)": "Exchange of human values" "(iii)": "Testimony to cultural tradition" "(iv)": "Outstanding example of architecture" "(v)": "Traditional human settlement or land-use" "(vi)": "Associated with events, traditions, or ideas" "(vii)": "Superlative natural phenomena or natural beauty" "(viii)": "Outstanding examples of Earth's history" "(ix)": "Significant ongoing ecological and biological processes" "(x)": "Habitats for biodiversity conservation" reference: "https://whc.unesco.org/en/criteria/" # ============================================================================= # VALIDATION RULES # ============================================================================= validation: required_fields: - id - name - institution_type - locations - identifiers - provenance post_transformation: - rule: validate_ghcid_format description: "Ensure GHCID matches specification pattern" pattern: "^[A-Z]{2}-[A-Z0-9]{1,3}-[A-Z]{3}-[A-Z]-[A-Z0-9]{1,10}(-Q[0-9]+)?$" applies_to: ghcid_current - rule: validate_country_code description: "Ensure country code is valid ISO 3166-1 alpha-2" validation_function: "pycountry.countries.get(alpha_2=value)" applies_to: locations[*].country - rule: validate_coordinates description: "Ensure coordinates are within valid ranges" validation_function: "validate_coordinates_in_range" applies_to: - locations[*].latitude - locations[*].longitude - rule: validate_confidence_score description: "Ensure confidence score is between 0.0 and 1.0" validation_function: "0.0 <= value <= 1.0" applies_to: provenance.confidence_score - rule: warn_low_confidence description: "Warn if institution type confidence < 0.7" severity: warning condition: "provenance.confidence_score < 0.7" action: "log_warning" message: "Low confidence institution type classification: {institution_type} ({confidence_score})" # ============================================================================= # EDGE CASES AND SPECIAL HANDLING # ============================================================================= edge_cases: serial_nominations: description: >- UNESCO serial nominations represent multiple sites under one WHC ID. Example: "Primeval Beech Forests of the Carpathians" spans 12 countries. handling: - "Detect via $.serial == '1' flag" - "Query UNESCO API for component sites" - "Create separate HeritageCustodian record for each component" - "Link via parent_organization or partnership relationship" implementation: "src/glam_extractor/parsers/serial_nomination_handler.py" transboundary_sites: description: >- Sites spanning multiple countries (e.g., Mont Blanc, Wadden Sea). handling: - "Detect via $.transboundary == '1' flag" - "Parse $.states field for multiple country codes (comma-separated)" - "Create Location objects for each country" - "GHCID generation: Use first country alphabetically, note transboundary in description" example: site: "Wadden Sea" countries: ["NL", "DE", "DK"] ghcid: "DE-NS-CUX-M-WS" # Uses Germany (alphabetically first among DE/DK/NL) sites_with_multiple_institutions: description: >- Large UNESCO sites may contain multiple GLAM institutions (e.g., Vatican City). handling: - "Primary extraction creates one HeritageCustodian for the UNESCO site itself" - "Flag for manual enrichment: Add sub_organizations for individual museums/archives within site" - "Use sub_organizations slot to link related institutions" example: site: "Vatican City" main_record: "Vatican Museums" sub_organizations: - "Vatican Apostolic Archive" - "Vatican Library" - "Sistine Chapel" sites_removed_from_list: description: >- Sites delisted from UNESCO World Heritage List (rare but possible). handling: - "Detect via $.date_end field (not null)" - "Set organization_status: INACTIVE" - "Set closed_date: $.date_end" - "Note delisting in change_history as ChangeEvent(CLOSURE)" example: site: "Dresden Elbe Valley (Germany)" date_end: "2009" reason: "Construction of Waldschlösschen Bridge" missing_coordinates: description: >- Some UNESCO sites lack precise coordinates (especially large serial nominations). handling: - "Attempt geocoding via city/country fallback" - "If geocoding fails, leave latitude/longitude as null" - "Mark with provenance.notes: 'Coordinates unavailable from UNESCO, requires manual geocoding'" - "Flag for manual review" ambiguous_institution_types: description: >- UNESCO sites that are clearly heritage sites but not obviously GLAM custodians. Example: Historic city centers without specific museum/archive mention. handling: - "Default to MIXED with confidence_score < 0.7" - "Add provenance.notes: 'Institution type inferred, requires verification'" - "Generate GHCID with 'X' type code (MIXED)" - "Flag for manual classification review" # ============================================================================= # TESTING STRATEGY # ============================================================================= testing: unit_tests: - test_direct_field_mapping: description: "Test simple field mappings (name, country, WHC ID)" fixtures: 20 coverage: "All direct mappings without custom functions" - test_custom_transformation_functions: description: "Test each custom function in isolation" fixtures: 50 functions: - generate_heritage_custodian_id - generate_ghcid_for_unesco_site - classify_unesco_institution_type - parse_unesco_criteria - test_nested_object_creation: description: "Test Location, Identifier, Collection creation" fixtures: 30 focus: "Ensure nested objects validate against LinkML schema" - test_multi_language_handling: description: "Test extraction of English and French names" fixtures: 20 edge_cases: - Only English name available - Only French name available - Both names identical - Names significantly different integration_tests: - test_full_transformation_pipeline: description: "End-to-end test: UNESCO JSON → HeritageCustodian YAML" fixtures: 20 (golden dataset) validation: - LinkML schema validation - GHCID format validation - Provenance completeness - No required fields missing - test_edge_cases: description: "Test special cases (serial, transboundary, delisted)" fixtures: - 3 serial nominations - 3 transboundary sites - 1 delisted site - 5 ambiguous institution types - test_external_api_integration: description: "Test GeoNames and Wikidata enrichment" approach: "Mock external APIs with cached responses" fixtures: 10 scenarios: - Successful enrichment - API rate limit exceeded - API unavailable (offline mode) - No results found property_based_tests: - test_ghcid_uniqueness: description: "Ensure GHCIDs are unique within dataset" strategy: "Generate 1000 UNESCO site transformations, check for collisions" property: "All GHCIDs unique OR collisions resolved via Q-number suffix" - test_coordinate_validity: description: "Ensure all coordinates within valid ranges" strategy: "Test with random valid/invalid coordinates" property: "latitude ∈ [-90, 90], longitude ∈ [-180, 180]" - test_confidence_score_range: description: "Ensure confidence scores always in [0.0, 1.0]" strategy: "Test with diverse UNESCO descriptions" property: "0.0 ≤ confidence_score ≤ 1.0" # ============================================================================= # PERFORMANCE CONSIDERATIONS # ============================================================================= performance: caching_strategy: geonames_lookups: ttl: 2592000 # 30 days rationale: "Place names rarely change" cache_backend: "SQLite (cache/geonames_cache.db)" wikidata_enrichment: ttl: 604800 # 7 days rationale: "Wikidata updates occasionally, refresh weekly" cache_backend: "SQLite (cache/wikidata_cache.db)" unesco_api_responses: ttl: 86400 # 24 hours rationale: "UNESCO data changes infrequently" cache_backend: "requests-cache with SQLite" rate_limiting: geonames: limit: "1 request/second" fallback: "Skip geocoding if rate limit exceeded, continue processing" wikidata: limit: "60 requests/minute" fallback: "Queue requests, process in batches" parallelization: approach: "Process UNESCO sites in parallel using multiprocessing" workers: 4 batch_size: 50 considerations: - "Share cache across workers (SQLite supports concurrent reads)" - "Coordinate API rate limiting across workers (use Redis or file lock)" - "Collect results and merge into single dataset" estimated_processing_time: total_sites: 1200 processing_per_site: "2-5 seconds (with external API calls)" total_time_serial: "40-100 minutes" total_time_parallel_4_workers: "10-25 minutes" # ============================================================================= # EXTENSION POINTS # ============================================================================= extensions: future_enhancements: - name: "Machine Learning Classification" description: "Train ML model on manually classified UNESCO sites" implementation: "Add MLClassificationStrategy to composite classifier" benefits: "Improve classification accuracy beyond keyword matching" - name: "OpenStreetMap Integration" description: "Enrich location data with OSM polygons and detailed addresses" implementation: "Query Overpass API for UNESCO site boundaries" benefits: "More precise geographic data, street addresses" - name: "Multi-language NLP" description: "Use spaCy multi-language models for better description parsing" implementation: "Integrate spaCy NER for extracting temporal coverage and subjects" benefits: "Better collection metadata inference" - name: "UNESCO Thesaurus Integration" description: "Map UNESCO criteria to UNESCO Thesaurus SKOS concepts" implementation: "SPARQL queries against vocabularies.unesco.org" benefits: "Richer semantic linking, LOD compatibility" --- ## Implementation Checklist - [ ] **Day 2-3**: Create `schemas/maps/unesco_to_heritage_custodian.yaml` (this file) - [ ] **Day 4**: Implement custom transformation functions in `src/glam_extractor/mappers/unesco_transformers.py` - [ ] **Day 5**: Write unit tests for all custom functions (50+ tests) - [ ] **Day 6**: Test full transformation pipeline with golden dataset (20 tests) - [ ] **Day 7**: Handle edge cases (serial, transboundary, delisted sites) - [ ] **Day 8**: Optimize performance (caching, rate limiting, parallelization) - [ ] **Day 9-10**: Integration with main GLAM dataset and validation --- ## Related Documentation - **Dependencies**: `01-dependencies.md` - LinkML Map extension requirements - **Implementation Phases**: `03-implementation-phases.md` - Day-by-day timeline - **TDD Strategy**: `04-tdd-strategy.md` - Testing approach - **Design Patterns**: `05-design-patterns.md` - Classification strategies - **Master Checklist**: `07-master-checklist.md` - Overall progress tracker --- **Version**: 1.0 **Date**: 2025-11-09 **Status**: Draft - Ready for Implementation **Next Steps**: Begin Day 2 implementation (create initial LinkML Map schema file)