48 KiB
UNESCO Data Extraction - Master Implementation Checklist
Project: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
Document: 07 - Master Implementation Checklist
Version: 1.1
Date: 2025-11-10
Status: Planning
Executive Summary
This document provides a comprehensive 30-day implementation checklist for the UNESCO World Heritage Sites extraction project. It consolidates all planning documents (00-06) into a structured timeline with actionable tasks, testing requirements, and quality gates.
Timeline: 30 calendar days (estimated 120 person-hours)
Approach: Test-Driven Development (TDD) with iterative validation
Output: 1,200+ LinkML-compliant heritage custodian records
How to Use This Checklist
Notation
- [D] = Development task
- [T] = Testing task
- [V] = Validation/Quality gate
- [D+T] = TDD pair (test first, then implementation)
- ✓ = Completed
- ⚠ = In progress
- ⊗ = Blocked/Issue
Task Priority Levels
- P0 = Critical path, must complete
- P1 = High priority, complete if time permits
- P2 = Nice to have, future iteration
Estimated Time
Each task includes an estimated duration in hours (e.g., [2h]).
Phase 0: Pre-Implementation Setup (Days 1-2, 8 hours)
Day 1: Environment & Dependencies
Morning (4 hours)
-
[D] Verify Python 3.11+ installation
[0.5h](P0)- Run:
python --version - Create virtual environment:
python -m venv .venv - Activate:
source .venv/bin/activate
- Run:
-
[D] Install core dependencies from
requirements.txt[1h](P0)pip install linkml==1.7.10 pip install linkml-runtime==1.7.5 pip install pydantic==1.10.13 pip install requests==2.31.0 pip install python-dotenv==1.0.0 pip install pytest==7.4.3 pip install pytest-cov==4.1.0 -
[D] Install optional dependencies
[0.5h](P1)pip install rapidfuzz==3.5.2 # Fuzzy matching pip install SPARQLWrapper==2.0.0 # Wikidata queries pip install click==8.1.7 # CLI interface -
[V] Verify LinkML installation
[0.5h](P0)linkml-validate --version gen-python --version gen-json-schema --version -
[D] Clone/update project repository
[0.5h](P0)cd /Users/kempersc/apps/glam git status git pull origin main -
[D] Create directory structure
[1h](P0)mkdir -p schemas/maps mkdir -p src/glam_extractor/extractors/unesco mkdir -p tests/fixtures/unesco mkdir -p tests/unit/unesco mkdir -p tests/integration/unesco mkdir -p data/raw/unesco mkdir -p data/processed/unesco mkdir -p data/instances/unesco
Afternoon (4 hours)
-
[D] Set up UNESCO API access
[1h](P0)- Read API docs: https://data.unesco.org/api/explore/v2.0/console
- Test API endpoint:
https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records - Note: Legacy endpoint
https://whc.unesco.org/en/list/jsonis BLOCKED (Cloudflare 403) - UNESCO uses OpenDataSoft Explore API v2 (not OAI-PMH)
- No authentication required for public datasets
- Document ODSQL query language for filtering
- Create
.envfile with configuration
-
[D] Download sample UNESCO data
[1h](P0)# OpenDataSoft API with pagination (max 100 records per request) curl -o data/raw/unesco/whc-sites-2025.json \ "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=100&offset=0" # For full dataset (1,200+ sites), implement pagination in Python # CSV export fallback (if API unavailable): # https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv -
[D] Create project README for UNESCO extraction
[1h](P1)- Document scope: 1,200+ WHS extraction
- Reference planning documents (00-06)
- List dependencies and prerequisites
-
[V] Review existing schema modules
[1h](P0)- Read
schemas/core.yaml(HeritageCustodian, Location, Identifier) - Read
schemas/enums.yaml(InstitutionTypeEnum, DataSource) - Read
schemas/provenance.yaml(Provenance, ChangeEvent) - Read
schemas/collections.yaml(Collection metadata)
- Read
Phase 1: Schema & Type System (Days 3-5, 16 hours)
Day 3: LinkML Schema Extension
Morning (4 hours)
-
[T] Create test for UNESCO-specific enums
[1h](P0)- File:
tests/unit/unesco/test_unesco_enums.py - Test cases:
- UNESCO category values (Cultural, Natural, Mixed)
- UNESCO inscription status (Inscribed, Danger, Delisted)
- UNESCO serial nomination flag
- UNESCO transboundary flag
- File:
-
[D+T] Add UNESCO enums to
schemas/enums.yaml[2h](P0)UNESCOCategoryEnum: permissible_values: CULTURAL: description: Cultural heritage site NATURAL: description: Natural heritage site MIXED: description: Mixed cultural and natural site UNESCOInscriptionStatusEnum: permissible_values: INSCRIBED: description: Currently inscribed on World Heritage List IN_DANGER: description: On List of World Heritage in Danger DELISTED: description: Removed from World Heritage List -
[V] Validate enum schema
[0.5h](P0)linkml-validate -s schemas/heritage_custodian.yaml \ -C UNESCOCategoryEnum \ --target-class UNESCOCategoryEnum -
[D] Generate Python classes from schema
[0.5h](P0)gen-python schemas/heritage_custodian.yaml > \ src/glam_extractor/models/generated.py
Afternoon (4 hours)
-
[T] Create test for UNESCO extension class
[1.5h](P0)- File:
tests/unit/unesco/test_unesco_custodian.py - Test cases:
- UNESCOHeritageCustodian inherits from HeritageCustodian
- Additional fields: unesco_id, inscription_year, category, criteria
- Validation rules for UNESCO WHC ID format
- Multi-language name handling
- File:
-
[D+T] Create UNESCO extension schema
[2h](P0)- File:
schemas/unesco.yaml - Extends: HeritageCustodian from
schemas/core.yaml - New slots:
unesco_id: Integer (UNESCO WHC site ID)inscription_year: Integer (year inscribed)unesco_category: UNESCOCategoryEnumunesco_criteria: List of strings (e.g., ["i", "ii", "iv"])in_danger: Booleanserial_nomination: Booleantransboundary: Boolean
- File:
-
[D] Update main schema to import UNESCO module
[0.5h](P0)- Edit
schemas/heritage_custodian.yaml - Add import:
imports: [core, enums, provenance, collections, unesco]
- Edit
Day 4: Type Classification System
Morning (4 hours)
-
[T] Create test for institution type classifier
[2h](P0)- File:
tests/unit/unesco/test_institution_classifier.py - Test cases:
- Museums: "National Museum of...", "Museum Complex"
- Libraries: "Royal Library", "Library of..."
- Archives: "National Archives", "Documentary Heritage"
- Holy Sites: "Cathedral", "Temple", "Monastery"
- Mixed: Multiple institution types at one site
- Unknown: Defaults to MIXED when uncertain
- File:
-
[D+T] Implement institution type classifier
[2h](P0)- File:
src/glam_extractor/extractors/unesco/institution_classifier.py - Function:
classify_unesco_site(site_data: dict) -> InstitutionTypeEnum - Logic:
- Extract
short_description_en,justification_en,name_en - Apply keyword matching patterns (see doc 05, design pattern #2)
- Return InstitutionTypeEnum value
- Assign confidence score (0.0-1.0)
- Extract
- File:
Afternoon (4 hours)
-
[D] Create classification keyword database
[2h](P0)- File:
src/glam_extractor/extractors/unesco/classification_rules.yaml - Structure:
MUSEUM: keywords: - museum - gallery - kunsthalle - musée context_boost: - collection - exhibit - artifact LIBRARY: keywords: - library - biblioteca - bibliothèque context_boost: - manuscript - codex - archival ARCHIVE: keywords: - archive - archivo - archiv context_boost: - document - record - registry HOLY_SITES: keywords: - cathedral - church - temple - monastery - abbey - mosque - synagogue context_boost: - pilgrimage - religious - sacred
- File:
-
[T] Add edge case tests for classifier
[1h](P1)- Test multilingual descriptions (French, Spanish, Arabic)
- Test sites with ambiguous descriptions
- Test serial nominations with multiple institution types
-
[V] Benchmark classifier accuracy
[1h](P1)- Manually classify 50 sample UNESCO sites
- Compare with algorithm output
- Target accuracy: >85%
Day 5: GHCID Generation for UNESCO
Morning (4 hours)
-
[T] Create test for UNESCO GHCID generation
[2h](P0)- File:
tests/unit/unesco/test_ghcid_unesco.py - Test cases:
- Single-country site:
FR-IDF-PAR-M-WHC600(Banks of Seine) - Transboundary site: Multi-country code handling
- Serial nomination: Append serial component ID
- UUID v5 generation from GHCID
- GHCID history tracking
- Single-country site:
- File:
-
[D+T] Implement UNESCO GHCID generator
[2h](P0)- File:
src/glam_extractor/extractors/unesco/ghcid_generator.py - Function:
generate_unesco_ghcid(site_data: dict) -> str - Format:
{COUNTRY}-{REGION}-{CITY}-{TYPE}-WHC{UNESCO_ID} - Logic:
- Extract country from
states_iso_code - Extract city from
name_enor geocode from lat/lon - Classify institution type → single-letter code
- Append
WHC{unesco_id}suffix - Handle transboundary: Use primary country or multi-country code
- Extract country from
- File:
Afternoon (4 hours)
-
[D] Implement city extraction logic
[2h](P0)- File:
src/glam_extractor/extractors/unesco/city_extractor.py - Function:
extract_city_from_name(name: str, country: str) -> str - Patterns:
- "Historic Centre of {City}"
- "Old Town of {City}"
- "{Site Name} in {City}"
- Fallback to geocoding API if pattern match fails
- File:
-
[T] Test city extraction edge cases
[1h](P1)- Test site names without city references
- Test sites spanning multiple cities
- Test non-English site names
-
[V] Validate GHCID uniqueness
[1h](P0)- Generate GHCIDs for all 1,200+ UNESCO sites
- Check for collisions (expect <5%)
- Document collision resolution strategy
Phase 2: Data Extraction Pipeline (Days 6-12, 32 hours)
Day 6: UNESCO API Client
Morning (4 hours)
-
[T] Create test for UNESCO API client
[2h](P0)- File:
tests/unit/unesco/test_unesco_api_client.py - Test cases:
- Fetch all sites with pagination (OpenDataSoft limit=100, offset mechanism)
- Fetch single site by ID using ODSQL
whereclause - Handle HTTP errors (404, 500, timeout)
- Parse OpenDataSoft response structure (nested
record.fields) - Cache responses (avoid repeated API calls)
- Handle pagination links from response
- File:
-
[D+T] Implement UNESCO API client
[2h](P0)- File:
src/glam_extractor/extractors/unesco/api_client.py - Class:
UNESCOAPIClient - Base URL:
https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records - Methods:
fetch_all_sites() -> List[dict]- Paginate with limit=100, offset incrementfetch_site_by_id(site_id: int) -> dict- Use ODSQLwhere="unique_number={id}"paginate_records(limit: int, offset: int) -> dict- Handle OpenDataSoft paginationdownload_and_cache(url: str, cache_path: Path) -> dict
- Response parsing: Extract
recordsarray, navigaterecord.fieldsfor each item
- File:
Afternoon (4 hours)
-
[D] Implement response parser
[2h](P0)- File:
src/glam_extractor/extractors/unesco/response_parser.py - Function:
parse_unesco_site(raw_data: dict) -> dict - Input: OpenDataSoft record structure:
raw_data['record']['fields'] - Transformations:
- Extract from nested
fieldsdict:raw_data['record']['fields'] - Normalize field names (snake_case)
- Parse
date_inscribedto datetime (format: "YYYY-MM-DD") - Parse
criteria_txtto list: "(i)(ii)(iv)" → ["i", "ii", "iv"] - Extract coordinates from
coordinatesobject:{"lat": ..., "lon": ...} - Handle missing/null fields (many optional fields in OpenDataSoft)
- Extract from nested
- Example input:
{ "record": { "id": "abc123...", "fields": { "name_en": "Angkor", "unique_number": 668, "date_inscribed": "1992-01-01", "criteria_txt": "(i)(ii)(iii)(iv)", "coordinates": {"lat": 13.4333, "lon": 103.8333}, "category": "Cultural", "states_name_en": "Cambodia" } } }
- File:
-
[T] Create parser tests with fixtures
[1.5h](P0)- Fixtures:
tests/fixtures/unesco/sample_sites.json(OpenDataSoft format) - Test parsing of 10 diverse UNESCO sites:
- Cultural site (e.g., Angkor Wat - unique_number: 668)
- Natural site (e.g., Galápagos Islands - unique_number: 1)
- Mixed site (e.g., Mount Taishan - unique_number: 437)
- Transboundary site (e.g., Struve Geodetic Arc - unique_number: 1187)
- Serial nomination (e.g., Le Corbusier's architectural works - unique_number: 1321)
- Test nested field extraction: Verify parser correctly navigates
record.fields - Test coordinate parsing: Verify
coordinatesobject → separate lat/lon floats - Test null handling: Verify sites with missing
coordinates,criteria_txt, etc.
- Fixtures:
-
[V] Test API client with live data
[0.5h](P0)python -m pytest tests/unit/unesco/test_unesco_api_client.py -v
Day 7: Location & Geographic Data
Morning (4 hours)
-
[T] Create test for Location object builder
[1.5h](P0)- File:
tests/unit/unesco/test_location_builder.py - Test cases:
- Build Location from UNESCO lat/lon + country
- Reverse geocode coordinates to city/region
- Handle missing coordinates (geocode from name)
- Handle transboundary sites (multiple locations)
- File:
-
[D+T] Implement Location builder
[2h](P0)- File:
src/glam_extractor/extractors/unesco/location_builder.py - Function:
build_location(site_data: dict) -> List[Location] - Logic:
- Extract
latitude,longitude,states_iso_code - Reverse geocode using Nominatim API (cached)
- Create Location object with city, region, country
- For transboundary sites, split by country and create multiple Locations
- Extract
- File:
-
[D] Implement Nominatim client wrapper
[0.5h](P1)- File:
src/glam_extractor/geocoding/nominatim_client.py - Function:
reverse_geocode(lat: float, lon: float) -> dict - Caching: Use 15-minute TTL cache (per AGENTS.md)
- File:
Afternoon (4 hours)
-
[D] Integrate with GeoNames (from global_glam project)
[2h](P1)- Reference:
docs/plan/global_glam/08-geonames-integration.md - Use existing GeoNames client if available
- Extract region codes (ADM1, ADM2)
- Reference:
-
[T] Test geographic edge cases
[1h](P1)- UNESCO sites in disputed territories
- Sites with coordinates in ocean (e.g., Galápagos)
- Sites spanning multiple administrative regions
-
[V] Validate geographic accuracy
[1h](P0)- Manually verify 50 geocoded UNESCO sites
- Check city names match expected values
- Target accuracy: >90%
Day 8: Identifier Extraction
Morning (4 hours)
-
[T] Create test for Identifier extraction
[2h](P0)- File:
tests/unit/unesco/test_identifier_extractor.py - Test cases:
- UNESCO WHC ID → Identifier(scheme="UNESCO_WHC", value="600")
- UNESCO URL → Identifier(scheme="Website", value="https://...")
- Wikidata enrichment (if available)
- ISIL codes (if site has institutional archive/library)
- File:
-
[D+T] Implement Identifier extractor
[2h](P0)- File:
src/glam_extractor/extractors/unesco/identifier_extractor.py - Function:
extract_identifiers(site_data: dict) -> List[Identifier] - Always create:
- UNESCO WHC ID:
unesco_idorunique_number - Website:
http_url
- UNESCO WHC ID:
- Conditionally add:
- Wikidata Q-number (via SPARQL query)
- VIAF ID (if available in future enrichment)
- File:
Afternoon (4 hours)
-
[D] Implement Wikidata enrichment
[3h](P1)- File:
src/glam_extractor/extractors/unesco/wikidata_enricher.py - Function:
find_wikidata_for_unesco_site(site_data: dict) -> Optional[str] - SPARQL Query:
SELECT ?item WHERE { ?item wdt:P757 "600" . # UNESCO WHC ID property } - Fuzzy fallback: If no P757 match, search by name + country
- File:
-
[T] Test Wikidata enrichment
[1h](P1)- Test with known UNESCO sites that have Wikidata entries
- Test handling of sites without Wikidata entries
- Test rate limiting (1 request/second per Wikidata policy)
Day 9-10: Provenance & Metadata
Day 9 Morning (4 hours)
-
[T] Create test for Provenance builder
[1.5h](P0)- File:
tests/unit/unesco/test_provenance_builder.py - Test cases:
- Provenance with data_source=UNESCO_DATAHUB
- Provenance with data_tier=TIER_2_VERIFIED (per doc 04)
- Extraction timestamp in UTC
- Extraction method documentation
- Confidence scores for institution type classification
- File:
-
[D+T] Implement Provenance builder
[2h](P0)- File:
src/glam_extractor/extractors/unesco/provenance_builder.py - Function:
build_provenance(site_data: dict, confidence: float) -> Provenance - Fields:
Provenance( data_source=DataSource.UNESCO_DATAHUB, data_tier=DataTier.TIER_2_VERIFIED, extraction_date=datetime.now(timezone.utc), extraction_method="UNESCO DataHub API extraction", confidence_score=confidence, source_url=site_data.get('http_url'), verified_date=datetime.fromisoformat(site_data['date_inscribed']), verified_by="UNESCO World Heritage Committee" )
- File:
-
[D] Add DataSource enum value
[0.5h](P0)- Edit
schemas/enums.yaml - Add:
UNESCO_DATAHUB: {description: "UNESCO World Heritage DataHub API"}
- Edit
Day 9 Afternoon (4 hours)
-
[T] Create test for ChangeEvent extraction
[2h](P1)- File:
tests/unit/unesco/test_change_event_extractor.py - Test cases:
- Founding event from
date_inscribed - Danger listing event (when
danger: "1") - Delisting event (when
date_endis set) - Extension event (when
extension: "1")
- Founding event from
- File:
-
[D+T] Implement ChangeEvent extractor
[2h](P1)- File:
src/glam_extractor/extractors/unesco/change_event_extractor.py - Function:
extract_change_events(site_data: dict) -> List[ChangeEvent] - Events to create:
- FOUNDING:
date_inscribed→ UNESCO inscription (not actual founding!) - STATUS_CHANGE:
danger: "1"→ added to Danger List - CLOSURE:
date_end→ delisted from World Heritage List
- FOUNDING:
- File:
Day 10 Morning (4 hours)
-
[D] Create Collection metadata extractor
[2h](P1)- File:
src/glam_extractor/extractors/unesco/collection_builder.py - Function:
build_collection(site_data: dict) -> Optional[Collection] - Logic:
- Extract
short_description_en→ collection description - Parse
criteria_txt→ subject areas - Extract
area_hectares→ extent - If site is MUSEUM/ARCHIVE/LIBRARY, create Collection object
- Extract
- File:
-
[T] Test Collection extraction
[1.5h](P1)- Test museum sites with clear collections
- Test non-institutional sites (landscapes, monuments) → return None
- Test serial nominations → multiple collections
-
[V] Review extraction pipeline completeness
[0.5h](P0)- Checklist:
- ✓ Institution name extracted
- ✓ Institution type classified
- ✓ Location(s) extracted
- ✓ Identifiers extracted
- ✓ Provenance metadata complete
- ✓ Change events extracted (optional)
- ✓ Collection metadata extracted (optional)
- Checklist:
Day 10 Afternoon (4 hours)
-
[D+T] Implement main orchestrator
[3h](P0)- File:
src/glam_extractor/extractors/unesco/extractor.py - Class:
UNESCOExtractor - Method:
extract_site(site_data: dict) -> HeritageCustodian - Orchestration logic:
- Parse raw UNESCO JSON
- Classify institution type → confidence score
- Extract name (prefer English, fallback to French)
- Build Location(s)
- Extract Identifiers
- Build Provenance
- Generate GHCID
- Generate UUIDs (v5, v8, numeric)
- Extract ChangeEvents
- Build Collection (if applicable)
- Return HeritageCustodian instance
- File:
-
[T] Integration test for full extraction
[1h](P0)- File:
tests/integration/unesco/test_end_to_end.py - Test: Extract 10 diverse UNESCO sites end-to-end
- Assertions:
- All required fields populated
- GHCID format valid
- UUIDs generated correctly
- Provenance complete
- File:
Day 11-12: Batch Processing
Day 11 Morning (4 hours)
-
[D] Implement batch processor
[2.5h](P0)- File:
src/glam_extractor/extractors/unesco/batch_processor.py - Class:
UNESCOBatchProcessor - Method:
process_all_sites() -> List[HeritageCustodian] - Features:
- Progress bar (tqdm)
- Error handling (skip invalid sites, log errors)
- Parallel processing (multiprocessing, 4 workers)
- Incremental checkpointing (save every 100 sites)
- File:
-
[T] Test batch processor
[1.5h](P0)- Test processing 100 sites
- Test error recovery (invalid site data)
- Test checkpoint/resume functionality
Day 11 Afternoon (4 hours)
-
[D] Add CLI interface
[2h](P1)- File:
src/glam_extractor/cli/unesco.py - Commands:
glam-extract unesco fetch-all # Download UNESCO data glam-extract unesco process # Process all sites glam-extract unesco validate # Validate output glam-extract unesco export # Export to formats
- File:
-
[D] Implement logging system
[1h](P1)- File:
src/glam_extractor/extractors/unesco/logger.py - Log levels:
- INFO: Progress milestones
- WARNING: Missing data, fallback logic triggered
- ERROR: Unhandled exceptions, invalid data
- Output:
logs/unesco_extraction_{timestamp}.log
- File:
-
[V] Dry run on full dataset
[1h](P0)- Process all 1,200+ UNESCO sites
- Log errors and warnings
- Review classification accuracy
- Target success rate: >95%
Day 12 Full Day (8 hours)
-
[D] Implement error recovery patterns
[3h](P0)- File:
src/glam_extractor/extractors/unesco/error_handling.py - Patterns:
- Retry on transient errors (HTTP 429, 503)
- Skip on permanent errors (invalid data structure)
- Log detailed error context
- Continue processing remaining sites
- File:
-
[T] Stress test batch processor
[2h](P0)- Test with 1,200+ sites
- Measure processing time (target: <10 minutes)
- Monitor memory usage
- Test interrupt/resume (Ctrl+C handling)
-
[D] Optimize performance
[2h](P1)- Profile slow operations
- Cache geocoding results
- Batch Wikidata queries
- Optimize JSON serialization
-
[V] Quality review of 50 random sites
[1h](P0)- Manual inspection of extracted records
- Verify institution type classification
- Verify GHCID generation
- Verify Location accuracy
- Document edge cases for refinement
Phase 3: Validation & Testing (Days 13-16, 16 hours)
Day 13: LinkML Schema Validation
Morning (4 hours)
-
[D] Create comprehensive validation script
[2h](P0)- File:
scripts/validate_unesco_output.py - Validations:
- LinkML schema compliance
- Required field presence
- Enum value validity
- GHCID format regex
- UUID format validation
- URL reachability (sample check)
- File:
-
[T] Create validation test suite
[2h](P0)- File:
tests/validation/test_unesco_validation.py - Test cases:
- Valid records pass validation
- Invalid records fail with clear error messages
- Edge cases handled gracefully
- File:
Afternoon (4 hours)
-
[V] Run LinkML validator on all records
[2h](P0)for file in data/instances/unesco/*.yaml; do linkml-validate -s schemas/heritage_custodian.yaml "$file" done -
[V] Review validation errors
[2h](P0)- Categorize errors (schema, data quality, edge cases)
- Fix schema issues
- Fix extraction logic bugs
- Document known limitations
Day 14: Data Quality Checks
Morning (4 hours)
-
[D] Implement data quality checks
[2.5h](P0)- File:
scripts/quality_checks_unesco.py - Checks:
- Completeness: % of fields populated per record
- Consistency: Same UNESCO ID → same GHCID
- Accuracy: Geocoding spot checks
- Uniqueness: No duplicate GHCIDs
- Referential integrity: All identifiers valid
- File:
-
[V] Run quality checks on full dataset
[1.5h](P0)- Generate quality report:
reports/unesco_quality_report.md - Target metrics:
- Completeness: >90% for required fields
- Accuracy: >85% for institution type classification
- Uniqueness: 100% (no duplicate GHCIDs)
- Generate quality report:
Afternoon (4 hours)
-
[D] Implement deduplication logic
[2h](P1)- File:
src/glam_extractor/extractors/unesco/deduplicator.py - Logic:
- Compare by UNESCO WHC ID (primary key)
- Fuzzy match by name + country (detect data errors)
- Merge records if duplicates found
- File:
-
[T] Test deduplication
[1h](P1)- Create synthetic duplicates
- Test merging logic
- Verify data integrity after merge
-
[V] Final quality gate
[1h](P0)- Criteria:
- All records pass LinkML validation
- Quality metrics meet targets
- No critical errors in logs
- Sample manual review passes
- Criteria:
Day 15: Integration Testing
Full Day (8 hours)
-
[T] Create integration test suite
[4h](P0)- File:
tests/integration/unesco/test_full_pipeline.py - Tests:
- Fetch UNESCO data from live API
- Process 100 random sites
- Validate output against schema
- Export to JSON-LD, RDF, CSV
- Verify exported files are valid
- File:
-
[T] Create cross-reference tests
[2h](P1)- File:
tests/integration/unesco/test_cross_reference.py - Tests:
- Cross-check UNESCO WHC IDs with official list
- Verify Wikidata Q-numbers resolve
- Verify website URLs are reachable (sample)
- File:
-
[V] Performance benchmarks
[1h](P0)- Measure extraction time per site (target: <0.5s)
- Measure total processing time (target: <10 minutes)
- Measure memory usage (target: <2GB)
-
[V] User acceptance testing
[1h](P0)- Load sample data into visualization tool
- Verify data is human-readable
- Verify data structure supports use cases (doc 02)
Day 16: Test Coverage & Documentation
Morning (4 hours)
-
[V] Measure test coverage
[1h](P0)pytest --cov=src/glam_extractor/extractors/unesco \ --cov-report=html \ tests/unit/unesco/- Target coverage: >80%
-
[T] Add missing tests to reach coverage target
[2h](P0)- Focus on edge cases
- Focus on error handling paths
-
[D] Document testing strategy
[1h](P1)- File:
docs/unesco/testing.md - Document test fixtures
- Document test data sources
- Document known test gaps
- File:
Afternoon (4 hours)
-
[D] Create developer guide
[2h](P1)- File:
docs/unesco/developer_guide.md - How to run extraction locally
- How to add new classifiers
- How to extend schema
- How to debug extraction errors
- File:
-
[D] Create operational runbook
[2h](P1)- File:
docs/unesco/operations.md - How to schedule periodic extractions
- How to monitor data quality
- How to handle API changes
- How to troubleshoot common issues
- File:
Phase 4: Export & Serialization (Days 17-20, 16 hours)
Day 17: JSON-LD Exporter
Morning (4 hours)
-
[T] Create test for JSON-LD exporter
[1.5h](P0)- File:
tests/unit/unesco/test_jsonld_exporter.py - Test cases:
- Export single HeritageCustodian to JSON-LD
- Verify @context is valid
- Verify @type is correct
- Verify linked data properties
- File:
-
[D+T] Implement JSON-LD exporter
[2.5h](P0)- File:
src/glam_extractor/exporters/jsonld_unesco.py - Function:
export_to_jsonld(custodian: HeritageCustodian) -> dict - JSON-LD context:
{ "@context": { "@vocab": "https://w3id.org/heritage/custodian/", "name": "http://schema.org/name", "location": "http://schema.org/location", "unesco": "https://whc.unesco.org/en/list/" } }
- File:
Afternoon (4 hours)
-
[D] Create JSON-LD context file
[1h](P0)- File:
contexts/unesco_heritage_custodian.jsonld - Reference existing context from global_glam project
- Add UNESCO-specific terms
- File:
-
[T] Validate JSON-LD output
[1h](P0)- Use JSON-LD Playground: https://json-ld.org/playground/
- Verify RDF triples are correct
- Test with sample records
-
[D] Batch export to JSON-LD
[2h](P0)- Export all 1,200+ records to
data/instances/unesco/jsonld/ - One file per record:
whc-{unesco_id}.jsonld
- Export all 1,200+ records to
Day 18: RDF/Turtle Exporter
Morning (4 hours)
-
[T] Create test for RDF exporter
[1.5h](P0)- File:
tests/unit/unesco/test_rdf_exporter.py - Test cases:
- Export to Turtle (.ttl)
- Export to N-Triples (.nt)
- Verify RDF triples are valid
- Test with rdflib
- File:
-
[D+T] Implement RDF exporter
[2.5h](P0)- File:
src/glam_extractor/exporters/rdf_unesco.py - Function:
export_to_rdf(custodian: HeritageCustodian, format: str) -> str - Dependencies:
pip install rdflib==7.0.0
- File:
Afternoon (4 hours)
-
[D] Create RDF namespace definitions
[1h](P0)- File:
src/glam_extractor/exporters/namespaces.py - Namespaces:
- heritage: https://w3id.org/heritage/custodian/
- unesco: https://whc.unesco.org/en/list/
- schema: http://schema.org/
- geo: http://www.w3.org/2003/01/geo/wgs84_pos#
- prov: http://www.w3.org/ns/prov#
- File:
-
[D] Batch export to Turtle
[2h](P0)- Export all records to
data/instances/unesco/rdf/unesco_sites.ttl - Create one large Turtle file with all 1,200+ sites
- Export all records to
-
[V] Validate RDF output
[1h](P0)- Use rapper validator:
rapper -i turtle unesco_sites.ttl - Test SPARQL queries on output
- Use rapper validator:
Day 19: CSV & Parquet Exporters
Morning (4 hours)
-
[T] Create test for CSV exporter
[1h](P0)- File:
tests/unit/unesco/test_csv_exporter.py - Test cases:
- Export flat CSV with essential fields
- Handle multi-valued fields (locations, identifiers)
- UTF-8 encoding
- File:
-
[D+T] Implement CSV exporter
[2h](P0)- File:
src/glam_extractor/exporters/csv_unesco.py - Function:
export_to_csv(custodians: List[HeritageCustodian], path: Path) - Columns:
- ghcid, name, institution_type, city, country, latitude, longitude
- unesco_id, inscription_year, category, criteria
- wikidata_id, website_url
- extraction_date, confidence_score
- File:
-
[D] Batch export to CSV
[1h](P0)- Export to
data/instances/unesco/unesco_heritage_custodians.csv
- Export to
Afternoon (4 hours)
-
[T] Create test for Parquet exporter
[1h](P0)- File:
tests/unit/unesco/test_parquet_exporter.py - Test cases:
- Export to Parquet format
- Verify schema preservation
- Test with DuckDB or Pandas
- File:
-
[D+T] Implement Parquet exporter
[2h](P0)- File:
src/glam_extractor/exporters/parquet_unesco.py - Function:
export_to_parquet(custodians: List[HeritageCustodian], path: Path) - Dependencies:
pip install pyarrow==14.0.0
- File:
-
[D] Batch export to Parquet
[1h](P0)- Export to
data/instances/unesco/unesco_heritage_custodians.parquet
- Export to
Day 20: SQLite Database
Morning (4 hours)
-
[D] Design SQLite schema
[1.5h](P0)- File:
scripts/create_unesco_db_schema.sql - Tables:
- heritage_custodians (main table)
- locations (1:N relationship)
- identifiers (1:N relationship)
- change_events (1:N relationship)
- collections (1:N relationship)
- File:
-
[D] Implement SQLite exporter
[2.5h](P0)- File:
src/glam_extractor/exporters/sqlite_unesco.py - Function:
export_to_sqlite(custodians: List[HeritageCustodian], db_path: Path) - Dependencies: Built-in
sqlite3
- File:
Afternoon (4 hours)
-
[D] Create database indexes
[1h](P0)- Index on ghcid (primary key)
- Index on unesco_id (unique)
- Index on country
- Index on institution_type
- Full-text search on name and description
-
[T] Test SQLite export and queries
[2h](P0)- Test data integrity
- Test foreign key constraints
- Test query performance
-
[D] Create sample SQL queries
[1h](P1)- File:
docs/unesco/sample_queries.sql - Example queries for common use cases
- File:
Phase 5: Integration & Enrichment (Days 21-24, 16 hours)
Day 21: Wikidata Enrichment Pipeline
Morning (4 hours)
-
[D] Implement batch Wikidata enrichment
[3h](P1)- File:
scripts/enrich_unesco_with_wikidata.py - Process:
- Load all UNESCO records without Wikidata IDs
- Query Wikidata SPARQL endpoint (batches of 100)
- Match by UNESCO WHC ID (property P757)
- Update records with Q-numbers
- Extract additional metadata from Wikidata (VIAF, founding date)
- File:
-
[T] Test Wikidata enrichment
[1h](P1)- Test with 50 sample sites
- Verify match accuracy
- Handle rate limiting
Afternoon (4 hours)
-
[D] Implement ISIL code lookup
[2h](P1)- File:
scripts/enrich_unesco_with_isil.py - Data source: Cross-reference with existing ISIL registry
- Logic:
- Match by institution name + country
- Fuzzy matching threshold: 0.85
- Add ISIL identifier to records
- File:
-
[V] Review enrichment results
[2h](P1)- Count enriched records
- Review match quality
- Document enrichment coverage
Day 22: Cross-dataset Linking
Morning (4 hours)
-
[D] Link UNESCO records to existing GLAM dataset
[3h](P1)- File:
scripts/link_unesco_to_global_glam.py - Matching strategies:
- GHCID match (direct)
- Wikidata ID match
- Name + country fuzzy match
- Output: Crosswalk table linking UNESCO WHC ID to existing GHCIDs
- File:
-
[V] Validate cross-dataset links
[1h](P1)- Manual review of 50 matched pairs
- Check for false positives
- Document ambiguous cases
Afternoon (4 hours)
-
[D] Create dataset merge logic
[2h](P1)- File:
scripts/merge_unesco_into_global_glam.py - Merge rules:
- If GHCID exists: Update with UNESCO metadata
- If no match: Add as new record
- Preserve provenance trail
- File:
-
[T] Test merge logic
[1h](P1)- Test with sample dataset
- Verify no data loss
- Verify GHCID uniqueness maintained
-
[V] Final merge quality check
[1h](P0)- Review merged dataset
- Verify provenance for all records
- Document merge statistics
Day 23: Data Quality Refinement
Full Day (8 hours)
-
[D] Implement data quality improvements
[4h](P0)- Fix classification errors from manual review
- Refine keyword patterns for edge cases
- Improve geocoding accuracy
- Handle special characters in names
-
[T] Regression testing
[2h](P0)- Re-run full test suite
- Verify improvements didn't break existing functionality
-
[V] Final data quality audit
[2h](P0)- Re-run quality checks
- Compare metrics to Day 14 baseline
- Target improvement: +5% accuracy
Day 24: Documentation & Handoff
Morning (4 hours)
-
[D] Create extraction report
[2h](P0)- File:
reports/unesco_extraction_report.md - Contents:
- Total records extracted: 1,200+
- Classification accuracy: XX%
- Coverage by institution type
- Geographic distribution
- Data quality metrics
- Known limitations
- File:
-
[D] Document known issues
[1h](P0)- File:
docs/unesco/known_issues.md - List edge cases not handled
- List sites with ambiguous classification
- List sites missing key metadata
- File:
-
[D] Create future enhancements roadmap
[1h](P1)- File:
docs/unesco/roadmap.md - Planned improvements
- Additional data sources to integrate
- New features to implement
- File:
Afternoon (4 hours)
-
[D] Create API documentation
[2h](P1)- File:
docs/unesco/api.md - Document all public functions
- Document CLI commands
- Provide usage examples
- File:
-
[D] Create maintenance guide
[2h](P1)- File:
docs/unesco/maintenance.md - How to update UNESCO data (periodic refresh)
- How to handle API changes
- How to re-run enrichment pipelines
- File:
Phase 6: Deployment & Monitoring (Days 25-28, 16 hours)
Day 25: Automation Scripts
Morning (4 hours)
-
[D] Create automated extraction pipeline
[2.5h](P0)- File:
scripts/run_unesco_extraction.sh - Steps:
- Fetch latest UNESCO data
- Run extraction
- Validate output
- Export to all formats
- Run quality checks
- Generate report
- File:
-
[T] Test automation script
[1.5h](P0)- Test end-to-end execution
- Test error handling
- Test with scheduled execution (cron)
Afternoon (4 hours)
-
[D] Create incremental update script
[2.5h](P1)- File:
scripts/update_unesco_incremental.py - Logic:
- Compare current UNESCO data with previous extraction
- Detect new sites, modified sites, delisted sites
- Process only changes
- Update database incrementally
- File:
-
[T] Test incremental updates
[1.5h](P1)- Simulate API changes
- Verify only deltas processed
- Verify data consistency
Day 26: Monitoring & Alerts
Morning (4 hours)
-
[D] Implement extraction monitoring
[2h](P1)- File:
src/glam_extractor/monitoring/unesco_monitor.py - Metrics to track:
- Extraction success rate
- Average processing time per site
- Classification confidence distribution
- Data quality scores
- File:
-
[D] Create alerting logic
[2h](P1)- File:
src/glam_extractor/monitoring/alerts.py - Alert conditions:
- Extraction failure rate >5%
- Classification confidence <0.7 for >10% of sites
- API downtime
- Schema validation failure rate >1%
- File:
Afternoon (4 hours)
-
[D] Create monitoring dashboard (optional)
[3h](P2)- Tool: Jupyter notebook or Streamlit
- Visualizations:
- Extraction pipeline status
- Data quality trends over time
- Geographic coverage map
- Institution type distribution
-
[V] Test monitoring system
[1h](P1)- Simulate failures
- Verify alerts trigger
- Test dashboard functionality
Day 27: Production Deployment
Morning (4 hours)
-
[D] Package extraction module
[2h](P0)- Create
setup.pyorpyproject.toml - Document installation instructions
- Test installation in clean environment
- Create
-
[D] Create Docker container (optional)
[2h](P2)- File:
docker/Dockerfile.unesco - Containerize extraction pipeline
- Document Docker deployment
- File:
Afternoon (4 hours)
-
[D] Deploy to production environment
[2h](P0)- Upload code to production server
- Install dependencies
- Configure environment variables
- Test production execution
-
[V] Production smoke test
[1h](P0)- Run extraction on production
- Verify output files created
- Verify database updated
- Verify logs generated
-
[D] Configure scheduled execution
[1h](P0)- Set up cron job for weekly updates
- Configure log rotation
- Set up error notifications
Day 28: Handoff & Training
Full Day (8 hours)
-
[D] Create training materials
[3h](P1)- Presentation: Project overview and architecture
- Demo: Live extraction walkthrough
- Exercises: Hands-on practice for maintainers
-
[D] Conduct knowledge transfer session
[3h](P1)- Present to stakeholders
- Walk through codebase
- Answer questions
- Document Q&A
-
[D] Create support plan
[2h](P1)- File:
docs/unesco/support.md - Contact information for maintainers
- Escalation procedures
- Known issues and workarounds
- File:
Phase 7: Finalization (Days 29-30, 8 hours)
Day 29: Final Review
Morning (4 hours)
-
[V] Code review
[2h](P0)- Review all implementation files
- Check code quality (linting, type hints)
- Verify documentation completeness
-
[V] Test coverage review
[1h](P0)- Verify coverage >80%
- Review untested code paths
- Document rationale for any skipped tests
-
[V] Security review
[1h](P0)- Check for hardcoded credentials
- Verify API keys in environment variables
- Check for SQL injection vulnerabilities
- Verify input validation
Afternoon (4 hours)
-
[D] Final documentation pass
[2h](P0)- Proofread all documentation
- Check for broken links
- Verify code examples are correct
- Update README with final statistics
-
[V] Stakeholder demo
[2h](P0)- Present final results
- Demo visualization of extracted data
- Discuss next steps
- Gather feedback
Day 30: Release & Archive
Morning (4 hours)
-
[D] Create release package
[2h](P0)- Tag release version:
v1.0.0-unesco - Create GitHub release (if applicable)
- Archive source code
- Archive extracted datasets
- Tag release version:
-
[D] Publish datasets
[1h](P0)- Upload to project repository
- Generate dataset DOI (Zenodo or similar)
- Update dataset catalog
-
[D] Create project retrospective
[1h](P1)- File:
reports/unesco_retrospective.md - What went well
- What could be improved
- Lessons learned
- Recommendations for future extractions
- File:
Afternoon (4 hours)
-
[D] Final cleanup
[2h](P0)- Remove temporary files
- Archive logs
- Clean up test data
- Optimize database indexes
-
[V] Final quality gate
[1h](P0)- Checklist:
- ✓ All tests passing
- ✓ All data validated
- ✓ All documentation complete
- ✓ Production deployment successful
- ✓ Monitoring in place
- ✓ Stakeholders trained
- Checklist:
-
[D] Project close-out
[1h](P0)- Send completion notification
- Archive project artifacts
- Update project status: COMPLETE
Appendix A: Task Dependencies
Critical Path (Must Complete in Order)
- Day 1-2: Environment setup → Blocks all development
- Day 3-5: Schema & types → Blocks extraction logic
- Day 6-8: Data extraction → Blocks validation
- Day 9-12: Batch processing → Blocks export
- Day 13-16: Validation → Blocks production deployment
- Day 17-20: Export → Blocks integration
- Day 25-28: Deployment → Blocks project completion
Parallel Work Opportunities
- Days 7-10: Location extraction + Identifier extraction can be done in parallel
- Days 17-20: All exporters (JSON-LD, RDF, CSV, Parquet, SQLite) can be developed in parallel
- Days 21-22: Wikidata enrichment + ISIL lookup can be done in parallel
Appendix B: Quality Gates
Gate 1: Schema Complete (End of Day 5)
- All LinkML schemas validated
- Python classes generated successfully
- Enum values documented
- GHCID format specified
Gate 2: Extraction Pipeline Complete (End of Day 12)
- API client tested with live data
- All extractors implemented and tested
- Batch processor handles 1,200+ sites
- Error handling robust
Gate 3: Validation Complete (End of Day 16)
- All records pass LinkML validation
- Quality metrics meet targets
- Test coverage >80%
- Sample manual review passes
Gate 4: Export Complete (End of Day 20)
- All export formats tested
- Output files validated
- Database indexes optimized
- Sample queries work
Gate 5: Production Ready (End of Day 28)
- Production deployment successful
- Monitoring in place
- Documentation complete
- Team trained
Gate 6: Project Complete (End of Day 30)
- All tests passing
- All datasets published
- Stakeholders notified
- Project archived
Appendix C: Risk Management
High-Priority Risks
Risk 1: UNESCO API Changes
- Impact: High (extraction fails)
- Probability: Low
- Mitigation:
- Monitor API version
- Cache raw API responses
- Implement schema validation on API responses
- Create fallback to manual data files
Risk 2: Institution Type Classification Accuracy <85%
- Impact: Medium (data quality issues)
- Probability: Medium
- Mitigation:
- Iterative refinement of keywords
- Manual review of 100+ samples
- Add confidence threshold (skip uncertain classifications)
- Document known edge cases
Risk 3: GHCID Collisions
- Impact: Medium (duplicate identifiers)
- Probability: Low (<5% expected)
- Mitigation:
- Implement collision detection
- Use Wikidata Q-numbers for disambiguation
- Document collision resolution rules
- Manual review of collisions
Risk 4: Performance Issues (>10 min processing time)
- Impact: Low (convenience only)
- Probability: Medium
- Mitigation:
- Implement parallel processing
- Cache geocoding results
- Batch Wikidata queries
- Profile and optimize slow operations
Risk 5: Wikidata Enrichment Rate Limiting
- Impact: Low (enrichment incomplete)
- Probability: High
- Mitigation:
- Respect rate limits (1 req/sec)
- Implement exponential backoff
- Make enrichment optional
- Run enrichment as separate batch job
Appendix D: Resource Requirements
Software Dependencies
Required:
- Python 3.11+
- linkml 1.7.10+
- pydantic 1.10.13
- requests 2.31.0
Optional:
- rapidfuzz 3.5.2 (fuzzy matching)
- SPARQLWrapper 2.0.0 (Wikidata)
- rdflib 7.0.0 (RDF export)
- pyarrow 14.0.0 (Parquet export)
Infrastructure
Development:
- Laptop/workstation with 8GB+ RAM
- Internet connection (API access)
- ~2GB disk space for data
Production:
- Server with 4GB+ RAM
- Cron scheduling capability
- 5GB disk space (with logs)
Team Resources
Estimated effort: 120 person-hours over 30 days
Roles:
- Software Engineer (primary developer)
- Data Scientist (classification logic)
- QA Engineer (testing)
- Documentation Writer (optional)
Appendix E: Success Metrics
Quantitative Metrics
- Extraction completeness: >95% of UNESCO sites processed successfully
- Classification accuracy: >85% correct institution type assignments
- Validation pass rate: 100% of records pass LinkML schema validation
- Test coverage: >80% code coverage
- Processing time: <10 minutes for full dataset
- Data quality score: >90% completeness on required fields
Qualitative Metrics
- Code maintainability: Clear documentation, modular design
- Data usability: Stakeholders can query and visualize data
- Reproducibility: Pipeline can be re-run by new team members
- Extensibility: Easy to add new data sources or fields
Version History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.1 | 2025-11-10 | AI Agent (OpenCode) | Updated Day 6 API client and parser sections to reflect OpenDataSoft Explore API v2 response structure (nested record.fields, pagination links, coordinate objects) |
| 1.0 | 2025-11-09 | GLAM Extraction Team | Initial version |
Document Status: ✅ COMPLETE
Next Steps: Begin Phase 0 - Pre-Implementation Setup
Questions? Refer to planning documents 00-06 for detailed specifications.