1513 lines
48 KiB
Markdown
1513 lines
48 KiB
Markdown
# UNESCO Data Extraction - Master Implementation Checklist
|
|
|
|
**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
|
|
**Document**: 07 - Master Implementation Checklist
|
|
**Version**: 1.1
|
|
**Date**: 2025-11-10
|
|
**Status**: Planning
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This document provides a **comprehensive 30-day implementation checklist** for the UNESCO World Heritage Sites extraction project. It consolidates all planning documents (00-06) into a structured timeline with actionable tasks, testing requirements, and quality gates.
|
|
|
|
**Timeline**: 30 calendar days (estimated 120 person-hours)
|
|
**Approach**: Test-Driven Development (TDD) with iterative validation
|
|
**Output**: 1,200+ LinkML-compliant heritage custodian records
|
|
|
|
---
|
|
|
|
## How to Use This Checklist
|
|
|
|
### Notation
|
|
|
|
- **[D]** = Development task
|
|
- **[T]** = Testing task
|
|
- **[V]** = Validation/Quality gate
|
|
- **[D+T]** = TDD pair (test first, then implementation)
|
|
- **✓** = Completed
|
|
- **⚠** = In progress
|
|
- **⊗** = Blocked/Issue
|
|
|
|
### Task Priority Levels
|
|
|
|
- **P0** = Critical path, must complete
|
|
- **P1** = High priority, complete if time permits
|
|
- **P2** = Nice to have, future iteration
|
|
|
|
### Estimated Time
|
|
|
|
Each task includes an estimated duration in hours (e.g., `[2h]`).
|
|
|
|
---
|
|
|
|
## Phase 0: Pre-Implementation Setup (Days 1-2, 8 hours)
|
|
|
|
### Day 1: Environment & Dependencies
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Verify Python 3.11+ installation `[0.5h]` (P0)
|
|
- Run: `python --version`
|
|
- Create virtual environment: `python -m venv .venv`
|
|
- Activate: `source .venv/bin/activate`
|
|
|
|
- [ ] **[D]** Install core dependencies from `requirements.txt` `[1h]` (P0)
|
|
```bash
|
|
pip install linkml==1.7.10
|
|
pip install linkml-runtime==1.7.5
|
|
pip install pydantic==1.10.13
|
|
pip install requests==2.31.0
|
|
pip install python-dotenv==1.0.0
|
|
pip install pytest==7.4.3
|
|
pip install pytest-cov==4.1.0
|
|
```
|
|
|
|
- [ ] **[D]** Install optional dependencies `[0.5h]` (P1)
|
|
```bash
|
|
pip install rapidfuzz==3.5.2 # Fuzzy matching
|
|
pip install SPARQLWrapper==2.0.0 # Wikidata queries
|
|
pip install click==8.1.7 # CLI interface
|
|
```
|
|
|
|
- [ ] **[V]** Verify LinkML installation `[0.5h]` (P0)
|
|
```bash
|
|
linkml-validate --version
|
|
gen-python --version
|
|
gen-json-schema --version
|
|
```
|
|
|
|
- [ ] **[D]** Clone/update project repository `[0.5h]` (P0)
|
|
```bash
|
|
cd /Users/kempersc/apps/glam
|
|
git status
|
|
git pull origin main
|
|
```
|
|
|
|
- [ ] **[D]** Create directory structure `[1h]` (P0)
|
|
```bash
|
|
mkdir -p schemas/maps
|
|
mkdir -p src/glam_extractor/extractors/unesco
|
|
mkdir -p tests/fixtures/unesco
|
|
mkdir -p tests/unit/unesco
|
|
mkdir -p tests/integration/unesco
|
|
mkdir -p data/raw/unesco
|
|
mkdir -p data/processed/unesco
|
|
mkdir -p data/instances/unesco
|
|
```
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Set up UNESCO API access `[1h]` (P0)
|
|
- Read API docs: https://data.unesco.org/api/explore/v2.0/console
|
|
- Test API endpoint: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records`
|
|
- Note: Legacy endpoint `https://whc.unesco.org/en/list/json` is **BLOCKED** (Cloudflare 403)
|
|
- UNESCO uses **OpenDataSoft Explore API v2** (not OAI-PMH)
|
|
- No authentication required for public datasets
|
|
- Document ODSQL query language for filtering
|
|
- Create `.env` file with configuration
|
|
|
|
- [ ] **[D]** Download sample UNESCO data `[1h]` (P0)
|
|
```bash
|
|
# OpenDataSoft API with pagination (max 100 records per request)
|
|
curl -o data/raw/unesco/whc-sites-2025.json \
|
|
"https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=100&offset=0"
|
|
|
|
# For full dataset (1,200+ sites), implement pagination in Python
|
|
# CSV export fallback (if API unavailable):
|
|
# https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/exports/csv
|
|
```
|
|
|
|
- [ ] **[D]** Create project README for UNESCO extraction `[1h]` (P1)
|
|
- Document scope: 1,200+ WHS extraction
|
|
- Reference planning documents (00-06)
|
|
- List dependencies and prerequisites
|
|
|
|
- [ ] **[V]** Review existing schema modules `[1h]` (P0)
|
|
- Read `schemas/core.yaml` (HeritageCustodian, Location, Identifier)
|
|
- Read `schemas/enums.yaml` (InstitutionTypeEnum, DataSource)
|
|
- Read `schemas/provenance.yaml` (Provenance, ChangeEvent)
|
|
- Read `schemas/collections.yaml` (Collection metadata)
|
|
|
|
---
|
|
|
|
## Phase 1: Schema & Type System (Days 3-5, 16 hours)
|
|
|
|
### Day 3: LinkML Schema Extension
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for UNESCO-specific enums `[1h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_unesco_enums.py`
|
|
- **Test cases**:
|
|
- UNESCO category values (Cultural, Natural, Mixed)
|
|
- UNESCO inscription status (Inscribed, Danger, Delisted)
|
|
- UNESCO serial nomination flag
|
|
- UNESCO transboundary flag
|
|
|
|
- [ ] **[D+T]** Add UNESCO enums to `schemas/enums.yaml` `[2h]` (P0)
|
|
```yaml
|
|
UNESCOCategoryEnum:
|
|
permissible_values:
|
|
CULTURAL:
|
|
description: Cultural heritage site
|
|
NATURAL:
|
|
description: Natural heritage site
|
|
MIXED:
|
|
description: Mixed cultural and natural site
|
|
|
|
UNESCOInscriptionStatusEnum:
|
|
permissible_values:
|
|
INSCRIBED:
|
|
description: Currently inscribed on World Heritage List
|
|
IN_DANGER:
|
|
description: On List of World Heritage in Danger
|
|
DELISTED:
|
|
description: Removed from World Heritage List
|
|
```
|
|
|
|
- [ ] **[V]** Validate enum schema `[0.5h]` (P0)
|
|
```bash
|
|
linkml-validate -s schemas/heritage_custodian.yaml \
|
|
-C UNESCOCategoryEnum \
|
|
--target-class UNESCOCategoryEnum
|
|
```
|
|
|
|
- [ ] **[D]** Generate Python classes from schema `[0.5h]` (P0)
|
|
```bash
|
|
gen-python schemas/heritage_custodian.yaml > \
|
|
src/glam_extractor/models/generated.py
|
|
```
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[T]** Create test for UNESCO extension class `[1.5h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_unesco_custodian.py`
|
|
- **Test cases**:
|
|
- UNESCOHeritageCustodian inherits from HeritageCustodian
|
|
- Additional fields: unesco_id, inscription_year, category, criteria
|
|
- Validation rules for UNESCO WHC ID format
|
|
- Multi-language name handling
|
|
|
|
- [ ] **[D+T]** Create UNESCO extension schema `[2h]` (P0)
|
|
- **File**: `schemas/unesco.yaml`
|
|
- **Extends**: HeritageCustodian from `schemas/core.yaml`
|
|
- **New slots**:
|
|
- `unesco_id`: Integer (UNESCO WHC site ID)
|
|
- `inscription_year`: Integer (year inscribed)
|
|
- `unesco_category`: UNESCOCategoryEnum
|
|
- `unesco_criteria`: List of strings (e.g., ["i", "ii", "iv"])
|
|
- `in_danger`: Boolean
|
|
- `serial_nomination`: Boolean
|
|
- `transboundary`: Boolean
|
|
|
|
- [ ] **[D]** Update main schema to import UNESCO module `[0.5h]` (P0)
|
|
- Edit `schemas/heritage_custodian.yaml`
|
|
- Add import: `imports: [core, enums, provenance, collections, unesco]`
|
|
|
|
### Day 4: Type Classification System
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for institution type classifier `[2h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_institution_classifier.py`
|
|
- **Test cases**:
|
|
- Museums: "National Museum of...", "Museum Complex"
|
|
- Libraries: "Royal Library", "Library of..."
|
|
- Archives: "National Archives", "Documentary Heritage"
|
|
- Holy Sites: "Cathedral", "Temple", "Monastery"
|
|
- Mixed: Multiple institution types at one site
|
|
- Unknown: Defaults to MIXED when uncertain
|
|
|
|
- [ ] **[D+T]** Implement institution type classifier `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/institution_classifier.py`
|
|
- **Function**: `classify_unesco_site(site_data: dict) -> InstitutionTypeEnum`
|
|
- **Logic**:
|
|
1. Extract `short_description_en`, `justification_en`, `name_en`
|
|
2. Apply keyword matching patterns (see doc 05, design pattern #2)
|
|
3. Return InstitutionTypeEnum value
|
|
4. Assign confidence score (0.0-1.0)
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create classification keyword database `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/classification_rules.yaml`
|
|
- **Structure**:
|
|
```yaml
|
|
MUSEUM:
|
|
keywords:
|
|
- museum
|
|
- gallery
|
|
- kunsthalle
|
|
- musée
|
|
context_boost:
|
|
- collection
|
|
- exhibit
|
|
- artifact
|
|
|
|
LIBRARY:
|
|
keywords:
|
|
- library
|
|
- biblioteca
|
|
- bibliothèque
|
|
context_boost:
|
|
- manuscript
|
|
- codex
|
|
- archival
|
|
|
|
ARCHIVE:
|
|
keywords:
|
|
- archive
|
|
- archivo
|
|
- archiv
|
|
context_boost:
|
|
- document
|
|
- record
|
|
- registry
|
|
|
|
HOLY_SITES:
|
|
keywords:
|
|
- cathedral
|
|
- church
|
|
- temple
|
|
- monastery
|
|
- abbey
|
|
- mosque
|
|
- synagogue
|
|
context_boost:
|
|
- pilgrimage
|
|
- religious
|
|
- sacred
|
|
```
|
|
|
|
- [ ] **[T]** Add edge case tests for classifier `[1h]` (P1)
|
|
- Test multilingual descriptions (French, Spanish, Arabic)
|
|
- Test sites with ambiguous descriptions
|
|
- Test serial nominations with multiple institution types
|
|
|
|
- [ ] **[V]** Benchmark classifier accuracy `[1h]` (P1)
|
|
- Manually classify 50 sample UNESCO sites
|
|
- Compare with algorithm output
|
|
- Target accuracy: >85%
|
|
|
|
### Day 5: GHCID Generation for UNESCO
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for UNESCO GHCID generation `[2h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_ghcid_unesco.py`
|
|
- **Test cases**:
|
|
- Single-country site: `FR-IDF-PAR-M-WHC600` (Banks of Seine)
|
|
- Transboundary site: Multi-country code handling
|
|
- Serial nomination: Append serial component ID
|
|
- UUID v5 generation from GHCID
|
|
- GHCID history tracking
|
|
|
|
- [ ] **[D+T]** Implement UNESCO GHCID generator `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/ghcid_generator.py`
|
|
- **Function**: `generate_unesco_ghcid(site_data: dict) -> str`
|
|
- **Format**: `{COUNTRY}-{REGION}-{CITY}-{TYPE}-WHC{UNESCO_ID}`
|
|
- **Logic**:
|
|
1. Extract country from `states_iso_code`
|
|
2. Extract city from `name_en` or geocode from lat/lon
|
|
3. Classify institution type → single-letter code
|
|
4. Append `WHC{unesco_id}` suffix
|
|
5. Handle transboundary: Use primary country or multi-country code
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Implement city extraction logic `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/city_extractor.py`
|
|
- **Function**: `extract_city_from_name(name: str, country: str) -> str`
|
|
- **Patterns**:
|
|
- "Historic Centre of {City}"
|
|
- "Old Town of {City}"
|
|
- "{Site Name} in {City}"
|
|
- Fallback to geocoding API if pattern match fails
|
|
|
|
- [ ] **[T]** Test city extraction edge cases `[1h]` (P1)
|
|
- Test site names without city references
|
|
- Test sites spanning multiple cities
|
|
- Test non-English site names
|
|
|
|
- [ ] **[V]** Validate GHCID uniqueness `[1h]` (P0)
|
|
- Generate GHCIDs for all 1,200+ UNESCO sites
|
|
- Check for collisions (expect <5%)
|
|
- Document collision resolution strategy
|
|
|
|
---
|
|
|
|
## Phase 2: Data Extraction Pipeline (Days 6-12, 32 hours)
|
|
|
|
### Day 6: UNESCO API Client
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for UNESCO API client `[2h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_unesco_api_client.py`
|
|
- **Test cases**:
|
|
- Fetch all sites with pagination (OpenDataSoft limit=100, offset mechanism)
|
|
- Fetch single site by ID using ODSQL `where` clause
|
|
- Handle HTTP errors (404, 500, timeout)
|
|
- Parse OpenDataSoft response structure (nested `record.fields`)
|
|
- Cache responses (avoid repeated API calls)
|
|
- Handle pagination links from response
|
|
|
|
- [ ] **[D+T]** Implement UNESCO API client `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/api_client.py`
|
|
- **Class**: `UNESCOAPIClient`
|
|
- **Base URL**: `https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records`
|
|
- **Methods**:
|
|
- `fetch_all_sites() -> List[dict]` - Paginate with limit=100, offset increment
|
|
- `fetch_site_by_id(site_id: int) -> dict` - Use ODSQL `where="unique_number={id}"`
|
|
- `paginate_records(limit: int, offset: int) -> dict` - Handle OpenDataSoft pagination
|
|
- `download_and_cache(url: str, cache_path: Path) -> dict`
|
|
- **Response parsing**: Extract `records` array, navigate `record.fields` for each item
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Implement response parser `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/response_parser.py`
|
|
- **Function**: `parse_unesco_site(raw_data: dict) -> dict`
|
|
- **Input**: OpenDataSoft record structure: `raw_data['record']['fields']`
|
|
- **Transformations**:
|
|
- Extract from nested `fields` dict: `raw_data['record']['fields']`
|
|
- Normalize field names (snake_case)
|
|
- Parse `date_inscribed` to datetime (format: "YYYY-MM-DD")
|
|
- Parse `criteria_txt` to list: "(i)(ii)(iv)" → ["i", "ii", "iv"]
|
|
- Extract coordinates from `coordinates` object: `{"lat": ..., "lon": ...}`
|
|
- Handle missing/null fields (many optional fields in OpenDataSoft)
|
|
- **Example input**:
|
|
```json
|
|
{
|
|
"record": {
|
|
"id": "abc123...",
|
|
"fields": {
|
|
"name_en": "Angkor",
|
|
"unique_number": 668,
|
|
"date_inscribed": "1992-01-01",
|
|
"criteria_txt": "(i)(ii)(iii)(iv)",
|
|
"coordinates": {"lat": 13.4333, "lon": 103.8333},
|
|
"category": "Cultural",
|
|
"states_name_en": "Cambodia"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **[T]** Create parser tests with fixtures `[1.5h]` (P0)
|
|
- **Fixtures**: `tests/fixtures/unesco/sample_sites.json` (OpenDataSoft format)
|
|
- Test parsing of 10 diverse UNESCO sites:
|
|
- Cultural site (e.g., Angkor Wat - unique_number: 668)
|
|
- Natural site (e.g., Galápagos Islands - unique_number: 1)
|
|
- Mixed site (e.g., Mount Taishan - unique_number: 437)
|
|
- Transboundary site (e.g., Struve Geodetic Arc - unique_number: 1187)
|
|
- Serial nomination (e.g., Le Corbusier's architectural works - unique_number: 1321)
|
|
- **Test nested field extraction**: Verify parser correctly navigates `record.fields`
|
|
- **Test coordinate parsing**: Verify `coordinates` object → separate lat/lon floats
|
|
- **Test null handling**: Verify sites with missing `coordinates`, `criteria_txt`, etc.
|
|
|
|
- [ ] **[V]** Test API client with live data `[0.5h]` (P0)
|
|
```bash
|
|
python -m pytest tests/unit/unesco/test_unesco_api_client.py -v
|
|
```
|
|
|
|
### Day 7: Location & Geographic Data
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for Location object builder `[1.5h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_location_builder.py`
|
|
- **Test cases**:
|
|
- Build Location from UNESCO lat/lon + country
|
|
- Reverse geocode coordinates to city/region
|
|
- Handle missing coordinates (geocode from name)
|
|
- Handle transboundary sites (multiple locations)
|
|
|
|
- [ ] **[D+T]** Implement Location builder `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/location_builder.py`
|
|
- **Function**: `build_location(site_data: dict) -> List[Location]`
|
|
- **Logic**:
|
|
1. Extract `latitude`, `longitude`, `states_iso_code`
|
|
2. Reverse geocode using Nominatim API (cached)
|
|
3. Create Location object with city, region, country
|
|
4. For transboundary sites, split by country and create multiple Locations
|
|
|
|
- [ ] **[D]** Implement Nominatim client wrapper `[0.5h]` (P1)
|
|
- **File**: `src/glam_extractor/geocoding/nominatim_client.py`
|
|
- **Function**: `reverse_geocode(lat: float, lon: float) -> dict`
|
|
- **Caching**: Use 15-minute TTL cache (per AGENTS.md)
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Integrate with GeoNames (from global_glam project) `[2h]` (P1)
|
|
- Reference: `docs/plan/global_glam/08-geonames-integration.md`
|
|
- Use existing GeoNames client if available
|
|
- Extract region codes (ADM1, ADM2)
|
|
|
|
- [ ] **[T]** Test geographic edge cases `[1h]` (P1)
|
|
- UNESCO sites in disputed territories
|
|
- Sites with coordinates in ocean (e.g., Galápagos)
|
|
- Sites spanning multiple administrative regions
|
|
|
|
- [ ] **[V]** Validate geographic accuracy `[1h]` (P0)
|
|
- Manually verify 50 geocoded UNESCO sites
|
|
- Check city names match expected values
|
|
- Target accuracy: >90%
|
|
|
|
### Day 8: Identifier Extraction
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for Identifier extraction `[2h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_identifier_extractor.py`
|
|
- **Test cases**:
|
|
- UNESCO WHC ID → Identifier(scheme="UNESCO_WHC", value="600")
|
|
- UNESCO URL → Identifier(scheme="Website", value="https://...")
|
|
- Wikidata enrichment (if available)
|
|
- ISIL codes (if site has institutional archive/library)
|
|
|
|
- [ ] **[D+T]** Implement Identifier extractor `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/identifier_extractor.py`
|
|
- **Function**: `extract_identifiers(site_data: dict) -> List[Identifier]`
|
|
- **Always create**:
|
|
- UNESCO WHC ID: `unesco_id` or `unique_number`
|
|
- Website: `http_url`
|
|
- **Conditionally add**:
|
|
- Wikidata Q-number (via SPARQL query)
|
|
- VIAF ID (if available in future enrichment)
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Implement Wikidata enrichment `[3h]` (P1)
|
|
- **File**: `src/glam_extractor/extractors/unesco/wikidata_enricher.py`
|
|
- **Function**: `find_wikidata_for_unesco_site(site_data: dict) -> Optional[str]`
|
|
- **SPARQL Query**:
|
|
```sparql
|
|
SELECT ?item WHERE {
|
|
?item wdt:P757 "600" . # UNESCO WHC ID property
|
|
}
|
|
```
|
|
- **Fuzzy fallback**: If no P757 match, search by name + country
|
|
|
|
- [ ] **[T]** Test Wikidata enrichment `[1h]` (P1)
|
|
- Test with known UNESCO sites that have Wikidata entries
|
|
- Test handling of sites without Wikidata entries
|
|
- Test rate limiting (1 request/second per Wikidata policy)
|
|
|
|
### Day 9-10: Provenance & Metadata
|
|
|
|
#### Day 9 Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for Provenance builder `[1.5h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_provenance_builder.py`
|
|
- **Test cases**:
|
|
- Provenance with data_source=UNESCO_DATAHUB
|
|
- Provenance with data_tier=TIER_2_VERIFIED (per doc 04)
|
|
- Extraction timestamp in UTC
|
|
- Extraction method documentation
|
|
- Confidence scores for institution type classification
|
|
|
|
- [ ] **[D+T]** Implement Provenance builder `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/provenance_builder.py`
|
|
- **Function**: `build_provenance(site_data: dict, confidence: float) -> Provenance`
|
|
- **Fields**:
|
|
```python
|
|
Provenance(
|
|
data_source=DataSource.UNESCO_DATAHUB,
|
|
data_tier=DataTier.TIER_2_VERIFIED,
|
|
extraction_date=datetime.now(timezone.utc),
|
|
extraction_method="UNESCO DataHub API extraction",
|
|
confidence_score=confidence,
|
|
source_url=site_data.get('http_url'),
|
|
verified_date=datetime.fromisoformat(site_data['date_inscribed']),
|
|
verified_by="UNESCO World Heritage Committee"
|
|
)
|
|
```
|
|
|
|
- [ ] **[D]** Add DataSource enum value `[0.5h]` (P0)
|
|
- Edit `schemas/enums.yaml`
|
|
- Add: `UNESCO_DATAHUB: {description: "UNESCO World Heritage DataHub API"}`
|
|
|
|
#### Day 9 Afternoon (4 hours)
|
|
|
|
- [ ] **[T]** Create test for ChangeEvent extraction `[2h]` (P1)
|
|
- **File**: `tests/unit/unesco/test_change_event_extractor.py`
|
|
- **Test cases**:
|
|
- Founding event from `date_inscribed`
|
|
- Danger listing event (when `danger: "1"`)
|
|
- Delisting event (when `date_end` is set)
|
|
- Extension event (when `extension: "1"`)
|
|
|
|
- [ ] **[D+T]** Implement ChangeEvent extractor `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/extractors/unesco/change_event_extractor.py`
|
|
- **Function**: `extract_change_events(site_data: dict) -> List[ChangeEvent]`
|
|
- **Events to create**:
|
|
- **FOUNDING**: `date_inscribed` → UNESCO inscription (not actual founding!)
|
|
- **STATUS_CHANGE**: `danger: "1"` → added to Danger List
|
|
- **CLOSURE**: `date_end` → delisted from World Heritage List
|
|
|
|
#### Day 10 Morning (4 hours)
|
|
|
|
- [ ] **[D]** Create Collection metadata extractor `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/extractors/unesco/collection_builder.py`
|
|
- **Function**: `build_collection(site_data: dict) -> Optional[Collection]`
|
|
- **Logic**:
|
|
- Extract `short_description_en` → collection description
|
|
- Parse `criteria_txt` → subject areas
|
|
- Extract `area_hectares` → extent
|
|
- If site is MUSEUM/ARCHIVE/LIBRARY, create Collection object
|
|
|
|
- [ ] **[T]** Test Collection extraction `[1.5h]` (P1)
|
|
- Test museum sites with clear collections
|
|
- Test non-institutional sites (landscapes, monuments) → return None
|
|
- Test serial nominations → multiple collections
|
|
|
|
- [ ] **[V]** Review extraction pipeline completeness `[0.5h]` (P0)
|
|
- Checklist:
|
|
- ✓ Institution name extracted
|
|
- ✓ Institution type classified
|
|
- ✓ Location(s) extracted
|
|
- ✓ Identifiers extracted
|
|
- ✓ Provenance metadata complete
|
|
- ✓ Change events extracted (optional)
|
|
- ✓ Collection metadata extracted (optional)
|
|
|
|
#### Day 10 Afternoon (4 hours)
|
|
|
|
- [ ] **[D+T]** Implement main orchestrator `[3h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/extractor.py`
|
|
- **Class**: `UNESCOExtractor`
|
|
- **Method**: `extract_site(site_data: dict) -> HeritageCustodian`
|
|
- **Orchestration logic**:
|
|
1. Parse raw UNESCO JSON
|
|
2. Classify institution type → confidence score
|
|
3. Extract name (prefer English, fallback to French)
|
|
4. Build Location(s)
|
|
5. Extract Identifiers
|
|
6. Build Provenance
|
|
7. Generate GHCID
|
|
8. Generate UUIDs (v5, v8, numeric)
|
|
9. Extract ChangeEvents
|
|
10. Build Collection (if applicable)
|
|
11. Return HeritageCustodian instance
|
|
|
|
- [ ] **[T]** Integration test for full extraction `[1h]` (P0)
|
|
- **File**: `tests/integration/unesco/test_end_to_end.py`
|
|
- **Test**: Extract 10 diverse UNESCO sites end-to-end
|
|
- **Assertions**:
|
|
- All required fields populated
|
|
- GHCID format valid
|
|
- UUIDs generated correctly
|
|
- Provenance complete
|
|
|
|
### Day 11-12: Batch Processing
|
|
|
|
#### Day 11 Morning (4 hours)
|
|
|
|
- [ ] **[D]** Implement batch processor `[2.5h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/batch_processor.py`
|
|
- **Class**: `UNESCOBatchProcessor`
|
|
- **Method**: `process_all_sites() -> List[HeritageCustodian]`
|
|
- **Features**:
|
|
- Progress bar (tqdm)
|
|
- Error handling (skip invalid sites, log errors)
|
|
- Parallel processing (multiprocessing, 4 workers)
|
|
- Incremental checkpointing (save every 100 sites)
|
|
|
|
- [ ] **[T]** Test batch processor `[1.5h]` (P0)
|
|
- Test processing 100 sites
|
|
- Test error recovery (invalid site data)
|
|
- Test checkpoint/resume functionality
|
|
|
|
#### Day 11 Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Add CLI interface `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/cli/unesco.py`
|
|
- **Commands**:
|
|
```bash
|
|
glam-extract unesco fetch-all # Download UNESCO data
|
|
glam-extract unesco process # Process all sites
|
|
glam-extract unesco validate # Validate output
|
|
glam-extract unesco export # Export to formats
|
|
```
|
|
|
|
- [ ] **[D]** Implement logging system `[1h]` (P1)
|
|
- **File**: `src/glam_extractor/extractors/unesco/logger.py`
|
|
- **Log levels**:
|
|
- INFO: Progress milestones
|
|
- WARNING: Missing data, fallback logic triggered
|
|
- ERROR: Unhandled exceptions, invalid data
|
|
- **Output**: `logs/unesco_extraction_{timestamp}.log`
|
|
|
|
- [ ] **[V]** Dry run on full dataset `[1h]` (P0)
|
|
- Process all 1,200+ UNESCO sites
|
|
- Log errors and warnings
|
|
- Review classification accuracy
|
|
- Target success rate: >95%
|
|
|
|
#### Day 12 Full Day (8 hours)
|
|
|
|
- [ ] **[D]** Implement error recovery patterns `[3h]` (P0)
|
|
- **File**: `src/glam_extractor/extractors/unesco/error_handling.py`
|
|
- **Patterns**:
|
|
- Retry on transient errors (HTTP 429, 503)
|
|
- Skip on permanent errors (invalid data structure)
|
|
- Log detailed error context
|
|
- Continue processing remaining sites
|
|
|
|
- [ ] **[T]** Stress test batch processor `[2h]` (P0)
|
|
- Test with 1,200+ sites
|
|
- Measure processing time (target: <10 minutes)
|
|
- Monitor memory usage
|
|
- Test interrupt/resume (Ctrl+C handling)
|
|
|
|
- [ ] **[D]** Optimize performance `[2h]` (P1)
|
|
- Profile slow operations
|
|
- Cache geocoding results
|
|
- Batch Wikidata queries
|
|
- Optimize JSON serialization
|
|
|
|
- [ ] **[V]** Quality review of 50 random sites `[1h]` (P0)
|
|
- Manual inspection of extracted records
|
|
- Verify institution type classification
|
|
- Verify GHCID generation
|
|
- Verify Location accuracy
|
|
- Document edge cases for refinement
|
|
|
|
---
|
|
|
|
## Phase 3: Validation & Testing (Days 13-16, 16 hours)
|
|
|
|
### Day 13: LinkML Schema Validation
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Create comprehensive validation script `[2h]` (P0)
|
|
- **File**: `scripts/validate_unesco_output.py`
|
|
- **Validations**:
|
|
- LinkML schema compliance
|
|
- Required field presence
|
|
- Enum value validity
|
|
- GHCID format regex
|
|
- UUID format validation
|
|
- URL reachability (sample check)
|
|
|
|
- [ ] **[T]** Create validation test suite `[2h]` (P0)
|
|
- **File**: `tests/validation/test_unesco_validation.py`
|
|
- **Test cases**:
|
|
- Valid records pass validation
|
|
- Invalid records fail with clear error messages
|
|
- Edge cases handled gracefully
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[V]** Run LinkML validator on all records `[2h]` (P0)
|
|
```bash
|
|
for file in data/instances/unesco/*.yaml; do
|
|
linkml-validate -s schemas/heritage_custodian.yaml "$file"
|
|
done
|
|
```
|
|
|
|
- [ ] **[V]** Review validation errors `[2h]` (P0)
|
|
- Categorize errors (schema, data quality, edge cases)
|
|
- Fix schema issues
|
|
- Fix extraction logic bugs
|
|
- Document known limitations
|
|
|
|
### Day 14: Data Quality Checks
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Implement data quality checks `[2.5h]` (P0)
|
|
- **File**: `scripts/quality_checks_unesco.py`
|
|
- **Checks**:
|
|
- Completeness: % of fields populated per record
|
|
- Consistency: Same UNESCO ID → same GHCID
|
|
- Accuracy: Geocoding spot checks
|
|
- Uniqueness: No duplicate GHCIDs
|
|
- Referential integrity: All identifiers valid
|
|
|
|
- [ ] **[V]** Run quality checks on full dataset `[1.5h]` (P0)
|
|
- Generate quality report: `reports/unesco_quality_report.md`
|
|
- Target metrics:
|
|
- Completeness: >90% for required fields
|
|
- Accuracy: >85% for institution type classification
|
|
- Uniqueness: 100% (no duplicate GHCIDs)
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Implement deduplication logic `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/extractors/unesco/deduplicator.py`
|
|
- **Logic**:
|
|
- Compare by UNESCO WHC ID (primary key)
|
|
- Fuzzy match by name + country (detect data errors)
|
|
- Merge records if duplicates found
|
|
|
|
- [ ] **[T]** Test deduplication `[1h]` (P1)
|
|
- Create synthetic duplicates
|
|
- Test merging logic
|
|
- Verify data integrity after merge
|
|
|
|
- [ ] **[V]** Final quality gate `[1h]` (P0)
|
|
- **Criteria**:
|
|
- All records pass LinkML validation
|
|
- Quality metrics meet targets
|
|
- No critical errors in logs
|
|
- Sample manual review passes
|
|
|
|
### Day 15: Integration Testing
|
|
|
|
#### Full Day (8 hours)
|
|
|
|
- [ ] **[T]** Create integration test suite `[4h]` (P0)
|
|
- **File**: `tests/integration/unesco/test_full_pipeline.py`
|
|
- **Tests**:
|
|
- Fetch UNESCO data from live API
|
|
- Process 100 random sites
|
|
- Validate output against schema
|
|
- Export to JSON-LD, RDF, CSV
|
|
- Verify exported files are valid
|
|
|
|
- [ ] **[T]** Create cross-reference tests `[2h]` (P1)
|
|
- **File**: `tests/integration/unesco/test_cross_reference.py`
|
|
- **Tests**:
|
|
- Cross-check UNESCO WHC IDs with official list
|
|
- Verify Wikidata Q-numbers resolve
|
|
- Verify website URLs are reachable (sample)
|
|
|
|
- [ ] **[V]** Performance benchmarks `[1h]` (P0)
|
|
- Measure extraction time per site (target: <0.5s)
|
|
- Measure total processing time (target: <10 minutes)
|
|
- Measure memory usage (target: <2GB)
|
|
|
|
- [ ] **[V]** User acceptance testing `[1h]` (P0)
|
|
- Load sample data into visualization tool
|
|
- Verify data is human-readable
|
|
- Verify data structure supports use cases (doc 02)
|
|
|
|
### Day 16: Test Coverage & Documentation
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[V]** Measure test coverage `[1h]` (P0)
|
|
```bash
|
|
pytest --cov=src/glam_extractor/extractors/unesco \
|
|
--cov-report=html \
|
|
tests/unit/unesco/
|
|
```
|
|
- Target coverage: >80%
|
|
|
|
- [ ] **[T]** Add missing tests to reach coverage target `[2h]` (P0)
|
|
- Focus on edge cases
|
|
- Focus on error handling paths
|
|
|
|
- [ ] **[D]** Document testing strategy `[1h]` (P1)
|
|
- **File**: `docs/unesco/testing.md`
|
|
- Document test fixtures
|
|
- Document test data sources
|
|
- Document known test gaps
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create developer guide `[2h]` (P1)
|
|
- **File**: `docs/unesco/developer_guide.md`
|
|
- How to run extraction locally
|
|
- How to add new classifiers
|
|
- How to extend schema
|
|
- How to debug extraction errors
|
|
|
|
- [ ] **[D]** Create operational runbook `[2h]` (P1)
|
|
- **File**: `docs/unesco/operations.md`
|
|
- How to schedule periodic extractions
|
|
- How to monitor data quality
|
|
- How to handle API changes
|
|
- How to troubleshoot common issues
|
|
|
|
---
|
|
|
|
## Phase 4: Export & Serialization (Days 17-20, 16 hours)
|
|
|
|
### Day 17: JSON-LD Exporter
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for JSON-LD exporter `[1.5h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_jsonld_exporter.py`
|
|
- **Test cases**:
|
|
- Export single HeritageCustodian to JSON-LD
|
|
- Verify @context is valid
|
|
- Verify @type is correct
|
|
- Verify linked data properties
|
|
|
|
- [ ] **[D+T]** Implement JSON-LD exporter `[2.5h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/jsonld_unesco.py`
|
|
- **Function**: `export_to_jsonld(custodian: HeritageCustodian) -> dict`
|
|
- **JSON-LD context**:
|
|
```json
|
|
{
|
|
"@context": {
|
|
"@vocab": "https://w3id.org/heritage/custodian/",
|
|
"name": "http://schema.org/name",
|
|
"location": "http://schema.org/location",
|
|
"unesco": "https://whc.unesco.org/en/list/"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create JSON-LD context file `[1h]` (P0)
|
|
- **File**: `contexts/unesco_heritage_custodian.jsonld`
|
|
- Reference existing context from global_glam project
|
|
- Add UNESCO-specific terms
|
|
|
|
- [ ] **[T]** Validate JSON-LD output `[1h]` (P0)
|
|
- Use JSON-LD Playground: https://json-ld.org/playground/
|
|
- Verify RDF triples are correct
|
|
- Test with sample records
|
|
|
|
- [ ] **[D]** Batch export to JSON-LD `[2h]` (P0)
|
|
- Export all 1,200+ records to `data/instances/unesco/jsonld/`
|
|
- One file per record: `whc-{unesco_id}.jsonld`
|
|
|
|
### Day 18: RDF/Turtle Exporter
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for RDF exporter `[1.5h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_rdf_exporter.py`
|
|
- **Test cases**:
|
|
- Export to Turtle (.ttl)
|
|
- Export to N-Triples (.nt)
|
|
- Verify RDF triples are valid
|
|
- Test with rdflib
|
|
|
|
- [ ] **[D+T]** Implement RDF exporter `[2.5h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/rdf_unesco.py`
|
|
- **Function**: `export_to_rdf(custodian: HeritageCustodian, format: str) -> str`
|
|
- **Dependencies**: `pip install rdflib==7.0.0`
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create RDF namespace definitions `[1h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/namespaces.py`
|
|
- **Namespaces**:
|
|
- heritage: https://w3id.org/heritage/custodian/
|
|
- unesco: https://whc.unesco.org/en/list/
|
|
- schema: http://schema.org/
|
|
- geo: http://www.w3.org/2003/01/geo/wgs84_pos#
|
|
- prov: http://www.w3.org/ns/prov#
|
|
|
|
- [ ] **[D]** Batch export to Turtle `[2h]` (P0)
|
|
- Export all records to `data/instances/unesco/rdf/unesco_sites.ttl`
|
|
- Create one large Turtle file with all 1,200+ sites
|
|
|
|
- [ ] **[V]** Validate RDF output `[1h]` (P0)
|
|
- Use rapper validator: `rapper -i turtle unesco_sites.ttl`
|
|
- Test SPARQL queries on output
|
|
|
|
### Day 19: CSV & Parquet Exporters
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[T]** Create test for CSV exporter `[1h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_csv_exporter.py`
|
|
- **Test cases**:
|
|
- Export flat CSV with essential fields
|
|
- Handle multi-valued fields (locations, identifiers)
|
|
- UTF-8 encoding
|
|
|
|
- [ ] **[D+T]** Implement CSV exporter `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/csv_unesco.py`
|
|
- **Function**: `export_to_csv(custodians: List[HeritageCustodian], path: Path)`
|
|
- **Columns**:
|
|
- ghcid, name, institution_type, city, country, latitude, longitude
|
|
- unesco_id, inscription_year, category, criteria
|
|
- wikidata_id, website_url
|
|
- extraction_date, confidence_score
|
|
|
|
- [ ] **[D]** Batch export to CSV `[1h]` (P0)
|
|
- Export to `data/instances/unesco/unesco_heritage_custodians.csv`
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[T]** Create test for Parquet exporter `[1h]` (P0)
|
|
- **File**: `tests/unit/unesco/test_parquet_exporter.py`
|
|
- **Test cases**:
|
|
- Export to Parquet format
|
|
- Verify schema preservation
|
|
- Test with DuckDB or Pandas
|
|
|
|
- [ ] **[D+T]** Implement Parquet exporter `[2h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/parquet_unesco.py`
|
|
- **Function**: `export_to_parquet(custodians: List[HeritageCustodian], path: Path)`
|
|
- **Dependencies**: `pip install pyarrow==14.0.0`
|
|
|
|
- [ ] **[D]** Batch export to Parquet `[1h]` (P0)
|
|
- Export to `data/instances/unesco/unesco_heritage_custodians.parquet`
|
|
|
|
### Day 20: SQLite Database
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Design SQLite schema `[1.5h]` (P0)
|
|
- **File**: `scripts/create_unesco_db_schema.sql`
|
|
- **Tables**:
|
|
- heritage_custodians (main table)
|
|
- locations (1:N relationship)
|
|
- identifiers (1:N relationship)
|
|
- change_events (1:N relationship)
|
|
- collections (1:N relationship)
|
|
|
|
- [ ] **[D]** Implement SQLite exporter `[2.5h]` (P0)
|
|
- **File**: `src/glam_extractor/exporters/sqlite_unesco.py`
|
|
- **Function**: `export_to_sqlite(custodians: List[HeritageCustodian], db_path: Path)`
|
|
- **Dependencies**: Built-in `sqlite3`
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create database indexes `[1h]` (P0)
|
|
- Index on ghcid (primary key)
|
|
- Index on unesco_id (unique)
|
|
- Index on country
|
|
- Index on institution_type
|
|
- Full-text search on name and description
|
|
|
|
- [ ] **[T]** Test SQLite export and queries `[2h]` (P0)
|
|
- Test data integrity
|
|
- Test foreign key constraints
|
|
- Test query performance
|
|
|
|
- [ ] **[D]** Create sample SQL queries `[1h]` (P1)
|
|
- **File**: `docs/unesco/sample_queries.sql`
|
|
- Example queries for common use cases
|
|
|
|
---
|
|
|
|
## Phase 5: Integration & Enrichment (Days 21-24, 16 hours)
|
|
|
|
### Day 21: Wikidata Enrichment Pipeline
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Implement batch Wikidata enrichment `[3h]` (P1)
|
|
- **File**: `scripts/enrich_unesco_with_wikidata.py`
|
|
- **Process**:
|
|
1. Load all UNESCO records without Wikidata IDs
|
|
2. Query Wikidata SPARQL endpoint (batches of 100)
|
|
3. Match by UNESCO WHC ID (property P757)
|
|
4. Update records with Q-numbers
|
|
5. Extract additional metadata from Wikidata (VIAF, founding date)
|
|
|
|
- [ ] **[T]** Test Wikidata enrichment `[1h]` (P1)
|
|
- Test with 50 sample sites
|
|
- Verify match accuracy
|
|
- Handle rate limiting
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Implement ISIL code lookup `[2h]` (P1)
|
|
- **File**: `scripts/enrich_unesco_with_isil.py`
|
|
- **Data source**: Cross-reference with existing ISIL registry
|
|
- **Logic**:
|
|
- Match by institution name + country
|
|
- Fuzzy matching threshold: 0.85
|
|
- Add ISIL identifier to records
|
|
|
|
- [ ] **[V]** Review enrichment results `[2h]` (P1)
|
|
- Count enriched records
|
|
- Review match quality
|
|
- Document enrichment coverage
|
|
|
|
### Day 22: Cross-dataset Linking
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Link UNESCO records to existing GLAM dataset `[3h]` (P1)
|
|
- **File**: `scripts/link_unesco_to_global_glam.py`
|
|
- **Matching strategies**:
|
|
1. GHCID match (direct)
|
|
2. Wikidata ID match
|
|
3. Name + country fuzzy match
|
|
- **Output**: Crosswalk table linking UNESCO WHC ID to existing GHCIDs
|
|
|
|
- [ ] **[V]** Validate cross-dataset links `[1h]` (P1)
|
|
- Manual review of 50 matched pairs
|
|
- Check for false positives
|
|
- Document ambiguous cases
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create dataset merge logic `[2h]` (P1)
|
|
- **File**: `scripts/merge_unesco_into_global_glam.py`
|
|
- **Merge rules**:
|
|
- If GHCID exists: Update with UNESCO metadata
|
|
- If no match: Add as new record
|
|
- Preserve provenance trail
|
|
|
|
- [ ] **[T]** Test merge logic `[1h]` (P1)
|
|
- Test with sample dataset
|
|
- Verify no data loss
|
|
- Verify GHCID uniqueness maintained
|
|
|
|
- [ ] **[V]** Final merge quality check `[1h]` (P0)
|
|
- Review merged dataset
|
|
- Verify provenance for all records
|
|
- Document merge statistics
|
|
|
|
### Day 23: Data Quality Refinement
|
|
|
|
#### Full Day (8 hours)
|
|
|
|
- [ ] **[D]** Implement data quality improvements `[4h]` (P0)
|
|
- Fix classification errors from manual review
|
|
- Refine keyword patterns for edge cases
|
|
- Improve geocoding accuracy
|
|
- Handle special characters in names
|
|
|
|
- [ ] **[T]** Regression testing `[2h]` (P0)
|
|
- Re-run full test suite
|
|
- Verify improvements didn't break existing functionality
|
|
|
|
- [ ] **[V]** Final data quality audit `[2h]` (P0)
|
|
- Re-run quality checks
|
|
- Compare metrics to Day 14 baseline
|
|
- Target improvement: +5% accuracy
|
|
|
|
### Day 24: Documentation & Handoff
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Create extraction report `[2h]` (P0)
|
|
- **File**: `reports/unesco_extraction_report.md`
|
|
- **Contents**:
|
|
- Total records extracted: 1,200+
|
|
- Classification accuracy: XX%
|
|
- Coverage by institution type
|
|
- Geographic distribution
|
|
- Data quality metrics
|
|
- Known limitations
|
|
|
|
- [ ] **[D]** Document known issues `[1h]` (P0)
|
|
- **File**: `docs/unesco/known_issues.md`
|
|
- List edge cases not handled
|
|
- List sites with ambiguous classification
|
|
- List sites missing key metadata
|
|
|
|
- [ ] **[D]** Create future enhancements roadmap `[1h]` (P1)
|
|
- **File**: `docs/unesco/roadmap.md`
|
|
- Planned improvements
|
|
- Additional data sources to integrate
|
|
- New features to implement
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create API documentation `[2h]` (P1)
|
|
- **File**: `docs/unesco/api.md`
|
|
- Document all public functions
|
|
- Document CLI commands
|
|
- Provide usage examples
|
|
|
|
- [ ] **[D]** Create maintenance guide `[2h]` (P1)
|
|
- **File**: `docs/unesco/maintenance.md`
|
|
- How to update UNESCO data (periodic refresh)
|
|
- How to handle API changes
|
|
- How to re-run enrichment pipelines
|
|
|
|
---
|
|
|
|
## Phase 6: Deployment & Monitoring (Days 25-28, 16 hours)
|
|
|
|
### Day 25: Automation Scripts
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Create automated extraction pipeline `[2.5h]` (P0)
|
|
- **File**: `scripts/run_unesco_extraction.sh`
|
|
- **Steps**:
|
|
1. Fetch latest UNESCO data
|
|
2. Run extraction
|
|
3. Validate output
|
|
4. Export to all formats
|
|
5. Run quality checks
|
|
6. Generate report
|
|
|
|
- [ ] **[T]** Test automation script `[1.5h]` (P0)
|
|
- Test end-to-end execution
|
|
- Test error handling
|
|
- Test with scheduled execution (cron)
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create incremental update script `[2.5h]` (P1)
|
|
- **File**: `scripts/update_unesco_incremental.py`
|
|
- **Logic**:
|
|
- Compare current UNESCO data with previous extraction
|
|
- Detect new sites, modified sites, delisted sites
|
|
- Process only changes
|
|
- Update database incrementally
|
|
|
|
- [ ] **[T]** Test incremental updates `[1.5h]` (P1)
|
|
- Simulate API changes
|
|
- Verify only deltas processed
|
|
- Verify data consistency
|
|
|
|
### Day 26: Monitoring & Alerts
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Implement extraction monitoring `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/monitoring/unesco_monitor.py`
|
|
- **Metrics to track**:
|
|
- Extraction success rate
|
|
- Average processing time per site
|
|
- Classification confidence distribution
|
|
- Data quality scores
|
|
|
|
- [ ] **[D]** Create alerting logic `[2h]` (P1)
|
|
- **File**: `src/glam_extractor/monitoring/alerts.py`
|
|
- **Alert conditions**:
|
|
- Extraction failure rate >5%
|
|
- Classification confidence <0.7 for >10% of sites
|
|
- API downtime
|
|
- Schema validation failure rate >1%
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Create monitoring dashboard (optional) `[3h]` (P2)
|
|
- **Tool**: Jupyter notebook or Streamlit
|
|
- **Visualizations**:
|
|
- Extraction pipeline status
|
|
- Data quality trends over time
|
|
- Geographic coverage map
|
|
- Institution type distribution
|
|
|
|
- [ ] **[V]** Test monitoring system `[1h]` (P1)
|
|
- Simulate failures
|
|
- Verify alerts trigger
|
|
- Test dashboard functionality
|
|
|
|
### Day 27: Production Deployment
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Package extraction module `[2h]` (P0)
|
|
- Create `setup.py` or `pyproject.toml`
|
|
- Document installation instructions
|
|
- Test installation in clean environment
|
|
|
|
- [ ] **[D]** Create Docker container (optional) `[2h]` (P2)
|
|
- **File**: `docker/Dockerfile.unesco`
|
|
- Containerize extraction pipeline
|
|
- Document Docker deployment
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Deploy to production environment `[2h]` (P0)
|
|
- Upload code to production server
|
|
- Install dependencies
|
|
- Configure environment variables
|
|
- Test production execution
|
|
|
|
- [ ] **[V]** Production smoke test `[1h]` (P0)
|
|
- Run extraction on production
|
|
- Verify output files created
|
|
- Verify database updated
|
|
- Verify logs generated
|
|
|
|
- [ ] **[D]** Configure scheduled execution `[1h]` (P0)
|
|
- Set up cron job for weekly updates
|
|
- Configure log rotation
|
|
- Set up error notifications
|
|
|
|
### Day 28: Handoff & Training
|
|
|
|
#### Full Day (8 hours)
|
|
|
|
- [ ] **[D]** Create training materials `[3h]` (P1)
|
|
- **Presentation**: Project overview and architecture
|
|
- **Demo**: Live extraction walkthrough
|
|
- **Exercises**: Hands-on practice for maintainers
|
|
|
|
- [ ] **[D]** Conduct knowledge transfer session `[3h]` (P1)
|
|
- Present to stakeholders
|
|
- Walk through codebase
|
|
- Answer questions
|
|
- Document Q&A
|
|
|
|
- [ ] **[D]** Create support plan `[2h]` (P1)
|
|
- **File**: `docs/unesco/support.md`
|
|
- Contact information for maintainers
|
|
- Escalation procedures
|
|
- Known issues and workarounds
|
|
|
|
---
|
|
|
|
## Phase 7: Finalization (Days 29-30, 8 hours)
|
|
|
|
### Day 29: Final Review
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[V]** Code review `[2h]` (P0)
|
|
- Review all implementation files
|
|
- Check code quality (linting, type hints)
|
|
- Verify documentation completeness
|
|
|
|
- [ ] **[V]** Test coverage review `[1h]` (P0)
|
|
- Verify coverage >80%
|
|
- Review untested code paths
|
|
- Document rationale for any skipped tests
|
|
|
|
- [ ] **[V]** Security review `[1h]` (P0)
|
|
- Check for hardcoded credentials
|
|
- Verify API keys in environment variables
|
|
- Check for SQL injection vulnerabilities
|
|
- Verify input validation
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Final documentation pass `[2h]` (P0)
|
|
- Proofread all documentation
|
|
- Check for broken links
|
|
- Verify code examples are correct
|
|
- Update README with final statistics
|
|
|
|
- [ ] **[V]** Stakeholder demo `[2h]` (P0)
|
|
- Present final results
|
|
- Demo visualization of extracted data
|
|
- Discuss next steps
|
|
- Gather feedback
|
|
|
|
### Day 30: Release & Archive
|
|
|
|
#### Morning (4 hours)
|
|
|
|
- [ ] **[D]** Create release package `[2h]` (P0)
|
|
- Tag release version: `v1.0.0-unesco`
|
|
- Create GitHub release (if applicable)
|
|
- Archive source code
|
|
- Archive extracted datasets
|
|
|
|
- [ ] **[D]** Publish datasets `[1h]` (P0)
|
|
- Upload to project repository
|
|
- Generate dataset DOI (Zenodo or similar)
|
|
- Update dataset catalog
|
|
|
|
- [ ] **[D]** Create project retrospective `[1h]` (P1)
|
|
- **File**: `reports/unesco_retrospective.md`
|
|
- What went well
|
|
- What could be improved
|
|
- Lessons learned
|
|
- Recommendations for future extractions
|
|
|
|
#### Afternoon (4 hours)
|
|
|
|
- [ ] **[D]** Final cleanup `[2h]` (P0)
|
|
- Remove temporary files
|
|
- Archive logs
|
|
- Clean up test data
|
|
- Optimize database indexes
|
|
|
|
- [ ] **[V]** Final quality gate `[1h]` (P0)
|
|
- **Checklist**:
|
|
- ✓ All tests passing
|
|
- ✓ All data validated
|
|
- ✓ All documentation complete
|
|
- ✓ Production deployment successful
|
|
- ✓ Monitoring in place
|
|
- ✓ Stakeholders trained
|
|
|
|
- [ ] **[D]** Project close-out `[1h]` (P0)
|
|
- Send completion notification
|
|
- Archive project artifacts
|
|
- Update project status: COMPLETE
|
|
|
|
---
|
|
|
|
## Appendix A: Task Dependencies
|
|
|
|
### Critical Path (Must Complete in Order)
|
|
|
|
1. **Day 1-2**: Environment setup → **Blocks all development**
|
|
2. **Day 3-5**: Schema & types → **Blocks extraction logic**
|
|
3. **Day 6-8**: Data extraction → **Blocks validation**
|
|
4. **Day 9-12**: Batch processing → **Blocks export**
|
|
5. **Day 13-16**: Validation → **Blocks production deployment**
|
|
6. **Day 17-20**: Export → **Blocks integration**
|
|
7. **Day 25-28**: Deployment → **Blocks project completion**
|
|
|
|
### Parallel Work Opportunities
|
|
|
|
- **Days 7-10**: Location extraction + Identifier extraction can be done in parallel
|
|
- **Days 17-20**: All exporters (JSON-LD, RDF, CSV, Parquet, SQLite) can be developed in parallel
|
|
- **Days 21-22**: Wikidata enrichment + ISIL lookup can be done in parallel
|
|
|
|
---
|
|
|
|
## Appendix B: Quality Gates
|
|
|
|
### Gate 1: Schema Complete (End of Day 5)
|
|
- [ ] All LinkML schemas validated
|
|
- [ ] Python classes generated successfully
|
|
- [ ] Enum values documented
|
|
- [ ] GHCID format specified
|
|
|
|
### Gate 2: Extraction Pipeline Complete (End of Day 12)
|
|
- [ ] API client tested with live data
|
|
- [ ] All extractors implemented and tested
|
|
- [ ] Batch processor handles 1,200+ sites
|
|
- [ ] Error handling robust
|
|
|
|
### Gate 3: Validation Complete (End of Day 16)
|
|
- [ ] All records pass LinkML validation
|
|
- [ ] Quality metrics meet targets
|
|
- [ ] Test coverage >80%
|
|
- [ ] Sample manual review passes
|
|
|
|
### Gate 4: Export Complete (End of Day 20)
|
|
- [ ] All export formats tested
|
|
- [ ] Output files validated
|
|
- [ ] Database indexes optimized
|
|
- [ ] Sample queries work
|
|
|
|
### Gate 5: Production Ready (End of Day 28)
|
|
- [ ] Production deployment successful
|
|
- [ ] Monitoring in place
|
|
- [ ] Documentation complete
|
|
- [ ] Team trained
|
|
|
|
### Gate 6: Project Complete (End of Day 30)
|
|
- [ ] All tests passing
|
|
- [ ] All datasets published
|
|
- [ ] Stakeholders notified
|
|
- [ ] Project archived
|
|
|
|
---
|
|
|
|
## Appendix C: Risk Management
|
|
|
|
### High-Priority Risks
|
|
|
|
#### Risk 1: UNESCO API Changes
|
|
- **Impact**: High (extraction fails)
|
|
- **Probability**: Low
|
|
- **Mitigation**:
|
|
- Monitor API version
|
|
- Cache raw API responses
|
|
- Implement schema validation on API responses
|
|
- Create fallback to manual data files
|
|
|
|
#### Risk 2: Institution Type Classification Accuracy <85%
|
|
- **Impact**: Medium (data quality issues)
|
|
- **Probability**: Medium
|
|
- **Mitigation**:
|
|
- Iterative refinement of keywords
|
|
- Manual review of 100+ samples
|
|
- Add confidence threshold (skip uncertain classifications)
|
|
- Document known edge cases
|
|
|
|
#### Risk 3: GHCID Collisions
|
|
- **Impact**: Medium (duplicate identifiers)
|
|
- **Probability**: Low (<5% expected)
|
|
- **Mitigation**:
|
|
- Implement collision detection
|
|
- Use Wikidata Q-numbers for disambiguation
|
|
- Document collision resolution rules
|
|
- Manual review of collisions
|
|
|
|
#### Risk 4: Performance Issues (>10 min processing time)
|
|
- **Impact**: Low (convenience only)
|
|
- **Probability**: Medium
|
|
- **Mitigation**:
|
|
- Implement parallel processing
|
|
- Cache geocoding results
|
|
- Batch Wikidata queries
|
|
- Profile and optimize slow operations
|
|
|
|
#### Risk 5: Wikidata Enrichment Rate Limiting
|
|
- **Impact**: Low (enrichment incomplete)
|
|
- **Probability**: High
|
|
- **Mitigation**:
|
|
- Respect rate limits (1 req/sec)
|
|
- Implement exponential backoff
|
|
- Make enrichment optional
|
|
- Run enrichment as separate batch job
|
|
|
|
---
|
|
|
|
## Appendix D: Resource Requirements
|
|
|
|
### Software Dependencies
|
|
|
|
**Required**:
|
|
- Python 3.11+
|
|
- linkml 1.7.10+
|
|
- pydantic 1.10.13
|
|
- requests 2.31.0
|
|
|
|
**Optional**:
|
|
- rapidfuzz 3.5.2 (fuzzy matching)
|
|
- SPARQLWrapper 2.0.0 (Wikidata)
|
|
- rdflib 7.0.0 (RDF export)
|
|
- pyarrow 14.0.0 (Parquet export)
|
|
|
|
### Infrastructure
|
|
|
|
**Development**:
|
|
- Laptop/workstation with 8GB+ RAM
|
|
- Internet connection (API access)
|
|
- ~2GB disk space for data
|
|
|
|
**Production**:
|
|
- Server with 4GB+ RAM
|
|
- Cron scheduling capability
|
|
- 5GB disk space (with logs)
|
|
|
|
### Team Resources
|
|
|
|
**Estimated effort**: 120 person-hours over 30 days
|
|
|
|
**Roles**:
|
|
- Software Engineer (primary developer)
|
|
- Data Scientist (classification logic)
|
|
- QA Engineer (testing)
|
|
- Documentation Writer (optional)
|
|
|
|
---
|
|
|
|
## Appendix E: Success Metrics
|
|
|
|
### Quantitative Metrics
|
|
|
|
- **Extraction completeness**: >95% of UNESCO sites processed successfully
|
|
- **Classification accuracy**: >85% correct institution type assignments
|
|
- **Validation pass rate**: 100% of records pass LinkML schema validation
|
|
- **Test coverage**: >80% code coverage
|
|
- **Processing time**: <10 minutes for full dataset
|
|
- **Data quality score**: >90% completeness on required fields
|
|
|
|
### Qualitative Metrics
|
|
|
|
- **Code maintainability**: Clear documentation, modular design
|
|
- **Data usability**: Stakeholders can query and visualize data
|
|
- **Reproducibility**: Pipeline can be re-run by new team members
|
|
- **Extensibility**: Easy to add new data sources or fields
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 1.1 | 2025-11-10 | AI Agent (OpenCode) | Updated Day 6 API client and parser sections to reflect OpenDataSoft Explore API v2 response structure (nested `record.fields`, pagination links, coordinate objects) |
|
|
| 1.0 | 2025-11-09 | GLAM Extraction Team | Initial version |
|
|
|
|
---
|
|
|
|
**Document Status**: ✅ COMPLETE
|
|
**Next Steps**: Begin Phase 0 - Pre-Implementation Setup
|
|
|
|
**Questions?** Refer to planning documents 00-06 for detailed specifications.
|