1444 lines
48 KiB
Markdown
1444 lines
48 KiB
Markdown
# UNESCO Data Extraction - Implementation Phases
|
|
|
|
**Project**: Global GLAM Dataset - UNESCO World Heritage Sites Extraction
|
|
**Document**: 03 - Implementation Phases
|
|
**Version**: 1.1
|
|
**Date**: 2025-11-10
|
|
**Status**: Draft
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This document outlines the 6-week (30 working days) implementation plan for extracting UNESCO World Heritage Site data into the Global GLAM Dataset. The plan follows a test-driven development (TDD) approach with five distinct phases, each building on the previous phase's deliverables.
|
|
|
|
**Total Effort**: 30 working days (6 weeks)
|
|
**Team Size**: 1-2 developers + AI agents
|
|
**Target Output**: 1,000+ heritage custodian records from UNESCO sites
|
|
|
|
---
|
|
|
|
## Phase Overview
|
|
|
|
| Phase | Duration | Focus | Key Deliverables |
|
|
|-------|----------|-------|------------------|
|
|
| **Phase 1** | 5 days | API Exploration & Schema Design | UNESCO API parser, LinkML Map schema |
|
|
| **Phase 2** | 8 days | Extractor Implementation | Institution type classifier, GHCID generator |
|
|
| **Phase 3** | 6 days | Data Quality & Validation | LinkML validator, conflict resolver |
|
|
| **Phase 4** | 5 days | Integration & Enrichment | Wikidata enrichment, dataset merge |
|
|
| **Phase 5** | 6 days | Export & Documentation | RDF/JSON-LD exporters, user docs |
|
|
|
|
---
|
|
|
|
## Phase 1: API Exploration & Schema Design (Days 1-5)
|
|
|
|
### Objectives
|
|
|
|
- Understand UNESCO DataHub API structure and data quality
|
|
- Design LinkML Map transformation rules for UNESCO JSON → HeritageCustodian
|
|
- Create test fixtures from real UNESCO API responses
|
|
- Establish baseline for institution type classification
|
|
|
|
### Day 1: UNESCO API Reconnaissance
|
|
|
|
**Tasks**:
|
|
1. **API Documentation Review**
|
|
- Study UNESCO DataHub OpenDataSoft API docs (https://data.unesco.org/api/explore/v2.0/console)
|
|
- Identify available endpoints (dataset `whc001` - World Heritage List)
|
|
- Document authentication requirements (none - public dataset)
|
|
- Document pagination limits (max 100 records per request, use `offset` parameter)
|
|
- Test API responses for sample sites
|
|
|
|
2. **Data Structure Analysis**
|
|
```bash
|
|
# Fetch sample UNESCO site data (OpenDataSoft Explore API v2)
|
|
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?limit=10" \
|
|
> samples/unesco_site_list.json
|
|
|
|
# Fetch specific site by unique_number using ODSQL
|
|
curl "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001/records?where=unique_number%3D668" \
|
|
> samples/unesco_angkor_detail.json
|
|
|
|
# Response structure: nested record.fields with coordinates object
|
|
# {
|
|
# "total_count": 1248,
|
|
# "records": [{
|
|
# "record": {
|
|
# "fields": {
|
|
# "name_en": "Angkor",
|
|
# "unique_number": 668,
|
|
# "coordinates": {"lat": 13.4333, "lon": 103.8333}
|
|
# }
|
|
# }
|
|
# }]
|
|
# }
|
|
```
|
|
|
|
3. **Schema Mapping**
|
|
- Map OpenDataSoft `record.fields` to LinkML `HeritageCustodian` slots
|
|
- Identify missing fields (require inference or external enrichment)
|
|
- Document ambiguities (e.g., when is a site also a museum?)
|
|
- Handle nested response structure (`response['records'][i]['record']['fields']`)
|
|
|
|
**Deliverables**:
|
|
- `docs/unesco-api-analysis.md` - API structure documentation
|
|
- `tests/fixtures/unesco_api_responses/` - 10+ sample JSON files
|
|
- `docs/unesco-to-linkml-mapping.md` - Field mapping table
|
|
|
|
**Success Criteria**:
|
|
- [ ] Successfully fetch data for 50 UNESCO sites via API
|
|
- [ ] Document all relevant JSON fields for extraction
|
|
- [ ] Identify 3+ institution type classification patterns
|
|
|
|
---
|
|
|
|
### Day 2: LinkML Map Schema Design (Part 1)
|
|
|
|
**Tasks**:
|
|
1. **Install LinkML Map Extension**
|
|
```bash
|
|
pip install linkml-map
|
|
# OR implement custom extension: src/glam_extractor/mappers/extended_map.py
|
|
```
|
|
|
|
2. **Design Transformation Rules**
|
|
- Create `schemas/maps/unesco_to_heritage_custodian.yaml`
|
|
- Define JSONPath expressions for field extraction
|
|
- Handle multi-language names (UNESCO provides English, French, often local language)
|
|
- Map UNESCO categories to InstitutionTypeEnum
|
|
|
|
3. **Conditional Extraction Logic**
|
|
```yaml
|
|
# Example LinkML Map rule
|
|
mappings:
|
|
- source_path: $.category
|
|
target_path: institution_type
|
|
transform:
|
|
type: conditional
|
|
rules:
|
|
- condition: "contains(description, 'museum')"
|
|
value: MUSEUM
|
|
- condition: "contains(description, 'library')"
|
|
value: LIBRARY
|
|
- condition: "contains(description, 'archive')"
|
|
value: ARCHIVE
|
|
- default: MIXED
|
|
```
|
|
|
|
**Deliverables**:
|
|
- `schemas/maps/unesco_to_heritage_custodian.yaml` (initial version)
|
|
- `tests/test_unesco_linkml_map.py` - Unit tests for mapping rules
|
|
|
|
**Success Criteria**:
|
|
- [ ] LinkML Map schema validates against sample UNESCO JSON
|
|
- [ ] Successfully extract name, location, UNESCO WHC ID from 10 fixtures
|
|
- [ ] Handle multilingual names without data loss
|
|
|
|
---
|
|
|
|
### Day 3: LinkML Map Schema Design (Part 2)
|
|
|
|
**Tasks**:
|
|
1. **Advanced Transformation Rules**
|
|
- Regex extraction for identifiers (UNESCO WHC ID format: `^\d{3,4}$`)
|
|
- GeoNames ID lookup from UNESCO location strings
|
|
- Wikidata Q-number extraction from UNESCO external links
|
|
|
|
2. **Multi-value Array Handling**
|
|
```yaml
|
|
# Extract all languages from UNESCO site names
|
|
mappings:
|
|
- source_path: $.names[*]
|
|
target_path: alternative_names
|
|
transform:
|
|
type: array
|
|
element_transform:
|
|
type: template
|
|
template: "{name}@{lang}"
|
|
```
|
|
|
|
3. **Error Handling Patterns**
|
|
- Missing required fields → skip record with warning
|
|
- Invalid coordinates → flag for manual geocoding
|
|
- Unknown institution type → default to MIXED, log for review
|
|
|
|
**Deliverables**:
|
|
- `schemas/maps/unesco_to_heritage_custodian.yaml` (complete)
|
|
- `docs/linkml-map-extension-spec.md` - Custom extension specification
|
|
|
|
**Success Criteria**:
|
|
- [ ] Extract ALL relevant fields from 10 diverse UNESCO sites
|
|
- [ ] Handle edge cases (missing data, malformed coordinates)
|
|
- [ ] Generate valid LinkML instances from real API responses
|
|
|
|
---
|
|
|
|
### Day 4: Test Fixture Creation
|
|
|
|
**Tasks**:
|
|
1. **Curate Representative Samples**
|
|
- Select 20 UNESCO sites covering:
|
|
- All continents (Europe, Asia, Africa, Americas, Oceania)
|
|
- Multiple institution types (museums, libraries, archives, botanical gardens)
|
|
- Edge cases (serial nominations, transboundary sites)
|
|
|
|
2. **Create Expected Outputs**
|
|
```yaml
|
|
# tests/fixtures/expected_outputs/unesco_louvre.yaml
|
|
- id: https://w3id.org/heritage/custodian/fr/louvre
|
|
name: Musée du Louvre
|
|
institution_type: MUSEUM
|
|
locations:
|
|
- city: Paris
|
|
country: FR
|
|
coordinates: [48.8606, 2.3376]
|
|
identifiers:
|
|
- identifier_scheme: UNESCO_WHC
|
|
identifier_value: "600"
|
|
identifier_url: https://whc.unesco.org/en/list/600
|
|
provenance:
|
|
data_source: UNESCO_WORLD_HERITAGE
|
|
data_tier: TIER_1_AUTHORITATIVE
|
|
```
|
|
|
|
3. **Golden Dataset Construction**
|
|
- Manually verify 20 expected outputs against authoritative sources
|
|
- Document any assumptions or inferences made
|
|
|
|
**Deliverables**:
|
|
- `tests/fixtures/unesco_api_responses/` - 20 JSON files
|
|
- `tests/fixtures/expected_outputs/` - 20 YAML files
|
|
- `tests/test_unesco_golden_dataset.py` - Integration tests
|
|
|
|
**Success Criteria**:
|
|
- [ ] 20 golden dataset pairs (input JSON + expected YAML)
|
|
- [ ] 100% passing tests for golden dataset
|
|
- [ ] Documented edge cases and classification rules
|
|
|
|
---
|
|
|
|
### Day 5: Institution Type Classifier Design
|
|
|
|
**Tasks**:
|
|
1. **Pattern Analysis**
|
|
- Analyze UNESCO descriptions for GLAM-related keywords
|
|
- Create decision tree for institution type classification
|
|
- Handle ambiguous cases (e.g., "archaeological park with museum")
|
|
|
|
2. **Keyword Extraction**
|
|
```python
|
|
# src/glam_extractor/classifiers/unesco_institution_type.py
|
|
|
|
MUSEUM_KEYWORDS = ['museum', 'musée', 'museo', 'muzeum', 'gallery', 'exhibition']
|
|
LIBRARY_KEYWORDS = ['library', 'bibliothèque', 'biblioteca', 'bibliotheek']
|
|
ARCHIVE_KEYWORDS = ['archive', 'archiv', 'archivo', 'archief', 'documentary heritage']
|
|
BOTANICAL_KEYWORDS = ['botanical garden', 'jardin botanique', 'arboretum']
|
|
HOLY_SITE_KEYWORDS = ['cathedral', 'church', 'monastery', 'abbey', 'temple', 'mosque', 'synagogue']
|
|
FEATURES_KEYWORDS = ['monument', 'statue', 'sculpture', 'memorial', 'landmark', 'cemetery', 'obelisk', 'fountain', 'arch', 'gate']
|
|
```
|
|
|
|
3. **Confidence Scoring**
|
|
- High confidence (0.9+): Explicit mentions of "museum" or "library"
|
|
- Medium confidence (0.7-0.9): Inferred from UNESCO category + keywords
|
|
- Low confidence (0.5-0.7): Ambiguous, default to MIXED, flag for review
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/classifiers/unesco_institution_type.py`
|
|
- `tests/test_unesco_classifier.py` - 50+ test cases
|
|
- `docs/unesco-classification-rules.md` - Decision tree documentation
|
|
|
|
**Success Criteria**:
|
|
- [ ] Classifier achieves 90%+ accuracy on golden dataset
|
|
- [ ] Low-confidence classifications flagged for manual review
|
|
- [ ] Handle multilingual descriptions (English, French, Spanish, etc.)
|
|
|
|
---
|
|
|
|
## Phase 2: Extractor Implementation (Days 6-13)
|
|
|
|
### Objectives
|
|
|
|
- Implement UNESCO API client with caching and rate limiting
|
|
- Build LinkML instance generator using Map schema
|
|
- Create GHCID generator for UNESCO institutions
|
|
- Achieve 100% test coverage for core extraction logic
|
|
|
|
### Day 6: UNESCO API Client
|
|
|
|
**Tasks**:
|
|
1. **HTTP Client Implementation**
|
|
```python
|
|
# src/glam_extractor/parsers/unesco_api_client.py
|
|
|
|
class UNESCOAPIClient:
|
|
def __init__(self, cache_ttl: int = 86400):
|
|
self.base_url = "https://data.unesco.org/api/explore/v2.0/catalog/datasets/whc001"
|
|
self.cache = Cache(ttl=cache_ttl)
|
|
self.rate_limiter = RateLimiter(requests_per_second=2)
|
|
|
|
def fetch_site_list(self, limit: int = 100, offset: int = 0) -> Dict:
|
|
"""
|
|
Fetch paginated list of UNESCO World Heritage Sites.
|
|
|
|
Returns OpenDataSoft API response with structure:
|
|
{
|
|
"total_count": int,
|
|
"results": [{"record": {"id": str, "fields": {...}}, ...}]
|
|
}
|
|
"""
|
|
...
|
|
|
|
def fetch_site_details(self, whc_id: int) -> Dict:
|
|
"""
|
|
Fetch detailed information for a specific site using ODSQL query.
|
|
|
|
Returns single record with structure:
|
|
{"record": {"id": str, "fields": {field_name: value, ...}}}
|
|
"""
|
|
...
|
|
```
|
|
|
|
2. **Caching Strategy**
|
|
- Cache API responses for 24 hours (UNESCO updates infrequently)
|
|
- Store in SQLite database: `cache/unesco_api_cache.db`
|
|
- Invalidate cache on demand for data refreshes
|
|
|
|
3. **Error Handling**
|
|
- Network errors → retry with exponential backoff
|
|
- 404 Not Found → skip site, log warning
|
|
- Rate limit exceeded → pause and retry
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/parsers/unesco_api_client.py`
|
|
- `tests/test_unesco_api_client.py` - Mock API tests
|
|
- `cache/unesco_api_cache.db` - SQLite cache database
|
|
|
|
**Success Criteria**:
|
|
- [ ] Successfully fetch all UNESCO sites (1,000+ sites)
|
|
- [ ] Handle API errors gracefully (no crashes)
|
|
- [ ] Cache reduces API calls by 95% on repeat runs
|
|
|
|
---
|
|
|
|
### Day 7: LinkML Instance Generator
|
|
|
|
**Tasks**:
|
|
1. **Apply LinkML Map Transformations**
|
|
```python
|
|
# src/glam_extractor/parsers/unesco_parser.py
|
|
|
|
from linkml_map import Mapper
|
|
|
|
def parse_unesco_site(api_response: Dict) -> HeritageCustodian:
|
|
"""
|
|
Parse OpenDataSoft API response to HeritageCustodian instance.
|
|
|
|
Args:
|
|
api_response: OpenDataSoft record with nested structure:
|
|
{"record": {"id": str, "fields": {field_name: value, ...}}}
|
|
|
|
Returns:
|
|
HeritageCustodian: Validated LinkML instance
|
|
"""
|
|
# Extract fields from nested structure
|
|
site_data = api_response['record']['fields']
|
|
|
|
mapper = Mapper(schema_path="schemas/maps/unesco_to_heritage_custodian.yaml")
|
|
instance = mapper.transform(site_data)
|
|
return HeritageCustodian(**instance)
|
|
```
|
|
|
|
2. **Validation Pipeline**
|
|
- Apply LinkML schema validation after transformation
|
|
- Catch validation errors, log details
|
|
- Skip invalid records, continue processing
|
|
|
|
3. **Multi-language Name Handling**
|
|
```python
|
|
def extract_multilingual_names(site_data: Dict) -> Tuple[str, List[str]]:
|
|
"""
|
|
Extract primary name and alternative names in multiple languages.
|
|
|
|
Args:
|
|
site_data: Extracted fields from OpenDataSoft record['fields']
|
|
"""
|
|
primary_name = site_data.get('site', '')
|
|
alternative_names = []
|
|
|
|
# OpenDataSoft may provide language variants in separate fields
|
|
# or as structured data - adjust based on actual API response
|
|
for lang_data in site_data.get('names', []):
|
|
name = lang_data.get('name', '')
|
|
lang = lang_data.get('lang', 'en')
|
|
if name and name != primary_name:
|
|
alternative_names.append(f"{name}@{lang}")
|
|
|
|
return primary_name, alternative_names
|
|
```
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/parsers/unesco_parser.py`
|
|
- `tests/test_unesco_parser.py` - 20 golden dataset tests
|
|
|
|
**Success Criteria**:
|
|
- [ ] Parse 20 golden dataset fixtures with 100% accuracy
|
|
- [ ] Extract multilingual names without data loss
|
|
- [ ] Generate valid LinkML instances (pass schema validation)
|
|
|
|
---
|
|
|
|
### Day 8: GHCID Generator for UNESCO Sites
|
|
|
|
**Tasks**:
|
|
1. **Extend GHCID Logic for UNESCO**
|
|
```python
|
|
# src/glam_extractor/identifiers/ghcid_generator.py
|
|
|
|
def generate_ghcid_for_unesco_site(
|
|
country_code: str,
|
|
region_code: str,
|
|
city_code: str,
|
|
institution_type: InstitutionTypeEnum,
|
|
institution_name: str,
|
|
has_collision: bool = False
|
|
) -> str:
|
|
"""
|
|
Generate GHCID for UNESCO World Heritage Site institution.
|
|
|
|
Format: {COUNTRY}-{REGION}-{CITY}-{TYPE}-{ABBREV}[-{native_name_snake_case}]
|
|
Example: FR-IDF-PAR-M-SM-stedelijk_museum_paris (with collision suffix)
|
|
|
|
Note: Collision suffix uses native language institution name in snake_case,
|
|
NOT Wikidata Q-numbers. See docs/plan/global_glam/07-ghcid-collision-resolution.md
|
|
"""
|
|
...
|
|
```
|
|
|
|
2. **City Code Lookup**
|
|
- Use GeoNames API to convert city names to UN/LOCODE
|
|
- Fallback to 3-letter abbreviation if UN/LOCODE not found
|
|
- Cache lookups to minimize API calls
|
|
|
|
3. **Collision Detection**
|
|
- Check existing GHCID dataset for collisions
|
|
- Apply temporal priority rules (first batch vs. historical addition)
|
|
- Append native language name suffix if collision detected
|
|
|
|
**Deliverables**:
|
|
- Extended `src/glam_extractor/identifiers/ghcid_generator.py`
|
|
- `tests/test_ghcid_unesco.py` - GHCID generation tests
|
|
|
|
**Success Criteria**:
|
|
- [ ] Generate valid GHCIDs for 20 golden dataset institutions
|
|
- [ ] No collisions with existing Dutch ISIL dataset
|
|
- [ ] Handle missing Wikidata Q-numbers gracefully
|
|
|
|
---
|
|
|
|
### Day 9-10: Batch Processing Pipeline
|
|
|
|
**Tasks**:
|
|
1. **Parallel Processing**
|
|
```python
|
|
# scripts/extract_unesco_sites.py
|
|
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
|
def extract_all_unesco_sites(max_workers: int = 4):
|
|
api_client = UNESCOAPIClient()
|
|
|
|
# Fetch paginated site list from OpenDataSoft API
|
|
all_sites = []
|
|
offset = 0
|
|
limit = 100
|
|
|
|
while True:
|
|
response = api_client.fetch_site_list(limit=limit, offset=offset)
|
|
all_sites.extend(response['results'])
|
|
|
|
if len(all_sites) >= response['total_count']:
|
|
break
|
|
offset += limit
|
|
|
|
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
|
results = executor.map(process_unesco_site, all_sites)
|
|
|
|
return list(results)
|
|
|
|
def process_unesco_site(site_record: Dict) -> Optional[HeritageCustodian]:
|
|
"""
|
|
Process single OpenDataSoft record.
|
|
|
|
Args:
|
|
site_record: {"record": {"id": str, "fields": {...}}}
|
|
"""
|
|
try:
|
|
# Extract unique_number from nested fields
|
|
whc_id = site_record['record']['fields']['unique_number']
|
|
|
|
# Fetch full details if needed (or use site_record directly)
|
|
details = api_client.fetch_site_details(whc_id)
|
|
institution_type = classify_institution_type(details['record']['fields'])
|
|
custodian = parse_unesco_site(details)
|
|
custodian.ghcid = generate_ghcid(custodian)
|
|
return custodian
|
|
except Exception as e:
|
|
log.error(f"Failed to process site {whc_id}: {e}")
|
|
return None
|
|
```
|
|
|
|
2. **Progress Tracking**
|
|
- Use `tqdm` for progress bars
|
|
- Log successful extractions to `logs/unesco_extraction.log`
|
|
- Save intermediate results every 100 sites
|
|
|
|
3. **Error Recovery**
|
|
- Resume from last checkpoint if script crashes
|
|
- Separate successful extractions from failed ones
|
|
- Generate error report with failed site IDs
|
|
|
|
**Deliverables**:
|
|
- `scripts/extract_unesco_sites.py` - Batch extraction script
|
|
- `data/unesco_extracted/` - Output directory for YAML instances
|
|
- `logs/unesco_extraction.log` - Extraction log
|
|
|
|
**Success Criteria**:
|
|
- [ ] Process all 1,000+ UNESCO sites in < 2 hours
|
|
- [ ] < 5% failure rate (API errors, missing data)
|
|
- [ ] Successful extractions saved as valid LinkML YAML files
|
|
|
|
---
|
|
|
|
### Day 11-12: Integration Testing
|
|
|
|
**Tasks**:
|
|
1. **End-to-End Tests**
|
|
```python
|
|
# tests/integration/test_unesco_pipeline.py
|
|
|
|
def test_full_unesco_extraction_pipeline():
|
|
"""Test complete pipeline from OpenDataSoft API fetch to LinkML instance."""
|
|
# 1. Fetch API data from OpenDataSoft
|
|
api_client = UNESCOAPIClient()
|
|
response = api_client.fetch_site_details(600) # Paris, Banks of the Seine
|
|
site_data = response['record']['fields'] # Extract from nested structure
|
|
|
|
# 2. Classify institution type
|
|
inst_type = classify_institution_type(site_data)
|
|
assert inst_type in [InstitutionTypeEnum.MUSEUM, InstitutionTypeEnum.MIXED]
|
|
|
|
# 3. Parse to LinkML instance
|
|
custodian = parse_unesco_site(response)
|
|
assert custodian.name is not None
|
|
|
|
# 4. Generate GHCID
|
|
custodian.ghcid = generate_ghcid(custodian)
|
|
assert custodian.ghcid.startswith("FR-")
|
|
|
|
# 5. Validate against schema
|
|
validator = SchemaValidator(schema="schemas/heritage_custodian.yaml")
|
|
result = validator.validate(custodian)
|
|
assert result.is_valid
|
|
```
|
|
|
|
2. **Property-Based Testing**
|
|
```python
|
|
from hypothesis import given, strategies as st
|
|
|
|
@given(st.integers(min_value=1, max_value=1500))
|
|
def test_ghcid_determinism(whc_id: int):
|
|
"""GHCID generation is deterministic for same input."""
|
|
site1 = generate_ghcid_for_site(whc_id)
|
|
site2 = generate_ghcid_for_site(whc_id)
|
|
assert site1 == site2
|
|
```
|
|
|
|
3. **Performance Testing**
|
|
- Benchmark extraction speed (sites per second)
|
|
- Memory profiling (ensure no memory leaks)
|
|
- Cache hit rate analysis
|
|
|
|
**Deliverables**:
|
|
- `tests/integration/test_unesco_pipeline.py` - End-to-end tests
|
|
- `tests/test_unesco_property_based.py` - Property-based tests
|
|
- `docs/performance-benchmarks.md` - Performance results
|
|
|
|
**Success Criteria**:
|
|
- [ ] 100% passing integration tests
|
|
- [ ] Extract 1,000 sites in < 2 hours (with cache)
|
|
- [ ] Memory usage < 500MB for full extraction
|
|
|
|
---
|
|
|
|
### Day 13: Code Review & Refactoring
|
|
|
|
**Tasks**:
|
|
1. **Code Quality Review**
|
|
- Run `ruff` linter, fix all warnings
|
|
- Run `mypy` type checker, resolve type errors
|
|
- Ensure 90%+ test coverage
|
|
|
|
2. **Documentation Review**
|
|
- Add docstrings to all public functions
|
|
- Update README with UNESCO extraction instructions
|
|
- Create developer guide for extending classifiers
|
|
|
|
3. **Performance Optimization**
|
|
- Profile slow functions, optimize bottlenecks
|
|
- Reduce redundant API calls
|
|
- Optimize GHCID generation (cache city code lookups)
|
|
|
|
**Deliverables**:
|
|
- Refactored codebase with 90%+ test coverage
|
|
- Updated documentation
|
|
- Performance optimizations applied
|
|
|
|
**Success Criteria**:
|
|
- [ ] Zero linter warnings
|
|
- [ ] Zero type errors
|
|
- [ ] Test coverage > 90%
|
|
|
|
---
|
|
|
|
## Phase 3: Data Quality & Validation (Days 14-19)
|
|
|
|
### Objectives
|
|
|
|
- Cross-reference UNESCO data with existing GLAM dataset (Dutch ISIL, conversations)
|
|
- Detect and resolve conflicts
|
|
- Implement confidence scoring system
|
|
- Generate data quality report
|
|
|
|
### Day 14-15: Cross-Referencing with Existing Data
|
|
|
|
**Tasks**:
|
|
1. **Load Existing Dataset**
|
|
```python
|
|
# scripts/crosslink_unesco_with_glam.py
|
|
|
|
def load_existing_glam_dataset():
|
|
"""Load Dutch ISIL + conversation extractions."""
|
|
dutch_isil = load_isil_registry("data/ISIL-codes_2025-08-01.csv")
|
|
dutch_orgs = load_dutch_orgs("data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv")
|
|
conversations = load_conversation_extractions("data/instances/")
|
|
return merge_datasets([dutch_isil, dutch_orgs, conversations])
|
|
```
|
|
|
|
2. **Match UNESCO Sites to Existing Records**
|
|
- Match by Wikidata Q-number (highest confidence)
|
|
- Match by ISIL code (for Dutch institutions)
|
|
- Match by name + location (fuzzy matching, score > 0.85)
|
|
|
|
3. **Conflict Detection**
|
|
```python
|
|
def detect_conflicts(unesco_record: HeritageCustodian, existing_record: HeritageCustodian) -> List[Conflict]:
|
|
"""Detect field-level conflicts between UNESCO and existing data."""
|
|
conflicts = []
|
|
|
|
if unesco_record.name != existing_record.name:
|
|
conflicts.append(Conflict(
|
|
field="name",
|
|
unesco_value=unesco_record.name,
|
|
existing_value=existing_record.name,
|
|
resolution="MANUAL_REVIEW"
|
|
))
|
|
|
|
# Check institution type mismatch
|
|
if unesco_record.institution_type != existing_record.institution_type:
|
|
conflicts.append(Conflict(field="institution_type", ...))
|
|
|
|
return conflicts
|
|
```
|
|
|
|
**Deliverables**:
|
|
- `scripts/crosslink_unesco_with_glam.py` - Cross-linking script
|
|
- `data/unesco_conflicts.csv` - Detected conflicts report
|
|
- `tests/test_crosslinking.py` - Unit tests for matching logic
|
|
|
|
**Success Criteria**:
|
|
- [ ] Identify 50+ matches between UNESCO and existing dataset
|
|
- [ ] Detect conflicts (name mismatches, type discrepancies)
|
|
- [ ] Generate conflict report for manual review
|
|
|
|
---
|
|
|
|
### Day 16: Conflict Resolution
|
|
|
|
**Tasks**:
|
|
1. **Tier-Based Priority**
|
|
- TIER_1 (UNESCO, Dutch ISIL) > TIER_4 (conversation NLP)
|
|
- When conflict: UNESCO data wins, existing data flagged
|
|
|
|
2. **Merge Strategy**
|
|
```python
|
|
def merge_unesco_with_existing(
|
|
unesco_record: HeritageCustodian,
|
|
existing_record: HeritageCustodian
|
|
) -> HeritageCustodian:
|
|
"""Merge UNESCO data with existing record, UNESCO takes priority."""
|
|
merged = existing_record.copy()
|
|
|
|
# UNESCO name becomes primary
|
|
merged.name = unesco_record.name
|
|
|
|
# Preserve alternative names from both sources
|
|
merged.alternative_names = list(set(
|
|
unesco_record.alternative_names + existing_record.alternative_names
|
|
))
|
|
|
|
# Add UNESCO identifier
|
|
merged.identifiers.append({
|
|
'identifier_scheme': 'UNESCO_WHC',
|
|
'identifier_value': unesco_record.identifiers[0].identifier_value
|
|
})
|
|
|
|
# Track provenance of merge
|
|
merged.provenance.notes = "Merged with UNESCO TIER_1 data on 2025-11-XX"
|
|
|
|
return merged
|
|
```
|
|
|
|
3. **Manual Review Queue**
|
|
- Flag high-impact conflicts (institution type change, location change)
|
|
- Generate review spreadsheet for human validation
|
|
- Provide evidence for each conflict (source URLs, descriptions)
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/validators/conflict_resolver.py`
|
|
- `data/manual_review_queue.csv` - Conflicts requiring human review
|
|
- `tests/test_conflict_resolution.py`
|
|
|
|
**Success Criteria**:
|
|
- [ ] Resolve 80% of conflicts automatically (tier-based priority)
|
|
- [ ] Flag 20% for manual review (complex cases)
|
|
- [ ] Zero data loss (preserve all alternative names, identifiers)
|
|
|
|
---
|
|
|
|
### Day 17: Confidence Scoring System
|
|
|
|
**Tasks**:
|
|
1. **Score Calculation**
|
|
```python
|
|
def calculate_confidence_score(custodian: HeritageCustodian) -> float:
|
|
"""Calculate confidence score based on data completeness and source quality."""
|
|
score = 1.0 # Start at maximum (TIER_1 authoritative)
|
|
|
|
# Deduct for missing fields
|
|
if not custodian.identifiers:
|
|
score -= 0.1
|
|
if not custodian.locations:
|
|
score -= 0.15
|
|
if custodian.institution_type == InstitutionTypeEnum.MIXED:
|
|
score -= 0.2 # Ambiguous classification
|
|
|
|
# Boost for rich metadata
|
|
if len(custodian.identifiers) > 2:
|
|
score += 0.05
|
|
if custodian.digital_platforms:
|
|
score += 0.05
|
|
|
|
return max(0.0, min(1.0, score)) # Clamp to [0.0, 1.0]
|
|
```
|
|
|
|
2. **Quality Metrics**
|
|
- Completeness: % of optional fields populated
|
|
- Accuracy: Agreement with authoritative sources (Wikidata, official websites)
|
|
- Freshness: Days since extraction
|
|
|
|
3. **Tier Validation**
|
|
- Ensure all UNESCO records have `data_tier: TIER_1_AUTHORITATIVE`
|
|
- Downgrade tier if conflicts detected (TIER_1 → TIER_2)
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/validators/confidence_scorer.py`
|
|
- `tests/test_confidence_scoring.py`
|
|
- `data/unesco_quality_metrics.json` - Aggregate statistics
|
|
|
|
**Success Criteria**:
|
|
- [ ] 90%+ of UNESCO records score > 0.85 confidence
|
|
- [ ] Flag < 5% as low confidence (require review)
|
|
- [ ] Document scoring methodology in provenance metadata
|
|
|
|
---
|
|
|
|
### Day 18: LinkML Schema Validation
|
|
|
|
**Tasks**:
|
|
1. **Batch Validation**
|
|
```bash
|
|
# Validate all UNESCO extractions against LinkML schema
|
|
for file in data/unesco_extracted/*.yaml; do
|
|
linkml-validate -s schemas/heritage_custodian.yaml "$file" || echo "FAILED: $file"
|
|
done
|
|
```
|
|
|
|
2. **Custom Validators**
|
|
```python
|
|
# src/glam_extractor/validators/unesco_validators.py
|
|
|
|
def validate_unesco_whc_id(whc_id: str) -> bool:
|
|
"""UNESCO WHC IDs are 3-4 digit integers."""
|
|
return bool(re.match(r'^\d{3,4}$', whc_id))
|
|
|
|
def validate_ghcid_format(ghcid: str) -> bool:
|
|
"""Validate GHCID format for UNESCO institutions."""
|
|
pattern = r'^[A-Z]{2}-[A-Z0-9]+-[A-Z]{3}-[GLAMORCUBEPSXHF]-[A-Z]{2,5}(-Q\d+)?$'
|
|
return bool(re.match(pattern, ghcid))
|
|
```
|
|
|
|
3. **Error Reporting**
|
|
- Generate validation report with line numbers and error messages
|
|
- Categorize errors (required field missing, invalid enum value, format error)
|
|
- Prioritize fixes (blocking errors vs. warnings)
|
|
|
|
**Deliverables**:
|
|
- `scripts/validate_unesco_dataset.py` - Batch validation script
|
|
- `src/glam_extractor/validators/unesco_validators.py` - Custom validators
|
|
- `data/unesco_validation_report.json` - Validation results
|
|
|
|
**Success Criteria**:
|
|
- [ ] 100% of extracted records pass LinkML validation
|
|
- [ ] Zero blocking errors
|
|
- [ ] Document any warnings in provenance notes
|
|
|
|
---
|
|
|
|
### Day 19: Data Quality Report
|
|
|
|
**Tasks**:
|
|
1. **Generate Statistics**
|
|
```python
|
|
# scripts/generate_unesco_quality_report.py
|
|
|
|
def generate_quality_report(unesco_dataset: List[HeritageCustodian]) -> Dict:
|
|
return {
|
|
'total_institutions': len(unesco_dataset),
|
|
'by_country': count_by_country(unesco_dataset),
|
|
'by_institution_type': count_by_type(unesco_dataset),
|
|
'avg_confidence_score': calculate_avg_confidence(unesco_dataset),
|
|
'completeness_metrics': {
|
|
'with_wikidata_id': count_with_wikidata(unesco_dataset),
|
|
'with_digital_platform': count_with_platforms(unesco_dataset),
|
|
'with_geocoded_location': count_with_geocoding(unesco_dataset)
|
|
},
|
|
'conflicts_detected': len(load_conflicts('data/unesco_conflicts.csv')),
|
|
'manual_review_pending': len(load_review_queue('data/manual_review_queue.csv'))
|
|
}
|
|
```
|
|
|
|
2. **Visualization**
|
|
- Generate maps showing UNESCO site distribution
|
|
- Bar charts: institutions by country, by type
|
|
- Heatmap: data completeness by field
|
|
|
|
3. **Documentation**
|
|
- Write executive summary of data quality
|
|
- Document known issues and limitations
|
|
- Provide recommendations for improvement
|
|
|
|
**Deliverables**:
|
|
- `scripts/generate_unesco_quality_report.py`
|
|
- `data/unesco_quality_report.json` - Statistics
|
|
- `docs/unesco-data-quality.md` - Quality report document
|
|
- `data/visualizations/` - Maps and charts
|
|
|
|
**Success Criteria**:
|
|
- [ ] Quality report shows 90%+ completeness for core fields
|
|
- [ ] < 5% of records require manual review
|
|
- [ ] Geographic coverage across all inhabited continents
|
|
|
|
---
|
|
|
|
## Phase 4: Integration & Enrichment (Days 20-24)
|
|
|
|
### Objectives
|
|
|
|
- Enrich UNESCO data with Wikidata identifiers
|
|
- Merge UNESCO dataset with existing GLAM dataset
|
|
- Resolve GHCID collisions
|
|
- Update GHCID history for modified records
|
|
|
|
### Day 20-21: Wikidata Enrichment
|
|
|
|
**Tasks**:
|
|
1. **SPARQL Query for UNESCO Sites**
|
|
```python
|
|
# scripts/enrich_unesco_with_wikidata.py
|
|
|
|
def query_wikidata_for_unesco_site(whc_id: str) -> Optional[str]:
|
|
"""Find Wikidata Q-number for UNESCO World Heritage Site."""
|
|
query = f"""
|
|
SELECT ?item WHERE {{
|
|
?item wdt:P757 "{whc_id}" . # P757 = UNESCO World Heritage Site ID
|
|
}}
|
|
LIMIT 1
|
|
"""
|
|
|
|
results = sparql_query(query)
|
|
if results:
|
|
return extract_qid(results[0]['item']['value'])
|
|
return None
|
|
```
|
|
|
|
2. **Batch Enrichment**
|
|
- Query Wikidata for all 1,000+ UNESCO sites
|
|
- Extract Q-numbers, VIAF IDs, ISIL codes (if available)
|
|
- Add to `identifiers` array in LinkML instances
|
|
|
|
3. **Fuzzy Matching Fallback**
|
|
- If WHC ID not found in Wikidata, try name + location matching
|
|
- Use same fuzzy matching logic from existing enrichment scripts
|
|
- Threshold: 0.85 similarity score
|
|
|
|
**Deliverables**:
|
|
- `scripts/enrich_unesco_with_wikidata.py` - Enrichment script
|
|
- `data/unesco_enriched/` - Enriched YAML instances
|
|
- `logs/wikidata_enrichment.log` - Enrichment log
|
|
|
|
**Success Criteria**:
|
|
- [ ] Find Wikidata Q-numbers for 80%+ of UNESCO sites
|
|
- [ ] Add VIAF/ISIL identifiers where available
|
|
- [ ] Document enrichment in provenance metadata
|
|
|
|
---
|
|
|
|
### Day 22: Dataset Merge
|
|
|
|
**Tasks**:
|
|
1. **Merge Strategy**
|
|
```python
|
|
# scripts/merge_unesco_into_glam_dataset.py
|
|
|
|
def merge_datasets(
|
|
unesco_data: List[HeritageCustodian],
|
|
existing_data: List[HeritageCustodian]
|
|
) -> List[HeritageCustodian]:
|
|
"""Merge UNESCO data into existing GLAM dataset."""
|
|
merged = existing_data.copy()
|
|
|
|
for unesco_record in unesco_data:
|
|
# Check if institution already exists
|
|
match = find_match(unesco_record, existing_data)
|
|
|
|
if match:
|
|
# Merge records
|
|
merged_record = merge_unesco_with_existing(unesco_record, match)
|
|
merged[merged.index(match)] = merged_record
|
|
else:
|
|
# Add new institution
|
|
merged.append(unesco_record)
|
|
|
|
return merged
|
|
```
|
|
|
|
2. **Deduplication**
|
|
- Detect duplicates by GHCID, Wikidata Q-number, ISIL code
|
|
- Prefer UNESCO data (TIER_1) over conversation data (TIER_4)
|
|
- Preserve alternative names and identifiers from both sources
|
|
|
|
3. **Provenance Tracking**
|
|
- Update `provenance.notes` for merged records
|
|
- Record merge timestamp
|
|
- Link back to original extraction sources
|
|
|
|
**Deliverables**:
|
|
- `scripts/merge_unesco_into_glam_dataset.py` - Merge script
|
|
- `data/merged_glam_dataset/` - Merged dataset (YAML files)
|
|
- `data/merge_report.json` - Merge statistics
|
|
|
|
**Success Criteria**:
|
|
- [ ] Merge 1,000+ UNESCO records with existing GLAM dataset
|
|
- [ ] Deduplicate matches (no duplicate GHCIDs)
|
|
- [ ] Preserve data from all sources (no information loss)
|
|
|
|
---
|
|
|
|
### Day 23: GHCID Collision Resolution
|
|
|
|
**Tasks**:
|
|
1. **Detect Collisions**
|
|
```python
|
|
# scripts/resolve_ghcid_collisions.py
|
|
|
|
def detect_ghcid_collisions(dataset: List[HeritageCustodian]) -> List[Collision]:
|
|
"""Find institutions with identical base GHCIDs."""
|
|
ghcid_map = defaultdict(list)
|
|
|
|
for custodian in dataset:
|
|
base_ghcid = remove_q_number(custodian.ghcid)
|
|
ghcid_map[base_ghcid].append(custodian)
|
|
|
|
collisions = [
|
|
Collision(base_ghcid=k, institutions=v)
|
|
for k, v in ghcid_map.items()
|
|
if len(v) > 1
|
|
]
|
|
|
|
return collisions
|
|
```
|
|
|
|
2. **Apply Temporal Priority Rules**
|
|
- Compare `provenance.extraction_date` for colliding institutions
|
|
- First batch (same date): ALL get Q-numbers
|
|
- Historical addition (later date): ONLY new gets name suffix
|
|
|
|
3. **Update GHCID History**
|
|
```python
|
|
def update_ghcid_history(custodian: HeritageCustodian, old_ghcid: str, new_ghcid: str):
|
|
"""Record GHCID change in history."""
|
|
custodian.ghcid_history.append(GHCIDHistoryEntry(
|
|
ghcid=new_ghcid,
|
|
ghcid_numeric=generate_numeric_id(new_ghcid),
|
|
valid_from=datetime.now(timezone.utc).isoformat(),
|
|
valid_to=None,
|
|
reason=f"Name suffix added to resolve collision with {old_ghcid}"
|
|
))
|
|
```
|
|
|
|
**Deliverables**:
|
|
- `scripts/resolve_ghcid_collisions.py` - Collision resolution script
|
|
- `data/ghcid_collision_report.json` - Detected collisions
|
|
- Updated YAML instances with `ghcid_history` entries
|
|
|
|
**Success Criteria**:
|
|
- [ ] Resolve all GHCID collisions (zero duplicates)
|
|
- [ ] Update GHCID history for affected records
|
|
- [ ] Preserve PID stability (no changes to published GHCIDs)
|
|
|
|
---
|
|
|
|
### Day 24: Final Data Validation
|
|
|
|
**Tasks**:
|
|
1. **Full Dataset Validation**
|
|
- Run LinkML validation on merged dataset
|
|
- Check for orphaned references (invalid foreign keys)
|
|
- Verify all GHCIDs are unique
|
|
|
|
2. **Integrity Checks**
|
|
```python
|
|
# tests/integration/test_merged_dataset_integrity.py
|
|
|
|
def test_no_duplicate_ghcids():
|
|
dataset = load_merged_dataset()
|
|
ghcids = [c.ghcid for c in dataset]
|
|
assert len(ghcids) == len(set(ghcids)), "Duplicate GHCIDs detected!"
|
|
|
|
def test_all_unesco_sites_have_whc_id():
|
|
dataset = load_merged_dataset()
|
|
unesco_records = [c for c in dataset if c.provenance.data_source == "UNESCO_WORLD_HERITAGE"]
|
|
|
|
for record in unesco_records:
|
|
whc_ids = [i for i in record.identifiers if i.identifier_scheme == "UNESCO_WHC"]
|
|
assert len(whc_ids) > 0, f"{record.name} missing UNESCO WHC ID"
|
|
```
|
|
|
|
3. **Coverage Analysis**
|
|
- Verify UNESCO sites across all continents
|
|
- Check institution type distribution (not all MUSEUM)
|
|
- Ensure Dutch institutions properly merged with ISIL registry
|
|
|
|
**Deliverables**:
|
|
- `tests/integration/test_merged_dataset_integrity.py` - Integrity tests
|
|
- `data/final_validation_report.json` - Validation results
|
|
- `docs/dataset-coverage.md` - Coverage analysis
|
|
|
|
**Success Criteria**:
|
|
- [ ] 100% passing integrity tests
|
|
- [ ] Zero duplicate GHCIDs
|
|
- [ ] UNESCO sites cover 100+ countries
|
|
|
|
---
|
|
|
|
## Phase 5: Export & Documentation (Days 25-30)
|
|
|
|
### Objectives
|
|
|
|
- Export merged dataset in multiple formats (RDF, JSON-LD, CSV, Parquet)
|
|
- Generate user documentation and API docs
|
|
- Create example queries and use case tutorials
|
|
- Publish dataset with persistent identifiers
|
|
|
|
### Day 25-26: RDF/JSON-LD Export
|
|
|
|
**Tasks**:
|
|
1. **RDF Serialization**
|
|
```python
|
|
# src/glam_extractor/exporters/rdf_exporter.py
|
|
|
|
def export_to_rdf(dataset: List[HeritageCustodian], output_path: str):
|
|
"""Export dataset to RDF/Turtle format."""
|
|
graph = Graph()
|
|
|
|
# Define namespaces
|
|
GLAM = Namespace("https://w3id.org/heritage/custodian/")
|
|
graph.bind("glam", GLAM)
|
|
graph.bind("schema", SCHEMA)
|
|
graph.bind("cpov", Namespace("http://data.europa.eu/m8g/"))
|
|
|
|
for custodian in dataset:
|
|
uri = URIRef(custodian.id)
|
|
|
|
# Type assertions
|
|
graph.add((uri, RDF.type, GLAM.HeritageCustodian))
|
|
graph.add((uri, RDF.type, SCHEMA.Museum)) # If institution_type == MUSEUM
|
|
|
|
# Literals
|
|
graph.add((uri, SCHEMA.name, Literal(custodian.name)))
|
|
graph.add((uri, GLAM.institution_type, Literal(custodian.institution_type)))
|
|
|
|
# Identifiers (owl:sameAs)
|
|
for identifier in custodian.identifiers:
|
|
if identifier.identifier_scheme == "Wikidata":
|
|
graph.add((uri, OWL.sameAs, URIRef(identifier.identifier_url)))
|
|
|
|
graph.serialize(destination=output_path, format="turtle")
|
|
```
|
|
|
|
2. **JSON-LD Context**
|
|
```json
|
|
// data/context/heritage_custodian_context.jsonld
|
|
{
|
|
"@context": {
|
|
"@vocab": "https://w3id.org/heritage/custodian/",
|
|
"schema": "http://schema.org/",
|
|
"name": "schema:name",
|
|
"location": "schema:location",
|
|
"identifiers": "schema:identifier",
|
|
"institution_type": "institutionType",
|
|
"data_source": "dataSource"
|
|
}
|
|
}
|
|
```
|
|
|
|
3. **Content Negotiation Setup**
|
|
- Configure w3id.org redirects (if hosting on GitHub Pages)
|
|
- Test URI resolution for sample institutions
|
|
- Ensure Accept header routing (text/turtle, application/ld+json)
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/exporters/rdf_exporter.py` - RDF exporter
|
|
- `data/exports/glam_dataset.ttl` - RDF/Turtle export
|
|
- `data/exports/glam_dataset.jsonld` - JSON-LD export
|
|
- `data/context/heritage_custodian_context.jsonld` - JSON-LD context
|
|
|
|
**Success Criteria**:
|
|
- [ ] RDF validates with Turtle parser
|
|
- [ ] JSON-LD validates with JSON-LD Playground
|
|
- [ ] Sample URIs resolve correctly
|
|
|
|
---
|
|
|
|
### Day 27: CSV/Parquet Export
|
|
|
|
**Tasks**:
|
|
1. **Flatten Schema for CSV**
|
|
```python
|
|
# src/glam_extractor/exporters/csv_exporter.py
|
|
|
|
def export_to_csv(dataset: List[HeritageCustodian], output_path: str):
|
|
"""Export dataset to CSV with flattened structure."""
|
|
rows = []
|
|
|
|
for custodian in dataset:
|
|
row = {
|
|
'ghcid': custodian.ghcid,
|
|
'ghcid_uuid': str(custodian.ghcid_uuid),
|
|
'name': custodian.name,
|
|
'institution_type': custodian.institution_type,
|
|
'country': custodian.locations[0].country if custodian.locations else None,
|
|
'city': custodian.locations[0].city if custodian.locations else None,
|
|
'wikidata_id': get_identifier(custodian, 'Wikidata'),
|
|
'unesco_whc_id': get_identifier(custodian, 'UNESCO_WHC'),
|
|
'data_source': custodian.provenance.data_source,
|
|
'data_tier': custodian.provenance.data_tier,
|
|
'confidence_score': custodian.provenance.confidence_score
|
|
}
|
|
rows.append(row)
|
|
|
|
df = pd.DataFrame(rows)
|
|
df.to_csv(output_path, index=False, encoding='utf-8-sig')
|
|
```
|
|
|
|
2. **Parquet Export (Columnar)**
|
|
```python
|
|
def export_to_parquet(dataset: List[HeritageCustodian], output_path: str):
|
|
"""Export dataset to Parquet for efficient querying."""
|
|
df = pd.DataFrame([custodian.dict() for custodian in dataset])
|
|
df.to_parquet(output_path, engine='pyarrow', compression='snappy')
|
|
```
|
|
|
|
3. **SQLite Export**
|
|
```python
|
|
def export_to_sqlite(dataset: List[HeritageCustodian], db_path: str):
|
|
"""Export dataset to SQLite database."""
|
|
conn = sqlite3.connect(db_path)
|
|
|
|
# Create tables
|
|
conn.execute("""
|
|
CREATE TABLE heritage_custodians (
|
|
ghcid TEXT PRIMARY KEY,
|
|
ghcid_uuid TEXT UNIQUE,
|
|
name TEXT NOT NULL,
|
|
institution_type TEXT,
|
|
data_source TEXT,
|
|
...
|
|
)
|
|
""")
|
|
|
|
# Insert records
|
|
for custodian in dataset:
|
|
conn.execute("INSERT INTO heritage_custodians VALUES (?, ?, ...)", ...)
|
|
|
|
conn.commit()
|
|
```
|
|
|
|
**Deliverables**:
|
|
- `src/glam_extractor/exporters/csv_exporter.py` - CSV exporter
|
|
- `data/exports/glam_dataset.csv` - CSV export
|
|
- `data/exports/glam_dataset.parquet` - Parquet export
|
|
- `data/exports/glam_dataset.db` - SQLite database
|
|
|
|
**Success Criteria**:
|
|
- [ ] CSV opens correctly in Excel, Google Sheets
|
|
- [ ] Parquet loads in pandas, DuckDB
|
|
- [ ] SQLite database queryable with SQL
|
|
|
|
---
|
|
|
|
### Day 28: Documentation - User Guide
|
|
|
|
**Tasks**:
|
|
1. **Getting Started Guide**
|
|
```markdown
|
|
# docs/user-guide/getting-started.md
|
|
|
|
## Installation
|
|
|
|
pip install glam-extractor
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
from glam_extractor import load_dataset
|
|
|
|
# Load the GLAM dataset
|
|
dataset = load_dataset("data/exports/glam_dataset.parquet")
|
|
|
|
# Filter UNESCO museums in France
|
|
museums = dataset.filter(
|
|
institution_type="MUSEUM",
|
|
data_source="UNESCO_WORLD_HERITAGE",
|
|
country="FR"
|
|
)
|
|
```
|
|
|
|
2. **Example Queries**
|
|
- SPARQL examples (find institutions by type, country)
|
|
- Pandas examples (data analysis, statistics)
|
|
- SQL examples (SQLite queries)
|
|
|
|
3. **API Reference**
|
|
- Document all public classes and methods
|
|
- Provide code examples for each function
|
|
- Link to LinkML schema documentation
|
|
|
|
**Deliverables**:
|
|
- `docs/user-guide/getting-started.md` - Quick start guide
|
|
- `docs/user-guide/example-queries.md` - Query examples
|
|
- `docs/api-reference.md` - API documentation
|
|
|
|
**Success Criteria**:
|
|
- [ ] Complete documentation for all public APIs
|
|
- [ ] 10+ example queries covering common use cases
|
|
- [ ] Step-by-step tutorials for data consumers
|
|
|
|
---
|
|
|
|
### Day 29: Documentation - Developer Guide
|
|
|
|
**Tasks**:
|
|
1. **Architecture Overview**
|
|
- Diagram of extraction pipeline (API → Parser → Validator → Exporter)
|
|
- Explanation of LinkML Map transformation
|
|
- GHCID generation algorithm
|
|
|
|
2. **Contributing Guide**
|
|
- How to add new institution type classifiers
|
|
- How to extend LinkML schema with new fields
|
|
- How to add new export formats
|
|
|
|
3. **Testing Guide**
|
|
- Running unit tests, integration tests
|
|
- Creating new test fixtures
|
|
- Using property-based testing
|
|
|
|
**Deliverables**:
|
|
- `docs/developer-guide/architecture.md` - Architecture docs
|
|
- `docs/developer-guide/contributing.md` - Contribution guide
|
|
- `docs/developer-guide/testing.md` - Testing guide
|
|
|
|
**Success Criteria**:
|
|
- [ ] Complete architecture documentation with diagrams
|
|
- [ ] Clear instructions for extending the system
|
|
- [ ] Comprehensive testing guide
|
|
|
|
---
|
|
|
|
### Day 30: Release & Publication
|
|
|
|
**Tasks**:
|
|
1. **Dataset Release**
|
|
- Tag repository with version number (e.g., v1.0.0-unesco)
|
|
- Create GitHub Release with exports attached
|
|
- Publish to Zenodo for DOI (persistent citation)
|
|
|
|
2. **Announcement**
|
|
- Write blog post announcing UNESCO data release
|
|
- Share on social media (Twitter, Mastodon, LinkedIn)
|
|
- Notify stakeholders (Europeana, DPLA, heritage researchers)
|
|
|
|
3. **Data Portal Update**
|
|
- Update w3id.org redirects for new institutions
|
|
- Deploy SPARQL endpoint (if applicable)
|
|
- Update REST API to include UNESCO data
|
|
|
|
**Deliverables**:
|
|
- GitHub Release with dataset exports
|
|
- Zenodo DOI for citation
|
|
- Blog post and announcement
|
|
|
|
**Success Criteria**:
|
|
- [ ] Dataset published with persistent DOI
|
|
- [ ] Documentation live and accessible
|
|
- [ ] Stakeholders notified of release
|
|
|
|
---
|
|
|
|
## Risk Mitigation
|
|
|
|
### Technical Risks
|
|
|
|
| Risk | Probability | Impact | Mitigation |
|
|
|------|-------------|--------|------------|
|
|
| UNESCO API changes format | Low | High | Cache all responses, version API client |
|
|
| LinkML Map lacks features | Medium | High | Implement custom extension early (Day 2-3) |
|
|
| GHCID collisions exceed capacity | Low | Medium | Q-number resolution strategy documented |
|
|
| Wikidata enrichment fails | Medium | Medium | Fallback to fuzzy name matching |
|
|
|
|
### Resource Risks
|
|
|
|
| Risk | Probability | Impact | Mitigation |
|
|
|------|-------------|--------|------------|
|
|
| Timeline slips past 6 weeks | Medium | Medium | Prioritize core features, defer non-critical exports |
|
|
| Test coverage falls below 90% | Low | High | TDD approach enforced from Day 1 |
|
|
| Documentation incomplete | Medium | High | Reserve full week for docs (Phase 5) |
|
|
|
|
### Data Quality Risks
|
|
|
|
| Risk | Probability | Impact | Mitigation |
|
|
|------|-------------|--------|------------|
|
|
| Institution type misclassification | Medium | Medium | Manual review queue for low-confidence cases |
|
|
| Missing Wikidata Q-numbers | Medium | Low | Accept base GHCID without Q-number, enrich later |
|
|
| Conflicts with existing data | Low | Medium | Tier-based priority, UNESCO wins |
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
### Quantitative Metrics
|
|
|
|
- **Coverage**: Extract 1,000+ UNESCO site institutions
|
|
- **Quality**: 90%+ confidence score average
|
|
- **Completeness**: 80%+ have Wikidata Q-numbers
|
|
- **Performance**: Process all sites in < 2 hours
|
|
- **Test Coverage**: 90%+ code coverage
|
|
|
|
### Qualitative Metrics
|
|
|
|
- **Usability**: Positive feedback from 3+ data consumers
|
|
- **Documentation**: Complete user guide and API docs
|
|
- **Maintainability**: Code passes linter, type checker
|
|
- **Reproducibility**: Dataset generation fully automated
|
|
|
|
---
|
|
|
|
## Appendix: Day-by-Day Checklist
|
|
|
|
### Phase 1 (Days 1-5)
|
|
- [ ] Day 1: UNESCO API documentation reviewed, 50 sites fetched
|
|
- [ ] Day 2: LinkML Map schema (part 1) - basic transformations
|
|
- [ ] Day 3: LinkML Map schema (part 2) - advanced patterns
|
|
- [ ] Day 4: Golden dataset created (20 test fixtures)
|
|
- [ ] Day 5: Institution type classifier designed
|
|
|
|
### Phase 2 (Days 6-13)
|
|
- [ ] Day 6: UNESCO API client implemented
|
|
- [ ] Day 7: LinkML instance generator implemented
|
|
- [ ] Day 8: GHCID generator extended for UNESCO
|
|
- [ ] Day 9-10: Batch processing pipeline
|
|
- [ ] Day 11-12: Integration testing
|
|
- [ ] Day 13: Code review and refactoring
|
|
|
|
### Phase 3 (Days 14-19)
|
|
- [ ] Day 14-15: Cross-referencing with existing data
|
|
- [ ] Day 16: Conflict resolution
|
|
- [ ] Day 17: Confidence scoring system
|
|
- [ ] Day 18: LinkML schema validation
|
|
- [ ] Day 19: Data quality report
|
|
|
|
### Phase 4 (Days 20-24)
|
|
- [ ] Day 20-21: Wikidata enrichment
|
|
- [ ] Day 22: Dataset merge
|
|
- [ ] Day 23: GHCID collision resolution
|
|
- [ ] Day 24: Final data validation
|
|
|
|
### Phase 5 (Days 25-30)
|
|
- [ ] Day 25-26: RDF/JSON-LD export
|
|
- [ ] Day 27: CSV/Parquet/SQLite export
|
|
- [ ] Day 28: User guide documentation
|
|
- [ ] Day 29: Developer guide documentation
|
|
- [ ] Day 30: Release and publication
|
|
|
|
---
|
|
|
|
**Document Status**: Complete
|
|
**Next Document**: `04-tdd-strategy.md` - Test-driven development plan
|
|
**Version**: 1.1
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
### Version 1.1 (2025-11-10)
|
|
**Changes**: Updated for OpenDataSoft Explore API v2.0 migration
|
|
|
|
- **Day 1 API Reconnaissance** (lines 44-62): Updated API endpoint from legacy `whc.unesco.org/en/list/json` to OpenDataSoft `data.unesco.org/api/explore/v2.0`
|
|
- **Day 6 UNESCO API Client** (lines 265-305):
|
|
- Updated `base_url` to OpenDataSoft API endpoint
|
|
- Removed `api_key` parameter (public dataset, no authentication)
|
|
- Added pagination parameters to `fetch_site_list()`: `limit`, `offset`
|
|
- Updated method documentation to reflect OpenDataSoft response structure: `{"record": {"fields": {...}}}`
|
|
- **Day 7 LinkML Parser** (lines 309-352):
|
|
- Updated `parse_unesco_site()` to extract from nested `api_response['record']['fields']`
|
|
- Added documentation clarifying OpenDataSoft structure
|
|
- Updated `extract_multilingual_names()` parameter name from `unesco_data` to `site_data`
|
|
- **Day 10 Batch Processing** (lines 406-450):
|
|
- Updated `extract_all_unesco_sites()` with pagination loop for OpenDataSoft API
|
|
- Updated `process_unesco_site()` to handle nested record structure
|
|
- Changed field access from `site_data['id_number']` to `site_record['record']['fields']['unique_number']`
|
|
- **Day 11-12 Integration Tests** (lines 454-483):
|
|
- Updated `test_full_unesco_extraction_pipeline()` to extract `site_data` from `response['record']['fields']`
|
|
- Added explicit documentation of OpenDataSoft API structure
|
|
|
|
**Rationale**: Legacy UNESCO JSON API deprecated; OpenDataSoft provides standardized REST API with pagination, ODSQL filtering, and better data quality.
|
|
|
|
### Version 1.0 (2025-11-09)
|
|
**Initial Release**
|
|
|
|
- Comprehensive 30-day implementation plan for UNESCO data extraction
|
|
- Five phases: API Exploration, Extractor Implementation, Data Quality, Integration, Export
|
|
- TDD approach with golden dataset and integration tests
|
|
- GHCID generation strategy for UNESCO heritage sites
|
|
- Wikidata enrichment and cross-referencing plan
|