# YAML Enrichment Provenance Schema **Created**: 2025-12-28 **Updated**: 2025-12-28 **Status**: Active Rule **Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md (for web_claims in JSON files) ## Purpose This schema defines provenance requirements for **custodian YAML files** that contain enrichment data from various sources (Wikidata, Google Maps, YouTube, web scraping, etc.). Unlike web claims in person entity JSON files, YAML enrichment sections represent API-sourced data that requires different provenance tracking. ## Key Differences from WEB_CLAIM_PROVENANCE_SCHEMA | Aspect | Web Claims (JSON) | YAML Enrichment | |--------|------------------|-----------------| | **Data Source** | Web scraping, news articles | APIs (Wikidata, Google Maps, YouTube) | | **Selector Types** | CSS, XPath, TextQuoteSelector | Not applicable (API responses) | | **Archival** | Wayback Machine memento URIs | API response caching, local archives | | **Content Hash** | SHA-256 of extracted_text | SHA-256 of entire enrichment section | | **Verification** | Re-fetch webpage and compare | Re-query API and compare | ## Enrichment Types in Custodian YAML Files Based on audit of 29,073 files: | Enrichment Type | Files | Provenance Elements Needed | |-----------------|-------|---------------------------| | `wikidata_enrichment` | 17,900 | entity_id, api_endpoint, fetch_timestamp, content_hash | | `google_maps_enrichment` | 3,564 | place_id, api_endpoint, fetch_timestamp, content_hash | | `youtube_enrichment` | varies | channel_id, api_endpoint, fetch_timestamp, content_hash | | `web_enrichment` | 1,708 | source_url, archive_path, fetch_timestamp, content_hash | | `zcbs_enrichment` | 142 | source_url, fetch_timestamp, content_hash | | `linkup_timespan` | varies | search_query, source_urls, archive_path | ## Provenance Schema Structure ### Per-Enrichment Section Provenance Each enrichment section should have a `_provenance` sub-key containing: ```yaml wikidata_enrichment: # ... existing enrichment data ... _provenance: # MANDATORY: Content integrity content_hash: algorithm: sha256 value: "sha256-" scope: enrichment_section computed_at: "2025-12-28T00:00:00Z" # MANDATORY: Source identification (PROV-O alignment) prov: wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899" generatedAtTime: "2025-12-28T00:00:00Z" wasGeneratedBy: "@type": "prov:Activity" name: "api_fetch" used: "wikidata_rest_api" # MANDATORY: Verification status verification: status: "verified" # verified|stale|failed|pending last_verified: "2025-12-28T00:00:00Z" content_hash_at_verification: "sha256-" # RECOMMENDED: Standards compliance standards_compliance: - "W3C PROV-O" - "W3C SRI (content hashes)" ``` ### Existing Provenance Elements to Preserve The existing `provenance` section at root level should be PRESERVED and ENHANCED, not replaced: ```yaml provenance: schema_version: "2.0.0" # Increment version generated_at: "2025-12-28T00:00:00Z" sources: # ... existing source tracking ... data_tier_summary: # ... existing tier summary ... # NEW: Enrichment-level provenance summary enrichment_provenance: wikidata_enrichment: content_hash: "sha256-..." verified_at: "2025-12-28T00:00:00Z" google_maps_enrichment: content_hash: "sha256-..." verified_at: "2025-12-28T00:00:00Z" # ... other enrichment sections ... # NEW: Schema compliance provenance_schema_version: "2.0" standards_compliance: - "W3C PROV-O" - "W3C SRI (content hashes)" ``` ## Content Hash Generation ### For API-Sourced Enrichment Sections Hash the **entire enrichment section** (excluding `_provenance` sub-key) as canonical JSON: ```python import hashlib import base64 import json from ruamel.yaml import YAML def generate_enrichment_hash(enrichment_data: dict) -> dict: """ Generate SHA-256 hash for enrichment section integrity. Excludes '_provenance' key to avoid circular dependency. """ # Remove _provenance to avoid hashing the hash data_to_hash = {k: v for k, v in enrichment_data.items() if k != '_provenance'} # Canonical JSON (sorted keys, no extra whitespace) canonical = json.dumps(data_to_hash, sort_keys=True, separators=(',', ':')) # SHA-256 hash hash_bytes = hashlib.sha256(canonical.encode('utf-8')).digest() hash_b64 = base64.b64encode(hash_bytes).decode('ascii') return { "algorithm": "sha256", "value": f"sha256-{hash_b64}", "scope": "enrichment_section", "computed_at": datetime.now(timezone.utc).isoformat() } ``` ### Hash Stability Considerations - **Floating point precision**: Normalize coordinates to 6 decimal places before hashing - **Timestamp normalization**: Use ISO 8601 with UTC timezone - **Unicode normalization**: Apply NFD normalization before hashing - **Key ordering**: Always sort keys alphabetically for canonical JSON ## Enrichment-Specific Provenance Requirements ### 1. Wikidata Enrichment ```yaml wikidata_enrichment: entity_id: Q2710899 labels: en: "National War and Resistance Museum" nl: "Nationaal Onderduikmuseum" # ... other data ... _provenance: content_hash: algorithm: sha256 value: "sha256-abc123..." scope: enrichment_section prov: wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899" generatedAtTime: "2025-12-28T00:00:00Z" wasGeneratedBy: "@type": "prov:Activity" name: "wikidata_api_fetch" used: "https://www.wikidata.org/w/rest.php/wikibase/v1" verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" ``` ### 2. Google Maps Enrichment ```yaml google_maps_enrichment: place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk name: "Nationaal Onderduikmuseum" coordinates: latitude: 51.927699 longitude: 6.5815864 # ... other data ... _provenance: content_hash: algorithm: sha256 value: "sha256-def456..." scope: enrichment_section prov: wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk" generatedAtTime: "2025-12-28T00:00:00Z" verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" ``` ### 3. YouTube Enrichment ```yaml youtube_enrichment: channel: channel_id: UC9jeOgD_4thSAgfPPDAGJDQ title: "Nationaal Onderduikmuseum" subscriber_count: 40 # ... other data ... _provenance: content_hash: algorithm: sha256 value: "sha256-ghi789..." scope: enrichment_section prov: wasDerivedFrom: "https://www.googleapis.com/youtube/v3/channels?id=UC9jeOgD_4thSAgfPPDAGJDQ" generatedAtTime: "2025-12-28T00:00:00Z" verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" ``` ### 4. Web Enrichment (Archived Sites) For web archives, reference the local archive path: ```yaml web_enrichment: web_archives: - url: https://nationaalonderduikmuseum.nl directory: web/0234/nationaalonderduikmuseum.nl pages_archived: 200 warc_file: archive.warc.gz _provenance: content_hash: algorithm: sha256 value: "sha256-jkl012..." scope: enrichment_section prov: wasDerivedFrom: "https://nationaalonderduikmuseum.nl" generatedAtTime: "2025-11-29T15:40:23Z" archive: local_path: "web/0234/nationaalonderduikmuseum.nl/archive.warc.gz" format: "ISO 28500 WARC" size_bytes: 4910963 verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" ``` ## Implementation Priority ### Phase 1: Content Hashes (Critical) Add `content_hash` to all enrichment sections. This is deterministic and requires no external API calls. **Target**: All 29,073 files **Elements Added**: - `_provenance.content_hash` for each enrichment section - `provenance.enrichment_provenance` summary at root level ### Phase 2: PROV-O Alignment Add `prov.wasDerivedFrom` and related PROV-O elements. **Target**: Files with existing `api_metadata` or `fetch_timestamp` **Elements Added**: - `_provenance.prov.wasDerivedFrom` - `_provenance.prov.generatedAtTime` - `_provenance.prov.wasGeneratedBy` ### Phase 3: Verification Status Add verification tracking for future re-verification workflows. **Target**: All enriched files **Elements Added**: - `_provenance.verification.status` - `_provenance.verification.last_verified` ## Processing Rules ### 1. Preserve Existing Data (DATA_PRESERVATION_RULES) Never delete existing enrichment content. Only ADD provenance metadata. ```yaml # CORRECT: Add _provenance, preserve everything else google_maps_enrichment: place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk # PRESERVED name: "Nationaal Onderduikmuseum" # PRESERVED rating: 4.5 # PRESERVED reviews: [...] # PRESERVED _provenance: # ADDED content_hash: ... # WRONG: Do not restructure or delete existing data ``` ### 2. Use ruamel.yaml for YAML Processing Preserve comments, formatting, and key ordering: ```python from ruamel.yaml import YAML yaml = YAML() yaml.preserve_quotes = True yaml.default_flow_style = False yaml.indent(mapping=2, sequence=4, offset=2) with open(filepath, 'r') as f: data = yaml.load(f) # ... add provenance ... with open(filepath, 'w') as f: yaml.dump(data, f) ``` ### 3. Idempotent Processing Skip files that already have provenance: ```python def needs_provenance(data: dict) -> bool: """Check if file needs provenance enhancement.""" # Check for per-section provenance for key in ENRICHMENT_SECTIONS: if key in data and '_provenance' not in data[key]: return True return False ``` ### 4. Batch Processing with Progress Process in batches with progress reporting: ```python # Process by enrichment type priority PROCESSING_ORDER = [ 'web_enrichment', # 1,708 files - highest provenance need 'wikidata_enrichment', # 17,900 files - API data 'google_maps_enrichment', # 3,564 files 'youtube_enrichment', # varies 'zcbs_enrichment', # 142 files ] ``` ## Validation ### Validate Provenance Completeness ```python def validate_provenance(data: dict) -> list[str]: """Validate provenance completeness.""" errors = [] for section_name in ENRICHMENT_SECTIONS: if section_name not in data: continue section = data[section_name] prov = section.get('_provenance', {}) # Check mandatory elements if 'content_hash' not in prov: errors.append(f"{section_name}: missing content_hash") if 'verification' not in prov: errors.append(f"{section_name}: missing verification") if 'prov' not in prov or 'wasDerivedFrom' not in prov.get('prov', {}): errors.append(f"{section_name}: missing prov.wasDerivedFrom") return errors ``` ## Related Documentation - `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Provenance for web claims in JSON files - `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data - `.opencode/DATA_FABRICATION_PROHIBITION.md` - All data must be real - `AGENTS.md` - Project rules and conventions ## Example: Complete Enriched File with Provenance ```yaml original_entry: organisatie: Nationaal Onderduikmuseum # ... other original data ... provenance: schema_version: "2.0.0" generated_at: "2025-12-28T00:00:00Z" provenance_schema_version: "2.0" sources: # ... existing source tracking ... enrichment_provenance: wikidata_enrichment: content_hash: "sha256-abc123..." verified_at: "2025-12-28T00:00:00Z" google_maps_enrichment: content_hash: "sha256-def456..." verified_at: "2025-12-28T00:00:00Z" standards_compliance: - "W3C PROV-O" - "W3C SRI (content hashes)" wikidata_enrichment: entity_id: Q2710899 labels: nl: "Nationaal Onderduikmuseum" # ... enrichment data ... _provenance: content_hash: algorithm: sha256 value: "sha256-abc123..." scope: enrichment_section computed_at: "2025-12-28T00:00:00Z" prov: wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899" generatedAtTime: "2025-11-27T15:17:00Z" verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" google_maps_enrichment: place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk # ... enrichment data ... _provenance: content_hash: algorithm: sha256 value: "sha256-def456..." scope: enrichment_section computed_at: "2025-12-28T00:00:00Z" prov: wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk" generatedAtTime: "2025-11-28T09:50:57Z" verification: status: "verified" last_verified: "2025-12-28T00:00:00Z" ```