13 KiB
YAML Enrichment Provenance Schema
Created: 2025-12-28 Updated: 2025-12-28 Status: Active Rule Related: WEB_CLAIM_PROVENANCE_SCHEMA.md (for web_claims in JSON files)
Purpose
This schema defines provenance requirements for custodian YAML files that contain enrichment data from various sources (Wikidata, Google Maps, YouTube, web scraping, etc.). Unlike web claims in person entity JSON files, YAML enrichment sections represent API-sourced data that requires different provenance tracking.
Key Differences from WEB_CLAIM_PROVENANCE_SCHEMA
| Aspect | Web Claims (JSON) | YAML Enrichment |
|---|---|---|
| Data Source | Web scraping, news articles | APIs (Wikidata, Google Maps, YouTube) |
| Selector Types | CSS, XPath, TextQuoteSelector | Not applicable (API responses) |
| Archival | Wayback Machine memento URIs | API response caching, local archives |
| Content Hash | SHA-256 of extracted_text | SHA-256 of entire enrichment section |
| Verification | Re-fetch webpage and compare | Re-query API and compare |
Enrichment Types in Custodian YAML Files
Based on audit of 29,073 files:
| Enrichment Type | Files | Provenance Elements Needed |
|---|---|---|
wikidata_enrichment |
17,900 | entity_id, api_endpoint, fetch_timestamp, content_hash |
google_maps_enrichment |
3,564 | place_id, api_endpoint, fetch_timestamp, content_hash |
youtube_enrichment |
varies | channel_id, api_endpoint, fetch_timestamp, content_hash |
web_enrichment |
1,708 | source_url, archive_path, fetch_timestamp, content_hash |
zcbs_enrichment |
142 | source_url, fetch_timestamp, content_hash |
linkup_timespan |
varies | search_query, source_urls, archive_path |
Provenance Schema Structure
Per-Enrichment Section Provenance
Each enrichment section should have a _provenance sub-key containing:
wikidata_enrichment:
# ... existing enrichment data ...
_provenance:
# MANDATORY: Content integrity
content_hash:
algorithm: sha256
value: "sha256-<base64_hash>"
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
# MANDATORY: Source identification (PROV-O alignment)
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-12-28T00:00:00Z"
wasGeneratedBy:
"@type": "prov:Activity"
name: "api_fetch"
used: "wikidata_rest_api"
# MANDATORY: Verification status
verification:
status: "verified" # verified|stale|failed|pending
last_verified: "2025-12-28T00:00:00Z"
content_hash_at_verification: "sha256-<hash>"
# RECOMMENDED: Standards compliance
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
Existing Provenance Elements to Preserve
The existing provenance section at root level should be PRESERVED and ENHANCED, not replaced:
provenance:
schema_version: "2.0.0" # Increment version
generated_at: "2025-12-28T00:00:00Z"
sources:
# ... existing source tracking ...
data_tier_summary:
# ... existing tier summary ...
# NEW: Enrichment-level provenance summary
enrichment_provenance:
wikidata_enrichment:
content_hash: "sha256-..."
verified_at: "2025-12-28T00:00:00Z"
google_maps_enrichment:
content_hash: "sha256-..."
verified_at: "2025-12-28T00:00:00Z"
# ... other enrichment sections ...
# NEW: Schema compliance
provenance_schema_version: "2.0"
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
Content Hash Generation
For API-Sourced Enrichment Sections
Hash the entire enrichment section (excluding _provenance sub-key) as canonical JSON:
import hashlib
import base64
import json
from ruamel.yaml import YAML
def generate_enrichment_hash(enrichment_data: dict) -> dict:
"""
Generate SHA-256 hash for enrichment section integrity.
Excludes '_provenance' key to avoid circular dependency.
"""
# Remove _provenance to avoid hashing the hash
data_to_hash = {k: v for k, v in enrichment_data.items() if k != '_provenance'}
# Canonical JSON (sorted keys, no extra whitespace)
canonical = json.dumps(data_to_hash, sort_keys=True, separators=(',', ':'))
# SHA-256 hash
hash_bytes = hashlib.sha256(canonical.encode('utf-8')).digest()
hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
return {
"algorithm": "sha256",
"value": f"sha256-{hash_b64}",
"scope": "enrichment_section",
"computed_at": datetime.now(timezone.utc).isoformat()
}
Hash Stability Considerations
- Floating point precision: Normalize coordinates to 6 decimal places before hashing
- Timestamp normalization: Use ISO 8601 with UTC timezone
- Unicode normalization: Apply NFD normalization before hashing
- Key ordering: Always sort keys alphabetically for canonical JSON
Enrichment-Specific Provenance Requirements
1. Wikidata Enrichment
wikidata_enrichment:
entity_id: Q2710899
labels:
en: "National War and Resistance Museum"
nl: "Nationaal Onderduikmuseum"
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-abc123..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-12-28T00:00:00Z"
wasGeneratedBy:
"@type": "prov:Activity"
name: "wikidata_api_fetch"
used: "https://www.wikidata.org/w/rest.php/wikibase/v1"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
2. Google Maps Enrichment
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
name: "Nationaal Onderduikmuseum"
coordinates:
latitude: 51.927699
longitude: 6.5815864
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-def456..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
generatedAtTime: "2025-12-28T00:00:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
3. YouTube Enrichment
youtube_enrichment:
channel:
channel_id: UC9jeOgD_4thSAgfPPDAGJDQ
title: "Nationaal Onderduikmuseum"
subscriber_count: 40
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-ghi789..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://www.googleapis.com/youtube/v3/channels?id=UC9jeOgD_4thSAgfPPDAGJDQ"
generatedAtTime: "2025-12-28T00:00:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
4. Web Enrichment (Archived Sites)
For web archives, reference the local archive path:
web_enrichment:
web_archives:
- url: https://nationaalonderduikmuseum.nl
directory: web/0234/nationaalonderduikmuseum.nl
pages_archived: 200
warc_file: archive.warc.gz
_provenance:
content_hash:
algorithm: sha256
value: "sha256-jkl012..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://nationaalonderduikmuseum.nl"
generatedAtTime: "2025-11-29T15:40:23Z"
archive:
local_path: "web/0234/nationaalonderduikmuseum.nl/archive.warc.gz"
format: "ISO 28500 WARC"
size_bytes: 4910963
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
Implementation Priority
Phase 1: Content Hashes (Critical)
Add content_hash to all enrichment sections. This is deterministic and requires no external API calls.
Target: All 29,073 files Elements Added:
_provenance.content_hashfor each enrichment sectionprovenance.enrichment_provenancesummary at root level
Phase 2: PROV-O Alignment
Add prov.wasDerivedFrom and related PROV-O elements.
Target: Files with existing api_metadata or fetch_timestamp
Elements Added:
_provenance.prov.wasDerivedFrom_provenance.prov.generatedAtTime_provenance.prov.wasGeneratedBy
Phase 3: Verification Status
Add verification tracking for future re-verification workflows.
Target: All enriched files Elements Added:
_provenance.verification.status_provenance.verification.last_verified
Processing Rules
1. Preserve Existing Data (DATA_PRESERVATION_RULES)
Never delete existing enrichment content. Only ADD provenance metadata.
# CORRECT: Add _provenance, preserve everything else
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk # PRESERVED
name: "Nationaal Onderduikmuseum" # PRESERVED
rating: 4.5 # PRESERVED
reviews: [...] # PRESERVED
_provenance: # ADDED
content_hash: ...
# WRONG: Do not restructure or delete existing data
2. Use ruamel.yaml for YAML Processing
Preserve comments, formatting, and key ordering:
from ruamel.yaml import YAML
yaml = YAML()
yaml.preserve_quotes = True
yaml.default_flow_style = False
yaml.indent(mapping=2, sequence=4, offset=2)
with open(filepath, 'r') as f:
data = yaml.load(f)
# ... add provenance ...
with open(filepath, 'w') as f:
yaml.dump(data, f)
3. Idempotent Processing
Skip files that already have provenance:
def needs_provenance(data: dict) -> bool:
"""Check if file needs provenance enhancement."""
# Check for per-section provenance
for key in ENRICHMENT_SECTIONS:
if key in data and '_provenance' not in data[key]:
return True
return False
4. Batch Processing with Progress
Process in batches with progress reporting:
# Process by enrichment type priority
PROCESSING_ORDER = [
'web_enrichment', # 1,708 files - highest provenance need
'wikidata_enrichment', # 17,900 files - API data
'google_maps_enrichment', # 3,564 files
'youtube_enrichment', # varies
'zcbs_enrichment', # 142 files
]
Validation
Validate Provenance Completeness
def validate_provenance(data: dict) -> list[str]:
"""Validate provenance completeness."""
errors = []
for section_name in ENRICHMENT_SECTIONS:
if section_name not in data:
continue
section = data[section_name]
prov = section.get('_provenance', {})
# Check mandatory elements
if 'content_hash' not in prov:
errors.append(f"{section_name}: missing content_hash")
if 'verification' not in prov:
errors.append(f"{section_name}: missing verification")
if 'prov' not in prov or 'wasDerivedFrom' not in prov.get('prov', {}):
errors.append(f"{section_name}: missing prov.wasDerivedFrom")
return errors
Related Documentation
.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md- Provenance for web claims in JSON files.opencode/DATA_PRESERVATION_RULES.md- Never delete enriched data.opencode/DATA_FABRICATION_PROHIBITION.md- All data must be realAGENTS.md- Project rules and conventions
Example: Complete Enriched File with Provenance
original_entry:
organisatie: Nationaal Onderduikmuseum
# ... other original data ...
provenance:
schema_version: "2.0.0"
generated_at: "2025-12-28T00:00:00Z"
provenance_schema_version: "2.0"
sources:
# ... existing source tracking ...
enrichment_provenance:
wikidata_enrichment:
content_hash: "sha256-abc123..."
verified_at: "2025-12-28T00:00:00Z"
google_maps_enrichment:
content_hash: "sha256-def456..."
verified_at: "2025-12-28T00:00:00Z"
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
wikidata_enrichment:
entity_id: Q2710899
labels:
nl: "Nationaal Onderduikmuseum"
# ... enrichment data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-abc123..."
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-11-27T15:17:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
# ... enrichment data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-def456..."
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
prov:
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
generatedAtTime: "2025-11-28T09:50:57Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"