glam/.opencode/YAML_PROVENANCE_SCHEMA.md
2025-12-30 03:43:31 +01:00

441 lines
13 KiB
Markdown

# YAML Enrichment Provenance Schema
**Created**: 2025-12-28
**Updated**: 2025-12-28
**Status**: Active Rule
**Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md (for web_claims in JSON files)
## Purpose
This schema defines provenance requirements for **custodian YAML files** that contain enrichment data from various sources (Wikidata, Google Maps, YouTube, web scraping, etc.). Unlike web claims in person entity JSON files, YAML enrichment sections represent API-sourced data that requires different provenance tracking.
## Key Differences from WEB_CLAIM_PROVENANCE_SCHEMA
| Aspect | Web Claims (JSON) | YAML Enrichment |
|--------|------------------|-----------------|
| **Data Source** | Web scraping, news articles | APIs (Wikidata, Google Maps, YouTube) |
| **Selector Types** | CSS, XPath, TextQuoteSelector | Not applicable (API responses) |
| **Archival** | Wayback Machine memento URIs | API response caching, local archives |
| **Content Hash** | SHA-256 of extracted_text | SHA-256 of entire enrichment section |
| **Verification** | Re-fetch webpage and compare | Re-query API and compare |
## Enrichment Types in Custodian YAML Files
Based on audit of 29,073 files:
| Enrichment Type | Files | Provenance Elements Needed |
|-----------------|-------|---------------------------|
| `wikidata_enrichment` | 17,900 | entity_id, api_endpoint, fetch_timestamp, content_hash |
| `google_maps_enrichment` | 3,564 | place_id, api_endpoint, fetch_timestamp, content_hash |
| `youtube_enrichment` | varies | channel_id, api_endpoint, fetch_timestamp, content_hash |
| `web_enrichment` | 1,708 | source_url, archive_path, fetch_timestamp, content_hash |
| `zcbs_enrichment` | 142 | source_url, fetch_timestamp, content_hash |
| `linkup_timespan` | varies | search_query, source_urls, archive_path |
## Provenance Schema Structure
### Per-Enrichment Section Provenance
Each enrichment section should have a `_provenance` sub-key containing:
```yaml
wikidata_enrichment:
# ... existing enrichment data ...
_provenance:
# MANDATORY: Content integrity
content_hash:
algorithm: sha256
value: "sha256-<base64_hash>"
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
# MANDATORY: Source identification (PROV-O alignment)
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-12-28T00:00:00Z"
wasGeneratedBy:
"@type": "prov:Activity"
name: "api_fetch"
used: "wikidata_rest_api"
# MANDATORY: Verification status
verification:
status: "verified" # verified|stale|failed|pending
last_verified: "2025-12-28T00:00:00Z"
content_hash_at_verification: "sha256-<hash>"
# RECOMMENDED: Standards compliance
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
```
### Existing Provenance Elements to Preserve
The existing `provenance` section at root level should be PRESERVED and ENHANCED, not replaced:
```yaml
provenance:
schema_version: "2.0.0" # Increment version
generated_at: "2025-12-28T00:00:00Z"
sources:
# ... existing source tracking ...
data_tier_summary:
# ... existing tier summary ...
# NEW: Enrichment-level provenance summary
enrichment_provenance:
wikidata_enrichment:
content_hash: "sha256-..."
verified_at: "2025-12-28T00:00:00Z"
google_maps_enrichment:
content_hash: "sha256-..."
verified_at: "2025-12-28T00:00:00Z"
# ... other enrichment sections ...
# NEW: Schema compliance
provenance_schema_version: "2.0"
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
```
## Content Hash Generation
### For API-Sourced Enrichment Sections
Hash the **entire enrichment section** (excluding `_provenance` sub-key) as canonical JSON:
```python
import hashlib
import base64
import json
from ruamel.yaml import YAML
def generate_enrichment_hash(enrichment_data: dict) -> dict:
"""
Generate SHA-256 hash for enrichment section integrity.
Excludes '_provenance' key to avoid circular dependency.
"""
# Remove _provenance to avoid hashing the hash
data_to_hash = {k: v for k, v in enrichment_data.items() if k != '_provenance'}
# Canonical JSON (sorted keys, no extra whitespace)
canonical = json.dumps(data_to_hash, sort_keys=True, separators=(',', ':'))
# SHA-256 hash
hash_bytes = hashlib.sha256(canonical.encode('utf-8')).digest()
hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
return {
"algorithm": "sha256",
"value": f"sha256-{hash_b64}",
"scope": "enrichment_section",
"computed_at": datetime.now(timezone.utc).isoformat()
}
```
### Hash Stability Considerations
- **Floating point precision**: Normalize coordinates to 6 decimal places before hashing
- **Timestamp normalization**: Use ISO 8601 with UTC timezone
- **Unicode normalization**: Apply NFD normalization before hashing
- **Key ordering**: Always sort keys alphabetically for canonical JSON
## Enrichment-Specific Provenance Requirements
### 1. Wikidata Enrichment
```yaml
wikidata_enrichment:
entity_id: Q2710899
labels:
en: "National War and Resistance Museum"
nl: "Nationaal Onderduikmuseum"
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-abc123..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-12-28T00:00:00Z"
wasGeneratedBy:
"@type": "prov:Activity"
name: "wikidata_api_fetch"
used: "https://www.wikidata.org/w/rest.php/wikibase/v1"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
```
### 2. Google Maps Enrichment
```yaml
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
name: "Nationaal Onderduikmuseum"
coordinates:
latitude: 51.927699
longitude: 6.5815864
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-def456..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
generatedAtTime: "2025-12-28T00:00:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
```
### 3. YouTube Enrichment
```yaml
youtube_enrichment:
channel:
channel_id: UC9jeOgD_4thSAgfPPDAGJDQ
title: "Nationaal Onderduikmuseum"
subscriber_count: 40
# ... other data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-ghi789..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://www.googleapis.com/youtube/v3/channels?id=UC9jeOgD_4thSAgfPPDAGJDQ"
generatedAtTime: "2025-12-28T00:00:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
```
### 4. Web Enrichment (Archived Sites)
For web archives, reference the local archive path:
```yaml
web_enrichment:
web_archives:
- url: https://nationaalonderduikmuseum.nl
directory: web/0234/nationaalonderduikmuseum.nl
pages_archived: 200
warc_file: archive.warc.gz
_provenance:
content_hash:
algorithm: sha256
value: "sha256-jkl012..."
scope: enrichment_section
prov:
wasDerivedFrom: "https://nationaalonderduikmuseum.nl"
generatedAtTime: "2025-11-29T15:40:23Z"
archive:
local_path: "web/0234/nationaalonderduikmuseum.nl/archive.warc.gz"
format: "ISO 28500 WARC"
size_bytes: 4910963
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
```
## Implementation Priority
### Phase 1: Content Hashes (Critical)
Add `content_hash` to all enrichment sections. This is deterministic and requires no external API calls.
**Target**: All 29,073 files
**Elements Added**:
- `_provenance.content_hash` for each enrichment section
- `provenance.enrichment_provenance` summary at root level
### Phase 2: PROV-O Alignment
Add `prov.wasDerivedFrom` and related PROV-O elements.
**Target**: Files with existing `api_metadata` or `fetch_timestamp`
**Elements Added**:
- `_provenance.prov.wasDerivedFrom`
- `_provenance.prov.generatedAtTime`
- `_provenance.prov.wasGeneratedBy`
### Phase 3: Verification Status
Add verification tracking for future re-verification workflows.
**Target**: All enriched files
**Elements Added**:
- `_provenance.verification.status`
- `_provenance.verification.last_verified`
## Processing Rules
### 1. Preserve Existing Data (DATA_PRESERVATION_RULES)
Never delete existing enrichment content. Only ADD provenance metadata.
```yaml
# CORRECT: Add _provenance, preserve everything else
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk # PRESERVED
name: "Nationaal Onderduikmuseum" # PRESERVED
rating: 4.5 # PRESERVED
reviews: [...] # PRESERVED
_provenance: # ADDED
content_hash: ...
# WRONG: Do not restructure or delete existing data
```
### 2. Use ruamel.yaml for YAML Processing
Preserve comments, formatting, and key ordering:
```python
from ruamel.yaml import YAML
yaml = YAML()
yaml.preserve_quotes = True
yaml.default_flow_style = False
yaml.indent(mapping=2, sequence=4, offset=2)
with open(filepath, 'r') as f:
data = yaml.load(f)
# ... add provenance ...
with open(filepath, 'w') as f:
yaml.dump(data, f)
```
### 3. Idempotent Processing
Skip files that already have provenance:
```python
def needs_provenance(data: dict) -> bool:
"""Check if file needs provenance enhancement."""
# Check for per-section provenance
for key in ENRICHMENT_SECTIONS:
if key in data and '_provenance' not in data[key]:
return True
return False
```
### 4. Batch Processing with Progress
Process in batches with progress reporting:
```python
# Process by enrichment type priority
PROCESSING_ORDER = [
'web_enrichment', # 1,708 files - highest provenance need
'wikidata_enrichment', # 17,900 files - API data
'google_maps_enrichment', # 3,564 files
'youtube_enrichment', # varies
'zcbs_enrichment', # 142 files
]
```
## Validation
### Validate Provenance Completeness
```python
def validate_provenance(data: dict) -> list[str]:
"""Validate provenance completeness."""
errors = []
for section_name in ENRICHMENT_SECTIONS:
if section_name not in data:
continue
section = data[section_name]
prov = section.get('_provenance', {})
# Check mandatory elements
if 'content_hash' not in prov:
errors.append(f"{section_name}: missing content_hash")
if 'verification' not in prov:
errors.append(f"{section_name}: missing verification")
if 'prov' not in prov or 'wasDerivedFrom' not in prov.get('prov', {}):
errors.append(f"{section_name}: missing prov.wasDerivedFrom")
return errors
```
## Related Documentation
- `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Provenance for web claims in JSON files
- `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data
- `.opencode/DATA_FABRICATION_PROHIBITION.md` - All data must be real
- `AGENTS.md` - Project rules and conventions
## Example: Complete Enriched File with Provenance
```yaml
original_entry:
organisatie: Nationaal Onderduikmuseum
# ... other original data ...
provenance:
schema_version: "2.0.0"
generated_at: "2025-12-28T00:00:00Z"
provenance_schema_version: "2.0"
sources:
# ... existing source tracking ...
enrichment_provenance:
wikidata_enrichment:
content_hash: "sha256-abc123..."
verified_at: "2025-12-28T00:00:00Z"
google_maps_enrichment:
content_hash: "sha256-def456..."
verified_at: "2025-12-28T00:00:00Z"
standards_compliance:
- "W3C PROV-O"
- "W3C SRI (content hashes)"
wikidata_enrichment:
entity_id: Q2710899
labels:
nl: "Nationaal Onderduikmuseum"
# ... enrichment data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-abc123..."
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
prov:
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
generatedAtTime: "2025-11-27T15:17:00Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
google_maps_enrichment:
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
# ... enrichment data ...
_provenance:
content_hash:
algorithm: sha256
value: "sha256-def456..."
scope: enrichment_section
computed_at: "2025-12-28T00:00:00Z"
prov:
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
generatedAtTime: "2025-11-28T09:50:57Z"
verification:
status: "verified"
last_verified: "2025-12-28T00:00:00Z"
```