441 lines
13 KiB
Markdown
441 lines
13 KiB
Markdown
# YAML Enrichment Provenance Schema
|
|
|
|
**Created**: 2025-12-28
|
|
**Updated**: 2025-12-28
|
|
**Status**: Active Rule
|
|
**Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md (for web_claims in JSON files)
|
|
|
|
## Purpose
|
|
|
|
This schema defines provenance requirements for **custodian YAML files** that contain enrichment data from various sources (Wikidata, Google Maps, YouTube, web scraping, etc.). Unlike web claims in person entity JSON files, YAML enrichment sections represent API-sourced data that requires different provenance tracking.
|
|
|
|
## Key Differences from WEB_CLAIM_PROVENANCE_SCHEMA
|
|
|
|
| Aspect | Web Claims (JSON) | YAML Enrichment |
|
|
|--------|------------------|-----------------|
|
|
| **Data Source** | Web scraping, news articles | APIs (Wikidata, Google Maps, YouTube) |
|
|
| **Selector Types** | CSS, XPath, TextQuoteSelector | Not applicable (API responses) |
|
|
| **Archival** | Wayback Machine memento URIs | API response caching, local archives |
|
|
| **Content Hash** | SHA-256 of extracted_text | SHA-256 of entire enrichment section |
|
|
| **Verification** | Re-fetch webpage and compare | Re-query API and compare |
|
|
|
|
## Enrichment Types in Custodian YAML Files
|
|
|
|
Based on audit of 29,073 files:
|
|
|
|
| Enrichment Type | Files | Provenance Elements Needed |
|
|
|-----------------|-------|---------------------------|
|
|
| `wikidata_enrichment` | 17,900 | entity_id, api_endpoint, fetch_timestamp, content_hash |
|
|
| `google_maps_enrichment` | 3,564 | place_id, api_endpoint, fetch_timestamp, content_hash |
|
|
| `youtube_enrichment` | varies | channel_id, api_endpoint, fetch_timestamp, content_hash |
|
|
| `web_enrichment` | 1,708 | source_url, archive_path, fetch_timestamp, content_hash |
|
|
| `zcbs_enrichment` | 142 | source_url, fetch_timestamp, content_hash |
|
|
| `linkup_timespan` | varies | search_query, source_urls, archive_path |
|
|
|
|
## Provenance Schema Structure
|
|
|
|
### Per-Enrichment Section Provenance
|
|
|
|
Each enrichment section should have a `_provenance` sub-key containing:
|
|
|
|
```yaml
|
|
wikidata_enrichment:
|
|
# ... existing enrichment data ...
|
|
_provenance:
|
|
# MANDATORY: Content integrity
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-<base64_hash>"
|
|
scope: enrichment_section
|
|
computed_at: "2025-12-28T00:00:00Z"
|
|
|
|
# MANDATORY: Source identification (PROV-O alignment)
|
|
prov:
|
|
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
|
|
generatedAtTime: "2025-12-28T00:00:00Z"
|
|
wasGeneratedBy:
|
|
"@type": "prov:Activity"
|
|
name: "api_fetch"
|
|
used: "wikidata_rest_api"
|
|
|
|
# MANDATORY: Verification status
|
|
verification:
|
|
status: "verified" # verified|stale|failed|pending
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
content_hash_at_verification: "sha256-<hash>"
|
|
|
|
# RECOMMENDED: Standards compliance
|
|
standards_compliance:
|
|
- "W3C PROV-O"
|
|
- "W3C SRI (content hashes)"
|
|
```
|
|
|
|
### Existing Provenance Elements to Preserve
|
|
|
|
The existing `provenance` section at root level should be PRESERVED and ENHANCED, not replaced:
|
|
|
|
```yaml
|
|
provenance:
|
|
schema_version: "2.0.0" # Increment version
|
|
generated_at: "2025-12-28T00:00:00Z"
|
|
sources:
|
|
# ... existing source tracking ...
|
|
data_tier_summary:
|
|
# ... existing tier summary ...
|
|
|
|
# NEW: Enrichment-level provenance summary
|
|
enrichment_provenance:
|
|
wikidata_enrichment:
|
|
content_hash: "sha256-..."
|
|
verified_at: "2025-12-28T00:00:00Z"
|
|
google_maps_enrichment:
|
|
content_hash: "sha256-..."
|
|
verified_at: "2025-12-28T00:00:00Z"
|
|
# ... other enrichment sections ...
|
|
|
|
# NEW: Schema compliance
|
|
provenance_schema_version: "2.0"
|
|
standards_compliance:
|
|
- "W3C PROV-O"
|
|
- "W3C SRI (content hashes)"
|
|
```
|
|
|
|
## Content Hash Generation
|
|
|
|
### For API-Sourced Enrichment Sections
|
|
|
|
Hash the **entire enrichment section** (excluding `_provenance` sub-key) as canonical JSON:
|
|
|
|
```python
|
|
import hashlib
|
|
import base64
|
|
import json
|
|
from ruamel.yaml import YAML
|
|
|
|
def generate_enrichment_hash(enrichment_data: dict) -> dict:
|
|
"""
|
|
Generate SHA-256 hash for enrichment section integrity.
|
|
|
|
Excludes '_provenance' key to avoid circular dependency.
|
|
"""
|
|
# Remove _provenance to avoid hashing the hash
|
|
data_to_hash = {k: v for k, v in enrichment_data.items() if k != '_provenance'}
|
|
|
|
# Canonical JSON (sorted keys, no extra whitespace)
|
|
canonical = json.dumps(data_to_hash, sort_keys=True, separators=(',', ':'))
|
|
|
|
# SHA-256 hash
|
|
hash_bytes = hashlib.sha256(canonical.encode('utf-8')).digest()
|
|
hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
|
|
|
|
return {
|
|
"algorithm": "sha256",
|
|
"value": f"sha256-{hash_b64}",
|
|
"scope": "enrichment_section",
|
|
"computed_at": datetime.now(timezone.utc).isoformat()
|
|
}
|
|
```
|
|
|
|
### Hash Stability Considerations
|
|
|
|
- **Floating point precision**: Normalize coordinates to 6 decimal places before hashing
|
|
- **Timestamp normalization**: Use ISO 8601 with UTC timezone
|
|
- **Unicode normalization**: Apply NFD normalization before hashing
|
|
- **Key ordering**: Always sort keys alphabetically for canonical JSON
|
|
|
|
## Enrichment-Specific Provenance Requirements
|
|
|
|
### 1. Wikidata Enrichment
|
|
|
|
```yaml
|
|
wikidata_enrichment:
|
|
entity_id: Q2710899
|
|
labels:
|
|
en: "National War and Resistance Museum"
|
|
nl: "Nationaal Onderduikmuseum"
|
|
# ... other data ...
|
|
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-abc123..."
|
|
scope: enrichment_section
|
|
prov:
|
|
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
|
|
generatedAtTime: "2025-12-28T00:00:00Z"
|
|
wasGeneratedBy:
|
|
"@type": "prov:Activity"
|
|
name: "wikidata_api_fetch"
|
|
used: "https://www.wikidata.org/w/rest.php/wikibase/v1"
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
```
|
|
|
|
### 2. Google Maps Enrichment
|
|
|
|
```yaml
|
|
google_maps_enrichment:
|
|
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
|
|
name: "Nationaal Onderduikmuseum"
|
|
coordinates:
|
|
latitude: 51.927699
|
|
longitude: 6.5815864
|
|
# ... other data ...
|
|
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-def456..."
|
|
scope: enrichment_section
|
|
prov:
|
|
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
|
|
generatedAtTime: "2025-12-28T00:00:00Z"
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
```
|
|
|
|
### 3. YouTube Enrichment
|
|
|
|
```yaml
|
|
youtube_enrichment:
|
|
channel:
|
|
channel_id: UC9jeOgD_4thSAgfPPDAGJDQ
|
|
title: "Nationaal Onderduikmuseum"
|
|
subscriber_count: 40
|
|
# ... other data ...
|
|
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-ghi789..."
|
|
scope: enrichment_section
|
|
prov:
|
|
wasDerivedFrom: "https://www.googleapis.com/youtube/v3/channels?id=UC9jeOgD_4thSAgfPPDAGJDQ"
|
|
generatedAtTime: "2025-12-28T00:00:00Z"
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
```
|
|
|
|
### 4. Web Enrichment (Archived Sites)
|
|
|
|
For web archives, reference the local archive path:
|
|
|
|
```yaml
|
|
web_enrichment:
|
|
web_archives:
|
|
- url: https://nationaalonderduikmuseum.nl
|
|
directory: web/0234/nationaalonderduikmuseum.nl
|
|
pages_archived: 200
|
|
warc_file: archive.warc.gz
|
|
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-jkl012..."
|
|
scope: enrichment_section
|
|
prov:
|
|
wasDerivedFrom: "https://nationaalonderduikmuseum.nl"
|
|
generatedAtTime: "2025-11-29T15:40:23Z"
|
|
archive:
|
|
local_path: "web/0234/nationaalonderduikmuseum.nl/archive.warc.gz"
|
|
format: "ISO 28500 WARC"
|
|
size_bytes: 4910963
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
```
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1: Content Hashes (Critical)
|
|
|
|
Add `content_hash` to all enrichment sections. This is deterministic and requires no external API calls.
|
|
|
|
**Target**: All 29,073 files
|
|
**Elements Added**:
|
|
- `_provenance.content_hash` for each enrichment section
|
|
- `provenance.enrichment_provenance` summary at root level
|
|
|
|
### Phase 2: PROV-O Alignment
|
|
|
|
Add `prov.wasDerivedFrom` and related PROV-O elements.
|
|
|
|
**Target**: Files with existing `api_metadata` or `fetch_timestamp`
|
|
**Elements Added**:
|
|
- `_provenance.prov.wasDerivedFrom`
|
|
- `_provenance.prov.generatedAtTime`
|
|
- `_provenance.prov.wasGeneratedBy`
|
|
|
|
### Phase 3: Verification Status
|
|
|
|
Add verification tracking for future re-verification workflows.
|
|
|
|
**Target**: All enriched files
|
|
**Elements Added**:
|
|
- `_provenance.verification.status`
|
|
- `_provenance.verification.last_verified`
|
|
|
|
## Processing Rules
|
|
|
|
### 1. Preserve Existing Data (DATA_PRESERVATION_RULES)
|
|
|
|
Never delete existing enrichment content. Only ADD provenance metadata.
|
|
|
|
```yaml
|
|
# CORRECT: Add _provenance, preserve everything else
|
|
google_maps_enrichment:
|
|
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk # PRESERVED
|
|
name: "Nationaal Onderduikmuseum" # PRESERVED
|
|
rating: 4.5 # PRESERVED
|
|
reviews: [...] # PRESERVED
|
|
_provenance: # ADDED
|
|
content_hash: ...
|
|
|
|
# WRONG: Do not restructure or delete existing data
|
|
```
|
|
|
|
### 2. Use ruamel.yaml for YAML Processing
|
|
|
|
Preserve comments, formatting, and key ordering:
|
|
|
|
```python
|
|
from ruamel.yaml import YAML
|
|
|
|
yaml = YAML()
|
|
yaml.preserve_quotes = True
|
|
yaml.default_flow_style = False
|
|
yaml.indent(mapping=2, sequence=4, offset=2)
|
|
|
|
with open(filepath, 'r') as f:
|
|
data = yaml.load(f)
|
|
|
|
# ... add provenance ...
|
|
|
|
with open(filepath, 'w') as f:
|
|
yaml.dump(data, f)
|
|
```
|
|
|
|
### 3. Idempotent Processing
|
|
|
|
Skip files that already have provenance:
|
|
|
|
```python
|
|
def needs_provenance(data: dict) -> bool:
|
|
"""Check if file needs provenance enhancement."""
|
|
# Check for per-section provenance
|
|
for key in ENRICHMENT_SECTIONS:
|
|
if key in data and '_provenance' not in data[key]:
|
|
return True
|
|
return False
|
|
```
|
|
|
|
### 4. Batch Processing with Progress
|
|
|
|
Process in batches with progress reporting:
|
|
|
|
```python
|
|
# Process by enrichment type priority
|
|
PROCESSING_ORDER = [
|
|
'web_enrichment', # 1,708 files - highest provenance need
|
|
'wikidata_enrichment', # 17,900 files - API data
|
|
'google_maps_enrichment', # 3,564 files
|
|
'youtube_enrichment', # varies
|
|
'zcbs_enrichment', # 142 files
|
|
]
|
|
```
|
|
|
|
## Validation
|
|
|
|
### Validate Provenance Completeness
|
|
|
|
```python
|
|
def validate_provenance(data: dict) -> list[str]:
|
|
"""Validate provenance completeness."""
|
|
errors = []
|
|
|
|
for section_name in ENRICHMENT_SECTIONS:
|
|
if section_name not in data:
|
|
continue
|
|
|
|
section = data[section_name]
|
|
prov = section.get('_provenance', {})
|
|
|
|
# Check mandatory elements
|
|
if 'content_hash' not in prov:
|
|
errors.append(f"{section_name}: missing content_hash")
|
|
if 'verification' not in prov:
|
|
errors.append(f"{section_name}: missing verification")
|
|
if 'prov' not in prov or 'wasDerivedFrom' not in prov.get('prov', {}):
|
|
errors.append(f"{section_name}: missing prov.wasDerivedFrom")
|
|
|
|
return errors
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Provenance for web claims in JSON files
|
|
- `.opencode/DATA_PRESERVATION_RULES.md` - Never delete enriched data
|
|
- `.opencode/DATA_FABRICATION_PROHIBITION.md` - All data must be real
|
|
- `AGENTS.md` - Project rules and conventions
|
|
|
|
## Example: Complete Enriched File with Provenance
|
|
|
|
```yaml
|
|
original_entry:
|
|
organisatie: Nationaal Onderduikmuseum
|
|
# ... other original data ...
|
|
|
|
provenance:
|
|
schema_version: "2.0.0"
|
|
generated_at: "2025-12-28T00:00:00Z"
|
|
provenance_schema_version: "2.0"
|
|
sources:
|
|
# ... existing source tracking ...
|
|
enrichment_provenance:
|
|
wikidata_enrichment:
|
|
content_hash: "sha256-abc123..."
|
|
verified_at: "2025-12-28T00:00:00Z"
|
|
google_maps_enrichment:
|
|
content_hash: "sha256-def456..."
|
|
verified_at: "2025-12-28T00:00:00Z"
|
|
standards_compliance:
|
|
- "W3C PROV-O"
|
|
- "W3C SRI (content hashes)"
|
|
|
|
wikidata_enrichment:
|
|
entity_id: Q2710899
|
|
labels:
|
|
nl: "Nationaal Onderduikmuseum"
|
|
# ... enrichment data ...
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-abc123..."
|
|
scope: enrichment_section
|
|
computed_at: "2025-12-28T00:00:00Z"
|
|
prov:
|
|
wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
|
|
generatedAtTime: "2025-11-27T15:17:00Z"
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
|
|
google_maps_enrichment:
|
|
place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
|
|
# ... enrichment data ...
|
|
_provenance:
|
|
content_hash:
|
|
algorithm: sha256
|
|
value: "sha256-def456..."
|
|
scope: enrichment_section
|
|
computed_at: "2025-12-28T00:00:00Z"
|
|
prov:
|
|
wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
|
|
generatedAtTime: "2025-11-28T09:50:57Z"
|
|
verification:
|
|
status: "verified"
|
|
last_verified: "2025-12-28T00:00:00Z"
|
|
```
|