glam/.opencode/YAML_PROVENANCE_SCHEMA.md
2025-12-30 03:43:31 +01:00

13 KiB

YAML Enrichment Provenance Schema

Created: 2025-12-28 Updated: 2025-12-28 Status: Active Rule Related: WEB_CLAIM_PROVENANCE_SCHEMA.md (for web_claims in JSON files)

Purpose

This schema defines provenance requirements for custodian YAML files that contain enrichment data from various sources (Wikidata, Google Maps, YouTube, web scraping, etc.). Unlike web claims in person entity JSON files, YAML enrichment sections represent API-sourced data that requires different provenance tracking.

Key Differences from WEB_CLAIM_PROVENANCE_SCHEMA

Aspect Web Claims (JSON) YAML Enrichment
Data Source Web scraping, news articles APIs (Wikidata, Google Maps, YouTube)
Selector Types CSS, XPath, TextQuoteSelector Not applicable (API responses)
Archival Wayback Machine memento URIs API response caching, local archives
Content Hash SHA-256 of extracted_text SHA-256 of entire enrichment section
Verification Re-fetch webpage and compare Re-query API and compare

Enrichment Types in Custodian YAML Files

Based on audit of 29,073 files:

Enrichment Type Files Provenance Elements Needed
wikidata_enrichment 17,900 entity_id, api_endpoint, fetch_timestamp, content_hash
google_maps_enrichment 3,564 place_id, api_endpoint, fetch_timestamp, content_hash
youtube_enrichment varies channel_id, api_endpoint, fetch_timestamp, content_hash
web_enrichment 1,708 source_url, archive_path, fetch_timestamp, content_hash
zcbs_enrichment 142 source_url, fetch_timestamp, content_hash
linkup_timespan varies search_query, source_urls, archive_path

Provenance Schema Structure

Per-Enrichment Section Provenance

Each enrichment section should have a _provenance sub-key containing:

wikidata_enrichment:
  # ... existing enrichment data ...
  _provenance:
    # MANDATORY: Content integrity
    content_hash:
      algorithm: sha256
      value: "sha256-<base64_hash>"
      scope: enrichment_section
      computed_at: "2025-12-28T00:00:00Z"
    
    # MANDATORY: Source identification (PROV-O alignment)
    prov:
      wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
      generatedAtTime: "2025-12-28T00:00:00Z"
      wasGeneratedBy:
        "@type": "prov:Activity"
        name: "api_fetch"
        used: "wikidata_rest_api"
    
    # MANDATORY: Verification status
    verification:
      status: "verified"  # verified|stale|failed|pending
      last_verified: "2025-12-28T00:00:00Z"
      content_hash_at_verification: "sha256-<hash>"
    
    # RECOMMENDED: Standards compliance
    standards_compliance:
      - "W3C PROV-O"
      - "W3C SRI (content hashes)"

Existing Provenance Elements to Preserve

The existing provenance section at root level should be PRESERVED and ENHANCED, not replaced:

provenance:
  schema_version: "2.0.0"  # Increment version
  generated_at: "2025-12-28T00:00:00Z"
  sources:
    # ... existing source tracking ...
  data_tier_summary:
    # ... existing tier summary ...
  
  # NEW: Enrichment-level provenance summary
  enrichment_provenance:
    wikidata_enrichment:
      content_hash: "sha256-..."
      verified_at: "2025-12-28T00:00:00Z"
    google_maps_enrichment:
      content_hash: "sha256-..."
      verified_at: "2025-12-28T00:00:00Z"
    # ... other enrichment sections ...
  
  # NEW: Schema compliance
  provenance_schema_version: "2.0"
  standards_compliance:
    - "W3C PROV-O"
    - "W3C SRI (content hashes)"

Content Hash Generation

For API-Sourced Enrichment Sections

Hash the entire enrichment section (excluding _provenance sub-key) as canonical JSON:

import hashlib
import base64
import json
from ruamel.yaml import YAML

def generate_enrichment_hash(enrichment_data: dict) -> dict:
    """
    Generate SHA-256 hash for enrichment section integrity.
    
    Excludes '_provenance' key to avoid circular dependency.
    """
    # Remove _provenance to avoid hashing the hash
    data_to_hash = {k: v for k, v in enrichment_data.items() if k != '_provenance'}
    
    # Canonical JSON (sorted keys, no extra whitespace)
    canonical = json.dumps(data_to_hash, sort_keys=True, separators=(',', ':'))
    
    # SHA-256 hash
    hash_bytes = hashlib.sha256(canonical.encode('utf-8')).digest()
    hash_b64 = base64.b64encode(hash_bytes).decode('ascii')
    
    return {
        "algorithm": "sha256",
        "value": f"sha256-{hash_b64}",
        "scope": "enrichment_section",
        "computed_at": datetime.now(timezone.utc).isoformat()
    }

Hash Stability Considerations

  • Floating point precision: Normalize coordinates to 6 decimal places before hashing
  • Timestamp normalization: Use ISO 8601 with UTC timezone
  • Unicode normalization: Apply NFD normalization before hashing
  • Key ordering: Always sort keys alphabetically for canonical JSON

Enrichment-Specific Provenance Requirements

1. Wikidata Enrichment

wikidata_enrichment:
  entity_id: Q2710899
  labels:
    en: "National War and Resistance Museum"
    nl: "Nationaal Onderduikmuseum"
  # ... other data ...
  
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-abc123..."
      scope: enrichment_section
    prov:
      wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
      generatedAtTime: "2025-12-28T00:00:00Z"
      wasGeneratedBy:
        "@type": "prov:Activity"
        name: "wikidata_api_fetch"
        used: "https://www.wikidata.org/w/rest.php/wikibase/v1"
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"

2. Google Maps Enrichment

google_maps_enrichment:
  place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
  name: "Nationaal Onderduikmuseum"
  coordinates:
    latitude: 51.927699
    longitude: 6.5815864
  # ... other data ...
  
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-def456..."
      scope: enrichment_section
    prov:
      wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
      generatedAtTime: "2025-12-28T00:00:00Z"
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"

3. YouTube Enrichment

youtube_enrichment:
  channel:
    channel_id: UC9jeOgD_4thSAgfPPDAGJDQ
    title: "Nationaal Onderduikmuseum"
    subscriber_count: 40
  # ... other data ...
  
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-ghi789..."
      scope: enrichment_section
    prov:
      wasDerivedFrom: "https://www.googleapis.com/youtube/v3/channels?id=UC9jeOgD_4thSAgfPPDAGJDQ"
      generatedAtTime: "2025-12-28T00:00:00Z"
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"

4. Web Enrichment (Archived Sites)

For web archives, reference the local archive path:

web_enrichment:
  web_archives:
    - url: https://nationaalonderduikmuseum.nl
      directory: web/0234/nationaalonderduikmuseum.nl
      pages_archived: 200
      warc_file: archive.warc.gz
  
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-jkl012..."
      scope: enrichment_section
    prov:
      wasDerivedFrom: "https://nationaalonderduikmuseum.nl"
      generatedAtTime: "2025-11-29T15:40:23Z"
    archive:
      local_path: "web/0234/nationaalonderduikmuseum.nl/archive.warc.gz"
      format: "ISO 28500 WARC"
      size_bytes: 4910963
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"

Implementation Priority

Phase 1: Content Hashes (Critical)

Add content_hash to all enrichment sections. This is deterministic and requires no external API calls.

Target: All 29,073 files Elements Added:

  • _provenance.content_hash for each enrichment section
  • provenance.enrichment_provenance summary at root level

Phase 2: PROV-O Alignment

Add prov.wasDerivedFrom and related PROV-O elements.

Target: Files with existing api_metadata or fetch_timestamp Elements Added:

  • _provenance.prov.wasDerivedFrom
  • _provenance.prov.generatedAtTime
  • _provenance.prov.wasGeneratedBy

Phase 3: Verification Status

Add verification tracking for future re-verification workflows.

Target: All enriched files Elements Added:

  • _provenance.verification.status
  • _provenance.verification.last_verified

Processing Rules

1. Preserve Existing Data (DATA_PRESERVATION_RULES)

Never delete existing enrichment content. Only ADD provenance metadata.

# CORRECT: Add _provenance, preserve everything else
google_maps_enrichment:
  place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk  # PRESERVED
  name: "Nationaal Onderduikmuseum"        # PRESERVED
  rating: 4.5                               # PRESERVED
  reviews: [...]                            # PRESERVED
  _provenance:                              # ADDED
    content_hash: ...

# WRONG: Do not restructure or delete existing data

2. Use ruamel.yaml for YAML Processing

Preserve comments, formatting, and key ordering:

from ruamel.yaml import YAML

yaml = YAML()
yaml.preserve_quotes = True
yaml.default_flow_style = False
yaml.indent(mapping=2, sequence=4, offset=2)

with open(filepath, 'r') as f:
    data = yaml.load(f)

# ... add provenance ...

with open(filepath, 'w') as f:
    yaml.dump(data, f)

3. Idempotent Processing

Skip files that already have provenance:

def needs_provenance(data: dict) -> bool:
    """Check if file needs provenance enhancement."""
    # Check for per-section provenance
    for key in ENRICHMENT_SECTIONS:
        if key in data and '_provenance' not in data[key]:
            return True
    return False

4. Batch Processing with Progress

Process in batches with progress reporting:

# Process by enrichment type priority
PROCESSING_ORDER = [
    'web_enrichment',       # 1,708 files - highest provenance need
    'wikidata_enrichment',  # 17,900 files - API data
    'google_maps_enrichment',  # 3,564 files
    'youtube_enrichment',   # varies
    'zcbs_enrichment',      # 142 files
]

Validation

Validate Provenance Completeness

def validate_provenance(data: dict) -> list[str]:
    """Validate provenance completeness."""
    errors = []
    
    for section_name in ENRICHMENT_SECTIONS:
        if section_name not in data:
            continue
            
        section = data[section_name]
        prov = section.get('_provenance', {})
        
        # Check mandatory elements
        if 'content_hash' not in prov:
            errors.append(f"{section_name}: missing content_hash")
        if 'verification' not in prov:
            errors.append(f"{section_name}: missing verification")
        if 'prov' not in prov or 'wasDerivedFrom' not in prov.get('prov', {}):
            errors.append(f"{section_name}: missing prov.wasDerivedFrom")
    
    return errors
  • .opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md - Provenance for web claims in JSON files
  • .opencode/DATA_PRESERVATION_RULES.md - Never delete enriched data
  • .opencode/DATA_FABRICATION_PROHIBITION.md - All data must be real
  • AGENTS.md - Project rules and conventions

Example: Complete Enriched File with Provenance

original_entry:
  organisatie: Nationaal Onderduikmuseum
  # ... other original data ...

provenance:
  schema_version: "2.0.0"
  generated_at: "2025-12-28T00:00:00Z"
  provenance_schema_version: "2.0"
  sources:
    # ... existing source tracking ...
  enrichment_provenance:
    wikidata_enrichment:
      content_hash: "sha256-abc123..."
      verified_at: "2025-12-28T00:00:00Z"
    google_maps_enrichment:
      content_hash: "sha256-def456..."
      verified_at: "2025-12-28T00:00:00Z"
  standards_compliance:
    - "W3C PROV-O"
    - "W3C SRI (content hashes)"

wikidata_enrichment:
  entity_id: Q2710899
  labels:
    nl: "Nationaal Onderduikmuseum"
  # ... enrichment data ...
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-abc123..."
      scope: enrichment_section
      computed_at: "2025-12-28T00:00:00Z"
    prov:
      wasDerivedFrom: "https://www.wikidata.org/wiki/Q2710899"
      generatedAtTime: "2025-11-27T15:17:00Z"
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"

google_maps_enrichment:
  place_id: ChIJUw9Z1E15uEcRhlnlq4CCqNk
  # ... enrichment data ...
  _provenance:
    content_hash:
      algorithm: sha256
      value: "sha256-def456..."
      scope: enrichment_section
      computed_at: "2025-12-28T00:00:00Z"
    prov:
      wasDerivedFrom: "https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJUw9Z1E15uEcRhlnlq4CCqNk"
      generatedAtTime: "2025-11-28T09:50:57Z"
    verification:
      status: "verified"
      last_verified: "2025-12-28T00:00:00Z"