glam/.opencode/PROVENANCE_TIMESTAMP_RULES.md
2025-12-30 23:19:38 +01:00

11 KiB

Provenance Timestamp Rules

Created: 2025-12-30 Updated: 2025-12-30 Status: Active Rule Related: WEB_CLAIM_PROVENANCE_SCHEMA.md, YAML_PROVENANCE_SCHEMA.md, WEB_OBSERVATION_PROVENANCE_RULES.md

Core Principle: Every Provenance Statement MUST Have At Least Two Timestamps

All provenance statements in custodian data MUST include at minimum two timestamps:

  1. statement_created_at - When the provenance statement/claim was created (extraction/annotation time)
  2. source_archived_at - When the source material was archived/captured

These two timestamps are MANDATORY. Additional temporal metadata is encouraged but optional.


Mandatory Timestamps

1. Statement Created Timestamp (statement_created_at)

Purpose: Records when the claim/statement was extracted, annotated, or created by the agent.

Format: ISO 8601 with timezone (UTC preferred)

Example:

statement_created_at: "2025-12-30T14:30:00Z"

Source: Generated by the extraction/annotation agent at processing time.

2. Source Archived Timestamp (source_archived_at)

Purpose: Records when the source material (webpage, document, API response) was archived/captured.

Format: ISO 8601 with timezone (UTC preferred)

Example:

source_archived_at: "2025-12-29T10:15:00Z"

Source:

  • For web sources: Playwright archival timestamp, Wayback Machine memento datetime
  • For API sources: API response fetch timestamp
  • For documents: Document capture/download timestamp

Optional Timestamps (Encouraged)

3. Source Created Timestamp (source_created_at)

Purpose: When the original source content was created/published.

Example:

source_created_at: "2022-07-15T14:15:00Z"  # Article publish date

Sources:

  • article:published_time meta tag
  • datePublished in JSON-LD
  • File creation date
  • API response created_at field

4. Source Last Modified Timestamp (source_last_modified_at)

Purpose: When the source content was last updated.

Example:

source_last_modified_at: "2023-01-10T09:00:00Z"

Sources:

  • article:modified_time meta tag
  • dateModified in JSON-LD
  • HTTP Last-Modified header
  • File modification date

5. Verification Timestamp (last_verified_at)

Purpose: When the claim was last re-verified against the source.

Example:

last_verified_at: "2025-12-30T14:30:00Z"

6. Next Verification Due (next_verification_due)

Purpose: When the claim should be re-verified (for staleness tracking).

Example:

next_verification_due: "2026-03-30T00:00:00Z"  # 90 days from last verification

Complete Provenance Timestamp Structure

For Web Claims

provenance:
  # MANDATORY (both required)
  statement_created_at: "2025-12-30T14:30:00Z"   # When we extracted this
  source_archived_at: "2025-12-29T10:15:00Z"    # When we archived the webpage
  
  # OPTIONAL (encouraged)
  source_created_at: "2022-07-15T14:15:00Z"     # When article was published
  source_last_modified_at: "2023-01-10T09:00:00Z"  # When article was updated
  last_verified_at: "2025-12-30T14:30:00Z"      # Last verification
  next_verification_due: "2026-03-30T00:00:00Z" # Re-verify in 90 days

For API-Sourced Data (Wikidata, Google Maps, etc.)

_provenance:
  # MANDATORY
  statement_created_at: "2025-12-30T14:30:00Z"   # When we processed API response
  source_archived_at: "2025-12-30T14:29:55Z"    # When API was queried (fetch_timestamp)
  
  # OPTIONAL
  source_last_modified_at: "2025-12-15T00:00:00Z"  # Wikidata entity last modified
  last_verified_at: "2025-12-30T14:30:00Z"

For CH-Annotator Extracted Claims

provenance:
  namespace: glam
  path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
  
  # MANDATORY
  statement_created_at: "2025-12-06T21:13:56Z"   # When CH-Annotator processed this
  source_archived_at: "2025-11-06T08:02:44Z"    # When conversation was exported
  
  # Agent identification
  agent: opencode-claude-sonnet-4
  context_convention: ch_annotator-v1_7_0

Invalid Provenance: agent: claude-conversation

PROBLEM: 24,328 custodian files currently contain provenance statements like:

# INVALID - Missing timestamps and proper source identification
extraction_provenance:
  namespace: glam
  path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
  timestamp: '2025-11-06T08:02:44.240037+00:00'  # Only ONE timestamp!
  agent: claude-conversation  # Vague agent identifier
  context_convention: ch_annotator-v1_7_0

ISSUES:

  1. claude-conversation is not a valid agent identifier (which Claude model? which session?)
  2. Only one timestamp - doesn't distinguish statement creation from source archival
  3. No UUID reference to the specific conversation
  4. No archived source path

Valid Provenance Structure (Migration Target)

extraction_provenance:
  namespace: glam
  
  # Source identification
  source_type: claude_conversation_export
  source_path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
  conversation_uuid: edc75d66-ee42-4199-8e22-65b0d2347922
  
  # MANDATORY timestamps
  statement_created_at: "2025-12-06T21:13:56.173868+00:00"  # Annotation time
  source_archived_at: "2025-11-06T08:02:44.240037+00:00"   # Conversation export time
  
  # Agent identification (proper format)
  agent:
    name: opencode-claude-sonnet-4
    model: claude-sonnet-4-20250514
    session_type: opencode_conversation
  
  # Context
  context_convention: ch_annotator-v1_7_0
  
  # Archive reference
  archive:
    format: claude_conversation_json
    local_path: data/conversations/edc75d66-ee42-4199-8e22-65b0d2347922.json

Timestamp Hierarchy and Derivation

When only one timestamp is available, derive the other:

Available Derive statement_created_at Derive source_archived_at
Only timestamp Use as statement_created_at Set to same value (assume simultaneous)
Only extraction_date Use as statement_created_at Set to same value
Only fetch_timestamp Set to same value Use as source_archived_at
Only annotation_date Use as statement_created_at Look for timestamp in source

Migration rule: If we cannot determine source_archived_at, use the earliest available timestamp from the source chain.


Agent Identification Standards

Invalid Agent Identifiers

# INVALID - Too vague
agent: claude-conversation
agent: claude
agent: ai
agent: llm
agent: opencode

Valid Agent Identifiers

# Format: {tool}-{model}-{version}
agent: opencode-claude-sonnet-4
agent: opencode-claude-opus-4
agent: batch-script-python-3.11
agent: manual-human-curator

# Or structured format
agent:
  name: opencode-claude-sonnet-4
  model: claude-sonnet-4-20250514
  tool: opencode
  version: "1.0.0"

PROV-O Alignment

These timestamps align with W3C PROV-O:

Our Field PROV-O Property Description
statement_created_at prov:generatedAtTime When entity was generated
source_archived_at prov:atTime (on Activity) When archival activity occurred
source_created_at dcterms:created Original creation date
source_last_modified_at dcterms:modified Last modification date
prov:
  generatedAtTime: "2025-12-30T14:30:00Z"  # = statement_created_at
  wasGeneratedBy:
    "@type": "prov:Activity"
    name: "web_extraction"
    atTime: "2025-12-29T10:15:00Z"  # = source_archived_at

Validation Rules

Rule 1: Both Mandatory Timestamps Required

def validate_provenance_timestamps(provenance: dict) -> list[str]:
    errors = []
    
    # Check for mandatory timestamps
    if 'statement_created_at' not in provenance:
        errors.append("Missing mandatory 'statement_created_at' timestamp")
    if 'source_archived_at' not in provenance:
        errors.append("Missing mandatory 'source_archived_at' timestamp")
    
    return errors

Rule 2: Timestamps Must Be Valid ISO 8601

from datetime import datetime

def validate_timestamp_format(timestamp: str) -> bool:
    try:
        datetime.fromisoformat(timestamp.replace('Z', '+00:00'))
        return True
    except ValueError:
        return False

Rule 3: source_archived_at <= statement_created_at

The source must be archived BEFORE or AT the same time as the statement is created.

def validate_timestamp_order(provenance: dict) -> bool:
    archived = datetime.fromisoformat(provenance['source_archived_at'])
    created = datetime.fromisoformat(provenance['statement_created_at'])
    return archived <= created

Migration Strategy for Existing Files

Phase 1: Identify Files Needing Migration

# Count affected files
find data/custodian -name "*.yaml" -exec grep -l "agent: claude-conversation" {} \; | wc -l
# Result: 24,328 files

Phase 2: Parse and Transform

For each file with agent: claude-conversation:

  1. Extract existing timestamp field
  2. Set source_archived_at = existing timestamp
  3. Set statement_created_at = annotation_date if present, else use current time
  4. Replace agent: claude-conversation with proper agent identifier
  5. Add conversation UUID from path

Phase 3: Validate and Write

def migrate_provenance(data: dict) -> dict:
    """Migrate old claude-conversation provenance to new format."""
    
    if 'ch_annotator' in data:
        ch = data['ch_annotator']
        
        if ch.get('extraction_provenance', {}).get('agent') == 'claude-conversation':
            old_prov = ch['extraction_provenance']
            
            # Extract conversation UUID from path
            path = old_prov.get('path', '')
            conv_uuid = path.split('/')[-1] if '/conversations/' in path else None
            
            # Get timestamps
            source_archived_at = old_prov.get('timestamp')
            statement_created_at = ch.get('annotation_provenance', {}).get('annotation_date', source_archived_at)
            
            # Build new provenance
            ch['extraction_provenance'] = {
                'namespace': old_prov.get('namespace', 'glam'),
                'source_type': 'claude_conversation_export',
                'source_path': old_prov.get('path'),
                'conversation_uuid': conv_uuid,
                'statement_created_at': statement_created_at,
                'source_archived_at': source_archived_at,
                'agent': 'opencode-claude-sonnet-4',  # Default migration value
                'context_convention': old_prov.get('context_convention'),
                'migration_note': 'Migrated from agent:claude-conversation on 2025-12-30'
            }
    
    return data

Implementation Checklist

  • Add statement_created_at to all new provenance statements
  • Add source_archived_at to all new provenance statements
  • Replace agent: claude-conversation with proper agent identifiers
  • Add conversation UUIDs where applicable
  • Migrate existing 24,328 files with invalid provenance
  • Update LinkML schema to require dual timestamps
  • Add validation to data pipeline

  • .opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md - Web claim provenance structure
  • .opencode/YAML_PROVENANCE_SCHEMA.md - YAML enrichment provenance
  • .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - XPath provenance requirements
  • AGENTS.md - Rule 35: Provenance Timestamps