# Provenance Timestamp Rules **Created**: 2025-12-30 **Updated**: 2025-12-30 **Status**: Active Rule **Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md, YAML_PROVENANCE_SCHEMA.md, WEB_OBSERVATION_PROVENANCE_RULES.md ## Core Principle: Every Provenance Statement MUST Have At Least Two Timestamps **All provenance statements in custodian data MUST include at minimum two timestamps:** 1. **`statement_created_at`** - When the provenance statement/claim was created (extraction/annotation time) 2. **`source_archived_at`** - When the source material was archived/captured These two timestamps are MANDATORY. Additional temporal metadata is encouraged but optional. --- ## Mandatory Timestamps ### 1. Statement Created Timestamp (`statement_created_at`) **Purpose**: Records when the claim/statement was extracted, annotated, or created by the agent. **Format**: ISO 8601 with timezone (UTC preferred) **Example**: ```yaml statement_created_at: "2025-12-30T14:30:00Z" ``` **Source**: Generated by the extraction/annotation agent at processing time. ### 2. Source Archived Timestamp (`source_archived_at`) **Purpose**: Records when the source material (webpage, document, API response) was archived/captured. **Format**: ISO 8601 with timezone (UTC preferred) **Example**: ```yaml source_archived_at: "2025-12-29T10:15:00Z" ``` **Source**: - For web sources: Playwright archival timestamp, Wayback Machine memento datetime - For API sources: API response fetch timestamp - For documents: Document capture/download timestamp --- ## Optional Timestamps (Encouraged) ### 3. Source Created Timestamp (`source_created_at`) **Purpose**: When the original source content was created/published. **Example**: ```yaml source_created_at: "2022-07-15T14:15:00Z" # Article publish date ``` **Sources**: - `article:published_time` meta tag - `datePublished` in JSON-LD - File creation date - API response `created_at` field ### 4. Source Last Modified Timestamp (`source_last_modified_at`) **Purpose**: When the source content was last updated. **Example**: ```yaml source_last_modified_at: "2023-01-10T09:00:00Z" ``` **Sources**: - `article:modified_time` meta tag - `dateModified` in JSON-LD - HTTP `Last-Modified` header - File modification date ### 5. Verification Timestamp (`last_verified_at`) **Purpose**: When the claim was last re-verified against the source. **Example**: ```yaml last_verified_at: "2025-12-30T14:30:00Z" ``` ### 6. Next Verification Due (`next_verification_due`) **Purpose**: When the claim should be re-verified (for staleness tracking). **Example**: ```yaml next_verification_due: "2026-03-30T00:00:00Z" # 90 days from last verification ``` --- ## Complete Provenance Timestamp Structure ### For Web Claims ```yaml provenance: # MANDATORY (both required) statement_created_at: "2025-12-30T14:30:00Z" # When we extracted this source_archived_at: "2025-12-29T10:15:00Z" # When we archived the webpage # OPTIONAL (encouraged) source_created_at: "2022-07-15T14:15:00Z" # When article was published source_last_modified_at: "2023-01-10T09:00:00Z" # When article was updated last_verified_at: "2025-12-30T14:30:00Z" # Last verification next_verification_due: "2026-03-30T00:00:00Z" # Re-verify in 90 days ``` ### For API-Sourced Data (Wikidata, Google Maps, etc.) ```yaml _provenance: # MANDATORY statement_created_at: "2025-12-30T14:30:00Z" # When we processed API response source_archived_at: "2025-12-30T14:29:55Z" # When API was queried (fetch_timestamp) # OPTIONAL source_last_modified_at: "2025-12-15T00:00:00Z" # Wikidata entity last modified last_verified_at: "2025-12-30T14:30:00Z" ``` ### For CH-Annotator Extracted Claims ```yaml provenance: namespace: glam path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922 # MANDATORY statement_created_at: "2025-12-06T21:13:56Z" # When CH-Annotator processed this source_archived_at: "2025-11-06T08:02:44Z" # When conversation was exported # Agent identification agent: opencode-claude-sonnet-4 context_convention: ch_annotator-v1_7_0 ``` --- ## Invalid Provenance: `agent: claude-conversation` **PROBLEM**: 24,328 custodian files currently contain provenance statements like: ```yaml # INVALID - Missing timestamps and proper source identification extraction_provenance: namespace: glam path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922 timestamp: '2025-11-06T08:02:44.240037+00:00' # Only ONE timestamp! agent: claude-conversation # Vague agent identifier context_convention: ch_annotator-v1_7_0 ``` **ISSUES**: 1. `claude-conversation` is not a valid agent identifier (which Claude model? which session?) 2. Only one timestamp - doesn't distinguish statement creation from source archival 3. No UUID reference to the specific conversation 4. No archived source path --- ## Valid Provenance Structure (Migration Target) ```yaml extraction_provenance: namespace: glam # Source identification source_type: claude_conversation_export source_path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922 conversation_uuid: edc75d66-ee42-4199-8e22-65b0d2347922 # MANDATORY timestamps statement_created_at: "2025-12-06T21:13:56.173868+00:00" # Annotation time source_archived_at: "2025-11-06T08:02:44.240037+00:00" # Conversation export time # Agent identification (proper format) agent: name: opencode-claude-sonnet-4 model: claude-sonnet-4-20250514 session_type: opencode_conversation # Context context_convention: ch_annotator-v1_7_0 # Archive reference archive: format: claude_conversation_json local_path: data/conversations/edc75d66-ee42-4199-8e22-65b0d2347922.json ``` --- ## Timestamp Hierarchy and Derivation When only one timestamp is available, derive the other: | Available | Derive `statement_created_at` | Derive `source_archived_at` | |-----------|------------------------------|----------------------------| | Only `timestamp` | Use as `statement_created_at` | Set to same value (assume simultaneous) | | Only `extraction_date` | Use as `statement_created_at` | Set to same value | | Only `fetch_timestamp` | Set to same value | Use as `source_archived_at` | | Only `annotation_date` | Use as `statement_created_at` | Look for `timestamp` in source | **Migration rule**: If we cannot determine `source_archived_at`, use the earliest available timestamp from the source chain. --- ## Agent Identification Standards ### Invalid Agent Identifiers ```yaml # INVALID - Too vague agent: claude-conversation agent: claude agent: ai agent: llm agent: opencode ``` ### Valid Agent Identifiers ```yaml # Format: {tool}-{model}-{version} agent: opencode-claude-sonnet-4 agent: opencode-claude-opus-4 agent: batch-script-python-3.11 agent: manual-human-curator # Or structured format agent: name: opencode-claude-sonnet-4 model: claude-sonnet-4-20250514 tool: opencode version: "1.0.0" ``` --- ## PROV-O Alignment These timestamps align with W3C PROV-O: | Our Field | PROV-O Property | Description | |-----------|-----------------|-------------| | `statement_created_at` | `prov:generatedAtTime` | When entity was generated | | `source_archived_at` | `prov:atTime` (on Activity) | When archival activity occurred | | `source_created_at` | `dcterms:created` | Original creation date | | `source_last_modified_at` | `dcterms:modified` | Last modification date | ```yaml prov: generatedAtTime: "2025-12-30T14:30:00Z" # = statement_created_at wasGeneratedBy: "@type": "prov:Activity" name: "web_extraction" atTime: "2025-12-29T10:15:00Z" # = source_archived_at ``` --- ## Validation Rules ### Rule 1: Both Mandatory Timestamps Required ```python def validate_provenance_timestamps(provenance: dict) -> list[str]: errors = [] # Check for mandatory timestamps if 'statement_created_at' not in provenance: errors.append("Missing mandatory 'statement_created_at' timestamp") if 'source_archived_at' not in provenance: errors.append("Missing mandatory 'source_archived_at' timestamp") return errors ``` ### Rule 2: Timestamps Must Be Valid ISO 8601 ```python from datetime import datetime def validate_timestamp_format(timestamp: str) -> bool: try: datetime.fromisoformat(timestamp.replace('Z', '+00:00')) return True except ValueError: return False ``` ### Rule 3: source_archived_at <= statement_created_at The source must be archived BEFORE or AT the same time as the statement is created. ```python def validate_timestamp_order(provenance: dict) -> bool: archived = datetime.fromisoformat(provenance['source_archived_at']) created = datetime.fromisoformat(provenance['statement_created_at']) return archived <= created ``` --- ## Migration Strategy for Existing Files ### Phase 1: Identify Files Needing Migration ```bash # Count affected files find data/custodian -name "*.yaml" -exec grep -l "agent: claude-conversation" {} \; | wc -l # Result: 24,328 files ``` ### Phase 2: Parse and Transform For each file with `agent: claude-conversation`: 1. Extract existing `timestamp` field 2. Set `source_archived_at` = existing `timestamp` 3. Set `statement_created_at` = `annotation_date` if present, else use current time 4. Replace `agent: claude-conversation` with proper agent identifier 5. Add conversation UUID from path ### Phase 3: Validate and Write ```python def migrate_provenance(data: dict) -> dict: """Migrate old claude-conversation provenance to new format.""" if 'ch_annotator' in data: ch = data['ch_annotator'] if ch.get('extraction_provenance', {}).get('agent') == 'claude-conversation': old_prov = ch['extraction_provenance'] # Extract conversation UUID from path path = old_prov.get('path', '') conv_uuid = path.split('/')[-1] if '/conversations/' in path else None # Get timestamps source_archived_at = old_prov.get('timestamp') statement_created_at = ch.get('annotation_provenance', {}).get('annotation_date', source_archived_at) # Build new provenance ch['extraction_provenance'] = { 'namespace': old_prov.get('namespace', 'glam'), 'source_type': 'claude_conversation_export', 'source_path': old_prov.get('path'), 'conversation_uuid': conv_uuid, 'statement_created_at': statement_created_at, 'source_archived_at': source_archived_at, 'agent': 'opencode-claude-sonnet-4', # Default migration value 'context_convention': old_prov.get('context_convention'), 'migration_note': 'Migrated from agent:claude-conversation on 2025-12-30' } return data ``` --- ## Implementation Checklist - [ ] Add `statement_created_at` to all new provenance statements - [ ] Add `source_archived_at` to all new provenance statements - [ ] Replace `agent: claude-conversation` with proper agent identifiers - [ ] Add conversation UUIDs where applicable - [ ] Migrate existing 24,328 files with invalid provenance - [ ] Update LinkML schema to require dual timestamps - [ ] Add validation to data pipeline --- ## Related Documentation - `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Web claim provenance structure - `.opencode/YAML_PROVENANCE_SCHEMA.md` - YAML enrichment provenance - `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath provenance requirements - `AGENTS.md` - Rule 35: Provenance Timestamps