11 KiB
Provenance Timestamp Rules
Created: 2025-12-30 Updated: 2025-12-30 Status: Active Rule Related: WEB_CLAIM_PROVENANCE_SCHEMA.md, YAML_PROVENANCE_SCHEMA.md, WEB_OBSERVATION_PROVENANCE_RULES.md
Core Principle: Every Provenance Statement MUST Have At Least Two Timestamps
All provenance statements in custodian data MUST include at minimum two timestamps:
statement_created_at- When the provenance statement/claim was created (extraction/annotation time)source_archived_at- When the source material was archived/captured
These two timestamps are MANDATORY. Additional temporal metadata is encouraged but optional.
Mandatory Timestamps
1. Statement Created Timestamp (statement_created_at)
Purpose: Records when the claim/statement was extracted, annotated, or created by the agent.
Format: ISO 8601 with timezone (UTC preferred)
Example:
statement_created_at: "2025-12-30T14:30:00Z"
Source: Generated by the extraction/annotation agent at processing time.
2. Source Archived Timestamp (source_archived_at)
Purpose: Records when the source material (webpage, document, API response) was archived/captured.
Format: ISO 8601 with timezone (UTC preferred)
Example:
source_archived_at: "2025-12-29T10:15:00Z"
Source:
- For web sources: Playwright archival timestamp, Wayback Machine memento datetime
- For API sources: API response fetch timestamp
- For documents: Document capture/download timestamp
Optional Timestamps (Encouraged)
3. Source Created Timestamp (source_created_at)
Purpose: When the original source content was created/published.
Example:
source_created_at: "2022-07-15T14:15:00Z" # Article publish date
Sources:
article:published_timemeta tagdatePublishedin JSON-LD- File creation date
- API response
created_atfield
4. Source Last Modified Timestamp (source_last_modified_at)
Purpose: When the source content was last updated.
Example:
source_last_modified_at: "2023-01-10T09:00:00Z"
Sources:
article:modified_timemeta tagdateModifiedin JSON-LD- HTTP
Last-Modifiedheader - File modification date
5. Verification Timestamp (last_verified_at)
Purpose: When the claim was last re-verified against the source.
Example:
last_verified_at: "2025-12-30T14:30:00Z"
6. Next Verification Due (next_verification_due)
Purpose: When the claim should be re-verified (for staleness tracking).
Example:
next_verification_due: "2026-03-30T00:00:00Z" # 90 days from last verification
Complete Provenance Timestamp Structure
For Web Claims
provenance:
# MANDATORY (both required)
statement_created_at: "2025-12-30T14:30:00Z" # When we extracted this
source_archived_at: "2025-12-29T10:15:00Z" # When we archived the webpage
# OPTIONAL (encouraged)
source_created_at: "2022-07-15T14:15:00Z" # When article was published
source_last_modified_at: "2023-01-10T09:00:00Z" # When article was updated
last_verified_at: "2025-12-30T14:30:00Z" # Last verification
next_verification_due: "2026-03-30T00:00:00Z" # Re-verify in 90 days
For API-Sourced Data (Wikidata, Google Maps, etc.)
_provenance:
# MANDATORY
statement_created_at: "2025-12-30T14:30:00Z" # When we processed API response
source_archived_at: "2025-12-30T14:29:55Z" # When API was queried (fetch_timestamp)
# OPTIONAL
source_last_modified_at: "2025-12-15T00:00:00Z" # Wikidata entity last modified
last_verified_at: "2025-12-30T14:30:00Z"
For CH-Annotator Extracted Claims
provenance:
namespace: glam
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
# MANDATORY
statement_created_at: "2025-12-06T21:13:56Z" # When CH-Annotator processed this
source_archived_at: "2025-11-06T08:02:44Z" # When conversation was exported
# Agent identification
agent: opencode-claude-sonnet-4
context_convention: ch_annotator-v1_7_0
Invalid Provenance: agent: claude-conversation
PROBLEM: 24,328 custodian files currently contain provenance statements like:
# INVALID - Missing timestamps and proper source identification
extraction_provenance:
namespace: glam
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
timestamp: '2025-11-06T08:02:44.240037+00:00' # Only ONE timestamp!
agent: claude-conversation # Vague agent identifier
context_convention: ch_annotator-v1_7_0
ISSUES:
claude-conversationis not a valid agent identifier (which Claude model? which session?)- Only one timestamp - doesn't distinguish statement creation from source archival
- No UUID reference to the specific conversation
- No archived source path
Valid Provenance Structure (Migration Target)
extraction_provenance:
namespace: glam
# Source identification
source_type: claude_conversation_export
source_path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
conversation_uuid: edc75d66-ee42-4199-8e22-65b0d2347922
# MANDATORY timestamps
statement_created_at: "2025-12-06T21:13:56.173868+00:00" # Annotation time
source_archived_at: "2025-11-06T08:02:44.240037+00:00" # Conversation export time
# Agent identification (proper format)
agent:
name: opencode-claude-sonnet-4
model: claude-sonnet-4-20250514
session_type: opencode_conversation
# Context
context_convention: ch_annotator-v1_7_0
# Archive reference
archive:
format: claude_conversation_json
local_path: data/conversations/edc75d66-ee42-4199-8e22-65b0d2347922.json
Timestamp Hierarchy and Derivation
When only one timestamp is available, derive the other:
| Available | Derive statement_created_at |
Derive source_archived_at |
|---|---|---|
Only timestamp |
Use as statement_created_at |
Set to same value (assume simultaneous) |
Only extraction_date |
Use as statement_created_at |
Set to same value |
Only fetch_timestamp |
Set to same value | Use as source_archived_at |
Only annotation_date |
Use as statement_created_at |
Look for timestamp in source |
Migration rule: If we cannot determine source_archived_at, use the earliest available timestamp from the source chain.
Agent Identification Standards
Invalid Agent Identifiers
# INVALID - Too vague
agent: claude-conversation
agent: claude
agent: ai
agent: llm
agent: opencode
Valid Agent Identifiers
# Format: {tool}-{model}-{version}
agent: opencode-claude-sonnet-4
agent: opencode-claude-opus-4
agent: batch-script-python-3.11
agent: manual-human-curator
# Or structured format
agent:
name: opencode-claude-sonnet-4
model: claude-sonnet-4-20250514
tool: opencode
version: "1.0.0"
PROV-O Alignment
These timestamps align with W3C PROV-O:
| Our Field | PROV-O Property | Description |
|---|---|---|
statement_created_at |
prov:generatedAtTime |
When entity was generated |
source_archived_at |
prov:atTime (on Activity) |
When archival activity occurred |
source_created_at |
dcterms:created |
Original creation date |
source_last_modified_at |
dcterms:modified |
Last modification date |
prov:
generatedAtTime: "2025-12-30T14:30:00Z" # = statement_created_at
wasGeneratedBy:
"@type": "prov:Activity"
name: "web_extraction"
atTime: "2025-12-29T10:15:00Z" # = source_archived_at
Validation Rules
Rule 1: Both Mandatory Timestamps Required
def validate_provenance_timestamps(provenance: dict) -> list[str]:
errors = []
# Check for mandatory timestamps
if 'statement_created_at' not in provenance:
errors.append("Missing mandatory 'statement_created_at' timestamp")
if 'source_archived_at' not in provenance:
errors.append("Missing mandatory 'source_archived_at' timestamp")
return errors
Rule 2: Timestamps Must Be Valid ISO 8601
from datetime import datetime
def validate_timestamp_format(timestamp: str) -> bool:
try:
datetime.fromisoformat(timestamp.replace('Z', '+00:00'))
return True
except ValueError:
return False
Rule 3: source_archived_at <= statement_created_at
The source must be archived BEFORE or AT the same time as the statement is created.
def validate_timestamp_order(provenance: dict) -> bool:
archived = datetime.fromisoformat(provenance['source_archived_at'])
created = datetime.fromisoformat(provenance['statement_created_at'])
return archived <= created
Migration Strategy for Existing Files
Phase 1: Identify Files Needing Migration
# Count affected files
find data/custodian -name "*.yaml" -exec grep -l "agent: claude-conversation" {} \; | wc -l
# Result: 24,328 files
Phase 2: Parse and Transform
For each file with agent: claude-conversation:
- Extract existing
timestampfield - Set
source_archived_at= existingtimestamp - Set
statement_created_at=annotation_dateif present, else use current time - Replace
agent: claude-conversationwith proper agent identifier - Add conversation UUID from path
Phase 3: Validate and Write
def migrate_provenance(data: dict) -> dict:
"""Migrate old claude-conversation provenance to new format."""
if 'ch_annotator' in data:
ch = data['ch_annotator']
if ch.get('extraction_provenance', {}).get('agent') == 'claude-conversation':
old_prov = ch['extraction_provenance']
# Extract conversation UUID from path
path = old_prov.get('path', '')
conv_uuid = path.split('/')[-1] if '/conversations/' in path else None
# Get timestamps
source_archived_at = old_prov.get('timestamp')
statement_created_at = ch.get('annotation_provenance', {}).get('annotation_date', source_archived_at)
# Build new provenance
ch['extraction_provenance'] = {
'namespace': old_prov.get('namespace', 'glam'),
'source_type': 'claude_conversation_export',
'source_path': old_prov.get('path'),
'conversation_uuid': conv_uuid,
'statement_created_at': statement_created_at,
'source_archived_at': source_archived_at,
'agent': 'opencode-claude-sonnet-4', # Default migration value
'context_convention': old_prov.get('context_convention'),
'migration_note': 'Migrated from agent:claude-conversation on 2025-12-30'
}
return data
Implementation Checklist
- Add
statement_created_atto all new provenance statements - Add
source_archived_atto all new provenance statements - Replace
agent: claude-conversationwith proper agent identifiers - Add conversation UUIDs where applicable
- Migrate existing 24,328 files with invalid provenance
- Update LinkML schema to require dual timestamps
- Add validation to data pipeline
Related Documentation
.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md- Web claim provenance structure.opencode/YAML_PROVENANCE_SCHEMA.md- YAML enrichment provenance.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md- XPath provenance requirementsAGENTS.md- Rule 35: Provenance Timestamps