glam/.opencode/PROVENANCE_TIMESTAMP_RULES.md
2025-12-30 23:19:38 +01:00

393 lines
11 KiB
Markdown

# Provenance Timestamp Rules
**Created**: 2025-12-30
**Updated**: 2025-12-30
**Status**: Active Rule
**Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md, YAML_PROVENANCE_SCHEMA.md, WEB_OBSERVATION_PROVENANCE_RULES.md
## Core Principle: Every Provenance Statement MUST Have At Least Two Timestamps
**All provenance statements in custodian data MUST include at minimum two timestamps:**
1. **`statement_created_at`** - When the provenance statement/claim was created (extraction/annotation time)
2. **`source_archived_at`** - When the source material was archived/captured
These two timestamps are MANDATORY. Additional temporal metadata is encouraged but optional.
---
## Mandatory Timestamps
### 1. Statement Created Timestamp (`statement_created_at`)
**Purpose**: Records when the claim/statement was extracted, annotated, or created by the agent.
**Format**: ISO 8601 with timezone (UTC preferred)
**Example**:
```yaml
statement_created_at: "2025-12-30T14:30:00Z"
```
**Source**: Generated by the extraction/annotation agent at processing time.
### 2. Source Archived Timestamp (`source_archived_at`)
**Purpose**: Records when the source material (webpage, document, API response) was archived/captured.
**Format**: ISO 8601 with timezone (UTC preferred)
**Example**:
```yaml
source_archived_at: "2025-12-29T10:15:00Z"
```
**Source**:
- For web sources: Playwright archival timestamp, Wayback Machine memento datetime
- For API sources: API response fetch timestamp
- For documents: Document capture/download timestamp
---
## Optional Timestamps (Encouraged)
### 3. Source Created Timestamp (`source_created_at`)
**Purpose**: When the original source content was created/published.
**Example**:
```yaml
source_created_at: "2022-07-15T14:15:00Z" # Article publish date
```
**Sources**:
- `article:published_time` meta tag
- `datePublished` in JSON-LD
- File creation date
- API response `created_at` field
### 4. Source Last Modified Timestamp (`source_last_modified_at`)
**Purpose**: When the source content was last updated.
**Example**:
```yaml
source_last_modified_at: "2023-01-10T09:00:00Z"
```
**Sources**:
- `article:modified_time` meta tag
- `dateModified` in JSON-LD
- HTTP `Last-Modified` header
- File modification date
### 5. Verification Timestamp (`last_verified_at`)
**Purpose**: When the claim was last re-verified against the source.
**Example**:
```yaml
last_verified_at: "2025-12-30T14:30:00Z"
```
### 6. Next Verification Due (`next_verification_due`)
**Purpose**: When the claim should be re-verified (for staleness tracking).
**Example**:
```yaml
next_verification_due: "2026-03-30T00:00:00Z" # 90 days from last verification
```
---
## Complete Provenance Timestamp Structure
### For Web Claims
```yaml
provenance:
# MANDATORY (both required)
statement_created_at: "2025-12-30T14:30:00Z" # When we extracted this
source_archived_at: "2025-12-29T10:15:00Z" # When we archived the webpage
# OPTIONAL (encouraged)
source_created_at: "2022-07-15T14:15:00Z" # When article was published
source_last_modified_at: "2023-01-10T09:00:00Z" # When article was updated
last_verified_at: "2025-12-30T14:30:00Z" # Last verification
next_verification_due: "2026-03-30T00:00:00Z" # Re-verify in 90 days
```
### For API-Sourced Data (Wikidata, Google Maps, etc.)
```yaml
_provenance:
# MANDATORY
statement_created_at: "2025-12-30T14:30:00Z" # When we processed API response
source_archived_at: "2025-12-30T14:29:55Z" # When API was queried (fetch_timestamp)
# OPTIONAL
source_last_modified_at: "2025-12-15T00:00:00Z" # Wikidata entity last modified
last_verified_at: "2025-12-30T14:30:00Z"
```
### For CH-Annotator Extracted Claims
```yaml
provenance:
namespace: glam
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
# MANDATORY
statement_created_at: "2025-12-06T21:13:56Z" # When CH-Annotator processed this
source_archived_at: "2025-11-06T08:02:44Z" # When conversation was exported
# Agent identification
agent: opencode-claude-sonnet-4
context_convention: ch_annotator-v1_7_0
```
---
## Invalid Provenance: `agent: claude-conversation`
**PROBLEM**: 24,328 custodian files currently contain provenance statements like:
```yaml
# INVALID - Missing timestamps and proper source identification
extraction_provenance:
namespace: glam
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
timestamp: '2025-11-06T08:02:44.240037+00:00' # Only ONE timestamp!
agent: claude-conversation # Vague agent identifier
context_convention: ch_annotator-v1_7_0
```
**ISSUES**:
1. `claude-conversation` is not a valid agent identifier (which Claude model? which session?)
2. Only one timestamp - doesn't distinguish statement creation from source archival
3. No UUID reference to the specific conversation
4. No archived source path
---
## Valid Provenance Structure (Migration Target)
```yaml
extraction_provenance:
namespace: glam
# Source identification
source_type: claude_conversation_export
source_path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
conversation_uuid: edc75d66-ee42-4199-8e22-65b0d2347922
# MANDATORY timestamps
statement_created_at: "2025-12-06T21:13:56.173868+00:00" # Annotation time
source_archived_at: "2025-11-06T08:02:44.240037+00:00" # Conversation export time
# Agent identification (proper format)
agent:
name: opencode-claude-sonnet-4
model: claude-sonnet-4-20250514
session_type: opencode_conversation
# Context
context_convention: ch_annotator-v1_7_0
# Archive reference
archive:
format: claude_conversation_json
local_path: data/conversations/edc75d66-ee42-4199-8e22-65b0d2347922.json
```
---
## Timestamp Hierarchy and Derivation
When only one timestamp is available, derive the other:
| Available | Derive `statement_created_at` | Derive `source_archived_at` |
|-----------|------------------------------|----------------------------|
| Only `timestamp` | Use as `statement_created_at` | Set to same value (assume simultaneous) |
| Only `extraction_date` | Use as `statement_created_at` | Set to same value |
| Only `fetch_timestamp` | Set to same value | Use as `source_archived_at` |
| Only `annotation_date` | Use as `statement_created_at` | Look for `timestamp` in source |
**Migration rule**: If we cannot determine `source_archived_at`, use the earliest available timestamp from the source chain.
---
## Agent Identification Standards
### Invalid Agent Identifiers
```yaml
# INVALID - Too vague
agent: claude-conversation
agent: claude
agent: ai
agent: llm
agent: opencode
```
### Valid Agent Identifiers
```yaml
# Format: {tool}-{model}-{version}
agent: opencode-claude-sonnet-4
agent: opencode-claude-opus-4
agent: batch-script-python-3.11
agent: manual-human-curator
# Or structured format
agent:
name: opencode-claude-sonnet-4
model: claude-sonnet-4-20250514
tool: opencode
version: "1.0.0"
```
---
## PROV-O Alignment
These timestamps align with W3C PROV-O:
| Our Field | PROV-O Property | Description |
|-----------|-----------------|-------------|
| `statement_created_at` | `prov:generatedAtTime` | When entity was generated |
| `source_archived_at` | `prov:atTime` (on Activity) | When archival activity occurred |
| `source_created_at` | `dcterms:created` | Original creation date |
| `source_last_modified_at` | `dcterms:modified` | Last modification date |
```yaml
prov:
generatedAtTime: "2025-12-30T14:30:00Z" # = statement_created_at
wasGeneratedBy:
"@type": "prov:Activity"
name: "web_extraction"
atTime: "2025-12-29T10:15:00Z" # = source_archived_at
```
---
## Validation Rules
### Rule 1: Both Mandatory Timestamps Required
```python
def validate_provenance_timestamps(provenance: dict) -> list[str]:
errors = []
# Check for mandatory timestamps
if 'statement_created_at' not in provenance:
errors.append("Missing mandatory 'statement_created_at' timestamp")
if 'source_archived_at' not in provenance:
errors.append("Missing mandatory 'source_archived_at' timestamp")
return errors
```
### Rule 2: Timestamps Must Be Valid ISO 8601
```python
from datetime import datetime
def validate_timestamp_format(timestamp: str) -> bool:
try:
datetime.fromisoformat(timestamp.replace('Z', '+00:00'))
return True
except ValueError:
return False
```
### Rule 3: source_archived_at <= statement_created_at
The source must be archived BEFORE or AT the same time as the statement is created.
```python
def validate_timestamp_order(provenance: dict) -> bool:
archived = datetime.fromisoformat(provenance['source_archived_at'])
created = datetime.fromisoformat(provenance['statement_created_at'])
return archived <= created
```
---
## Migration Strategy for Existing Files
### Phase 1: Identify Files Needing Migration
```bash
# Count affected files
find data/custodian -name "*.yaml" -exec grep -l "agent: claude-conversation" {} \; | wc -l
# Result: 24,328 files
```
### Phase 2: Parse and Transform
For each file with `agent: claude-conversation`:
1. Extract existing `timestamp` field
2. Set `source_archived_at` = existing `timestamp`
3. Set `statement_created_at` = `annotation_date` if present, else use current time
4. Replace `agent: claude-conversation` with proper agent identifier
5. Add conversation UUID from path
### Phase 3: Validate and Write
```python
def migrate_provenance(data: dict) -> dict:
"""Migrate old claude-conversation provenance to new format."""
if 'ch_annotator' in data:
ch = data['ch_annotator']
if ch.get('extraction_provenance', {}).get('agent') == 'claude-conversation':
old_prov = ch['extraction_provenance']
# Extract conversation UUID from path
path = old_prov.get('path', '')
conv_uuid = path.split('/')[-1] if '/conversations/' in path else None
# Get timestamps
source_archived_at = old_prov.get('timestamp')
statement_created_at = ch.get('annotation_provenance', {}).get('annotation_date', source_archived_at)
# Build new provenance
ch['extraction_provenance'] = {
'namespace': old_prov.get('namespace', 'glam'),
'source_type': 'claude_conversation_export',
'source_path': old_prov.get('path'),
'conversation_uuid': conv_uuid,
'statement_created_at': statement_created_at,
'source_archived_at': source_archived_at,
'agent': 'opencode-claude-sonnet-4', # Default migration value
'context_convention': old_prov.get('context_convention'),
'migration_note': 'Migrated from agent:claude-conversation on 2025-12-30'
}
return data
```
---
## Implementation Checklist
- [ ] Add `statement_created_at` to all new provenance statements
- [ ] Add `source_archived_at` to all new provenance statements
- [ ] Replace `agent: claude-conversation` with proper agent identifiers
- [ ] Add conversation UUIDs where applicable
- [ ] Migrate existing 24,328 files with invalid provenance
- [ ] Update LinkML schema to require dual timestamps
- [ ] Add validation to data pipeline
---
## Related Documentation
- `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Web claim provenance structure
- `.opencode/YAML_PROVENANCE_SCHEMA.md` - YAML enrichment provenance
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath provenance requirements
- `AGENTS.md` - Rule 35: Provenance Timestamps