393 lines
11 KiB
Markdown
393 lines
11 KiB
Markdown
# Provenance Timestamp Rules
|
|
|
|
**Created**: 2025-12-30
|
|
**Updated**: 2025-12-30
|
|
**Status**: Active Rule
|
|
**Related**: WEB_CLAIM_PROVENANCE_SCHEMA.md, YAML_PROVENANCE_SCHEMA.md, WEB_OBSERVATION_PROVENANCE_RULES.md
|
|
|
|
## Core Principle: Every Provenance Statement MUST Have At Least Two Timestamps
|
|
|
|
**All provenance statements in custodian data MUST include at minimum two timestamps:**
|
|
|
|
1. **`statement_created_at`** - When the provenance statement/claim was created (extraction/annotation time)
|
|
2. **`source_archived_at`** - When the source material was archived/captured
|
|
|
|
These two timestamps are MANDATORY. Additional temporal metadata is encouraged but optional.
|
|
|
|
---
|
|
|
|
## Mandatory Timestamps
|
|
|
|
### 1. Statement Created Timestamp (`statement_created_at`)
|
|
|
|
**Purpose**: Records when the claim/statement was extracted, annotated, or created by the agent.
|
|
|
|
**Format**: ISO 8601 with timezone (UTC preferred)
|
|
|
|
**Example**:
|
|
```yaml
|
|
statement_created_at: "2025-12-30T14:30:00Z"
|
|
```
|
|
|
|
**Source**: Generated by the extraction/annotation agent at processing time.
|
|
|
|
### 2. Source Archived Timestamp (`source_archived_at`)
|
|
|
|
**Purpose**: Records when the source material (webpage, document, API response) was archived/captured.
|
|
|
|
**Format**: ISO 8601 with timezone (UTC preferred)
|
|
|
|
**Example**:
|
|
```yaml
|
|
source_archived_at: "2025-12-29T10:15:00Z"
|
|
```
|
|
|
|
**Source**:
|
|
- For web sources: Playwright archival timestamp, Wayback Machine memento datetime
|
|
- For API sources: API response fetch timestamp
|
|
- For documents: Document capture/download timestamp
|
|
|
|
---
|
|
|
|
## Optional Timestamps (Encouraged)
|
|
|
|
### 3. Source Created Timestamp (`source_created_at`)
|
|
|
|
**Purpose**: When the original source content was created/published.
|
|
|
|
**Example**:
|
|
```yaml
|
|
source_created_at: "2022-07-15T14:15:00Z" # Article publish date
|
|
```
|
|
|
|
**Sources**:
|
|
- `article:published_time` meta tag
|
|
- `datePublished` in JSON-LD
|
|
- File creation date
|
|
- API response `created_at` field
|
|
|
|
### 4. Source Last Modified Timestamp (`source_last_modified_at`)
|
|
|
|
**Purpose**: When the source content was last updated.
|
|
|
|
**Example**:
|
|
```yaml
|
|
source_last_modified_at: "2023-01-10T09:00:00Z"
|
|
```
|
|
|
|
**Sources**:
|
|
- `article:modified_time` meta tag
|
|
- `dateModified` in JSON-LD
|
|
- HTTP `Last-Modified` header
|
|
- File modification date
|
|
|
|
### 5. Verification Timestamp (`last_verified_at`)
|
|
|
|
**Purpose**: When the claim was last re-verified against the source.
|
|
|
|
**Example**:
|
|
```yaml
|
|
last_verified_at: "2025-12-30T14:30:00Z"
|
|
```
|
|
|
|
### 6. Next Verification Due (`next_verification_due`)
|
|
|
|
**Purpose**: When the claim should be re-verified (for staleness tracking).
|
|
|
|
**Example**:
|
|
```yaml
|
|
next_verification_due: "2026-03-30T00:00:00Z" # 90 days from last verification
|
|
```
|
|
|
|
---
|
|
|
|
## Complete Provenance Timestamp Structure
|
|
|
|
### For Web Claims
|
|
|
|
```yaml
|
|
provenance:
|
|
# MANDATORY (both required)
|
|
statement_created_at: "2025-12-30T14:30:00Z" # When we extracted this
|
|
source_archived_at: "2025-12-29T10:15:00Z" # When we archived the webpage
|
|
|
|
# OPTIONAL (encouraged)
|
|
source_created_at: "2022-07-15T14:15:00Z" # When article was published
|
|
source_last_modified_at: "2023-01-10T09:00:00Z" # When article was updated
|
|
last_verified_at: "2025-12-30T14:30:00Z" # Last verification
|
|
next_verification_due: "2026-03-30T00:00:00Z" # Re-verify in 90 days
|
|
```
|
|
|
|
### For API-Sourced Data (Wikidata, Google Maps, etc.)
|
|
|
|
```yaml
|
|
_provenance:
|
|
# MANDATORY
|
|
statement_created_at: "2025-12-30T14:30:00Z" # When we processed API response
|
|
source_archived_at: "2025-12-30T14:29:55Z" # When API was queried (fetch_timestamp)
|
|
|
|
# OPTIONAL
|
|
source_last_modified_at: "2025-12-15T00:00:00Z" # Wikidata entity last modified
|
|
last_verified_at: "2025-12-30T14:30:00Z"
|
|
```
|
|
|
|
### For CH-Annotator Extracted Claims
|
|
|
|
```yaml
|
|
provenance:
|
|
namespace: glam
|
|
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
|
|
|
|
# MANDATORY
|
|
statement_created_at: "2025-12-06T21:13:56Z" # When CH-Annotator processed this
|
|
source_archived_at: "2025-11-06T08:02:44Z" # When conversation was exported
|
|
|
|
# Agent identification
|
|
agent: opencode-claude-sonnet-4
|
|
context_convention: ch_annotator-v1_7_0
|
|
```
|
|
|
|
---
|
|
|
|
## Invalid Provenance: `agent: claude-conversation`
|
|
|
|
**PROBLEM**: 24,328 custodian files currently contain provenance statements like:
|
|
|
|
```yaml
|
|
# INVALID - Missing timestamps and proper source identification
|
|
extraction_provenance:
|
|
namespace: glam
|
|
path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
|
|
timestamp: '2025-11-06T08:02:44.240037+00:00' # Only ONE timestamp!
|
|
agent: claude-conversation # Vague agent identifier
|
|
context_convention: ch_annotator-v1_7_0
|
|
```
|
|
|
|
**ISSUES**:
|
|
1. `claude-conversation` is not a valid agent identifier (which Claude model? which session?)
|
|
2. Only one timestamp - doesn't distinguish statement creation from source archival
|
|
3. No UUID reference to the specific conversation
|
|
4. No archived source path
|
|
|
|
---
|
|
|
|
## Valid Provenance Structure (Migration Target)
|
|
|
|
```yaml
|
|
extraction_provenance:
|
|
namespace: glam
|
|
|
|
# Source identification
|
|
source_type: claude_conversation_export
|
|
source_path: /conversations/edc75d66-ee42-4199-8e22-65b0d2347922
|
|
conversation_uuid: edc75d66-ee42-4199-8e22-65b0d2347922
|
|
|
|
# MANDATORY timestamps
|
|
statement_created_at: "2025-12-06T21:13:56.173868+00:00" # Annotation time
|
|
source_archived_at: "2025-11-06T08:02:44.240037+00:00" # Conversation export time
|
|
|
|
# Agent identification (proper format)
|
|
agent:
|
|
name: opencode-claude-sonnet-4
|
|
model: claude-sonnet-4-20250514
|
|
session_type: opencode_conversation
|
|
|
|
# Context
|
|
context_convention: ch_annotator-v1_7_0
|
|
|
|
# Archive reference
|
|
archive:
|
|
format: claude_conversation_json
|
|
local_path: data/conversations/edc75d66-ee42-4199-8e22-65b0d2347922.json
|
|
```
|
|
|
|
---
|
|
|
|
## Timestamp Hierarchy and Derivation
|
|
|
|
When only one timestamp is available, derive the other:
|
|
|
|
| Available | Derive `statement_created_at` | Derive `source_archived_at` |
|
|
|-----------|------------------------------|----------------------------|
|
|
| Only `timestamp` | Use as `statement_created_at` | Set to same value (assume simultaneous) |
|
|
| Only `extraction_date` | Use as `statement_created_at` | Set to same value |
|
|
| Only `fetch_timestamp` | Set to same value | Use as `source_archived_at` |
|
|
| Only `annotation_date` | Use as `statement_created_at` | Look for `timestamp` in source |
|
|
|
|
**Migration rule**: If we cannot determine `source_archived_at`, use the earliest available timestamp from the source chain.
|
|
|
|
---
|
|
|
|
## Agent Identification Standards
|
|
|
|
### Invalid Agent Identifiers
|
|
|
|
```yaml
|
|
# INVALID - Too vague
|
|
agent: claude-conversation
|
|
agent: claude
|
|
agent: ai
|
|
agent: llm
|
|
agent: opencode
|
|
```
|
|
|
|
### Valid Agent Identifiers
|
|
|
|
```yaml
|
|
# Format: {tool}-{model}-{version}
|
|
agent: opencode-claude-sonnet-4
|
|
agent: opencode-claude-opus-4
|
|
agent: batch-script-python-3.11
|
|
agent: manual-human-curator
|
|
|
|
# Or structured format
|
|
agent:
|
|
name: opencode-claude-sonnet-4
|
|
model: claude-sonnet-4-20250514
|
|
tool: opencode
|
|
version: "1.0.0"
|
|
```
|
|
|
|
---
|
|
|
|
## PROV-O Alignment
|
|
|
|
These timestamps align with W3C PROV-O:
|
|
|
|
| Our Field | PROV-O Property | Description |
|
|
|-----------|-----------------|-------------|
|
|
| `statement_created_at` | `prov:generatedAtTime` | When entity was generated |
|
|
| `source_archived_at` | `prov:atTime` (on Activity) | When archival activity occurred |
|
|
| `source_created_at` | `dcterms:created` | Original creation date |
|
|
| `source_last_modified_at` | `dcterms:modified` | Last modification date |
|
|
|
|
```yaml
|
|
prov:
|
|
generatedAtTime: "2025-12-30T14:30:00Z" # = statement_created_at
|
|
wasGeneratedBy:
|
|
"@type": "prov:Activity"
|
|
name: "web_extraction"
|
|
atTime: "2025-12-29T10:15:00Z" # = source_archived_at
|
|
```
|
|
|
|
---
|
|
|
|
## Validation Rules
|
|
|
|
### Rule 1: Both Mandatory Timestamps Required
|
|
|
|
```python
|
|
def validate_provenance_timestamps(provenance: dict) -> list[str]:
|
|
errors = []
|
|
|
|
# Check for mandatory timestamps
|
|
if 'statement_created_at' not in provenance:
|
|
errors.append("Missing mandatory 'statement_created_at' timestamp")
|
|
if 'source_archived_at' not in provenance:
|
|
errors.append("Missing mandatory 'source_archived_at' timestamp")
|
|
|
|
return errors
|
|
```
|
|
|
|
### Rule 2: Timestamps Must Be Valid ISO 8601
|
|
|
|
```python
|
|
from datetime import datetime
|
|
|
|
def validate_timestamp_format(timestamp: str) -> bool:
|
|
try:
|
|
datetime.fromisoformat(timestamp.replace('Z', '+00:00'))
|
|
return True
|
|
except ValueError:
|
|
return False
|
|
```
|
|
|
|
### Rule 3: source_archived_at <= statement_created_at
|
|
|
|
The source must be archived BEFORE or AT the same time as the statement is created.
|
|
|
|
```python
|
|
def validate_timestamp_order(provenance: dict) -> bool:
|
|
archived = datetime.fromisoformat(provenance['source_archived_at'])
|
|
created = datetime.fromisoformat(provenance['statement_created_at'])
|
|
return archived <= created
|
|
```
|
|
|
|
---
|
|
|
|
## Migration Strategy for Existing Files
|
|
|
|
### Phase 1: Identify Files Needing Migration
|
|
|
|
```bash
|
|
# Count affected files
|
|
find data/custodian -name "*.yaml" -exec grep -l "agent: claude-conversation" {} \; | wc -l
|
|
# Result: 24,328 files
|
|
```
|
|
|
|
### Phase 2: Parse and Transform
|
|
|
|
For each file with `agent: claude-conversation`:
|
|
|
|
1. Extract existing `timestamp` field
|
|
2. Set `source_archived_at` = existing `timestamp`
|
|
3. Set `statement_created_at` = `annotation_date` if present, else use current time
|
|
4. Replace `agent: claude-conversation` with proper agent identifier
|
|
5. Add conversation UUID from path
|
|
|
|
### Phase 3: Validate and Write
|
|
|
|
```python
|
|
def migrate_provenance(data: dict) -> dict:
|
|
"""Migrate old claude-conversation provenance to new format."""
|
|
|
|
if 'ch_annotator' in data:
|
|
ch = data['ch_annotator']
|
|
|
|
if ch.get('extraction_provenance', {}).get('agent') == 'claude-conversation':
|
|
old_prov = ch['extraction_provenance']
|
|
|
|
# Extract conversation UUID from path
|
|
path = old_prov.get('path', '')
|
|
conv_uuid = path.split('/')[-1] if '/conversations/' in path else None
|
|
|
|
# Get timestamps
|
|
source_archived_at = old_prov.get('timestamp')
|
|
statement_created_at = ch.get('annotation_provenance', {}).get('annotation_date', source_archived_at)
|
|
|
|
# Build new provenance
|
|
ch['extraction_provenance'] = {
|
|
'namespace': old_prov.get('namespace', 'glam'),
|
|
'source_type': 'claude_conversation_export',
|
|
'source_path': old_prov.get('path'),
|
|
'conversation_uuid': conv_uuid,
|
|
'statement_created_at': statement_created_at,
|
|
'source_archived_at': source_archived_at,
|
|
'agent': 'opencode-claude-sonnet-4', # Default migration value
|
|
'context_convention': old_prov.get('context_convention'),
|
|
'migration_note': 'Migrated from agent:claude-conversation on 2025-12-30'
|
|
}
|
|
|
|
return data
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
- [ ] Add `statement_created_at` to all new provenance statements
|
|
- [ ] Add `source_archived_at` to all new provenance statements
|
|
- [ ] Replace `agent: claude-conversation` with proper agent identifiers
|
|
- [ ] Add conversation UUIDs where applicable
|
|
- [ ] Migrate existing 24,328 files with invalid provenance
|
|
- [ ] Update LinkML schema to require dual timestamps
|
|
- [ ] Add validation to data pipeline
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `.opencode/WEB_CLAIM_PROVENANCE_SCHEMA.md` - Web claim provenance structure
|
|
- `.opencode/YAML_PROVENANCE_SCHEMA.md` - YAML enrichment provenance
|
|
- `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - XPath provenance requirements
|
|
- `AGENTS.md` - Rule 35: Provenance Timestamps
|