11 KiB
Migration Specification: agent: claude-conversation → Proper Provenance
Created: 2025-12-30 Status: SPECIFICATION (Not Yet Implemented) Related: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md
Problem Statement
24,328 custodian YAML files in data/custodian/ have provenance statements with:
agent: claude-conversation(vague, non-specific agent identifier)- Single
timestampfield (violates Rule 35: dual timestamp requirement) - No distinction between statement creation and source archival
Affected Files
All files matching:
grep -l "agent: claude-conversation" data/custodian/*.yaml
# Result: 24,328 files
Provenance Locations in Each File
ch_annotator.extraction_provenance.agent- Top-level extraction agentch_annotator.entity_claims[].provenance.agent- Per-claim provenance (multiple instances)
Source Data Categories
The 24,328 files come from different original sources, requiring different migration strategies:
Category 1: ISIL Registry / CSV Sources (~18,000 files)
Examples: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries
Characteristics:
path: /files/{country}_complete.yaml- Data originated from authoritative CSV registries
- The CSV files are already archived in
data/instances/
Migration Strategy (Scripted):
# BEFORE
extraction_provenance:
path: /files/japan_complete.yaml
timestamp: '2025-11-18T14:46:40.580095+00:00'
agent: claude-conversation # ← INVALID
# AFTER
extraction_provenance:
source_type: isil_registry_csv
source_path: /files/japan_complete.yaml
source_archived_at: '2025-11-18T14:46:40.580095+00:00' # When CSV was processed
statement_created_at: '2025-12-06T21:13:31.304940+00:00' # From annotation_date
agent: batch-script-create-custodian-from-ch-annotator
context_convention: ch_annotator-v1_7_0
Category 2: Conversation-Extracted Data (~4,000 files)
Examples: Palestinian heritage custodians, some Latin American institutions
Characteristics:
path: /conversations/{uuid}- Data extracted from Claude conversation exports
- Need to trace back to original sources mentioned IN the conversation
Migration Strategy (Requires GLM4.7 + Manual Review):
- Load the conversation JSON file
- Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
- For each source type:
- Web sources: Use web-reader to archive + extract with XPath
- Wikidata: Add Wikidata entity provenance
- Academic sources: Add DOI/citation provenance
Category 3: Web-Enriched Data (~2,000 files)
Examples: Institutions with web_enrichment, google_maps_enrichment
Characteristics:
- Have web-scraped data that needs XPath provenance
- May have Google Maps or OSM enrichment
Migration Strategy (Requires web-reader + Playwright):
- Re-archive source websites using Playwright
- Use web-reader to extract claims with XPath provenance
- Generate dual timestamps from archival metadata
Migration Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ MIGRATION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Categorizer │ ──▶ │ Source Resolver │ ──▶ │ Provenance │ │
│ │ │ │ │ │ Generator │ │
│ │ - Detect │ │ - CSV Registry │ │ │ │
│ │ source │ │ - Conversation │ │ - Dual │ │
│ │ type │ │ - Web Archive │ │ timestamps │ │
│ │ - Route to │ │ - Wikidata │ │ - Valid agent │ │
│ │ handler │ │ │ │ - Source refs │ │
│ └─────────────┘ └──────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Source-Specific Handlers │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │ │
│ │ │ ISIL/CSV │ │ Conversation │ │ Web Archive │ │ │
│ │ │ Handler │ │ Handler │ │ Handler │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - Read CSV │ │ - Parse JSON │ │ - Playwright │ │ │
│ │ │ - Map to │ │ - GLM4.7 │ │ - web-reader │ │ │
│ │ │ timestamps │ │ analysis │ │ - XPath │ │ │
│ │ │ - Update │ │ - Source │ │ extraction │ │ │
│ │ │ provenance │ │ tracing │ │ │ │ │
│ │ └────────────────┘ └────────────────┘ └───────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Validation Layer │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ - Dual timestamp check (Rule 35) │ │
│ │ - Agent identifier validation │ │
│ │ - source_archived_at <= statement_created_at │ │
│ │ - XPath verification (where applicable) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Implementation Phases
Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)
Scope: ~18,000 files Effort: 1-2 days scripting Tools: Python script only
Files where path matches /files/*.yaml or /files/*.csv:
- Parse the annotation_date to get
statement_created_at - Use the original file's processing timestamp for
source_archived_at - Replace
agent: claude-conversationwith source-specific agent
Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)
Scope: ~4,000 files Effort: 3-5 days with LLM processing Tools: GLM4.7 API, conversation JSON parser
For each file with path: /conversations/{uuid}:
- Load conversation JSON from archive (if available)
- Send to GLM4.7 with prompt to identify actual data sources
- Update provenance based on source analysis
Phase 3: Category 3 - Web Sources (web-reader + Playwright)
Scope: ~2,000 files Effort: 5-10 days with web archival Tools: Playwright, web-reader MCP, GLM4.7
For files with web-derived claims:
- Archive source URLs using Playwright
- Extract claims with XPath using web-reader
- Generate dual timestamps from archival metadata
File Updates
Per-File Changes
For each of the 24,328 files:
-
Update
ch_annotator.extraction_provenance:extraction_provenance: # Existing fields retained namespace: glam path: /files/japan_complete.yaml context_convention: ch_annotator-v1_7_0 # NEW: Dual timestamps source_archived_at: '2025-11-18T14:46:40.580095+00:00' statement_created_at: '2025-12-06T21:13:31.304940+00:00' # NEW: Valid agent identifier agent: batch-script-create-custodian-from-ch-annotator # NEW: Source classification source_type: isil_registry_csv # NEW: Migration tracking migration_note: 'Migrated from agent:claude-conversation on 2025-12-30' -
Update each
ch_annotator.entity_claims[].provenance:provenance: namespace: glam path: /files/japan_complete.yaml context_convention: ch_annotator-v1_7_0 # NEW: Dual timestamps (inherited from parent) source_archived_at: '2025-11-18T14:46:40.580095+00:00' statement_created_at: '2025-12-06T21:13:31.304940+00:00' # NEW: Valid agent agent: batch-script-create-custodian-from-ch-annotator
Validation Criteria
After migration, every provenance block MUST pass:
- ✅
statement_created_atis present (ISO 8601) - ✅
source_archived_atis present (ISO 8601) - ✅
source_archived_at <= statement_created_at - ✅
agentis NOTclaude-conversation,claude,ai,opencode, orllm - ✅
agentfollows format{tool}-{model}-{version}or{script-name}
Rollback Strategy
Before migration:
- Create timestamped backup:
data/custodian.backup.2025-12-30/ - Store original provenance in
_migration_backupfield - Generate diff report for manual review
References
- Rule 35:
.opencode/PROVENANCE_TIMESTAMP_RULES.md - Rule 6:
.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md - CH-Annotator:
data/entity_annotation/ch_annotator-v1_7_0.yaml - web-reader script:
scripts/add_web_claim_provenance.py