# Migration Specification: `agent: claude-conversation` → Proper Provenance **Created**: 2025-12-30 **Status**: SPECIFICATION (Not Yet Implemented) **Related**: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md ## Problem Statement 24,328 custodian YAML files in `data/custodian/` have provenance statements with: - `agent: claude-conversation` (vague, non-specific agent identifier) - Single `timestamp` field (violates Rule 35: dual timestamp requirement) - No distinction between statement creation and source archival ## Affected Files All files matching: ```bash grep -l "agent: claude-conversation" data/custodian/*.yaml # Result: 24,328 files ``` ### Provenance Locations in Each File 1. **`ch_annotator.extraction_provenance.agent`** - Top-level extraction agent 2. **`ch_annotator.entity_claims[].provenance.agent`** - Per-claim provenance (multiple instances) ## Source Data Categories The 24,328 files come from different original sources, requiring different migration strategies: ### Category 1: ISIL Registry / CSV Sources (~18,000 files) **Examples**: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries **Characteristics**: - `path: /files/{country}_complete.yaml` - Data originated from authoritative CSV registries - The CSV files are already archived in `data/instances/` **Migration Strategy** (Scripted): ```yaml # BEFORE extraction_provenance: path: /files/japan_complete.yaml timestamp: '2025-11-18T14:46:40.580095+00:00' agent: claude-conversation # ← INVALID # AFTER extraction_provenance: source_type: isil_registry_csv source_path: /files/japan_complete.yaml source_archived_at: '2025-11-18T14:46:40.580095+00:00' # When CSV was processed statement_created_at: '2025-12-06T21:13:31.304940+00:00' # From annotation_date agent: batch-script-create-custodian-from-ch-annotator context_convention: ch_annotator-v1_7_0 ``` ### Category 2: Conversation-Extracted Data (~4,000 files) **Examples**: Palestinian heritage custodians, some Latin American institutions **Characteristics**: - `path: /conversations/{uuid}` - Data extracted from Claude conversation exports - Need to trace back to original sources mentioned IN the conversation **Migration Strategy** (Requires GLM4.7 + Manual Review): 1. Load the conversation JSON file 2. Use GLM4.7 to identify the ACTUAL sources mentioned in conversation 3. For each source type: - **Web sources**: Use web-reader to archive + extract with XPath - **Wikidata**: Add Wikidata entity provenance - **Academic sources**: Add DOI/citation provenance ### Category 3: Web-Enriched Data (~2,000 files) **Examples**: Institutions with `web_enrichment`, `google_maps_enrichment` **Characteristics**: - Have web-scraped data that needs XPath provenance - May have Google Maps or OSM enrichment **Migration Strategy** (Requires web-reader + Playwright): 1. Re-archive source websites using Playwright 2. Use web-reader to extract claims with XPath provenance 3. Generate dual timestamps from archival metadata ## Migration Pipeline Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ MIGRATION PIPELINE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ │ │ Categorizer │ ──▶ │ Source Resolver │ ──▶ │ Provenance │ │ │ │ │ │ │ │ Generator │ │ │ │ - Detect │ │ - CSV Registry │ │ │ │ │ │ source │ │ - Conversation │ │ - Dual │ │ │ │ type │ │ - Web Archive │ │ timestamps │ │ │ │ - Route to │ │ - Wikidata │ │ - Valid agent │ │ │ │ handler │ │ │ │ - Source refs │ │ │ └─────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Source-Specific Handlers │ │ │ ├─────────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │ │ │ │ │ ISIL/CSV │ │ Conversation │ │ Web Archive │ │ │ │ │ │ Handler │ │ Handler │ │ Handler │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ - Read CSV │ │ - Parse JSON │ │ - Playwright │ │ │ │ │ │ - Map to │ │ - GLM4.7 │ │ - web-reader │ │ │ │ │ │ timestamps │ │ analysis │ │ - XPath │ │ │ │ │ │ - Update │ │ - Source │ │ extraction │ │ │ │ │ │ provenance │ │ tracing │ │ │ │ │ │ │ └────────────────┘ └────────────────┘ └───────────────┘ │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Validation Layer │ │ │ ├─────────────────────────────────────────────────────────────┤ │ │ │ - Dual timestamp check (Rule 35) │ │ │ │ - Agent identifier validation │ │ │ │ - source_archived_at <= statement_created_at │ │ │ │ - XPath verification (where applicable) │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Implementation Phases ### Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required) **Scope**: ~18,000 files **Effort**: 1-2 days scripting **Tools**: Python script only Files where `path` matches `/files/*.yaml` or `/files/*.csv`: - Parse the annotation_date to get `statement_created_at` - Use the original file's processing timestamp for `source_archived_at` - Replace `agent: claude-conversation` with source-specific agent ### Phase 2: Category 2 - Conversation Sources (GLM4.7 Required) **Scope**: ~4,000 files **Effort**: 3-5 days with LLM processing **Tools**: GLM4.7 API, conversation JSON parser For each file with `path: /conversations/{uuid}`: 1. Load conversation JSON from archive (if available) 2. Send to GLM4.7 with prompt to identify actual data sources 3. Update provenance based on source analysis ### Phase 3: Category 3 - Web Sources (web-reader + Playwright) **Scope**: ~2,000 files **Effort**: 5-10 days with web archival **Tools**: Playwright, web-reader MCP, GLM4.7 For files with web-derived claims: 1. Archive source URLs using Playwright 2. Extract claims with XPath using web-reader 3. Generate dual timestamps from archival metadata ## File Updates ### Per-File Changes For each of the 24,328 files: 1. **Update `ch_annotator.extraction_provenance`**: ```yaml extraction_provenance: # Existing fields retained namespace: glam path: /files/japan_complete.yaml context_convention: ch_annotator-v1_7_0 # NEW: Dual timestamps source_archived_at: '2025-11-18T14:46:40.580095+00:00' statement_created_at: '2025-12-06T21:13:31.304940+00:00' # NEW: Valid agent identifier agent: batch-script-create-custodian-from-ch-annotator # NEW: Source classification source_type: isil_registry_csv # NEW: Migration tracking migration_note: 'Migrated from agent:claude-conversation on 2025-12-30' ``` 2. **Update each `ch_annotator.entity_claims[].provenance`**: ```yaml provenance: namespace: glam path: /files/japan_complete.yaml context_convention: ch_annotator-v1_7_0 # NEW: Dual timestamps (inherited from parent) source_archived_at: '2025-11-18T14:46:40.580095+00:00' statement_created_at: '2025-12-06T21:13:31.304940+00:00' # NEW: Valid agent agent: batch-script-create-custodian-from-ch-annotator ``` ## Validation Criteria After migration, every provenance block MUST pass: 1. ✅ `statement_created_at` is present (ISO 8601) 2. ✅ `source_archived_at` is present (ISO 8601) 3. ✅ `source_archived_at <= statement_created_at` 4. ✅ `agent` is NOT `claude-conversation`, `claude`, `ai`, `opencode`, or `llm` 5. ✅ `agent` follows format `{tool}-{model}-{version}` or `{script-name}` ## Rollback Strategy Before migration: 1. Create timestamped backup: `data/custodian.backup.2025-12-30/` 2. Store original provenance in `_migration_backup` field 3. Generate diff report for manual review ## References - Rule 35: `.opencode/PROVENANCE_TIMESTAMP_RULES.md` - Rule 6: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md` - CH-Annotator: `data/entity_annotation/ch_annotator-v1_7_0.yaml` - web-reader script: `scripts/add_web_claim_provenance.py`