# Migration Specification: `agent: claude-conversation` → Proper Provenance

**Created**: 2025-12-30
**Status**: SPECIFICATION (Not Yet Implemented)
**Related**: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md

## Problem Statement

24,328 custodian YAML files in `data/custodian/` have provenance statements with:
- `agent: claude-conversation` (vague, non-specific agent identifier)
- Single `timestamp` field (violates Rule 35: dual timestamp requirement)
- No distinction between statement creation and source archival

## Affected Files

All files matching:
```bash
grep -l "agent: claude-conversation" data/custodian/*.yaml
# Result: 24,328 files
```

### Provenance Locations in Each File

1. **`ch_annotator.extraction_provenance.agent`** - Top-level extraction agent
2. **`ch_annotator.entity_claims[].provenance.agent`** - Per-claim provenance (multiple instances)

## Source Data Categories

The 24,328 files come from different original sources, requiring different migration strategies:

### Category 1: ISIL Registry / CSV Sources (~18,000 files)

**Examples**: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries

**Characteristics**:
- `path: /files/{country}_complete.yaml` 
- Data originated from authoritative CSV registries
- The CSV files are already archived in `data/instances/`

**Migration Strategy** (Scripted):
```yaml
# BEFORE
extraction_provenance:
  path: /files/japan_complete.yaml
  timestamp: '2025-11-18T14:46:40.580095+00:00'
  agent: claude-conversation  # ← INVALID

# AFTER
extraction_provenance:
  source_type: isil_registry_csv
  source_path: /files/japan_complete.yaml
  source_archived_at: '2025-11-18T14:46:40.580095+00:00'  # When CSV was processed
  statement_created_at: '2025-12-06T21:13:31.304940+00:00'  # From annotation_date
  agent: batch-script-create-custodian-from-ch-annotator
  context_convention: ch_annotator-v1_7_0
```

### Category 2: Conversation-Extracted Data (~4,000 files)

**Examples**: Palestinian heritage custodians, some Latin American institutions

**Characteristics**:
- `path: /conversations/{uuid}`
- Data extracted from Claude conversation exports
- Need to trace back to original sources mentioned IN the conversation

**Migration Strategy** (Requires GLM4.7 + Manual Review):
1. Load the conversation JSON file
2. Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
3. For each source type:
   - **Web sources**: Use web-reader to archive + extract with XPath
   - **Wikidata**: Add Wikidata entity provenance
   - **Academic sources**: Add DOI/citation provenance

### Category 3: Web-Enriched Data (~2,000 files)

**Examples**: Institutions with `web_enrichment`, `google_maps_enrichment`

**Characteristics**:
- Have web-scraped data that needs XPath provenance
- May have Google Maps or OSM enrichment

**Migration Strategy** (Requires web-reader + Playwright):
1. Re-archive source websites using Playwright
2. Use web-reader to extract claims with XPath provenance
3. Generate dual timestamps from archival metadata

## Migration Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     MIGRATION PIPELINE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Categorizer │ ──▶ │ Source Resolver  │ ──▶ │ Provenance      │  │
│  │             │     │                  │     │ Generator       │  │
│  │ - Detect    │     │ - CSV Registry   │     │                 │  │
│  │   source    │     │ - Conversation   │     │ - Dual          │  │
│  │   type      │     │ - Web Archive    │     │   timestamps    │  │
│  │ - Route to  │     │ - Wikidata       │     │ - Valid agent   │  │
│  │   handler   │     │                  │     │ - Source refs   │  │
│  └─────────────┘     └──────────────────┘     └─────────────────┘  │
│        │                     │                        │             │
│        ▼                     ▼                        ▼             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Source-Specific Handlers                  │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │                                                              │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌───────────────┐  │   │
│  │  │ ISIL/CSV       │  │ Conversation   │  │ Web Archive   │  │   │
│  │  │ Handler        │  │ Handler        │  │ Handler       │  │   │
│  │  │                │  │                │  │               │  │   │
│  │  │ - Read CSV     │  │ - Parse JSON   │  │ - Playwright  │  │   │
│  │  │ - Map to       │  │ - GLM4.7       │  │ - web-reader  │  │   │
│  │  │   timestamps   │  │   analysis     │  │ - XPath       │  │   │
│  │  │ - Update       │  │ - Source       │  │   extraction  │  │   │
│  │  │   provenance   │  │   tracing      │  │               │  │   │
│  │  └────────────────┘  └────────────────┘  └───────────────┘  │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Validation Layer                          │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │ - Dual timestamp check (Rule 35)                             │   │
│  │ - Agent identifier validation                                │   │
│  │ - source_archived_at <= statement_created_at                 │   │
│  │ - XPath verification (where applicable)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

## Implementation Phases

### Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)

**Scope**: ~18,000 files
**Effort**: 1-2 days scripting
**Tools**: Python script only

Files where `path` matches `/files/*.yaml` or `/files/*.csv`:
- Parse the annotation_date to get `statement_created_at`
- Use the original file's processing timestamp for `source_archived_at`
- Replace `agent: claude-conversation` with source-specific agent

### Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)

**Scope**: ~4,000 files
**Effort**: 3-5 days with LLM processing
**Tools**: GLM4.7 API, conversation JSON parser

For each file with `path: /conversations/{uuid}`:
1. Load conversation JSON from archive (if available)
2. Send to GLM4.7 with prompt to identify actual data sources
3. Update provenance based on source analysis

### Phase 3: Category 3 - Web Sources (web-reader + Playwright)

**Scope**: ~2,000 files
**Effort**: 5-10 days with web archival
**Tools**: Playwright, web-reader MCP, GLM4.7

For files with web-derived claims:
1. Archive source URLs using Playwright
2. Extract claims with XPath using web-reader
3. Generate dual timestamps from archival metadata

## File Updates

### Per-File Changes

For each of the 24,328 files:

1. **Update `ch_annotator.extraction_provenance`**:
   ```yaml
   extraction_provenance:
     # Existing fields retained
     namespace: glam
     path: /files/japan_complete.yaml
     context_convention: ch_annotator-v1_7_0
     
     # NEW: Dual timestamps
     source_archived_at: '2025-11-18T14:46:40.580095+00:00'
     statement_created_at: '2025-12-06T21:13:31.304940+00:00'
     
     # NEW: Valid agent identifier
     agent: batch-script-create-custodian-from-ch-annotator
     
     # NEW: Source classification
     source_type: isil_registry_csv
     
     # NEW: Migration tracking
     migration_note: 'Migrated from agent:claude-conversation on 2025-12-30'
   ```

2. **Update each `ch_annotator.entity_claims[].provenance`**:
   ```yaml
   provenance:
     namespace: glam
     path: /files/japan_complete.yaml
     context_convention: ch_annotator-v1_7_0
     
     # NEW: Dual timestamps (inherited from parent)
     source_archived_at: '2025-11-18T14:46:40.580095+00:00'
     statement_created_at: '2025-12-06T21:13:31.304940+00:00'
     
     # NEW: Valid agent
     agent: batch-script-create-custodian-from-ch-annotator
   ```

## Validation Criteria

After migration, every provenance block MUST pass:

1. ✅ `statement_created_at` is present (ISO 8601)
2. ✅ `source_archived_at` is present (ISO 8601)
3. ✅ `source_archived_at <= statement_created_at`
4. ✅ `agent` is NOT `claude-conversation`, `claude`, `ai`, `opencode`, or `llm`
5. ✅ `agent` follows format `{tool}-{model}-{version}` or `{script-name}`

## Rollback Strategy

Before migration:
1. Create timestamped backup: `data/custodian.backup.2025-12-30/`
2. Store original provenance in `_migration_backup` field
3. Generate diff report for manual review

## References

- Rule 35: `.opencode/PROVENANCE_TIMESTAMP_RULES.md`
- Rule 6: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
- CH-Annotator: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
- web-reader script: `scripts/add_web_claim_provenance.py`