glam/.opencode/CLAUDE_CONVERSATION_MIGRATION_SPEC.md
2025-12-30 23:19:38 +01:00

11 KiB

Migration Specification: agent: claude-conversation → Proper Provenance

Created: 2025-12-30 Status: SPECIFICATION (Not Yet Implemented) Related: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md

Problem Statement

24,328 custodian YAML files in data/custodian/ have provenance statements with:

  • agent: claude-conversation (vague, non-specific agent identifier)
  • Single timestamp field (violates Rule 35: dual timestamp requirement)
  • No distinction between statement creation and source archival

Affected Files

All files matching:

grep -l "agent: claude-conversation" data/custodian/*.yaml
# Result: 24,328 files

Provenance Locations in Each File

  1. ch_annotator.extraction_provenance.agent - Top-level extraction agent
  2. ch_annotator.entity_claims[].provenance.agent - Per-claim provenance (multiple instances)

Source Data Categories

The 24,328 files come from different original sources, requiring different migration strategies:

Category 1: ISIL Registry / CSV Sources (~18,000 files)

Examples: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries

Characteristics:

  • path: /files/{country}_complete.yaml
  • Data originated from authoritative CSV registries
  • The CSV files are already archived in data/instances/

Migration Strategy (Scripted):

# BEFORE
extraction_provenance:
  path: /files/japan_complete.yaml
  timestamp: '2025-11-18T14:46:40.580095+00:00'
  agent: claude-conversation  # ← INVALID

# AFTER
extraction_provenance:
  source_type: isil_registry_csv
  source_path: /files/japan_complete.yaml
  source_archived_at: '2025-11-18T14:46:40.580095+00:00'  # When CSV was processed
  statement_created_at: '2025-12-06T21:13:31.304940+00:00'  # From annotation_date
  agent: batch-script-create-custodian-from-ch-annotator
  context_convention: ch_annotator-v1_7_0

Category 2: Conversation-Extracted Data (~4,000 files)

Examples: Palestinian heritage custodians, some Latin American institutions

Characteristics:

  • path: /conversations/{uuid}
  • Data extracted from Claude conversation exports
  • Need to trace back to original sources mentioned IN the conversation

Migration Strategy (Requires GLM4.7 + Manual Review):

  1. Load the conversation JSON file
  2. Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
  3. For each source type:
    • Web sources: Use web-reader to archive + extract with XPath
    • Wikidata: Add Wikidata entity provenance
    • Academic sources: Add DOI/citation provenance

Category 3: Web-Enriched Data (~2,000 files)

Examples: Institutions with web_enrichment, google_maps_enrichment

Characteristics:

  • Have web-scraped data that needs XPath provenance
  • May have Google Maps or OSM enrichment

Migration Strategy (Requires web-reader + Playwright):

  1. Re-archive source websites using Playwright
  2. Use web-reader to extract claims with XPath provenance
  3. Generate dual timestamps from archival metadata

Migration Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     MIGRATION PIPELINE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Categorizer │ ──▶ │ Source Resolver  │ ──▶ │ Provenance      │  │
│  │             │     │                  │     │ Generator       │  │
│  │ - Detect    │     │ - CSV Registry   │     │                 │  │
│  │   source    │     │ - Conversation   │     │ - Dual          │  │
│  │   type      │     │ - Web Archive    │     │   timestamps    │  │
│  │ - Route to  │     │ - Wikidata       │     │ - Valid agent   │  │
│  │   handler   │     │                  │     │ - Source refs   │  │
│  └─────────────┘     └──────────────────┘     └─────────────────┘  │
│        │                     │                        │             │
│        ▼                     ▼                        ▼             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Source-Specific Handlers                  │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │                                                              │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌───────────────┐  │   │
│  │  │ ISIL/CSV       │  │ Conversation   │  │ Web Archive   │  │   │
│  │  │ Handler        │  │ Handler        │  │ Handler       │  │   │
│  │  │                │  │                │  │               │  │   │
│  │  │ - Read CSV     │  │ - Parse JSON   │  │ - Playwright  │  │   │
│  │  │ - Map to       │  │ - GLM4.7       │  │ - web-reader  │  │   │
│  │  │   timestamps   │  │   analysis     │  │ - XPath       │  │   │
│  │  │ - Update       │  │ - Source       │  │   extraction  │  │   │
│  │  │   provenance   │  │   tracing      │  │               │  │   │
│  │  └────────────────┘  └────────────────┘  └───────────────┘  │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Validation Layer                          │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │ - Dual timestamp check (Rule 35)                             │   │
│  │ - Agent identifier validation                                │   │
│  │ - source_archived_at <= statement_created_at                 │   │
│  │ - XPath verification (where applicable)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Implementation Phases

Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)

Scope: ~18,000 files Effort: 1-2 days scripting Tools: Python script only

Files where path matches /files/*.yaml or /files/*.csv:

  • Parse the annotation_date to get statement_created_at
  • Use the original file's processing timestamp for source_archived_at
  • Replace agent: claude-conversation with source-specific agent

Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)

Scope: ~4,000 files Effort: 3-5 days with LLM processing Tools: GLM4.7 API, conversation JSON parser

For each file with path: /conversations/{uuid}:

  1. Load conversation JSON from archive (if available)
  2. Send to GLM4.7 with prompt to identify actual data sources
  3. Update provenance based on source analysis

Phase 3: Category 3 - Web Sources (web-reader + Playwright)

Scope: ~2,000 files Effort: 5-10 days with web archival Tools: Playwright, web-reader MCP, GLM4.7

For files with web-derived claims:

  1. Archive source URLs using Playwright
  2. Extract claims with XPath using web-reader
  3. Generate dual timestamps from archival metadata

File Updates

Per-File Changes

For each of the 24,328 files:

  1. Update ch_annotator.extraction_provenance:

    extraction_provenance:
      # Existing fields retained
      namespace: glam
      path: /files/japan_complete.yaml
      context_convention: ch_annotator-v1_7_0
    
      # NEW: Dual timestamps
      source_archived_at: '2025-11-18T14:46:40.580095+00:00'
      statement_created_at: '2025-12-06T21:13:31.304940+00:00'
    
      # NEW: Valid agent identifier
      agent: batch-script-create-custodian-from-ch-annotator
    
      # NEW: Source classification
      source_type: isil_registry_csv
    
      # NEW: Migration tracking
      migration_note: 'Migrated from agent:claude-conversation on 2025-12-30'
    
  2. Update each ch_annotator.entity_claims[].provenance:

    provenance:
      namespace: glam
      path: /files/japan_complete.yaml
      context_convention: ch_annotator-v1_7_0
    
      # NEW: Dual timestamps (inherited from parent)
      source_archived_at: '2025-11-18T14:46:40.580095+00:00'
      statement_created_at: '2025-12-06T21:13:31.304940+00:00'
    
      # NEW: Valid agent
      agent: batch-script-create-custodian-from-ch-annotator
    

Validation Criteria

After migration, every provenance block MUST pass:

  1. statement_created_at is present (ISO 8601)
  2. source_archived_at is present (ISO 8601)
  3. source_archived_at <= statement_created_at
  4. agent is NOT claude-conversation, claude, ai, opencode, or llm
  5. agent follows format {tool}-{model}-{version} or {script-name}

Rollback Strategy

Before migration:

  1. Create timestamped backup: data/custodian.backup.2025-12-30/
  2. Store original provenance in _migration_backup field
  3. Generate diff report for manual review

References

  • Rule 35: .opencode/PROVENANCE_TIMESTAMP_RULES.md
  • Rule 6: .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md
  • CH-Annotator: data/entity_annotation/ch_annotator-v1_7_0.yaml
  • web-reader script: scripts/add_web_claim_provenance.py