kempersc b42d6bf5d2 backup CZ and JP

2025-12-30 23:19:38 +01:00

11 KiB

Raw Blame History

Migration Specification: `agent: claude-conversation` → Proper Provenance

Created: 2025-12-30 Status: SPECIFICATION (Not Yet Implemented) Related: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md

Problem Statement

24,328 custodian YAML files in data/custodian/ have provenance statements with:

agent: claude-conversation (vague, non-specific agent identifier)
Single timestamp field (violates Rule 35: dual timestamp requirement)
No distinction between statement creation and source archival

Affected Files

All files matching:

grep -l "agent: claude-conversation" data/custodian/*.yaml
# Result: 24,328 files

Provenance Locations in Each File

ch_annotator.extraction_provenance.agent - Top-level extraction agent
ch_annotator.entity_claims[].provenance.agent - Per-claim provenance (multiple instances)

Source Data Categories

The 24,328 files come from different original sources, requiring different migration strategies:

Category 1: ISIL Registry / CSV Sources (~18,000 files)

Examples: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries

Characteristics:

path: /files/{country}_complete.yaml
Data originated from authoritative CSV registries
The CSV files are already archived in data/instances/

Migration Strategy (Scripted):

# BEFORE
extraction_provenance:
  path: /files/japan_complete.yaml
  timestamp: '2025-11-18T14:46:40.580095+00:00'
  agent: claude-conversation  # ← INVALID

# AFTER
extraction_provenance:
  source_type: isil_registry_csv
  source_path: /files/japan_complete.yaml
  source_archived_at: '2025-11-18T14:46:40.580095+00:00'  # When CSV was processed
  statement_created_at: '2025-12-06T21:13:31.304940+00:00'  # From annotation_date
  agent: batch-script-create-custodian-from-ch-annotator
  context_convention: ch_annotator-v1_7_0

Category 2: Conversation-Extracted Data (~4,000 files)

Examples: Palestinian heritage custodians, some Latin American institutions

Characteristics:

path: /conversations/{uuid}
Data extracted from Claude conversation exports
Need to trace back to original sources mentioned IN the conversation

Migration Strategy (Requires GLM4.7 + Manual Review):

Load the conversation JSON file
Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
For each source type:
- Web sources: Use web-reader to archive + extract with XPath
- Wikidata: Add Wikidata entity provenance
- Academic sources: Add DOI/citation provenance

Category 3: Web-Enriched Data (~2,000 files)

Examples: Institutions with web_enrichment, google_maps_enrichment

Characteristics:

Have web-scraped data that needs XPath provenance
May have Google Maps or OSM enrichment

Migration Strategy (Requires web-reader + Playwright):

Re-archive source websites using Playwright
Use web-reader to extract claims with XPath provenance
Generate dual timestamps from archival metadata

Migration Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     MIGRATION PIPELINE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Categorizer │ ──▶ │ Source Resolver  │ ──▶ │ Provenance      │  │
│  │             │     │                  │     │ Generator       │  │
│  │ - Detect    │     │ - CSV Registry   │     │                 │  │
│  │   source    │     │ - Conversation   │     │ - Dual          │  │
│  │   type      │     │ - Web Archive    │     │   timestamps    │  │
│  │ - Route to  │     │ - Wikidata       │     │ - Valid agent   │  │
│  │   handler   │     │                  │     │ - Source refs   │  │
│  └─────────────┘     └──────────────────┘     └─────────────────┘  │
│        │                     │                        │             │
│        ▼                     ▼                        ▼             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Source-Specific Handlers                  │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │                                                              │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌───────────────┐  │   │
│  │  │ ISIL/CSV       │  │ Conversation   │  │ Web Archive   │  │   │
│  │  │ Handler        │  │ Handler        │  │ Handler       │  │   │
│  │  │                │  │                │  │               │  │   │
│  │  │ - Read CSV     │  │ - Parse JSON   │  │ - Playwright  │  │   │
│  │  │ - Map to       │  │ - GLM4.7       │  │ - web-reader  │  │   │
│  │  │   timestamps   │  │   analysis     │  │ - XPath       │  │   │
│  │  │ - Update       │  │ - Source       │  │   extraction  │  │   │
│  │  │   provenance   │  │   tracing      │  │               │  │   │
│  │  └────────────────┘  └────────────────┘  └───────────────┘  │   │
│  │                                                              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Validation Layer                          │   │
│  ├─────────────────────────────────────────────────────────────┤   │
│  │ - Dual timestamp check (Rule 35)                             │   │
│  │ - Agent identifier validation                                │   │
│  │ - source_archived_at <= statement_created_at                 │   │
│  │ - XPath verification (where applicable)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Implementation Phases

Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)

Scope: ~18,000 files Effort: 1-2 days scripting Tools: Python script only

Files where path matches /files/*.yaml or /files/*.csv:

Parse the annotation_date to get statement_created_at
Use the original file's processing timestamp for source_archived_at
Replace agent: claude-conversation with source-specific agent

Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)

Scope: ~4,000 files Effort: 3-5 days with LLM processing Tools: GLM4.7 API, conversation JSON parser

For each file with path: /conversations/{uuid}:

Load conversation JSON from archive (if available)
Send to GLM4.7 with prompt to identify actual data sources
Update provenance based on source analysis

Phase 3: Category 3 - Web Sources (web-reader + Playwright)

Scope: ~2,000 files Effort: 5-10 days with web archival Tools: Playwright, web-reader MCP, GLM4.7

For files with web-derived claims:

Archive source URLs using Playwright
Extract claims with XPath using web-reader
Generate dual timestamps from archival metadata

File Updates

Per-File Changes

For each of the 24,328 files:

Update ch_annotator.extraction_provenance:

extraction_provenance:
  # Existing fields retained
  namespace: glam
  path: /files/japan_complete.yaml
  context_convention: ch_annotator-v1_7_0

  # NEW: Dual timestamps
  source_archived_at: '2025-11-18T14:46:40.580095+00:00'
  statement_created_at: '2025-12-06T21:13:31.304940+00:00'

  # NEW: Valid agent identifier
  agent: batch-script-create-custodian-from-ch-annotator

  # NEW: Source classification
  source_type: isil_registry_csv

  # NEW: Migration tracking
  migration_note: 'Migrated from agent:claude-conversation on 2025-12-30'

Update each ch_annotator.entity_claims[].provenance:

provenance:
  namespace: glam
  path: /files/japan_complete.yaml
  context_convention: ch_annotator-v1_7_0

  # NEW: Dual timestamps (inherited from parent)
  source_archived_at: '2025-11-18T14:46:40.580095+00:00'
  statement_created_at: '2025-12-06T21:13:31.304940+00:00'

  # NEW: Valid agent
  agent: batch-script-create-custodian-from-ch-annotator

Validation Criteria

After migration, every provenance block MUST pass:

✅ statement_created_at is present (ISO 8601)
✅ source_archived_at is present (ISO 8601)
✅ source_archived_at <= statement_created_at
✅ agent is NOT claude-conversation, claude, ai, opencode, or llm
✅ agent follows format {tool}-{model}-{version} or {script-name}

Rollback Strategy

Before migration:

Create timestamped backup: data/custodian.backup.2025-12-30/
Store original provenance in _migration_backup field
Generate diff report for manual review

References

Rule 35: .opencode/PROVENANCE_TIMESTAMP_RULES.md
Rule 6: .opencode/WEB_OBSERVATION_PROVENANCE_RULES.md
CH-Annotator: data/entity_annotation/ch_annotator-v1_7_0.yaml
web-reader script: scripts/add_web_claim_provenance.py

11 KiB Raw Blame History

Migration Specification: agent: claude-conversation → Proper Provenance

Problem Statement

Affected Files

Provenance Locations in Each File

Source Data Categories

Category 1: ISIL Registry / CSV Sources (~18,000 files)

Category 2: Conversation-Extracted Data (~4,000 files)

Category 3: Web-Enriched Data (~2,000 files)

Migration Pipeline Architecture

Implementation Phases

Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)

Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)

Phase 3: Category 3 - Web Sources (web-reader + Playwright)

File Updates

Per-File Changes

Validation Criteria

Rollback Strategy

References

11 KiB

Raw Blame History

Migration Specification: `agent: claude-conversation` → Proper Provenance