glam/.opencode/CLAUDE_CONVERSATION_MIGRATION_SPEC.md
2025-12-30 23:19:38 +01:00

235 lines
11 KiB
Markdown

# Migration Specification: `agent: claude-conversation` → Proper Provenance
**Created**: 2025-12-30
**Status**: SPECIFICATION (Not Yet Implemented)
**Related**: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md
## Problem Statement
24,328 custodian YAML files in `data/custodian/` have provenance statements with:
- `agent: claude-conversation` (vague, non-specific agent identifier)
- Single `timestamp` field (violates Rule 35: dual timestamp requirement)
- No distinction between statement creation and source archival
## Affected Files
All files matching:
```bash
grep -l "agent: claude-conversation" data/custodian/*.yaml
# Result: 24,328 files
```
### Provenance Locations in Each File
1. **`ch_annotator.extraction_provenance.agent`** - Top-level extraction agent
2. **`ch_annotator.entity_claims[].provenance.agent`** - Per-claim provenance (multiple instances)
## Source Data Categories
The 24,328 files come from different original sources, requiring different migration strategies:
### Category 1: ISIL Registry / CSV Sources (~18,000 files)
**Examples**: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries
**Characteristics**:
- `path: /files/{country}_complete.yaml`
- Data originated from authoritative CSV registries
- The CSV files are already archived in `data/instances/`
**Migration Strategy** (Scripted):
```yaml
# BEFORE
extraction_provenance:
path: /files/japan_complete.yaml
timestamp: '2025-11-18T14:46:40.580095+00:00'
agent: claude-conversation # ← INVALID
# AFTER
extraction_provenance:
source_type: isil_registry_csv
source_path: /files/japan_complete.yaml
source_archived_at: '2025-11-18T14:46:40.580095+00:00' # When CSV was processed
statement_created_at: '2025-12-06T21:13:31.304940+00:00' # From annotation_date
agent: batch-script-create-custodian-from-ch-annotator
context_convention: ch_annotator-v1_7_0
```
### Category 2: Conversation-Extracted Data (~4,000 files)
**Examples**: Palestinian heritage custodians, some Latin American institutions
**Characteristics**:
- `path: /conversations/{uuid}`
- Data extracted from Claude conversation exports
- Need to trace back to original sources mentioned IN the conversation
**Migration Strategy** (Requires GLM4.7 + Manual Review):
1. Load the conversation JSON file
2. Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
3. For each source type:
- **Web sources**: Use web-reader to archive + extract with XPath
- **Wikidata**: Add Wikidata entity provenance
- **Academic sources**: Add DOI/citation provenance
### Category 3: Web-Enriched Data (~2,000 files)
**Examples**: Institutions with `web_enrichment`, `google_maps_enrichment`
**Characteristics**:
- Have web-scraped data that needs XPath provenance
- May have Google Maps or OSM enrichment
**Migration Strategy** (Requires web-reader + Playwright):
1. Re-archive source websites using Playwright
2. Use web-reader to extract claims with XPath provenance
3. Generate dual timestamps from archival metadata
## Migration Pipeline Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ MIGRATION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
│ │ Categorizer │ ──▶ │ Source Resolver │ ──▶ │ Provenance │ │
│ │ │ │ │ │ Generator │ │
│ │ - Detect │ │ - CSV Registry │ │ │ │
│ │ source │ │ - Conversation │ │ - Dual │ │
│ │ type │ │ - Web Archive │ │ timestamps │ │
│ │ - Route to │ │ - Wikidata │ │ - Valid agent │ │
│ │ handler │ │ │ │ - Source refs │ │
│ └─────────────┘ └──────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Source-Specific Handlers │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │ │
│ │ │ ISIL/CSV │ │ Conversation │ │ Web Archive │ │ │
│ │ │ Handler │ │ Handler │ │ Handler │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - Read CSV │ │ - Parse JSON │ │ - Playwright │ │ │
│ │ │ - Map to │ │ - GLM4.7 │ │ - web-reader │ │ │
│ │ │ timestamps │ │ analysis │ │ - XPath │ │ │
│ │ │ - Update │ │ - Source │ │ extraction │ │ │
│ │ │ provenance │ │ tracing │ │ │ │ │
│ │ └────────────────┘ └────────────────┘ └───────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Validation Layer │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ - Dual timestamp check (Rule 35) │ │
│ │ - Agent identifier validation │ │
│ │ - source_archived_at <= statement_created_at │ │
│ │ - XPath verification (where applicable) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
## Implementation Phases
### Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)
**Scope**: ~18,000 files
**Effort**: 1-2 days scripting
**Tools**: Python script only
Files where `path` matches `/files/*.yaml` or `/files/*.csv`:
- Parse the annotation_date to get `statement_created_at`
- Use the original file's processing timestamp for `source_archived_at`
- Replace `agent: claude-conversation` with source-specific agent
### Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)
**Scope**: ~4,000 files
**Effort**: 3-5 days with LLM processing
**Tools**: GLM4.7 API, conversation JSON parser
For each file with `path: /conversations/{uuid}`:
1. Load conversation JSON from archive (if available)
2. Send to GLM4.7 with prompt to identify actual data sources
3. Update provenance based on source analysis
### Phase 3: Category 3 - Web Sources (web-reader + Playwright)
**Scope**: ~2,000 files
**Effort**: 5-10 days with web archival
**Tools**: Playwright, web-reader MCP, GLM4.7
For files with web-derived claims:
1. Archive source URLs using Playwright
2. Extract claims with XPath using web-reader
3. Generate dual timestamps from archival metadata
## File Updates
### Per-File Changes
For each of the 24,328 files:
1. **Update `ch_annotator.extraction_provenance`**:
```yaml
extraction_provenance:
# Existing fields retained
namespace: glam
path: /files/japan_complete.yaml
context_convention: ch_annotator-v1_7_0
# NEW: Dual timestamps
source_archived_at: '2025-11-18T14:46:40.580095+00:00'
statement_created_at: '2025-12-06T21:13:31.304940+00:00'
# NEW: Valid agent identifier
agent: batch-script-create-custodian-from-ch-annotator
# NEW: Source classification
source_type: isil_registry_csv
# NEW: Migration tracking
migration_note: 'Migrated from agent:claude-conversation on 2025-12-30'
```
2. **Update each `ch_annotator.entity_claims[].provenance`**:
```yaml
provenance:
namespace: glam
path: /files/japan_complete.yaml
context_convention: ch_annotator-v1_7_0
# NEW: Dual timestamps (inherited from parent)
source_archived_at: '2025-11-18T14:46:40.580095+00:00'
statement_created_at: '2025-12-06T21:13:31.304940+00:00'
# NEW: Valid agent
agent: batch-script-create-custodian-from-ch-annotator
```
## Validation Criteria
After migration, every provenance block MUST pass:
1.`statement_created_at` is present (ISO 8601)
2.`source_archived_at` is present (ISO 8601)
3.`source_archived_at <= statement_created_at`
4.`agent` is NOT `claude-conversation`, `claude`, `ai`, `opencode`, or `llm`
5.`agent` follows format `{tool}-{model}-{version}` or `{script-name}`
## Rollback Strategy
Before migration:
1. Create timestamped backup: `data/custodian.backup.2025-12-30/`
2. Store original provenance in `_migration_backup` field
3. Generate diff report for manual review
## References
- Rule 35: `.opencode/PROVENANCE_TIMESTAMP_RULES.md`
- Rule 6: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
- CH-Annotator: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
- web-reader script: `scripts/add_web_claim_provenance.py`