235 lines
11 KiB
Markdown
235 lines
11 KiB
Markdown
# Migration Specification: `agent: claude-conversation` → Proper Provenance
|
|
|
|
**Created**: 2025-12-30
|
|
**Status**: SPECIFICATION (Not Yet Implemented)
|
|
**Related**: PROVENANCE_TIMESTAMP_RULES.md, WEB_OBSERVATION_PROVENANCE_RULES.md
|
|
|
|
## Problem Statement
|
|
|
|
24,328 custodian YAML files in `data/custodian/` have provenance statements with:
|
|
- `agent: claude-conversation` (vague, non-specific agent identifier)
|
|
- Single `timestamp` field (violates Rule 35: dual timestamp requirement)
|
|
- No distinction between statement creation and source archival
|
|
|
|
## Affected Files
|
|
|
|
All files matching:
|
|
```bash
|
|
grep -l "agent: claude-conversation" data/custodian/*.yaml
|
|
# Result: 24,328 files
|
|
```
|
|
|
|
### Provenance Locations in Each File
|
|
|
|
1. **`ch_annotator.extraction_provenance.agent`** - Top-level extraction agent
|
|
2. **`ch_annotator.entity_claims[].provenance.agent`** - Per-claim provenance (multiple instances)
|
|
|
|
## Source Data Categories
|
|
|
|
The 24,328 files come from different original sources, requiring different migration strategies:
|
|
|
|
### Category 1: ISIL Registry / CSV Sources (~18,000 files)
|
|
|
|
**Examples**: Japan, Austria, Switzerland, Czech, Bulgarian, Belgian ISIL registries
|
|
|
|
**Characteristics**:
|
|
- `path: /files/{country}_complete.yaml`
|
|
- Data originated from authoritative CSV registries
|
|
- The CSV files are already archived in `data/instances/`
|
|
|
|
**Migration Strategy** (Scripted):
|
|
```yaml
|
|
# BEFORE
|
|
extraction_provenance:
|
|
path: /files/japan_complete.yaml
|
|
timestamp: '2025-11-18T14:46:40.580095+00:00'
|
|
agent: claude-conversation # ← INVALID
|
|
|
|
# AFTER
|
|
extraction_provenance:
|
|
source_type: isil_registry_csv
|
|
source_path: /files/japan_complete.yaml
|
|
source_archived_at: '2025-11-18T14:46:40.580095+00:00' # When CSV was processed
|
|
statement_created_at: '2025-12-06T21:13:31.304940+00:00' # From annotation_date
|
|
agent: batch-script-create-custodian-from-ch-annotator
|
|
context_convention: ch_annotator-v1_7_0
|
|
```
|
|
|
|
### Category 2: Conversation-Extracted Data (~4,000 files)
|
|
|
|
**Examples**: Palestinian heritage custodians, some Latin American institutions
|
|
|
|
**Characteristics**:
|
|
- `path: /conversations/{uuid}`
|
|
- Data extracted from Claude conversation exports
|
|
- Need to trace back to original sources mentioned IN the conversation
|
|
|
|
**Migration Strategy** (Requires GLM4.7 + Manual Review):
|
|
1. Load the conversation JSON file
|
|
2. Use GLM4.7 to identify the ACTUAL sources mentioned in conversation
|
|
3. For each source type:
|
|
- **Web sources**: Use web-reader to archive + extract with XPath
|
|
- **Wikidata**: Add Wikidata entity provenance
|
|
- **Academic sources**: Add DOI/citation provenance
|
|
|
|
### Category 3: Web-Enriched Data (~2,000 files)
|
|
|
|
**Examples**: Institutions with `web_enrichment`, `google_maps_enrichment`
|
|
|
|
**Characteristics**:
|
|
- Have web-scraped data that needs XPath provenance
|
|
- May have Google Maps or OSM enrichment
|
|
|
|
**Migration Strategy** (Requires web-reader + Playwright):
|
|
1. Re-archive source websites using Playwright
|
|
2. Use web-reader to extract claims with XPath provenance
|
|
3. Generate dual timestamps from archival metadata
|
|
|
|
## Migration Pipeline Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ MIGRATION PIPELINE │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │
|
|
│ │ Categorizer │ ──▶ │ Source Resolver │ ──▶ │ Provenance │ │
|
|
│ │ │ │ │ │ Generator │ │
|
|
│ │ - Detect │ │ - CSV Registry │ │ │ │
|
|
│ │ source │ │ - Conversation │ │ - Dual │ │
|
|
│ │ type │ │ - Web Archive │ │ timestamps │ │
|
|
│ │ - Route to │ │ - Wikidata │ │ - Valid agent │ │
|
|
│ │ handler │ │ │ │ - Source refs │ │
|
|
│ └─────────────┘ └──────────────────┘ └─────────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ Source-Specific Handlers │ │
|
|
│ ├─────────────────────────────────────────────────────────────┤ │
|
|
│ │ │ │
|
|
│ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │ │
|
|
│ │ │ ISIL/CSV │ │ Conversation │ │ Web Archive │ │ │
|
|
│ │ │ Handler │ │ Handler │ │ Handler │ │ │
|
|
│ │ │ │ │ │ │ │ │ │
|
|
│ │ │ - Read CSV │ │ - Parse JSON │ │ - Playwright │ │ │
|
|
│ │ │ - Map to │ │ - GLM4.7 │ │ - web-reader │ │ │
|
|
│ │ │ timestamps │ │ analysis │ │ - XPath │ │ │
|
|
│ │ │ - Update │ │ - Source │ │ extraction │ │ │
|
|
│ │ │ provenance │ │ tracing │ │ │ │ │
|
|
│ │ └────────────────┘ └────────────────┘ └───────────────┘ │ │
|
|
│ │ │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ Validation Layer │ │
|
|
│ ├─────────────────────────────────────────────────────────────┤ │
|
|
│ │ - Dual timestamp check (Rule 35) │ │
|
|
│ │ - Agent identifier validation │ │
|
|
│ │ - source_archived_at <= statement_created_at │ │
|
|
│ │ - XPath verification (where applicable) │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Category 1 - ISIL/CSV Sources (Scripted, No LLM Required)
|
|
|
|
**Scope**: ~18,000 files
|
|
**Effort**: 1-2 days scripting
|
|
**Tools**: Python script only
|
|
|
|
Files where `path` matches `/files/*.yaml` or `/files/*.csv`:
|
|
- Parse the annotation_date to get `statement_created_at`
|
|
- Use the original file's processing timestamp for `source_archived_at`
|
|
- Replace `agent: claude-conversation` with source-specific agent
|
|
|
|
### Phase 2: Category 2 - Conversation Sources (GLM4.7 Required)
|
|
|
|
**Scope**: ~4,000 files
|
|
**Effort**: 3-5 days with LLM processing
|
|
**Tools**: GLM4.7 API, conversation JSON parser
|
|
|
|
For each file with `path: /conversations/{uuid}`:
|
|
1. Load conversation JSON from archive (if available)
|
|
2. Send to GLM4.7 with prompt to identify actual data sources
|
|
3. Update provenance based on source analysis
|
|
|
|
### Phase 3: Category 3 - Web Sources (web-reader + Playwright)
|
|
|
|
**Scope**: ~2,000 files
|
|
**Effort**: 5-10 days with web archival
|
|
**Tools**: Playwright, web-reader MCP, GLM4.7
|
|
|
|
For files with web-derived claims:
|
|
1. Archive source URLs using Playwright
|
|
2. Extract claims with XPath using web-reader
|
|
3. Generate dual timestamps from archival metadata
|
|
|
|
## File Updates
|
|
|
|
### Per-File Changes
|
|
|
|
For each of the 24,328 files:
|
|
|
|
1. **Update `ch_annotator.extraction_provenance`**:
|
|
```yaml
|
|
extraction_provenance:
|
|
# Existing fields retained
|
|
namespace: glam
|
|
path: /files/japan_complete.yaml
|
|
context_convention: ch_annotator-v1_7_0
|
|
|
|
# NEW: Dual timestamps
|
|
source_archived_at: '2025-11-18T14:46:40.580095+00:00'
|
|
statement_created_at: '2025-12-06T21:13:31.304940+00:00'
|
|
|
|
# NEW: Valid agent identifier
|
|
agent: batch-script-create-custodian-from-ch-annotator
|
|
|
|
# NEW: Source classification
|
|
source_type: isil_registry_csv
|
|
|
|
# NEW: Migration tracking
|
|
migration_note: 'Migrated from agent:claude-conversation on 2025-12-30'
|
|
```
|
|
|
|
2. **Update each `ch_annotator.entity_claims[].provenance`**:
|
|
```yaml
|
|
provenance:
|
|
namespace: glam
|
|
path: /files/japan_complete.yaml
|
|
context_convention: ch_annotator-v1_7_0
|
|
|
|
# NEW: Dual timestamps (inherited from parent)
|
|
source_archived_at: '2025-11-18T14:46:40.580095+00:00'
|
|
statement_created_at: '2025-12-06T21:13:31.304940+00:00'
|
|
|
|
# NEW: Valid agent
|
|
agent: batch-script-create-custodian-from-ch-annotator
|
|
```
|
|
|
|
## Validation Criteria
|
|
|
|
After migration, every provenance block MUST pass:
|
|
|
|
1. ✅ `statement_created_at` is present (ISO 8601)
|
|
2. ✅ `source_archived_at` is present (ISO 8601)
|
|
3. ✅ `source_archived_at <= statement_created_at`
|
|
4. ✅ `agent` is NOT `claude-conversation`, `claude`, `ai`, `opencode`, or `llm`
|
|
5. ✅ `agent` follows format `{tool}-{model}-{version}` or `{script-name}`
|
|
|
|
## Rollback Strategy
|
|
|
|
Before migration:
|
|
1. Create timestamped backup: `data/custodian.backup.2025-12-30/`
|
|
2. Store original provenance in `_migration_backup` field
|
|
3. Generate diff report for manual review
|
|
|
|
## References
|
|
|
|
- Rule 35: `.opencode/PROVENANCE_TIMESTAMP_RULES.md`
|
|
- Rule 6: `.opencode/WEB_OBSERVATION_PROVENANCE_RULES.md`
|
|
- CH-Annotator: `data/entity_annotation/ch_annotator-v1_7_0.yaml`
|
|
- web-reader script: `scripts/add_web_claim_provenance.py`
|