GLM4.7 Prompts for Category 2: Conversation Source Analysis

Created: 2025-12-30 Status: SPECIFICATION Related: CLAUDE_CONVERSATION_MIGRATION_SPEC.md, PROVENANCE_TIMESTAMP_RULES.md

Purpose

Category 2 files (~4,000) have provenance paths like /conversations/{uuid} which reference Claude conversation exports. The actual data sources (Wikidata, websites, registries, academic papers) are mentioned WITHIN the conversation text.

GLM4.7 is used to:

Parse conversation JSON files
Identify the REAL data sources mentioned
Extract source metadata (URLs, timestamps, identifiers)
Generate proper dual-timestamp provenance

Workflow Overview

┌─────────────────────────────────────────────────────────────────┐
│                 CATEGORY 2 MIGRATION WORKFLOW                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐     ┌──────────────┐     ┌─────────────────┐  │
│  │ Custodian    │     │ Conversation │     │ GLM4.7          │  │
│  │ YAML File    │ ──▶ │ JSON Archive │ ──▶ │ Source Analysis │  │
│  │              │     │              │     │                 │  │
│  │ path: /conv/ │     │ Full text of │     │ Identify:       │  │
│  │ {uuid}       │     │ messages     │     │ - URLs          │  │
│  └──────────────┘     └──────────────┘     │ - Wikidata IDs  │  │
│                                            │ - Registry refs │  │
│                                            │ - API calls     │  │
│                                            └─────────────────┘  │
│                                                    │             │
│                                                    ▼             │
│                              ┌─────────────────────────────────┐ │
│                              │ Source-Specific Handlers        │ │
│                              ├─────────────────────────────────┤ │
│                              │                                 │ │
│                              │ ┌───────────┐ ┌───────────────┐ │ │
│                              │ │ Web URLs  │ │ Wikidata IDs  │ │ │
│                              │ │           │ │               │ │ │
│                              │ │ Playwright│ │ SPARQL query  │ │ │
│                              │ │ archive + │ │ to verify     │ │ │
│                              │ │ web-reader│ │ claims        │ │ │
│                              │ └───────────┘ └───────────────┘ │ │
│                              │                                 │ │
│                              │ ┌───────────┐ ┌───────────────┐ │ │
│                              │ │ Registry  │ │ Academic      │ │ │
│                              │ │ References│ │ Citations     │ │ │
│                              │ │           │ │               │ │ │
│                              │ │ Map to    │ │ DOI lookup    │ │ │
│                              │ │ CSV files │ │ CrossRef API  │ │ │
│                              │ └───────────┘ └───────────────┘ │ │
│                              └─────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

GLM4.7 Prompts

Prompt 1: Source Identification

# Task: Identify Data Sources in Heritage Custodian Conversation

You are analyzing a Claude conversation that was used to extract heritage institution data. Your task is to identify ALL data sources mentioned or used in this conversation.

## Conversation Content
{conversation_json}

## Institution Being Analyzed
- Name: {institution_name}
- GHCID: {ghcid}
- Current provenance path: /conversations/{conversation_uuid}

## Instructions

1. Read through the entire conversation carefully
2. Identify every data source mentioned or used, including:
   - **Web URLs**: Institution websites, registry portals, news articles
   - **Wikidata**: Entity IDs (Q-numbers) referenced or queried
   - **API Calls**: Any structured data fetches (SPARQL, REST APIs)
   - **CSV/Registry References**: ISIL registries, national databases
   - **Academic Sources**: Papers, reports, DOIs
   - **Government Sources**: Official publications, gazettes

3. For each source, extract:
   - Source type (web, wikidata, api, registry, academic, government)
   - Source identifier (URL, Q-number, DOI, etc.)
   - What data was extracted from it
   - Approximate timestamp of access (if mentioned)

## Output Format

Return a JSON array of sources:

```json
{
  "institution_name": "{institution_name}",
  "ghcid": "{ghcid}",
  "conversation_uuid": "{conversation_uuid}",
  "identified_sources": [
    {
      "source_type": "web",
      "source_url": "https://example.org/about",
      "source_identifier": null,
      "data_extracted": ["name", "address", "opening_hours"],
      "access_timestamp": "2025-09-22T14:40:00Z",
      "confidence": 0.95,
      "evidence_quote": "Looking at their website at example.org..."
    },
    {
      "source_type": "wikidata",
      "source_url": "https://www.wikidata.org/wiki/Q12345",
      "source_identifier": "Q12345",
      "data_extracted": ["instance_of", "country", "coordinates"],
      "access_timestamp": null,
      "confidence": 0.98,
      "evidence_quote": "According to Wikidata (Q12345)..."
    }
  ],
  "analysis_notes": "Any relevant observations about source quality or gaps"
}

Important

Only include sources that were ACTUALLY used to extract data
Do not invent sources - if unsure, set confidence lower
Include the exact quote from conversation that references each source
If no sources can be identified, return empty array with explanation


### Prompt 2: Claim-Source Attribution

```markdown
# Task: Map Claims to Their Original Sources

You have identified the following sources used in a heritage custodian conversation:

## Identified Sources
{identified_sources_json}

## Entity Claims from Custodian File
{entity_claims_json}

## Institution
- Name: {institution_name}
- GHCID: {ghcid}

## Instructions

For each entity claim, determine which source(s) it was derived from.

1. Analyze each claim (full_name, institution_type, located_in_city, etc.)
2. Match it to the most likely source based on:
   - What data each source provides
   - The conversation context
   - Claim confidence scores

3. Generate proper provenance for each claim

## Output Format

Return updated provenance for each claim:

```json
{
  "claim_provenance_updates": [
    {
      "claim_type": "full_name",
      "claim_value": "Example Museum",
      "attributed_source": {
        "source_type": "web",
        "source_url": "https://example.org/about",
        "source_archived_at": "2025-09-22T14:40:00Z",
        "statement_created_at": "2025-12-06T21:13:31Z",
        "agent": "opencode-claude-sonnet-4",
        "attribution_confidence": 0.92
      },
      "attribution_rationale": "Name found on official website header"
    },
    {
      "claim_type": "wikidata_id",
      "claim_value": "Q12345",
      "attributed_source": {
        "source_type": "wikidata",
        "source_url": "https://www.wikidata.org/wiki/Q12345",
        "source_archived_at": "2025-09-22T14:45:00Z",
        "statement_created_at": "2025-12-06T21:13:31Z",
        "agent": "opencode-claude-sonnet-4",
        "attribution_confidence": 1.0
      },
      "attribution_rationale": "Directly queried from Wikidata"
    }
  ],
  "unattributed_claims": [
    {
      "claim_type": "opening_hours",
      "claim_value": "Mon-Fri 9-17",
      "reason": "Source could not be determined from conversation"
    }
  ]
}

Rules

If a claim cannot be attributed to any identified source, add to unattributed_claims
For unattributed claims, the migration script will flag for manual review
Use the conversation UUID as fallback source_archived_at if no timestamp available
statement_created_at should use the annotation_date from CH-Annotator


### Prompt 3: Web Source Verification

```markdown
# Task: Verify Web Sources for Archival

Before we archive web sources with Playwright, verify they are valid and relevant.

## Web Sources to Verify
{web_sources_json}

## Institution
- Name: {institution_name}
- GHCID: {ghcid}

## Instructions

For each web source, determine:

1. **URL Validity**: Is the URL well-formed and likely still accessible?
2. **Relevance**: Does this URL relate to the institution?
3. **Archive Priority**: Should we archive this with Playwright?
4. **Expected Content**: What data should we extract with web-reader?

## Output Format

```json
{
  "web_source_verification": [
    {
      "source_url": "https://example.org/about",
      "url_valid": true,
      "is_institution_website": true,
      "archive_priority": "high",
      "expected_claims": ["name", "address", "description", "contact"],
      "web_reader_selectors": {
        "name": "h1.institution-name",
        "address": ".contact-info address",
        "description": "main .about-text"
      },
      "notes": "Official institution website - primary source"
    },
    {
      "source_url": "https://twitter.com/example",
      "url_valid": true,
      "is_institution_website": false,
      "archive_priority": "low",
      "expected_claims": ["social_media_handle"],
      "web_reader_selectors": null,
      "notes": "Social media - only need URL, not content"
    }
  ],
  "sources_to_archive": ["https://example.org/about"],
  "sources_to_skip": ["https://twitter.com/example"]
}

Priority Levels

high: Institution's own website - archive immediately
medium: Government registries, Wikipedia - archive if accessible
low: Social media, aggregators - just store URL
skip: Dead links, paywalled content, dynamic apps


## Implementation Script Outline

```python
#!/usr/bin/env python3
"""
Phase 2 Migration: Conversation Sources → Proper Provenance

Uses GLM4.7 to analyze conversation JSON files and identify
actual data sources for heritage custodian claims.
"""

import json
import os
from pathlib import Path
from datetime import datetime, timezone

import httpx
import yaml

# Z.AI GLM API configuration (per Rule 11)
ZAI_API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
ZAI_MODEL = "glm-4.5"  # or glm-4.6 for higher quality


def get_zai_token() -> str:
    """Get Z.AI API token from environment."""
    token = os.environ.get("ZAI_API_TOKEN")
    if not token:
        raise ValueError("ZAI_API_TOKEN environment variable not set")
    return token


def call_glm4(prompt: str, system_prompt: str = None) -> str:
    """Call GLM4 API with prompt."""
    headers = {
        "Authorization": f"Bearer {get_zai_token()}",
        "Content-Type": "application/json"
    }
    
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})
    
    payload = {
        "model": ZAI_MODEL,
        "messages": messages,
        "temperature": 0.1,  # Low temperature for consistent extraction
        "max_tokens": 4096
    }
    
    response = httpx.post(ZAI_API_URL, json=payload, headers=headers, timeout=60)
    response.raise_for_status()
    
    return response.json()["choices"][0]["message"]["content"]


def load_conversation_json(uuid: str) -> dict:
    """Load conversation JSON from archive."""
    # Conversation archives stored in data/conversations/
    conv_path = Path(f"data/conversations/{uuid}.json")
    if not conv_path.exists():
        # Try alternative locations
        alt_path = Path(f"~/Documents/claude/glam/{uuid}.json").expanduser()
        if alt_path.exists():
            conv_path = alt_path
        else:
            return None
    
    with open(conv_path, 'r') as f:
        return json.load(f)


def identify_sources_for_institution(custodian_file: Path) -> dict:
    """
    Analyze conversation to identify sources for a custodian.
    
    Returns dict with:
    - identified_sources: list of sources found
    - claim_attributions: mapping of claims to sources
    - web_sources_to_archive: URLs needing Playwright archival
    """
    # Load custodian YAML
    with open(custodian_file, 'r') as f:
        custodian = yaml.safe_load(f)
    
    # Extract conversation UUID from provenance path
    ch_annotator = custodian.get('ch_annotator', {})
    extraction_prov = ch_annotator.get('extraction_provenance', {})
    path = extraction_prov.get('path', '')
    
    if not path.startswith('/conversations/'):
        return {'error': 'Not a conversation source file'}
    
    conv_uuid = path.replace('/conversations/', '')
    
    # Load conversation JSON
    conversation = load_conversation_json(conv_uuid)
    if not conversation:
        return {'error': f'Conversation not found: {conv_uuid}'}
    
    # Extract relevant info
    institution_name = custodian.get('custodian_name', {}).get('claim_value', 'Unknown')
    ghcid = custodian.get('ghcid', {}).get('ghcid_current', 'Unknown')
    entity_claims = ch_annotator.get('entity_claims', [])
    
    # Step 1: Identify sources using GLM4
    source_prompt = PROMPT_1_SOURCE_IDENTIFICATION.format(
        conversation_json=json.dumps(conversation, indent=2)[:50000],  # Truncate if needed
        institution_name=institution_name,
        ghcid=ghcid,
        conversation_uuid=conv_uuid
    )
    
    sources_response = call_glm4(source_prompt)
    identified_sources = json.loads(sources_response)
    
    # Step 2: Attribute claims to sources
    attribution_prompt = PROMPT_2_CLAIM_ATTRIBUTION.format(
        identified_sources_json=json.dumps(identified_sources['identified_sources'], indent=2),
        entity_claims_json=json.dumps(entity_claims, indent=2),
        institution_name=institution_name,
        ghcid=ghcid
    )
    
    attributions_response = call_glm4(attribution_prompt)
    claim_attributions = json.loads(attributions_response)
    
    # Step 3: Verify web sources
    web_sources = [s for s in identified_sources['identified_sources'] if s['source_type'] == 'web']
    
    if web_sources:
        verification_prompt = PROMPT_3_WEB_VERIFICATION.format(
            web_sources_json=json.dumps(web_sources, indent=2),
            institution_name=institution_name,
            ghcid=ghcid
        )
        
        verification_response = call_glm4(verification_prompt)
        web_verification = json.loads(verification_response)
    else:
        web_verification = {'sources_to_archive': [], 'sources_to_skip': []}
    
    return {
        'custodian_file': str(custodian_file),
        'conversation_uuid': conv_uuid,
        'identified_sources': identified_sources,
        'claim_attributions': claim_attributions,
        'web_verification': web_verification
    }


# Prompt templates (loaded from this file or external)
PROMPT_1_SOURCE_IDENTIFICATION = """..."""  # From Prompt 1 above
PROMPT_2_CLAIM_ATTRIBUTION = """..."""  # From Prompt 2 above  
PROMPT_3_WEB_VERIFICATION = """..."""  # From Prompt 3 above

Conversation JSON Location

Conversation exports need to be located. Check these paths:

~/Documents/claude/glam/*.json - Original Claude exports
data/conversations/*.json - Project archive location
data/instances/conversations/ - Alternative archive

If conversations are not archived, they may need to be re-exported from Claude.

Integration with Phase 1

Phase 2 runs AFTER Phase 1 completes:

Phase 1: Migrates ~18,000 Category 1 files (ISIL/CSV sources)
Phase 2: Processes ~4,000 Category 2 files (conversation sources)
Phase 3: Archives web sources with Playwright for Category 3

Cost Estimation

GLM4 API calls (per Rule 11: FREE via Z.AI Coding Plan):

~4,000 files × 3 prompts = ~12,000 API calls
Cost: $0 (Z.AI Coding Plan)
Time: ~2-4 hours (rate limited)

Validation Criteria

After Phase 2 migration, every Category 2 file MUST pass:

✅ source_archived_at is present (from identified source or conversation timestamp)
✅ statement_created_at is present (from annotation_date)
✅ agent is valid (opencode-claude-sonnet-4 or similar)
✅ At least one source identified, OR flagged for manual review
✅ Web sources queued for Playwright archival (Phase 3)

References

Rule 11: .opencode/ZAI_GLM_API_RULES.md
Rule 35: .opencode/PROVENANCE_TIMESTAMP_RULES.md
Migration Spec: .opencode/CLAUDE_CONVERSATION_MIGRATION_SPEC.md

18 KiB Raw Blame History Unescape Escape