18 KiB
18 KiB
GLM4.7 Prompts for Category 2: Conversation Source Analysis
Created: 2025-12-30 Status: SPECIFICATION Related: CLAUDE_CONVERSATION_MIGRATION_SPEC.md, PROVENANCE_TIMESTAMP_RULES.md
Purpose
Category 2 files (~4,000) have provenance paths like /conversations/{uuid} which reference Claude conversation exports. The actual data sources (Wikidata, websites, registries, academic papers) are mentioned WITHIN the conversation text.
GLM4.7 is used to:
- Parse conversation JSON files
- Identify the REAL data sources mentioned
- Extract source metadata (URLs, timestamps, identifiers)
- Generate proper dual-timestamp provenance
Workflow Overview
┌─────────────────────────────────────────────────────────────────┐
│ CATEGORY 2 MIGRATION WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Custodian │ │ Conversation │ │ GLM4.7 │ │
│ │ YAML File │ ──▶ │ JSON Archive │ ──▶ │ Source Analysis │ │
│ │ │ │ │ │ │ │
│ │ path: /conv/ │ │ Full text of │ │ Identify: │ │
│ │ {uuid} │ │ messages │ │ - URLs │ │
│ └──────────────┘ └──────────────┘ │ - Wikidata IDs │ │
│ │ - Registry refs │ │
│ │ - API calls │ │
│ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Source-Specific Handlers │ │
│ ├─────────────────────────────────┤ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────────┐ │ │
│ │ │ Web URLs │ │ Wikidata IDs │ │ │
│ │ │ │ │ │ │ │
│ │ │ Playwright│ │ SPARQL query │ │ │
│ │ │ archive + │ │ to verify │ │ │
│ │ │ web-reader│ │ claims │ │ │
│ │ └───────────┘ └───────────────┘ │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────────┐ │ │
│ │ │ Registry │ │ Academic │ │ │
│ │ │ References│ │ Citations │ │ │
│ │ │ │ │ │ │ │
│ │ │ Map to │ │ DOI lookup │ │ │
│ │ │ CSV files │ │ CrossRef API │ │ │
│ │ └───────────┘ └───────────────┘ │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
GLM4.7 Prompts
Prompt 1: Source Identification
# Task: Identify Data Sources in Heritage Custodian Conversation
You are analyzing a Claude conversation that was used to extract heritage institution data. Your task is to identify ALL data sources mentioned or used in this conversation.
## Conversation Content
{conversation_json}
## Institution Being Analyzed
- Name: {institution_name}
- GHCID: {ghcid}
- Current provenance path: /conversations/{conversation_uuid}
## Instructions
1. Read through the entire conversation carefully
2. Identify every data source mentioned or used, including:
- **Web URLs**: Institution websites, registry portals, news articles
- **Wikidata**: Entity IDs (Q-numbers) referenced or queried
- **API Calls**: Any structured data fetches (SPARQL, REST APIs)
- **CSV/Registry References**: ISIL registries, national databases
- **Academic Sources**: Papers, reports, DOIs
- **Government Sources**: Official publications, gazettes
3. For each source, extract:
- Source type (web, wikidata, api, registry, academic, government)
- Source identifier (URL, Q-number, DOI, etc.)
- What data was extracted from it
- Approximate timestamp of access (if mentioned)
## Output Format
Return a JSON array of sources:
```json
{
"institution_name": "{institution_name}",
"ghcid": "{ghcid}",
"conversation_uuid": "{conversation_uuid}",
"identified_sources": [
{
"source_type": "web",
"source_url": "https://example.org/about",
"source_identifier": null,
"data_extracted": ["name", "address", "opening_hours"],
"access_timestamp": "2025-09-22T14:40:00Z",
"confidence": 0.95,
"evidence_quote": "Looking at their website at example.org..."
},
{
"source_type": "wikidata",
"source_url": "https://www.wikidata.org/wiki/Q12345",
"source_identifier": "Q12345",
"data_extracted": ["instance_of", "country", "coordinates"],
"access_timestamp": null,
"confidence": 0.98,
"evidence_quote": "According to Wikidata (Q12345)..."
}
],
"analysis_notes": "Any relevant observations about source quality or gaps"
}
Important
- Only include sources that were ACTUALLY used to extract data
- Do not invent sources - if unsure, set confidence lower
- Include the exact quote from conversation that references each source
- If no sources can be identified, return empty array with explanation
### Prompt 2: Claim-Source Attribution
```markdown
# Task: Map Claims to Their Original Sources
You have identified the following sources used in a heritage custodian conversation:
## Identified Sources
{identified_sources_json}
## Entity Claims from Custodian File
{entity_claims_json}
## Institution
- Name: {institution_name}
- GHCID: {ghcid}
## Instructions
For each entity claim, determine which source(s) it was derived from.
1. Analyze each claim (full_name, institution_type, located_in_city, etc.)
2. Match it to the most likely source based on:
- What data each source provides
- The conversation context
- Claim confidence scores
3. Generate proper provenance for each claim
## Output Format
Return updated provenance for each claim:
```json
{
"claim_provenance_updates": [
{
"claim_type": "full_name",
"claim_value": "Example Museum",
"attributed_source": {
"source_type": "web",
"source_url": "https://example.org/about",
"source_archived_at": "2025-09-22T14:40:00Z",
"statement_created_at": "2025-12-06T21:13:31Z",
"agent": "opencode-claude-sonnet-4",
"attribution_confidence": 0.92
},
"attribution_rationale": "Name found on official website header"
},
{
"claim_type": "wikidata_id",
"claim_value": "Q12345",
"attributed_source": {
"source_type": "wikidata",
"source_url": "https://www.wikidata.org/wiki/Q12345",
"source_archived_at": "2025-09-22T14:45:00Z",
"statement_created_at": "2025-12-06T21:13:31Z",
"agent": "opencode-claude-sonnet-4",
"attribution_confidence": 1.0
},
"attribution_rationale": "Directly queried from Wikidata"
}
],
"unattributed_claims": [
{
"claim_type": "opening_hours",
"claim_value": "Mon-Fri 9-17",
"reason": "Source could not be determined from conversation"
}
]
}
Rules
- If a claim cannot be attributed to any identified source, add to unattributed_claims
- For unattributed claims, the migration script will flag for manual review
- Use the conversation UUID as fallback source_archived_at if no timestamp available
- statement_created_at should use the annotation_date from CH-Annotator
### Prompt 3: Web Source Verification
```markdown
# Task: Verify Web Sources for Archival
Before we archive web sources with Playwright, verify they are valid and relevant.
## Web Sources to Verify
{web_sources_json}
## Institution
- Name: {institution_name}
- GHCID: {ghcid}
## Instructions
For each web source, determine:
1. **URL Validity**: Is the URL well-formed and likely still accessible?
2. **Relevance**: Does this URL relate to the institution?
3. **Archive Priority**: Should we archive this with Playwright?
4. **Expected Content**: What data should we extract with web-reader?
## Output Format
```json
{
"web_source_verification": [
{
"source_url": "https://example.org/about",
"url_valid": true,
"is_institution_website": true,
"archive_priority": "high",
"expected_claims": ["name", "address", "description", "contact"],
"web_reader_selectors": {
"name": "h1.institution-name",
"address": ".contact-info address",
"description": "main .about-text"
},
"notes": "Official institution website - primary source"
},
{
"source_url": "https://twitter.com/example",
"url_valid": true,
"is_institution_website": false,
"archive_priority": "low",
"expected_claims": ["social_media_handle"],
"web_reader_selectors": null,
"notes": "Social media - only need URL, not content"
}
],
"sources_to_archive": ["https://example.org/about"],
"sources_to_skip": ["https://twitter.com/example"]
}
Priority Levels
- high: Institution's own website - archive immediately
- medium: Government registries, Wikipedia - archive if accessible
- low: Social media, aggregators - just store URL
- skip: Dead links, paywalled content, dynamic apps
## Implementation Script Outline
```python
#!/usr/bin/env python3
"""
Phase 2 Migration: Conversation Sources → Proper Provenance
Uses GLM4.7 to analyze conversation JSON files and identify
actual data sources for heritage custodian claims.
"""
import json
import os
from pathlib import Path
from datetime import datetime, timezone
import httpx
import yaml
# Z.AI GLM API configuration (per Rule 11)
ZAI_API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
ZAI_MODEL = "glm-4.5" # or glm-4.6 for higher quality
def get_zai_token() -> str:
"""Get Z.AI API token from environment."""
token = os.environ.get("ZAI_API_TOKEN")
if not token:
raise ValueError("ZAI_API_TOKEN environment variable not set")
return token
def call_glm4(prompt: str, system_prompt: str = None) -> str:
"""Call GLM4 API with prompt."""
headers = {
"Authorization": f"Bearer {get_zai_token()}",
"Content-Type": "application/json"
}
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
payload = {
"model": ZAI_MODEL,
"messages": messages,
"temperature": 0.1, # Low temperature for consistent extraction
"max_tokens": 4096
}
response = httpx.post(ZAI_API_URL, json=payload, headers=headers, timeout=60)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
def load_conversation_json(uuid: str) -> dict:
"""Load conversation JSON from archive."""
# Conversation archives stored in data/conversations/
conv_path = Path(f"data/conversations/{uuid}.json")
if not conv_path.exists():
# Try alternative locations
alt_path = Path(f"~/Documents/claude/glam/{uuid}.json").expanduser()
if alt_path.exists():
conv_path = alt_path
else:
return None
with open(conv_path, 'r') as f:
return json.load(f)
def identify_sources_for_institution(custodian_file: Path) -> dict:
"""
Analyze conversation to identify sources for a custodian.
Returns dict with:
- identified_sources: list of sources found
- claim_attributions: mapping of claims to sources
- web_sources_to_archive: URLs needing Playwright archival
"""
# Load custodian YAML
with open(custodian_file, 'r') as f:
custodian = yaml.safe_load(f)
# Extract conversation UUID from provenance path
ch_annotator = custodian.get('ch_annotator', {})
extraction_prov = ch_annotator.get('extraction_provenance', {})
path = extraction_prov.get('path', '')
if not path.startswith('/conversations/'):
return {'error': 'Not a conversation source file'}
conv_uuid = path.replace('/conversations/', '')
# Load conversation JSON
conversation = load_conversation_json(conv_uuid)
if not conversation:
return {'error': f'Conversation not found: {conv_uuid}'}
# Extract relevant info
institution_name = custodian.get('custodian_name', {}).get('claim_value', 'Unknown')
ghcid = custodian.get('ghcid', {}).get('ghcid_current', 'Unknown')
entity_claims = ch_annotator.get('entity_claims', [])
# Step 1: Identify sources using GLM4
source_prompt = PROMPT_1_SOURCE_IDENTIFICATION.format(
conversation_json=json.dumps(conversation, indent=2)[:50000], # Truncate if needed
institution_name=institution_name,
ghcid=ghcid,
conversation_uuid=conv_uuid
)
sources_response = call_glm4(source_prompt)
identified_sources = json.loads(sources_response)
# Step 2: Attribute claims to sources
attribution_prompt = PROMPT_2_CLAIM_ATTRIBUTION.format(
identified_sources_json=json.dumps(identified_sources['identified_sources'], indent=2),
entity_claims_json=json.dumps(entity_claims, indent=2),
institution_name=institution_name,
ghcid=ghcid
)
attributions_response = call_glm4(attribution_prompt)
claim_attributions = json.loads(attributions_response)
# Step 3: Verify web sources
web_sources = [s for s in identified_sources['identified_sources'] if s['source_type'] == 'web']
if web_sources:
verification_prompt = PROMPT_3_WEB_VERIFICATION.format(
web_sources_json=json.dumps(web_sources, indent=2),
institution_name=institution_name,
ghcid=ghcid
)
verification_response = call_glm4(verification_prompt)
web_verification = json.loads(verification_response)
else:
web_verification = {'sources_to_archive': [], 'sources_to_skip': []}
return {
'custodian_file': str(custodian_file),
'conversation_uuid': conv_uuid,
'identified_sources': identified_sources,
'claim_attributions': claim_attributions,
'web_verification': web_verification
}
# Prompt templates (loaded from this file or external)
PROMPT_1_SOURCE_IDENTIFICATION = """...""" # From Prompt 1 above
PROMPT_2_CLAIM_ATTRIBUTION = """...""" # From Prompt 2 above
PROMPT_3_WEB_VERIFICATION = """...""" # From Prompt 3 above
Conversation JSON Location
Conversation exports need to be located. Check these paths:
~/Documents/claude/glam/*.json- Original Claude exportsdata/conversations/*.json- Project archive locationdata/instances/conversations/- Alternative archive
If conversations are not archived, they may need to be re-exported from Claude.
Integration with Phase 1
Phase 2 runs AFTER Phase 1 completes:
- Phase 1: Migrates ~18,000 Category 1 files (ISIL/CSV sources)
- Phase 2: Processes ~4,000 Category 2 files (conversation sources)
- Phase 3: Archives web sources with Playwright for Category 3
Cost Estimation
GLM4 API calls (per Rule 11: FREE via Z.AI Coding Plan):
- ~4,000 files × 3 prompts = ~12,000 API calls
- Cost: $0 (Z.AI Coding Plan)
- Time: ~2-4 hours (rate limited)
Validation Criteria
After Phase 2 migration, every Category 2 file MUST pass:
- ✅
source_archived_atis present (from identified source or conversation timestamp) - ✅
statement_created_atis present (from annotation_date) - ✅
agentis valid (opencode-claude-sonnet-4 or similar) - ✅ At least one source identified, OR flagged for manual review
- ✅ Web sources queued for Playwright archival (Phase 3)
References
- Rule 11:
.opencode/ZAI_GLM_API_RULES.md - Rule 35:
.opencode/PROVENANCE_TIMESTAMP_RULES.md - Migration Spec:
.opencode/CLAUDE_CONVERSATION_MIGRATION_SPEC.md