477 lines
18 KiB
Markdown
477 lines
18 KiB
Markdown
# GLM4.7 Prompts for Category 2: Conversation Source Analysis
|
||
|
||
**Created**: 2025-12-30
|
||
**Status**: SPECIFICATION
|
||
**Related**: CLAUDE_CONVERSATION_MIGRATION_SPEC.md, PROVENANCE_TIMESTAMP_RULES.md
|
||
|
||
## Purpose
|
||
|
||
Category 2 files (~4,000) have provenance paths like `/conversations/{uuid}` which reference Claude conversation exports. The actual data sources (Wikidata, websites, registries, academic papers) are mentioned WITHIN the conversation text.
|
||
|
||
GLM4.7 is used to:
|
||
1. Parse conversation JSON files
|
||
2. Identify the REAL data sources mentioned
|
||
3. Extract source metadata (URLs, timestamps, identifiers)
|
||
4. Generate proper dual-timestamp provenance
|
||
|
||
## Workflow Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ CATEGORY 2 MIGRATION WORKFLOW │
|
||
├─────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
|
||
│ │ Custodian │ │ Conversation │ │ GLM4.7 │ │
|
||
│ │ YAML File │ ──▶ │ JSON Archive │ ──▶ │ Source Analysis │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ path: /conv/ │ │ Full text of │ │ Identify: │ │
|
||
│ │ {uuid} │ │ messages │ │ - URLs │ │
|
||
│ └──────────────┘ └──────────────┘ │ - Wikidata IDs │ │
|
||
│ │ - Registry refs │ │
|
||
│ │ - API calls │ │
|
||
│ └─────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────┐ │
|
||
│ │ Source-Specific Handlers │ │
|
||
│ ├─────────────────────────────────┤ │
|
||
│ │ │ │
|
||
│ │ ┌───────────┐ ┌───────────────┐ │ │
|
||
│ │ │ Web URLs │ │ Wikidata IDs │ │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ │ Playwright│ │ SPARQL query │ │ │
|
||
│ │ │ archive + │ │ to verify │ │ │
|
||
│ │ │ web-reader│ │ claims │ │ │
|
||
│ │ └───────────┘ └───────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌───────────┐ ┌───────────────┐ │ │
|
||
│ │ │ Registry │ │ Academic │ │ │
|
||
│ │ │ References│ │ Citations │ │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ │ Map to │ │ DOI lookup │ │ │
|
||
│ │ │ CSV files │ │ CrossRef API │ │ │
|
||
│ │ └───────────┘ └───────────────┘ │ │
|
||
│ └─────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## GLM4.7 Prompts
|
||
|
||
### Prompt 1: Source Identification
|
||
|
||
```markdown
|
||
# Task: Identify Data Sources in Heritage Custodian Conversation
|
||
|
||
You are analyzing a Claude conversation that was used to extract heritage institution data. Your task is to identify ALL data sources mentioned or used in this conversation.
|
||
|
||
## Conversation Content
|
||
{conversation_json}
|
||
|
||
## Institution Being Analyzed
|
||
- Name: {institution_name}
|
||
- GHCID: {ghcid}
|
||
- Current provenance path: /conversations/{conversation_uuid}
|
||
|
||
## Instructions
|
||
|
||
1. Read through the entire conversation carefully
|
||
2. Identify every data source mentioned or used, including:
|
||
- **Web URLs**: Institution websites, registry portals, news articles
|
||
- **Wikidata**: Entity IDs (Q-numbers) referenced or queried
|
||
- **API Calls**: Any structured data fetches (SPARQL, REST APIs)
|
||
- **CSV/Registry References**: ISIL registries, national databases
|
||
- **Academic Sources**: Papers, reports, DOIs
|
||
- **Government Sources**: Official publications, gazettes
|
||
|
||
3. For each source, extract:
|
||
- Source type (web, wikidata, api, registry, academic, government)
|
||
- Source identifier (URL, Q-number, DOI, etc.)
|
||
- What data was extracted from it
|
||
- Approximate timestamp of access (if mentioned)
|
||
|
||
## Output Format
|
||
|
||
Return a JSON array of sources:
|
||
|
||
```json
|
||
{
|
||
"institution_name": "{institution_name}",
|
||
"ghcid": "{ghcid}",
|
||
"conversation_uuid": "{conversation_uuid}",
|
||
"identified_sources": [
|
||
{
|
||
"source_type": "web",
|
||
"source_url": "https://example.org/about",
|
||
"source_identifier": null,
|
||
"data_extracted": ["name", "address", "opening_hours"],
|
||
"access_timestamp": "2025-09-22T14:40:00Z",
|
||
"confidence": 0.95,
|
||
"evidence_quote": "Looking at their website at example.org..."
|
||
},
|
||
{
|
||
"source_type": "wikidata",
|
||
"source_url": "https://www.wikidata.org/wiki/Q12345",
|
||
"source_identifier": "Q12345",
|
||
"data_extracted": ["instance_of", "country", "coordinates"],
|
||
"access_timestamp": null,
|
||
"confidence": 0.98,
|
||
"evidence_quote": "According to Wikidata (Q12345)..."
|
||
}
|
||
],
|
||
"analysis_notes": "Any relevant observations about source quality or gaps"
|
||
}
|
||
```
|
||
|
||
## Important
|
||
|
||
- Only include sources that were ACTUALLY used to extract data
|
||
- Do not invent sources - if unsure, set confidence lower
|
||
- Include the exact quote from conversation that references each source
|
||
- If no sources can be identified, return empty array with explanation
|
||
```
|
||
|
||
### Prompt 2: Claim-Source Attribution
|
||
|
||
```markdown
|
||
# Task: Map Claims to Their Original Sources
|
||
|
||
You have identified the following sources used in a heritage custodian conversation:
|
||
|
||
## Identified Sources
|
||
{identified_sources_json}
|
||
|
||
## Entity Claims from Custodian File
|
||
{entity_claims_json}
|
||
|
||
## Institution
|
||
- Name: {institution_name}
|
||
- GHCID: {ghcid}
|
||
|
||
## Instructions
|
||
|
||
For each entity claim, determine which source(s) it was derived from.
|
||
|
||
1. Analyze each claim (full_name, institution_type, located_in_city, etc.)
|
||
2. Match it to the most likely source based on:
|
||
- What data each source provides
|
||
- The conversation context
|
||
- Claim confidence scores
|
||
|
||
3. Generate proper provenance for each claim
|
||
|
||
## Output Format
|
||
|
||
Return updated provenance for each claim:
|
||
|
||
```json
|
||
{
|
||
"claim_provenance_updates": [
|
||
{
|
||
"claim_type": "full_name",
|
||
"claim_value": "Example Museum",
|
||
"attributed_source": {
|
||
"source_type": "web",
|
||
"source_url": "https://example.org/about",
|
||
"source_archived_at": "2025-09-22T14:40:00Z",
|
||
"statement_created_at": "2025-12-06T21:13:31Z",
|
||
"agent": "opencode-claude-sonnet-4",
|
||
"attribution_confidence": 0.92
|
||
},
|
||
"attribution_rationale": "Name found on official website header"
|
||
},
|
||
{
|
||
"claim_type": "wikidata_id",
|
||
"claim_value": "Q12345",
|
||
"attributed_source": {
|
||
"source_type": "wikidata",
|
||
"source_url": "https://www.wikidata.org/wiki/Q12345",
|
||
"source_archived_at": "2025-09-22T14:45:00Z",
|
||
"statement_created_at": "2025-12-06T21:13:31Z",
|
||
"agent": "opencode-claude-sonnet-4",
|
||
"attribution_confidence": 1.0
|
||
},
|
||
"attribution_rationale": "Directly queried from Wikidata"
|
||
}
|
||
],
|
||
"unattributed_claims": [
|
||
{
|
||
"claim_type": "opening_hours",
|
||
"claim_value": "Mon-Fri 9-17",
|
||
"reason": "Source could not be determined from conversation"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
## Rules
|
||
|
||
- If a claim cannot be attributed to any identified source, add to unattributed_claims
|
||
- For unattributed claims, the migration script will flag for manual review
|
||
- Use the conversation UUID as fallback source_archived_at if no timestamp available
|
||
- statement_created_at should use the annotation_date from CH-Annotator
|
||
```
|
||
|
||
### Prompt 3: Web Source Verification
|
||
|
||
```markdown
|
||
# Task: Verify Web Sources for Archival
|
||
|
||
Before we archive web sources with Playwright, verify they are valid and relevant.
|
||
|
||
## Web Sources to Verify
|
||
{web_sources_json}
|
||
|
||
## Institution
|
||
- Name: {institution_name}
|
||
- GHCID: {ghcid}
|
||
|
||
## Instructions
|
||
|
||
For each web source, determine:
|
||
|
||
1. **URL Validity**: Is the URL well-formed and likely still accessible?
|
||
2. **Relevance**: Does this URL relate to the institution?
|
||
3. **Archive Priority**: Should we archive this with Playwright?
|
||
4. **Expected Content**: What data should we extract with web-reader?
|
||
|
||
## Output Format
|
||
|
||
```json
|
||
{
|
||
"web_source_verification": [
|
||
{
|
||
"source_url": "https://example.org/about",
|
||
"url_valid": true,
|
||
"is_institution_website": true,
|
||
"archive_priority": "high",
|
||
"expected_claims": ["name", "address", "description", "contact"],
|
||
"web_reader_selectors": {
|
||
"name": "h1.institution-name",
|
||
"address": ".contact-info address",
|
||
"description": "main .about-text"
|
||
},
|
||
"notes": "Official institution website - primary source"
|
||
},
|
||
{
|
||
"source_url": "https://twitter.com/example",
|
||
"url_valid": true,
|
||
"is_institution_website": false,
|
||
"archive_priority": "low",
|
||
"expected_claims": ["social_media_handle"],
|
||
"web_reader_selectors": null,
|
||
"notes": "Social media - only need URL, not content"
|
||
}
|
||
],
|
||
"sources_to_archive": ["https://example.org/about"],
|
||
"sources_to_skip": ["https://twitter.com/example"]
|
||
}
|
||
```
|
||
|
||
## Priority Levels
|
||
|
||
- **high**: Institution's own website - archive immediately
|
||
- **medium**: Government registries, Wikipedia - archive if accessible
|
||
- **low**: Social media, aggregators - just store URL
|
||
- **skip**: Dead links, paywalled content, dynamic apps
|
||
```
|
||
|
||
## Implementation Script Outline
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
"""
|
||
Phase 2 Migration: Conversation Sources → Proper Provenance
|
||
|
||
Uses GLM4.7 to analyze conversation JSON files and identify
|
||
actual data sources for heritage custodian claims.
|
||
"""
|
||
|
||
import json
|
||
import os
|
||
from pathlib import Path
|
||
from datetime import datetime, timezone
|
||
|
||
import httpx
|
||
import yaml
|
||
|
||
# Z.AI GLM API configuration (per Rule 11)
|
||
ZAI_API_URL = "https://api.z.ai/api/coding/paas/v4/chat/completions"
|
||
ZAI_MODEL = "glm-4.5" # or glm-4.6 for higher quality
|
||
|
||
|
||
def get_zai_token() -> str:
|
||
"""Get Z.AI API token from environment."""
|
||
token = os.environ.get("ZAI_API_TOKEN")
|
||
if not token:
|
||
raise ValueError("ZAI_API_TOKEN environment variable not set")
|
||
return token
|
||
|
||
|
||
def call_glm4(prompt: str, system_prompt: str = None) -> str:
|
||
"""Call GLM4 API with prompt."""
|
||
headers = {
|
||
"Authorization": f"Bearer {get_zai_token()}",
|
||
"Content-Type": "application/json"
|
||
}
|
||
|
||
messages = []
|
||
if system_prompt:
|
||
messages.append({"role": "system", "content": system_prompt})
|
||
messages.append({"role": "user", "content": prompt})
|
||
|
||
payload = {
|
||
"model": ZAI_MODEL,
|
||
"messages": messages,
|
||
"temperature": 0.1, # Low temperature for consistent extraction
|
||
"max_tokens": 4096
|
||
}
|
||
|
||
response = httpx.post(ZAI_API_URL, json=payload, headers=headers, timeout=60)
|
||
response.raise_for_status()
|
||
|
||
return response.json()["choices"][0]["message"]["content"]
|
||
|
||
|
||
def load_conversation_json(uuid: str) -> dict:
|
||
"""Load conversation JSON from archive."""
|
||
# Conversation archives stored in data/conversations/
|
||
conv_path = Path(f"data/conversations/{uuid}.json")
|
||
if not conv_path.exists():
|
||
# Try alternative locations
|
||
alt_path = Path(f"~/Documents/claude/glam/{uuid}.json").expanduser()
|
||
if alt_path.exists():
|
||
conv_path = alt_path
|
||
else:
|
||
return None
|
||
|
||
with open(conv_path, 'r') as f:
|
||
return json.load(f)
|
||
|
||
|
||
def identify_sources_for_institution(custodian_file: Path) -> dict:
|
||
"""
|
||
Analyze conversation to identify sources for a custodian.
|
||
|
||
Returns dict with:
|
||
- identified_sources: list of sources found
|
||
- claim_attributions: mapping of claims to sources
|
||
- web_sources_to_archive: URLs needing Playwright archival
|
||
"""
|
||
# Load custodian YAML
|
||
with open(custodian_file, 'r') as f:
|
||
custodian = yaml.safe_load(f)
|
||
|
||
# Extract conversation UUID from provenance path
|
||
ch_annotator = custodian.get('ch_annotator', {})
|
||
extraction_prov = ch_annotator.get('extraction_provenance', {})
|
||
path = extraction_prov.get('path', '')
|
||
|
||
if not path.startswith('/conversations/'):
|
||
return {'error': 'Not a conversation source file'}
|
||
|
||
conv_uuid = path.replace('/conversations/', '')
|
||
|
||
# Load conversation JSON
|
||
conversation = load_conversation_json(conv_uuid)
|
||
if not conversation:
|
||
return {'error': f'Conversation not found: {conv_uuid}'}
|
||
|
||
# Extract relevant info
|
||
institution_name = custodian.get('custodian_name', {}).get('claim_value', 'Unknown')
|
||
ghcid = custodian.get('ghcid', {}).get('ghcid_current', 'Unknown')
|
||
entity_claims = ch_annotator.get('entity_claims', [])
|
||
|
||
# Step 1: Identify sources using GLM4
|
||
source_prompt = PROMPT_1_SOURCE_IDENTIFICATION.format(
|
||
conversation_json=json.dumps(conversation, indent=2)[:50000], # Truncate if needed
|
||
institution_name=institution_name,
|
||
ghcid=ghcid,
|
||
conversation_uuid=conv_uuid
|
||
)
|
||
|
||
sources_response = call_glm4(source_prompt)
|
||
identified_sources = json.loads(sources_response)
|
||
|
||
# Step 2: Attribute claims to sources
|
||
attribution_prompt = PROMPT_2_CLAIM_ATTRIBUTION.format(
|
||
identified_sources_json=json.dumps(identified_sources['identified_sources'], indent=2),
|
||
entity_claims_json=json.dumps(entity_claims, indent=2),
|
||
institution_name=institution_name,
|
||
ghcid=ghcid
|
||
)
|
||
|
||
attributions_response = call_glm4(attribution_prompt)
|
||
claim_attributions = json.loads(attributions_response)
|
||
|
||
# Step 3: Verify web sources
|
||
web_sources = [s for s in identified_sources['identified_sources'] if s['source_type'] == 'web']
|
||
|
||
if web_sources:
|
||
verification_prompt = PROMPT_3_WEB_VERIFICATION.format(
|
||
web_sources_json=json.dumps(web_sources, indent=2),
|
||
institution_name=institution_name,
|
||
ghcid=ghcid
|
||
)
|
||
|
||
verification_response = call_glm4(verification_prompt)
|
||
web_verification = json.loads(verification_response)
|
||
else:
|
||
web_verification = {'sources_to_archive': [], 'sources_to_skip': []}
|
||
|
||
return {
|
||
'custodian_file': str(custodian_file),
|
||
'conversation_uuid': conv_uuid,
|
||
'identified_sources': identified_sources,
|
||
'claim_attributions': claim_attributions,
|
||
'web_verification': web_verification
|
||
}
|
||
|
||
|
||
# Prompt templates (loaded from this file or external)
|
||
PROMPT_1_SOURCE_IDENTIFICATION = """...""" # From Prompt 1 above
|
||
PROMPT_2_CLAIM_ATTRIBUTION = """...""" # From Prompt 2 above
|
||
PROMPT_3_WEB_VERIFICATION = """...""" # From Prompt 3 above
|
||
```
|
||
|
||
## Conversation JSON Location
|
||
|
||
Conversation exports need to be located. Check these paths:
|
||
|
||
1. `~/Documents/claude/glam/*.json` - Original Claude exports
|
||
2. `data/conversations/*.json` - Project archive location
|
||
3. `data/instances/conversations/` - Alternative archive
|
||
|
||
If conversations are not archived, they may need to be re-exported from Claude.
|
||
|
||
## Integration with Phase 1
|
||
|
||
Phase 2 runs AFTER Phase 1 completes:
|
||
|
||
1. **Phase 1**: Migrates ~18,000 Category 1 files (ISIL/CSV sources)
|
||
2. **Phase 2**: Processes ~4,000 Category 2 files (conversation sources)
|
||
3. **Phase 3**: Archives web sources with Playwright for Category 3
|
||
|
||
## Cost Estimation
|
||
|
||
GLM4 API calls (per Rule 11: FREE via Z.AI Coding Plan):
|
||
- ~4,000 files × 3 prompts = ~12,000 API calls
|
||
- Cost: $0 (Z.AI Coding Plan)
|
||
- Time: ~2-4 hours (rate limited)
|
||
|
||
## Validation Criteria
|
||
|
||
After Phase 2 migration, every Category 2 file MUST pass:
|
||
|
||
1. ✅ `source_archived_at` is present (from identified source or conversation timestamp)
|
||
2. ✅ `statement_created_at` is present (from annotation_date)
|
||
3. ✅ `agent` is valid (opencode-claude-sonnet-4 or similar)
|
||
4. ✅ At least one source identified, OR flagged for manual review
|
||
5. ✅ Web sources queued for Playwright archival (Phase 3)
|
||
|
||
## References
|
||
|
||
- Rule 11: `.opencode/ZAI_GLM_API_RULES.md`
|
||
- Rule 35: `.opencode/PROVENANCE_TIMESTAMP_RULES.md`
|
||
- Migration Spec: `.opencode/CLAUDE_CONVERSATION_MIGRATION_SPEC.md`
|