History

kempersc 3c80de87e0 add isil entries		2025-11-19 23:25:22 +01:00
..
event-extractor.md	add isil entries	2025-11-19 23:25:22 +01:00
identifier-extractor.md	add isil entries	2025-11-19 23:25:22 +01:00
institution-extractor.md	add isil entries	2025-11-19 23:25:22 +01:00
location-extractor.md	add isil entries	2025-11-19 23:25:22 +01:00
README.md	add isil entries	2025-11-19 23:25:22 +01:00

README.md

OpenCode NLP Extraction Agents

This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.

Schema Reference (v0.2.0)

All agents extract data conforming to the modular Heritage Custodian Schema v0.2.0:

/schemas/heritage_custodian.yaml - Main schema (import-only structure)
/schemas/core.yaml - Core classes (HeritageCustodian, Location, Identifier, DigitalPlatform, GHCID)
/schemas/enums.yaml - Enumerations (InstitutionTypeEnum, ChangeTypeEnum, DataSource, DataTier, etc.)
/schemas/provenance.yaml - Provenance tracking (Provenance, ChangeEvent, GHCIDHistoryEntry)
/schemas/collections.yaml - Collection metadata (Collection, Accession, DigitalObject)
/schemas/dutch.yaml - Dutch-specific extensions (DutchHeritageCustodian)

See /docs/SCHEMA_MODULES.md for detailed architecture and usage patterns.

Available Agents

1. @institution-extractor

Purpose: Extract heritage institution names, types, and basic metadata
Schema: Uses HeritageCustodian class from /schemas/core.yaml

Input: Conversation text
Output: JSON array of institutions with:

Institution name
Institution type (from InstitutionTypeEnum in enums.yaml)
Alternative names
Description
Confidence score

Example:

@institution-extractor

Please extract all heritage institutions from the following text:
[paste conversation text]

2. @location-extractor

Purpose: Extract geographic locations (cities, addresses, regions, countries)
Schema: Uses Location class from /schemas/core.yaml

Input: Conversation text
Output: JSON array of locations with:

City
Street address
Postal code
Region/province
Country (ISO 3166-1 alpha-2)
Confidence score

Example:

@location-extractor

Please extract all locations mentioned for heritage institutions:
[paste conversation text]

3. @identifier-extractor

Purpose: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
Schema: Uses Identifier class from /schemas/core.yaml

Input: Conversation text
Output: JSON array of identifiers with:

Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
Identifier value
Identifier URL
Confidence score

Recognizes:

ISIL codes: NL-AsdAM, US-DLC, etc.
Wikidata IDs: Q190804
VIAF IDs: 147143282
KvK numbers (Dutch): 41231987
Website URLs
Other standard identifiers

Example:

@identifier-extractor

Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
[paste conversation text]

4. @event-extractor

Purpose: Extract organizational change events (founding, mergers, relocations, etc.)
Schema: Uses ChangeEvent class from /schemas/provenance.yaml

Input: Conversation text
Output: JSON array of change events with:

Event ID
Change type (from ChangeTypeEnum in enums.yaml: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.)
Event date
Event description
Affected organization
Resulting organization
Confidence score

Detects:

Founding events: "Founded in 1985"
Mergers: "Merged with X in 2001"
Relocations: "Moved to Y in 2010"
Name changes: "Renamed from A to B"
Closures, acquisitions, restructuring, etc.

Example:

@event-extractor

Please extract all organizational change events:
[paste conversation text]

Usage Workflow

Option 1: Using the Orchestration Script

The orchestration script (scripts/extract_with_agents.py) prepares prompts for each agent:

python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json

This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.

Option 2: Direct Agent Invocation

You can invoke agents directly in an OpenCode session:

Load conversation text:

from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
text = conv.extract_all_text()

Invoke agents (via @mention):

@institution-extractor

Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:

[paste text from conv.extract_all_text()]

Collect responses and combine results using AgentOrchestrator.create_heritage_custodian_record()

Option 3: Batch Processing

For processing multiple conversations:

from pathlib import Path
from scripts.extract_with_agents import AgentOrchestrator

conversation_dir = Path("conversations")
for conv_file in conversation_dir.glob("*.json"):
    orchestrator = AgentOrchestrator(conv_file)
    
    # Generate prompts
    institution_prompt = orchestrator.prepare_institution_extraction_prompt()
    # ... invoke agents and collect results ...

Agent Configuration

All agents are configured with:

mode: subagent (invokable by primary agents or @mention)
model: claude-sonnet-4 (high-quality extraction)
temperature: 0.1-0.2 (focused, deterministic)
tools: All disabled (read-only analysis)

This ensures consistent, high-quality extractions with minimal hallucination.

Output Format

All agents return JSON-only responses with no additional commentary:

{
  "institutions": [...],      // from @institution-extractor
  "locations": [...],         // from @location-extractor
  "identifiers": [...],       // from @identifier-extractor
  "change_events": [...]      // from @event-extractor
}

These JSON responses can be directly parsed and validated against the LinkML schema.

Confidence Scoring

All agents assign confidence scores (0.0-1.0):

0.9-1.0: Explicit, unambiguous mentions
0.7-0.9: Clear mentions with some ambiguity
0.5-0.7: Inferred from context
0.3-0.5: Low confidence, likely needs verification
0.0-0.3: Very uncertain, flag for manual review

Multilingual Support

Agents support 60+ languages found in the conversation dataset, including:

Dutch, Portuguese, Spanish, French, German
Vietnamese, Japanese, Korean, Chinese, Thai
Arabic, Persian, Turkish, Russian
And many more...

Agents preserve original language names (no translation) and adapt pattern matching to language context.

Data Quality

Extracted data is marked as:

Data Source: CONVERSATION_NLP
Data Tier: TIER_4_INFERRED
Provenance: Includes conversation ID, extraction date, method, and confidence score

This ensures proper provenance tracking and quality assessment.

Next Steps

After extraction:

Validate with LinkML schema:

linkml-validate -s schemas/heritage_custodian.yaml data.json

Cross-link with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
Geocode locations using GeoNames database
Generate GHCIDs for persistent identification
Export to JSON-LD, RDF/Turtle, CSV, or Parquet

See /AGENTS.md for detailed extraction guidelines and examples.

See /docs/SCHEMA_MODULES.md for schema architecture and usage patterns.

Contributing

To add a new extraction agent:

Create .opencode/agent/your-agent-name.md
Configure with mode: subagent, appropriate model and temperature
Define input/output format with examples
Document extraction patterns and confidence scoring
Add multilingual support and edge case handling
Test with real conversation data

Schema Version: v0.2.0 (modular)
Last Updated: 2025-11-05
Agent Count: 4
Languages Supported: 60+
Conversations Ready: 139