| .. | ||
| event-extractor.md | ||
| identifier-extractor.md | ||
| institution-extractor.md | ||
| location-extractor.md | ||
| README.md | ||
OpenCode NLP Extraction Agents
This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.
Schema Reference (v0.2.0)
All agents extract data conforming to the modular Heritage Custodian Schema v0.2.0:
/schemas/heritage_custodian.yaml- Main schema (import-only structure)/schemas/core.yaml- Core classes (HeritageCustodian, Location, Identifier, DigitalPlatform, GHCID)/schemas/enums.yaml- Enumerations (InstitutionTypeEnum, ChangeTypeEnum, DataSource, DataTier, etc.)/schemas/provenance.yaml- Provenance tracking (Provenance, ChangeEvent, GHCIDHistoryEntry)/schemas/collections.yaml- Collection metadata (Collection, Accession, DigitalObject)/schemas/dutch.yaml- Dutch-specific extensions (DutchHeritageCustodian)
See /docs/SCHEMA_MODULES.md for detailed architecture and usage patterns.
Available Agents
1. @institution-extractor
Purpose: Extract heritage institution names, types, and basic metadata
Schema: Uses HeritageCustodian class from /schemas/core.yaml
Input: Conversation text
Output: JSON array of institutions with:
- Institution name
- Institution type (from
InstitutionTypeEnuminenums.yaml) - Alternative names
- Description
- Confidence score
Example:
@institution-extractor
Please extract all heritage institutions from the following text:
[paste conversation text]
2. @location-extractor
Purpose: Extract geographic locations (cities, addresses, regions, countries)
Schema: Uses Location class from /schemas/core.yaml
Input: Conversation text
Output: JSON array of locations with:
- City
- Street address
- Postal code
- Region/province
- Country (ISO 3166-1 alpha-2)
- Confidence score
Example:
@location-extractor
Please extract all locations mentioned for heritage institutions:
[paste conversation text]
3. @identifier-extractor
Purpose: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
Schema: Uses Identifier class from /schemas/core.yaml
Input: Conversation text
Output: JSON array of identifiers with:
- Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
- Identifier value
- Identifier URL
- Confidence score
Recognizes:
- ISIL codes:
NL-AsdAM,US-DLC, etc. - Wikidata IDs:
Q190804 - VIAF IDs:
147143282 - KvK numbers (Dutch):
41231987 - Website URLs
- Other standard identifiers
Example:
@identifier-extractor
Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
[paste conversation text]
4. @event-extractor
Purpose: Extract organizational change events (founding, mergers, relocations, etc.)
Schema: Uses ChangeEvent class from /schemas/provenance.yaml
Input: Conversation text
Output: JSON array of change events with:
- Event ID
- Change type (from
ChangeTypeEnuminenums.yaml: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.) - Event date
- Event description
- Affected organization
- Resulting organization
- Confidence score
Detects:
- Founding events: "Founded in 1985"
- Mergers: "Merged with X in 2001"
- Relocations: "Moved to Y in 2010"
- Name changes: "Renamed from A to B"
- Closures, acquisitions, restructuring, etc.
Example:
@event-extractor
Please extract all organizational change events:
[paste conversation text]
Usage Workflow
Option 1: Using the Orchestration Script
The orchestration script (scripts/extract_with_agents.py) prepares prompts for each agent:
python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json
This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.
Option 2: Direct Agent Invocation
You can invoke agents directly in an OpenCode session:
- Load conversation text:
from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
text = conv.extract_all_text()
- Invoke agents (via @mention):
@institution-extractor
Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:
[paste text from conv.extract_all_text()]
- Collect responses and combine results using
AgentOrchestrator.create_heritage_custodian_record()
Option 3: Batch Processing
For processing multiple conversations:
from pathlib import Path
from scripts.extract_with_agents import AgentOrchestrator
conversation_dir = Path("conversations")
for conv_file in conversation_dir.glob("*.json"):
orchestrator = AgentOrchestrator(conv_file)
# Generate prompts
institution_prompt = orchestrator.prepare_institution_extraction_prompt()
# ... invoke agents and collect results ...
Agent Configuration
All agents are configured with:
- mode:
subagent(invokable by primary agents or @mention) - model:
claude-sonnet-4(high-quality extraction) - temperature:
0.1-0.2(focused, deterministic) - tools: All disabled (read-only analysis)
This ensures consistent, high-quality extractions with minimal hallucination.
Output Format
All agents return JSON-only responses with no additional commentary:
{
"institutions": [...], // from @institution-extractor
"locations": [...], // from @location-extractor
"identifiers": [...], // from @identifier-extractor
"change_events": [...] // from @event-extractor
}
These JSON responses can be directly parsed and validated against the LinkML schema.
Confidence Scoring
All agents assign confidence scores (0.0-1.0):
- 0.9-1.0: Explicit, unambiguous mentions
- 0.7-0.9: Clear mentions with some ambiguity
- 0.5-0.7: Inferred from context
- 0.3-0.5: Low confidence, likely needs verification
- 0.0-0.3: Very uncertain, flag for manual review
Multilingual Support
Agents support 60+ languages found in the conversation dataset, including:
- Dutch, Portuguese, Spanish, French, German
- Vietnamese, Japanese, Korean, Chinese, Thai
- Arabic, Persian, Turkish, Russian
- And many more...
Agents preserve original language names (no translation) and adapt pattern matching to language context.
Data Quality
Extracted data is marked as:
- Data Source:
CONVERSATION_NLP - Data Tier:
TIER_4_INFERRED - Provenance: Includes conversation ID, extraction date, method, and confidence score
This ensures proper provenance tracking and quality assessment.
Next Steps
After extraction:
-
Validate with LinkML schema:
linkml-validate -s schemas/heritage_custodian.yaml data.json -
Cross-link with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
-
Geocode locations using GeoNames database
-
Generate GHCIDs for persistent identification
-
Export to JSON-LD, RDF/Turtle, CSV, or Parquet
See /AGENTS.md for detailed extraction guidelines and examples.
See /docs/SCHEMA_MODULES.md for schema architecture and usage patterns.
Contributing
To add a new extraction agent:
- Create
.opencode/agent/your-agent-name.md - Configure with
mode: subagent, appropriate model and temperature - Define input/output format with examples
- Document extraction patterns and confidence scoring
- Add multilingual support and edge case handling
- Test with real conversation data
Schema Version: v0.2.0 (modular)
Last Updated: 2025-11-05
Agent Count: 4
Languages Supported: 60+
Conversations Ready: 139