11 KiB
OpenCode NLP Extraction Agents
This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.
🚨 Schema Source of Truth
MASTER SCHEMA LOCATION: schemas/20251121/linkml/
The LinkML schema files are the authoritative, canonical definition of the Heritage Custodian Ontology:
Primary Schema File (SINGLE SOURCE OF TRUTH):
schemas/20251121/linkml/01_custodian_name.yaml- Complete Heritage Custodian Ontology- Defines CustodianObservation (source-based references to heritage keepers)
- Defines CustodianName (standardized emic names)
- Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations)
- Includes ISO 20275 legal form codes (for legal entities)
- PiCo-inspired observation/reconstruction pattern
- Based on CIDOC-CRM E39_Actor (broader than organization)
ALL OTHER FILES ARE DERIVED/GENERATED from these LinkML schemas:
❌ DO NOT edit these derived files directly:
schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix}- GENERATED from LinkML viagen-owl+rdfpipeschemas/20251121/typedb/*.tql- DERIVED TypeDB schema (manual translation from LinkML)schemas/20251121/uml/mermaid/*.mmd- DERIVED UML diagrams (manual visualization of LinkML)schemas/20251121/examples/*.yaml- INSTANCES conforming to LinkML schema
Workflow for Schema Changes:
1. EDIT LinkML schema (01_custodian_name.yaml)
↓
2. REGENERATE RDF formats:
$ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/rdf/01_custodian_name.owl.ttl
$ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name.nt
$ # ... repeat for all 8 formats (see RDF_GENERATION_SUMMARY.md)
↓
3. UPDATE TypeDB schema (manual translation)
↓
4. UPDATE UML/Mermaid diagrams (manual visualization)
↓
5. VALIDATE example instances:
$ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml
Why LinkML is the Master:
- ✅ Formal specification: Type-safe, validation rules, cardinality constraints
- ✅ Multi-format generation: Single source → RDF, JSON-LD, Python, SQL, GraphQL
- ✅ Version control: Clear diffs, semantic versioning, change tracking
- ✅ Ontology alignment: Explicit
class_uriandslot_urimappings to base ontologies - ✅ Documentation: Rich inline documentation with examples
NEVER:
- ❌ Edit RDF files directly (they will be overwritten on next generation)
- ❌ Consider TypeDB schema as authoritative (it's a translation target)
- ❌ Treat UML diagrams as specification (they're visualizations)
ALWAYS:
- ✅ Refer to LinkML schemas for class definitions
- ✅ Update LinkML first, then regenerate derived formats
- ✅ Validate changes against LinkML metamodel
- ✅ Document schema changes in LinkML YAML comments
See also:
schemas/20251121/RDF_GENERATION_SUMMARY.md- RDF generation process documentationdocs/MIGRATION_GUIDE.md- Schema migration procedures- LinkML documentation: https://linkml.io/
Schema Reference (v0.2.1 - ISO 20275 Migration)
All agents extract data conforming to the Heritage Custodian Ontology defined in LinkML:
Authoritative Schema File:
schemas/20251121/linkml/01_custodian_name.yaml- Complete Heritage Custodian Ontology- CustodianObservation: Source-based references (emic/etic perspectives)
- CustodianName: Standardized emic names (subclass of Observation)
- CustodianReconstruction: Formal entities (individuals, groups, organizations, governments, corporations)
- ReconstructionActivity: Entity resolution provenance
- Includes ISO 20275 legal form codes (for legal entities)
- Based on CIDOC-CRM E39_Actor
Key Features (as of v0.2.1):
- ✅ ISO 20275 legal form codes (4-character alphanumeric:
ASBL,GOVT,PRIV, etc.) - ✅ Multi-aspect modeling (place, custodian, legal form, collections, people aspects)
- ✅ Temporal event tracking (founding, mergers, relocations, custody transfers)
- ✅ Ontology integration (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
- ✅ Provenance tracking (data source, tier, extraction method, confidence scores)
See schemas/20251121/RDF_GENERATION_SUMMARY.md for schema architecture and recent updates.
Available Agents
1. @institution-extractor
Purpose: Extract heritage institution names, types, and basic metadata
Schema: Uses CustodianObservation and CustodianName classes from schemas/20251121/linkml/01_custodian_name.yaml
Input: Conversation text
Output: JSON array of institutions with:
- Institution name
- Institution type (from
InstitutionTypeEnuminenums.yaml) - Alternative names
- Description
- Confidence score
Example:
@institution-extractor
Please extract all heritage institutions from the following text:
[paste conversation text]
2. @location-extractor
Purpose: Extract geographic locations (cities, addresses, regions, countries)
Schema: Uses PlaceAspect class from schemas/20251121/linkml/01_custodian_name.yaml
Input: Conversation text
Output: JSON array of locations with:
- City
- Street address
- Postal code
- Region/province
- Country (ISO 3166-1 alpha-2)
- Confidence score
Example:
@location-extractor
Please extract all locations mentioned for heritage institutions:
[paste conversation text]
3. @identifier-extractor
Purpose: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
Schema: Uses Identifier class from schemas/20251121/linkml/01_custodian_name.yaml
Input: Conversation text
Output: JSON array of identifiers with:
- Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
- Identifier value
- Identifier URL
- Confidence score
Recognizes:
- ISIL codes:
NL-AsdAM,US-DLC, etc. - Wikidata IDs:
Q190804 - VIAF IDs:
147143282 - KvK numbers (Dutch):
41231987 - Website URLs
- Other standard identifiers
Example:
@identifier-extractor
Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
[paste conversation text]
4. @event-extractor
Purpose: Extract organizational change events (founding, mergers, relocations, etc.)
Schema: Uses TemporalEvent class from schemas/20251121/linkml/01_custodian_name.yaml
Input: Conversation text
Output: JSON array of change events with:
- Event ID
- Change type (from
ChangeTypeEnuminenums.yaml: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.) - Event date
- Event description
- Affected organization
- Resulting organization
- Confidence score
Detects:
- Founding events: "Founded in 1985"
- Mergers: "Merged with X in 2001"
- Relocations: "Moved to Y in 2010"
- Name changes: "Renamed from A to B"
- Closures, acquisitions, restructuring, etc.
Example:
@event-extractor
Please extract all organizational change events:
[paste conversation text]
Usage Workflow
Option 1: Using the Orchestration Script
The orchestration script (scripts/extract_with_agents.py) prepares prompts for each agent:
python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json
This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.
Option 2: Direct Agent Invocation
You can invoke agents directly in an OpenCode session:
- Load conversation text:
from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
text = conv.extract_all_text()
- Invoke agents (via @mention):
@institution-extractor
Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:
[paste text from conv.extract_all_text()]
- Collect responses and combine results using
AgentOrchestrator.create_heritage_custodian_record()
Option 3: Batch Processing
For processing multiple conversations:
from pathlib import Path
from scripts.extract_with_agents import AgentOrchestrator
conversation_dir = Path("conversations")
for conv_file in conversation_dir.glob("*.json"):
orchestrator = AgentOrchestrator(conv_file)
# Generate prompts
institution_prompt = orchestrator.prepare_institution_extraction_prompt()
# ... invoke agents and collect results ...
Agent Configuration
All agents are configured with:
- mode:
subagent(invokable by primary agents or @mention) - model:
claude-sonnet-4(high-quality extraction) - temperature:
0.1-0.2(focused, deterministic) - tools: All disabled (read-only analysis)
This ensures consistent, high-quality extractions with minimal hallucination.
Output Format
All agents return JSON-only responses with no additional commentary:
{
"institutions": [...], // from @institution-extractor
"locations": [...], // from @location-extractor
"identifiers": [...], // from @identifier-extractor
"change_events": [...] // from @event-extractor
}
These JSON responses can be directly parsed and validated against the LinkML schema.
Confidence Scoring
All agents assign confidence scores (0.0-1.0):
- 0.9-1.0: Explicit, unambiguous mentions
- 0.7-0.9: Clear mentions with some ambiguity
- 0.5-0.7: Inferred from context
- 0.3-0.5: Low confidence, likely needs verification
- 0.0-0.3: Very uncertain, flag for manual review
Multilingual Support
Agents support 60+ languages found in the conversation dataset, including:
- Dutch, Portuguese, Spanish, French, German
- Vietnamese, Japanese, Korean, Chinese, Thai
- Arabic, Persian, Turkish, Russian
- And many more...
Agents preserve original language names (no translation) and adapt pattern matching to language context.
Data Quality
Extracted data is marked as:
- Data Source:
CONVERSATION_NLP - Data Tier:
TIER_4_INFERRED - Provenance: Includes conversation ID, extraction date, method, and confidence score
This ensures proper provenance tracking and quality assessment.
Next Steps
After extraction:
-
Validate with LinkML schema:
linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml data.yaml -
Cross-link with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
-
Geocode locations using GeoNames database
-
Generate GHCIDs for persistent identification
-
Export to JSON-LD, RDF/Turtle, CSV, or Parquet
See /AGENTS.md for detailed extraction guidelines and examples.
See /docs/SCHEMA_MODULES.md for schema architecture and usage patterns.
Contributing
To add a new extraction agent:
- Create
.opencode/agent/your-agent-name.md - Configure with
mode: subagent, appropriate model and temperature - Define input/output format with examples
- Document extraction patterns and confidence scoring
- Add multilingual support and edge case handling
- Test with real conversation data
Schema Version: v0.2.1 (ISO 20275 migration)
Schema Location: schemas/20251121/linkml/
Last Updated: 2025-11-21
Agent Count: 4
Languages Supported: 60+
Conversations Ready: 139