# OpenCode NLP Extraction Agents This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files. ## 🚨 Schema Source of Truth **MASTER SCHEMA LOCATION**: `schemas/20251121/linkml/` The LinkML schema files are the **authoritative, canonical definition** of the Heritage Custodian Ontology: **Primary Schema File** (SINGLE SOURCE OF TRUTH): - `schemas/20251121/linkml/01_custodian_name.yaml` - Complete Heritage Custodian Ontology - Defines CustodianObservation (source-based references to heritage keepers) - Defines CustodianName (standardized emic names) - Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations) - Includes ISO 20275 legal form codes (for legal entities) - PiCo-inspired observation/reconstruction pattern - Based on CIDOC-CRM E39_Actor (broader than organization) **ALL OTHER FILES ARE DERIVED/GENERATED** from these LinkML schemas: ❌ **DO NOT** edit these derived files directly: - `schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix}` - **GENERATED** from LinkML via `gen-owl` + `rdfpipe` - `schemas/20251121/typedb/*.tql` - **DERIVED** TypeDB schema (manual translation from LinkML) - `schemas/20251121/uml/mermaid/*.mmd` - **DERIVED** UML diagrams (manual visualization of LinkML) - `schemas/20251121/examples/*.yaml` - **INSTANCES** conforming to LinkML schema **Workflow for Schema Changes**: ``` 1. EDIT LinkML schema (01_custodian_name.yaml) ↓ 2. REGENERATE RDF formats: $ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/rdf/01_custodian_name.owl.ttl $ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name.nt $ # ... repeat for all 8 formats (see RDF_GENERATION_SUMMARY.md) ↓ 3. UPDATE TypeDB schema (manual translation) ↓ 4. UPDATE UML/Mermaid diagrams (manual visualization) ↓ 5. VALIDATE example instances: $ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml ``` **Why LinkML is the Master**: - ✅ **Formal specification**: Type-safe, validation rules, cardinality constraints - ✅ **Multi-format generation**: Single source → RDF, JSON-LD, Python, SQL, GraphQL - ✅ **Version control**: Clear diffs, semantic versioning, change tracking - ✅ **Ontology alignment**: Explicit `class_uri` and `slot_uri` mappings to base ontologies - ✅ **Documentation**: Rich inline documentation with examples **NEVER**: - ❌ Edit RDF files directly (they will be overwritten on next generation) - ❌ Consider TypeDB schema as authoritative (it's a translation target) - ❌ Treat UML diagrams as specification (they're visualizations) **ALWAYS**: - ✅ Refer to LinkML schemas for class definitions - ✅ Update LinkML first, then regenerate derived formats - ✅ Validate changes against LinkML metamodel - ✅ Document schema changes in LinkML YAML comments **See also**: - `schemas/20251121/RDF_GENERATION_SUMMARY.md` - RDF generation process documentation - `docs/MIGRATION_GUIDE.md` - Schema migration procedures - LinkML documentation: https://linkml.io/ --- ## Schema Reference (v0.2.1 - ISO 20275 Migration) All agents extract data conforming to the **Heritage Custodian Ontology** defined in LinkML: **Authoritative Schema File**: - **`schemas/20251121/linkml/01_custodian_name.yaml`** - Complete Heritage Custodian Ontology - CustodianObservation: Source-based references (emic/etic perspectives) - CustodianName: Standardized emic names (subclass of Observation) - CustodianReconstruction: Formal entities (individuals, groups, organizations, governments, corporations) - ReconstructionActivity: Entity resolution provenance - Includes ISO 20275 legal form codes (for legal entities) - Based on CIDOC-CRM E39_Actor **Key Features** (as of v0.2.1): - ✅ ISO 20275 legal form codes (4-character alphanumeric: `ASBL`, `GOVT`, `PRIV`, etc.) - ✅ Multi-aspect modeling (place, custodian, legal form, collections, people aspects) - ✅ Temporal event tracking (founding, mergers, relocations, custody transfers) - ✅ Ontology integration (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo) - ✅ Provenance tracking (data source, tier, extraction method, confidence scores) See `schemas/20251121/RDF_GENERATION_SUMMARY.md` for schema architecture and recent updates. ## Available Agents ### 1. @institution-extractor **Purpose**: Extract heritage institution names, types, and basic metadata **Schema**: Uses `CustodianObservation` and `CustodianName` classes from `schemas/20251121/linkml/01_custodian_name.yaml` **Input**: Conversation text **Output**: JSON array of institutions with: - Institution name - Institution type (from `InstitutionTypeEnum` in `enums.yaml`) - Alternative names - Description - Confidence score **Example**: ``` @institution-extractor Please extract all heritage institutions from the following text: [paste conversation text] ``` ### 2. @location-extractor **Purpose**: Extract geographic locations (cities, addresses, regions, countries) **Schema**: Uses `PlaceAspect` class from `schemas/20251121/linkml/01_custodian_name.yaml` **Input**: Conversation text **Output**: JSON array of locations with: - City - Street address - Postal code - Region/province - Country (ISO 3166-1 alpha-2) - Confidence score **Example**: ``` @location-extractor Please extract all locations mentioned for heritage institutions: [paste conversation text] ``` ### 3. @identifier-extractor **Purpose**: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs) **Schema**: Uses `Identifier` class from `schemas/20251121/linkml/01_custodian_name.yaml` **Input**: Conversation text **Output**: JSON array of identifiers with: - Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.) - Identifier value - Identifier URL - Confidence score **Recognizes**: - ISIL codes: `NL-AsdAM`, `US-DLC`, etc. - Wikidata IDs: `Q190804` - VIAF IDs: `147143282` - KvK numbers (Dutch): `41231987` - Website URLs - Other standard identifiers **Example**: ``` @identifier-extractor Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from: [paste conversation text] ``` ### 4. @event-extractor **Purpose**: Extract organizational change events (founding, mergers, relocations, etc.) **Schema**: Uses `TemporalEvent` class from `schemas/20251121/linkml/01_custodian_name.yaml` **Input**: Conversation text **Output**: JSON array of change events with: - Event ID - Change type (from `ChangeTypeEnum` in `enums.yaml`: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.) - Event date - Event description - Affected organization - Resulting organization - Confidence score **Detects**: - Founding events: "Founded in 1985" - Mergers: "Merged with X in 2001" - Relocations: "Moved to Y in 2010" - Name changes: "Renamed from A to B" - Closures, acquisitions, restructuring, etc. **Example**: ``` @event-extractor Please extract all organizational change events: [paste conversation text] ``` ## Usage Workflow ### Option 1: Using the Orchestration Script The orchestration script (`scripts/extract_with_agents.py`) prepares prompts for each agent: ```bash python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json ``` This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention. ### Option 2: Direct Agent Invocation You can invoke agents directly in an OpenCode session: 1. **Load conversation text**: ```python from glam_extractor.parsers.conversation import ConversationParser parser = ConversationParser() conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json") text = conv.extract_all_text() ``` 2. **Invoke agents** (via @mention): ``` @institution-extractor Extract all heritage institutions from the following conversation about Brazilian GLAM institutions: [paste text from conv.extract_all_text()] ``` 3. **Collect responses** and combine results using `AgentOrchestrator.create_heritage_custodian_record()` ### Option 3: Batch Processing For processing multiple conversations: ```python from pathlib import Path from scripts.extract_with_agents import AgentOrchestrator conversation_dir = Path("conversations") for conv_file in conversation_dir.glob("*.json"): orchestrator = AgentOrchestrator(conv_file) # Generate prompts institution_prompt = orchestrator.prepare_institution_extraction_prompt() # ... invoke agents and collect results ... ``` ## Agent Configuration All agents are configured with: - **mode**: `subagent` (invokable by primary agents or @mention) - **model**: `claude-sonnet-4` (high-quality extraction) - **temperature**: `0.1-0.2` (focused, deterministic) - **tools**: All disabled (read-only analysis) This ensures consistent, high-quality extractions with minimal hallucination. ## Output Format All agents return **JSON-only responses** with no additional commentary: ```json { "institutions": [...], // from @institution-extractor "locations": [...], // from @location-extractor "identifiers": [...], // from @identifier-extractor "change_events": [...] // from @event-extractor } ``` These JSON responses can be directly parsed and validated against the LinkML schema. ## Confidence Scoring All agents assign confidence scores (0.0-1.0): - **0.9-1.0**: Explicit, unambiguous mentions - **0.7-0.9**: Clear mentions with some ambiguity - **0.5-0.7**: Inferred from context - **0.3-0.5**: Low confidence, likely needs verification - **0.0-0.3**: Very uncertain, flag for manual review ## Multilingual Support Agents support **60+ languages** found in the conversation dataset, including: - Dutch, Portuguese, Spanish, French, German - Vietnamese, Japanese, Korean, Chinese, Thai - Arabic, Persian, Turkish, Russian - And many more... Agents preserve original language names (no translation) and adapt pattern matching to language context. ## Data Quality Extracted data is marked as: - **Data Source**: `CONVERSATION_NLP` - **Data Tier**: `TIER_4_INFERRED` - **Provenance**: Includes conversation ID, extraction date, method, and confidence score This ensures proper provenance tracking and quality assessment. ## Next Steps After extraction: 1. **Validate** with LinkML schema: ```bash linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml data.yaml ``` 2. **Cross-link** with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching 3. **Geocode** locations using GeoNames database 4. **Generate GHCIDs** for persistent identification 5. **Export** to JSON-LD, RDF/Turtle, CSV, or Parquet See `/AGENTS.md` for detailed extraction guidelines and examples. See `/docs/SCHEMA_MODULES.md` for schema architecture and usage patterns. ## Contributing To add a new extraction agent: 1. Create `.opencode/agent/your-agent-name.md` 2. Configure with `mode: subagent`, appropriate model and temperature 3. Define input/output format with examples 4. Document extraction patterns and confidence scoring 5. Add multilingual support and edge case handling 6. Test with real conversation data --- **Schema Version**: v0.2.1 (ISO 20275 migration) **Schema Location**: `schemas/20251121/linkml/` **Last Updated**: 2025-11-21 **Agent Count**: 4 **Languages Supported**: 60+ **Conversations Ready**: 139