337 lines
11 KiB
Markdown
337 lines
11 KiB
Markdown
# OpenCode NLP Extraction Agents
|
|
|
|
This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.
|
|
|
|
## 🚨 Schema Source of Truth
|
|
|
|
**MASTER SCHEMA LOCATION**: `schemas/20251121/linkml/`
|
|
|
|
The LinkML schema files are the **authoritative, canonical definition** of the Heritage Custodian Ontology:
|
|
|
|
**Primary Schema File** (SINGLE SOURCE OF TRUTH):
|
|
- `schemas/20251121/linkml/01_custodian_name.yaml` - Complete Heritage Custodian Ontology
|
|
- Defines CustodianObservation (source-based references to heritage keepers)
|
|
- Defines CustodianName (standardized emic names)
|
|
- Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations)
|
|
- Includes ISO 20275 legal form codes (for legal entities)
|
|
- PiCo-inspired observation/reconstruction pattern
|
|
- Based on CIDOC-CRM E39_Actor (broader than organization)
|
|
|
|
**ALL OTHER FILES ARE DERIVED/GENERATED** from these LinkML schemas:
|
|
|
|
❌ **DO NOT** edit these derived files directly:
|
|
- `schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix}` - **GENERATED** from LinkML via `gen-owl` + `rdfpipe`
|
|
- `schemas/20251121/typedb/*.tql` - **DERIVED** TypeDB schema (manual translation from LinkML)
|
|
- `schemas/20251121/uml/mermaid/*.mmd` - **DERIVED** UML diagrams (manual visualization of LinkML)
|
|
- `schemas/20251121/examples/*.yaml` - **INSTANCES** conforming to LinkML schema
|
|
|
|
**Workflow for Schema Changes**:
|
|
|
|
```
|
|
1. EDIT LinkML schema (01_custodian_name.yaml)
|
|
↓
|
|
2. REGENERATE RDF formats:
|
|
$ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/rdf/01_custodian_name.owl.ttl
|
|
$ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name.nt
|
|
$ # ... repeat for all 8 formats (see RDF_GENERATION_SUMMARY.md)
|
|
↓
|
|
3. UPDATE TypeDB schema (manual translation)
|
|
↓
|
|
4. UPDATE UML/Mermaid diagrams (manual visualization)
|
|
↓
|
|
5. VALIDATE example instances:
|
|
$ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml
|
|
```
|
|
|
|
**Why LinkML is the Master**:
|
|
- ✅ **Formal specification**: Type-safe, validation rules, cardinality constraints
|
|
- ✅ **Multi-format generation**: Single source → RDF, JSON-LD, Python, SQL, GraphQL
|
|
- ✅ **Version control**: Clear diffs, semantic versioning, change tracking
|
|
- ✅ **Ontology alignment**: Explicit `class_uri` and `slot_uri` mappings to base ontologies
|
|
- ✅ **Documentation**: Rich inline documentation with examples
|
|
|
|
**NEVER**:
|
|
- ❌ Edit RDF files directly (they will be overwritten on next generation)
|
|
- ❌ Consider TypeDB schema as authoritative (it's a translation target)
|
|
- ❌ Treat UML diagrams as specification (they're visualizations)
|
|
|
|
**ALWAYS**:
|
|
- ✅ Refer to LinkML schemas for class definitions
|
|
- ✅ Update LinkML first, then regenerate derived formats
|
|
- ✅ Validate changes against LinkML metamodel
|
|
- ✅ Document schema changes in LinkML YAML comments
|
|
|
|
**See also**:
|
|
- `schemas/20251121/RDF_GENERATION_SUMMARY.md` - RDF generation process documentation
|
|
- `docs/MIGRATION_GUIDE.md` - Schema migration procedures
|
|
- LinkML documentation: https://linkml.io/
|
|
|
|
---
|
|
|
|
## Schema Reference (v0.2.1 - ISO 20275 Migration)
|
|
|
|
All agents extract data conforming to the **Heritage Custodian Ontology** defined in LinkML:
|
|
|
|
**Authoritative Schema File**:
|
|
- **`schemas/20251121/linkml/01_custodian_name.yaml`** - Complete Heritage Custodian Ontology
|
|
- CustodianObservation: Source-based references (emic/etic perspectives)
|
|
- CustodianName: Standardized emic names (subclass of Observation)
|
|
- CustodianReconstruction: Formal entities (individuals, groups, organizations, governments, corporations)
|
|
- ReconstructionActivity: Entity resolution provenance
|
|
- Includes ISO 20275 legal form codes (for legal entities)
|
|
- Based on CIDOC-CRM E39_Actor
|
|
|
|
**Key Features** (as of v0.2.1):
|
|
- ✅ ISO 20275 legal form codes (4-character alphanumeric: `ASBL`, `GOVT`, `PRIV`, etc.)
|
|
- ✅ Multi-aspect modeling (place, custodian, legal form, collections, people aspects)
|
|
- ✅ Temporal event tracking (founding, mergers, relocations, custody transfers)
|
|
- ✅ Ontology integration (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
|
|
- ✅ Provenance tracking (data source, tier, extraction method, confidence scores)
|
|
|
|
See `schemas/20251121/RDF_GENERATION_SUMMARY.md` for schema architecture and recent updates.
|
|
|
|
## Available Agents
|
|
|
|
### 1. @institution-extractor
|
|
**Purpose**: Extract heritage institution names, types, and basic metadata
|
|
**Schema**: Uses `CustodianObservation` and `CustodianName` classes from `schemas/20251121/linkml/01_custodian_name.yaml`
|
|
|
|
**Input**: Conversation text
|
|
**Output**: JSON array of institutions with:
|
|
- Institution name
|
|
- Institution type (from `InstitutionTypeEnum` in `enums.yaml`)
|
|
- Alternative names
|
|
- Description
|
|
- Confidence score
|
|
|
|
**Example**:
|
|
```
|
|
@institution-extractor
|
|
|
|
Please extract all heritage institutions from the following text:
|
|
[paste conversation text]
|
|
```
|
|
|
|
### 2. @location-extractor
|
|
**Purpose**: Extract geographic locations (cities, addresses, regions, countries)
|
|
**Schema**: Uses `PlaceAspect` class from `schemas/20251121/linkml/01_custodian_name.yaml`
|
|
|
|
**Input**: Conversation text
|
|
**Output**: JSON array of locations with:
|
|
- City
|
|
- Street address
|
|
- Postal code
|
|
- Region/province
|
|
- Country (ISO 3166-1 alpha-2)
|
|
- Confidence score
|
|
|
|
**Example**:
|
|
```
|
|
@location-extractor
|
|
|
|
Please extract all locations mentioned for heritage institutions:
|
|
[paste conversation text]
|
|
```
|
|
|
|
### 3. @identifier-extractor
|
|
**Purpose**: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
|
|
**Schema**: Uses `Identifier` class from `schemas/20251121/linkml/01_custodian_name.yaml`
|
|
|
|
**Input**: Conversation text
|
|
**Output**: JSON array of identifiers with:
|
|
- Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
|
|
- Identifier value
|
|
- Identifier URL
|
|
- Confidence score
|
|
|
|
**Recognizes**:
|
|
- ISIL codes: `NL-AsdAM`, `US-DLC`, etc.
|
|
- Wikidata IDs: `Q190804`
|
|
- VIAF IDs: `147143282`
|
|
- KvK numbers (Dutch): `41231987`
|
|
- Website URLs
|
|
- Other standard identifiers
|
|
|
|
**Example**:
|
|
```
|
|
@identifier-extractor
|
|
|
|
Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
|
|
[paste conversation text]
|
|
```
|
|
|
|
### 4. @event-extractor
|
|
**Purpose**: Extract organizational change events (founding, mergers, relocations, etc.)
|
|
**Schema**: Uses `TemporalEvent` class from `schemas/20251121/linkml/01_custodian_name.yaml`
|
|
|
|
**Input**: Conversation text
|
|
**Output**: JSON array of change events with:
|
|
- Event ID
|
|
- Change type (from `ChangeTypeEnum` in `enums.yaml`: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.)
|
|
- Event date
|
|
- Event description
|
|
- Affected organization
|
|
- Resulting organization
|
|
- Confidence score
|
|
|
|
**Detects**:
|
|
- Founding events: "Founded in 1985"
|
|
- Mergers: "Merged with X in 2001"
|
|
- Relocations: "Moved to Y in 2010"
|
|
- Name changes: "Renamed from A to B"
|
|
- Closures, acquisitions, restructuring, etc.
|
|
|
|
**Example**:
|
|
```
|
|
@event-extractor
|
|
|
|
Please extract all organizational change events:
|
|
[paste conversation text]
|
|
```
|
|
|
|
## Usage Workflow
|
|
|
|
### Option 1: Using the Orchestration Script
|
|
|
|
The orchestration script (`scripts/extract_with_agents.py`) prepares prompts for each agent:
|
|
|
|
```bash
|
|
python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json
|
|
```
|
|
|
|
This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.
|
|
|
|
### Option 2: Direct Agent Invocation
|
|
|
|
You can invoke agents directly in an OpenCode session:
|
|
|
|
1. **Load conversation text**:
|
|
```python
|
|
from glam_extractor.parsers.conversation import ConversationParser
|
|
parser = ConversationParser()
|
|
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
|
|
text = conv.extract_all_text()
|
|
```
|
|
|
|
2. **Invoke agents** (via @mention):
|
|
```
|
|
@institution-extractor
|
|
|
|
Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:
|
|
|
|
[paste text from conv.extract_all_text()]
|
|
```
|
|
|
|
3. **Collect responses** and combine results using `AgentOrchestrator.create_heritage_custodian_record()`
|
|
|
|
### Option 3: Batch Processing
|
|
|
|
For processing multiple conversations:
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from scripts.extract_with_agents import AgentOrchestrator
|
|
|
|
conversation_dir = Path("conversations")
|
|
for conv_file in conversation_dir.glob("*.json"):
|
|
orchestrator = AgentOrchestrator(conv_file)
|
|
|
|
# Generate prompts
|
|
institution_prompt = orchestrator.prepare_institution_extraction_prompt()
|
|
# ... invoke agents and collect results ...
|
|
```
|
|
|
|
## Agent Configuration
|
|
|
|
All agents are configured with:
|
|
- **mode**: `subagent` (invokable by primary agents or @mention)
|
|
- **model**: `claude-sonnet-4` (high-quality extraction)
|
|
- **temperature**: `0.1-0.2` (focused, deterministic)
|
|
- **tools**: All disabled (read-only analysis)
|
|
|
|
This ensures consistent, high-quality extractions with minimal hallucination.
|
|
|
|
## Output Format
|
|
|
|
All agents return **JSON-only responses** with no additional commentary:
|
|
|
|
```json
|
|
{
|
|
"institutions": [...], // from @institution-extractor
|
|
"locations": [...], // from @location-extractor
|
|
"identifiers": [...], // from @identifier-extractor
|
|
"change_events": [...] // from @event-extractor
|
|
}
|
|
```
|
|
|
|
These JSON responses can be directly parsed and validated against the LinkML schema.
|
|
|
|
## Confidence Scoring
|
|
|
|
All agents assign confidence scores (0.0-1.0):
|
|
|
|
- **0.9-1.0**: Explicit, unambiguous mentions
|
|
- **0.7-0.9**: Clear mentions with some ambiguity
|
|
- **0.5-0.7**: Inferred from context
|
|
- **0.3-0.5**: Low confidence, likely needs verification
|
|
- **0.0-0.3**: Very uncertain, flag for manual review
|
|
|
|
## Multilingual Support
|
|
|
|
Agents support **60+ languages** found in the conversation dataset, including:
|
|
- Dutch, Portuguese, Spanish, French, German
|
|
- Vietnamese, Japanese, Korean, Chinese, Thai
|
|
- Arabic, Persian, Turkish, Russian
|
|
- And many more...
|
|
|
|
Agents preserve original language names (no translation) and adapt pattern matching to language context.
|
|
|
|
## Data Quality
|
|
|
|
Extracted data is marked as:
|
|
- **Data Source**: `CONVERSATION_NLP`
|
|
- **Data Tier**: `TIER_4_INFERRED`
|
|
- **Provenance**: Includes conversation ID, extraction date, method, and confidence score
|
|
|
|
This ensures proper provenance tracking and quality assessment.
|
|
|
|
## Next Steps
|
|
|
|
After extraction:
|
|
|
|
1. **Validate** with LinkML schema:
|
|
```bash
|
|
linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml data.yaml
|
|
```
|
|
|
|
2. **Cross-link** with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
|
|
|
|
3. **Geocode** locations using GeoNames database
|
|
|
|
4. **Generate GHCIDs** for persistent identification
|
|
|
|
5. **Export** to JSON-LD, RDF/Turtle, CSV, or Parquet
|
|
|
|
See `/AGENTS.md` for detailed extraction guidelines and examples.
|
|
|
|
See `/docs/SCHEMA_MODULES.md` for schema architecture and usage patterns.
|
|
|
|
## Contributing
|
|
|
|
To add a new extraction agent:
|
|
|
|
1. Create `.opencode/agent/your-agent-name.md`
|
|
2. Configure with `mode: subagent`, appropriate model and temperature
|
|
3. Define input/output format with examples
|
|
4. Document extraction patterns and confidence scoring
|
|
5. Add multilingual support and edge case handling
|
|
6. Test with real conversation data
|
|
|
|
---
|
|
|
|
**Schema Version**: v0.2.1 (ISO 20275 migration)
|
|
**Schema Location**: `schemas/20251121/linkml/`
|
|
**Last Updated**: 2025-11-21
|
|
**Agent Count**: 4
|
|
**Languages Supported**: 60+
|
|
**Conversations Ready**: 139
|