glam/.opencode/agent/README.md
2025-11-19 23:25:22 +01:00

261 lines
7.6 KiB
Markdown

# OpenCode NLP Extraction Agents
This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.
## Schema Reference (v0.2.0)
All agents extract data conforming to the **modular Heritage Custodian Schema v0.2.0**:
- **`/schemas/heritage_custodian.yaml`** - Main schema (import-only structure)
- **`/schemas/core.yaml`** - Core classes (HeritageCustodian, Location, Identifier, DigitalPlatform, GHCID)
- **`/schemas/enums.yaml`** - Enumerations (InstitutionTypeEnum, ChangeTypeEnum, DataSource, DataTier, etc.)
- **`/schemas/provenance.yaml`** - Provenance tracking (Provenance, ChangeEvent, GHCIDHistoryEntry)
- **`/schemas/collections.yaml`** - Collection metadata (Collection, Accession, DigitalObject)
- **`/schemas/dutch.yaml`** - Dutch-specific extensions (DutchHeritageCustodian)
See `/docs/SCHEMA_MODULES.md` for detailed architecture and usage patterns.
## Available Agents
### 1. @institution-extractor
**Purpose**: Extract heritage institution names, types, and basic metadata
**Schema**: Uses `HeritageCustodian` class from `/schemas/core.yaml`
**Input**: Conversation text
**Output**: JSON array of institutions with:
- Institution name
- Institution type (from `InstitutionTypeEnum` in `enums.yaml`)
- Alternative names
- Description
- Confidence score
**Example**:
```
@institution-extractor
Please extract all heritage institutions from the following text:
[paste conversation text]
```
### 2. @location-extractor
**Purpose**: Extract geographic locations (cities, addresses, regions, countries)
**Schema**: Uses `Location` class from `/schemas/core.yaml`
**Input**: Conversation text
**Output**: JSON array of locations with:
- City
- Street address
- Postal code
- Region/province
- Country (ISO 3166-1 alpha-2)
- Confidence score
**Example**:
```
@location-extractor
Please extract all locations mentioned for heritage institutions:
[paste conversation text]
```
### 3. @identifier-extractor
**Purpose**: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
**Schema**: Uses `Identifier` class from `/schemas/core.yaml`
**Input**: Conversation text
**Output**: JSON array of identifiers with:
- Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
- Identifier value
- Identifier URL
- Confidence score
**Recognizes**:
- ISIL codes: `NL-AsdAM`, `US-DLC`, etc.
- Wikidata IDs: `Q190804`
- VIAF IDs: `147143282`
- KvK numbers (Dutch): `41231987`
- Website URLs
- Other standard identifiers
**Example**:
```
@identifier-extractor
Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
[paste conversation text]
```
### 4. @event-extractor
**Purpose**: Extract organizational change events (founding, mergers, relocations, etc.)
**Schema**: Uses `ChangeEvent` class from `/schemas/provenance.yaml`
**Input**: Conversation text
**Output**: JSON array of change events with:
- Event ID
- Change type (from `ChangeTypeEnum` in `enums.yaml`: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.)
- Event date
- Event description
- Affected organization
- Resulting organization
- Confidence score
**Detects**:
- Founding events: "Founded in 1985"
- Mergers: "Merged with X in 2001"
- Relocations: "Moved to Y in 2010"
- Name changes: "Renamed from A to B"
- Closures, acquisitions, restructuring, etc.
**Example**:
```
@event-extractor
Please extract all organizational change events:
[paste conversation text]
```
## Usage Workflow
### Option 1: Using the Orchestration Script
The orchestration script (`scripts/extract_with_agents.py`) prepares prompts for each agent:
```bash
python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json
```
This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.
### Option 2: Direct Agent Invocation
You can invoke agents directly in an OpenCode session:
1. **Load conversation text**:
```python
from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
text = conv.extract_all_text()
```
2. **Invoke agents** (via @mention):
```
@institution-extractor
Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:
[paste text from conv.extract_all_text()]
```
3. **Collect responses** and combine results using `AgentOrchestrator.create_heritage_custodian_record()`
### Option 3: Batch Processing
For processing multiple conversations:
```python
from pathlib import Path
from scripts.extract_with_agents import AgentOrchestrator
conversation_dir = Path("conversations")
for conv_file in conversation_dir.glob("*.json"):
orchestrator = AgentOrchestrator(conv_file)
# Generate prompts
institution_prompt = orchestrator.prepare_institution_extraction_prompt()
# ... invoke agents and collect results ...
```
## Agent Configuration
All agents are configured with:
- **mode**: `subagent` (invokable by primary agents or @mention)
- **model**: `claude-sonnet-4` (high-quality extraction)
- **temperature**: `0.1-0.2` (focused, deterministic)
- **tools**: All disabled (read-only analysis)
This ensures consistent, high-quality extractions with minimal hallucination.
## Output Format
All agents return **JSON-only responses** with no additional commentary:
```json
{
"institutions": [...], // from @institution-extractor
"locations": [...], // from @location-extractor
"identifiers": [...], // from @identifier-extractor
"change_events": [...] // from @event-extractor
}
```
These JSON responses can be directly parsed and validated against the LinkML schema.
## Confidence Scoring
All agents assign confidence scores (0.0-1.0):
- **0.9-1.0**: Explicit, unambiguous mentions
- **0.7-0.9**: Clear mentions with some ambiguity
- **0.5-0.7**: Inferred from context
- **0.3-0.5**: Low confidence, likely needs verification
- **0.0-0.3**: Very uncertain, flag for manual review
## Multilingual Support
Agents support **60+ languages** found in the conversation dataset, including:
- Dutch, Portuguese, Spanish, French, German
- Vietnamese, Japanese, Korean, Chinese, Thai
- Arabic, Persian, Turkish, Russian
- And many more...
Agents preserve original language names (no translation) and adapt pattern matching to language context.
## Data Quality
Extracted data is marked as:
- **Data Source**: `CONVERSATION_NLP`
- **Data Tier**: `TIER_4_INFERRED`
- **Provenance**: Includes conversation ID, extraction date, method, and confidence score
This ensures proper provenance tracking and quality assessment.
## Next Steps
After extraction:
1. **Validate** with LinkML schema:
```bash
linkml-validate -s schemas/heritage_custodian.yaml data.json
```
2. **Cross-link** with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
3. **Geocode** locations using GeoNames database
4. **Generate GHCIDs** for persistent identification
5. **Export** to JSON-LD, RDF/Turtle, CSV, or Parquet
See `/AGENTS.md` for detailed extraction guidelines and examples.
See `/docs/SCHEMA_MODULES.md` for schema architecture and usage patterns.
## Contributing
To add a new extraction agent:
1. Create `.opencode/agent/your-agent-name.md`
2. Configure with `mode: subagent`, appropriate model and temperature
3. Define input/output format with examples
4. Document extraction patterns and confidence scoring
5. Add multilingual support and edge case handling
6. Test with real conversation data
---
**Schema Version**: v0.2.0 (modular)
**Last Updated**: 2025-11-05
**Agent Count**: 4
**Languages Supported**: 60+
**Conversations Ready**: 139