kempersc edb1e07941 updated schemata

2025-11-21 22:12:33 +01:00

11 KiB

Raw Blame History

OpenCode NLP Extraction Agents

This directory contains specialized OpenCode subagents for extracting structured heritage institution data from conversation JSON files.

🚨 Schema Source of Truth

MASTER SCHEMA LOCATION: schemas/20251121/linkml/

The LinkML schema files are the authoritative, canonical definition of the Heritage Custodian Ontology:

Primary Schema File (SINGLE SOURCE OF TRUTH):

schemas/20251121/linkml/01_custodian_name.yaml - Complete Heritage Custodian Ontology
- Defines CustodianObservation (source-based references to heritage keepers)
- Defines CustodianName (standardized emic names)
- Defines CustodianReconstruction (formal entities: individuals, groups, organizations, governments, corporations)
- Includes ISO 20275 legal form codes (for legal entities)
- PiCo-inspired observation/reconstruction pattern
- Based on CIDOC-CRM E39_Actor (broader than organization)

ALL OTHER FILES ARE DERIVED/GENERATED from these LinkML schemas:

❌ DO NOT edit these derived files directly:

schemas/20251121/rdf/*.{ttl,nt,jsonld,rdf,n3,trig,trix} - GENERATED from LinkML via gen-owl + rdfpipe
schemas/20251121/typedb/*.tql - DERIVED TypeDB schema (manual translation from LinkML)
schemas/20251121/uml/mermaid/*.mmd - DERIVED UML diagrams (manual visualization of LinkML)
schemas/20251121/examples/*.yaml - INSTANCES conforming to LinkML schema

Workflow for Schema Changes:

1. EDIT LinkML schema (01_custodian_name.yaml)
   ↓
2. REGENERATE RDF formats:
   $ gen-owl -f ttl schemas/20251121/linkml/01_custodian_name.yaml > schemas/20251121/rdf/01_custodian_name.owl.ttl
   $ rdfpipe schemas/20251121/rdf/01_custodian_name.owl.ttl -o nt > schemas/20251121/rdf/01_custodian_name.nt
   $ # ... repeat for all 8 formats (see RDF_GENERATION_SUMMARY.md)
   ↓
3. UPDATE TypeDB schema (manual translation)
   ↓
4. UPDATE UML/Mermaid diagrams (manual visualization)
   ↓
5. VALIDATE example instances:
   $ linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml schemas/20251121/examples/example.yaml

Why LinkML is the Master:

✅ Formal specification: Type-safe, validation rules, cardinality constraints
✅ Multi-format generation: Single source → RDF, JSON-LD, Python, SQL, GraphQL
✅ Version control: Clear diffs, semantic versioning, change tracking
✅ Ontology alignment: Explicit class_uri and slot_uri mappings to base ontologies
✅ Documentation: Rich inline documentation with examples

NEVER:

❌ Edit RDF files directly (they will be overwritten on next generation)
❌ Consider TypeDB schema as authoritative (it's a translation target)
❌ Treat UML diagrams as specification (they're visualizations)

ALWAYS:

✅ Refer to LinkML schemas for class definitions
✅ Update LinkML first, then regenerate derived formats
✅ Validate changes against LinkML metamodel
✅ Document schema changes in LinkML YAML comments

See also:

schemas/20251121/RDF_GENERATION_SUMMARY.md - RDF generation process documentation
docs/MIGRATION_GUIDE.md - Schema migration procedures
LinkML documentation: https://linkml.io/

Schema Reference (v0.2.1 - ISO 20275 Migration)

All agents extract data conforming to the Heritage Custodian Ontology defined in LinkML:

Authoritative Schema File:

schemas/20251121/linkml/01_custodian_name.yaml - Complete Heritage Custodian Ontology
- CustodianObservation: Source-based references (emic/etic perspectives)
- CustodianName: Standardized emic names (subclass of Observation)
- CustodianReconstruction: Formal entities (individuals, groups, organizations, governments, corporations)
- ReconstructionActivity: Entity resolution provenance
- Includes ISO 20275 legal form codes (for legal entities)
- Based on CIDOC-CRM E39_Actor

Key Features (as of v0.2.1):

✅ ISO 20275 legal form codes (4-character alphanumeric: ASBL, GOVT, PRIV, etc.)
✅ Multi-aspect modeling (place, custodian, legal form, collections, people aspects)
✅ Temporal event tracking (founding, mergers, relocations, custody transfers)
✅ Ontology integration (CPOV, TOOI, CIDOC-CRM, RiC-O, Schema.org, PiCo)
✅ Provenance tracking (data source, tier, extraction method, confidence scores)

See schemas/20251121/RDF_GENERATION_SUMMARY.md for schema architecture and recent updates.

Available Agents

1. @institution-extractor

Purpose: Extract heritage institution names, types, and basic metadata
Schema: Uses CustodianObservation and CustodianName classes from schemas/20251121/linkml/01_custodian_name.yaml

Input: Conversation text
Output: JSON array of institutions with:

Institution name
Institution type (from InstitutionTypeEnum in enums.yaml)
Alternative names
Description
Confidence score

Example:

@institution-extractor

Please extract all heritage institutions from the following text:
[paste conversation text]

2. @location-extractor

Purpose: Extract geographic locations (cities, addresses, regions, countries)
Schema: Uses PlaceAspect class from schemas/20251121/linkml/01_custodian_name.yaml

Input: Conversation text
Output: JSON array of locations with:

City
Street address
Postal code
Region/province
Country (ISO 3166-1 alpha-2)
Confidence score

Example:

@location-extractor

Please extract all locations mentioned for heritage institutions:
[paste conversation text]

3. @identifier-extractor

Purpose: Extract external identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
Schema: Uses Identifier class from schemas/20251121/linkml/01_custodian_name.yaml

Input: Conversation text
Output: JSON array of identifiers with:

Identifier scheme (ISIL, WIKIDATA, VIAF, KVK, etc.)
Identifier value
Identifier URL
Confidence score

Recognizes:

ISIL codes: NL-AsdAM, US-DLC, etc.
Wikidata IDs: Q190804
VIAF IDs: 147143282
KvK numbers (Dutch): 41231987
Website URLs
Other standard identifiers

Example:

@identifier-extractor

Please extract all identifiers (ISIL, Wikidata, VIAF, URLs) from:
[paste conversation text]

4. @event-extractor

Purpose: Extract organizational change events (founding, mergers, relocations, etc.)
Schema: Uses TemporalEvent class from schemas/20251121/linkml/01_custodian_name.yaml

Input: Conversation text
Output: JSON array of change events with:

Event ID
Change type (from ChangeTypeEnum in enums.yaml: FOUNDING, MERGER, RELOCATION, NAME_CHANGE, etc.)
Event date
Event description
Affected organization
Resulting organization
Confidence score

Detects:

Founding events: "Founded in 1985"
Mergers: "Merged with X in 2001"
Relocations: "Moved to Y in 2010"
Name changes: "Renamed from A to B"
Closures, acquisitions, restructuring, etc.

Example:

@event-extractor

Please extract all organizational change events:
[paste conversation text]

Usage Workflow

Option 1: Using the Orchestration Script

The orchestration script (scripts/extract_with_agents.py) prepares prompts for each agent:

python scripts/extract_with_agents.py conversations/Brazilian_GLAM_collection_inventories.json

This will print prompts for each agent. Copy/paste each prompt to invoke the corresponding agent via @mention.

Option 2: Direct Agent Invocation

You can invoke agents directly in an OpenCode session:

Load conversation text:

from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conv = parser.parse_file("conversations/Brazilian_GLAM_collection_inventories.json")
text = conv.extract_all_text()

Invoke agents (via @mention):

@institution-extractor

Extract all heritage institutions from the following conversation about Brazilian GLAM institutions:

[paste text from conv.extract_all_text()]

Collect responses and combine results using AgentOrchestrator.create_heritage_custodian_record()

Option 3: Batch Processing

For processing multiple conversations:

from pathlib import Path
from scripts.extract_with_agents import AgentOrchestrator

conversation_dir = Path("conversations")
for conv_file in conversation_dir.glob("*.json"):
    orchestrator = AgentOrchestrator(conv_file)
    
    # Generate prompts
    institution_prompt = orchestrator.prepare_institution_extraction_prompt()
    # ... invoke agents and collect results ...

Agent Configuration

All agents are configured with:

mode: subagent (invokable by primary agents or @mention)
model: claude-sonnet-4 (high-quality extraction)
temperature: 0.1-0.2 (focused, deterministic)
tools: All disabled (read-only analysis)

This ensures consistent, high-quality extractions with minimal hallucination.

Output Format

All agents return JSON-only responses with no additional commentary:

{
  "institutions": [...],      // from @institution-extractor
  "locations": [...],         // from @location-extractor
  "identifiers": [...],       // from @identifier-extractor
  "change_events": [...]      // from @event-extractor
}

These JSON responses can be directly parsed and validated against the LinkML schema.

Confidence Scoring

All agents assign confidence scores (0.0-1.0):

0.9-1.0: Explicit, unambiguous mentions
0.7-0.9: Clear mentions with some ambiguity
0.5-0.7: Inferred from context
0.3-0.5: Low confidence, likely needs verification
0.0-0.3: Very uncertain, flag for manual review

Multilingual Support

Agents support 60+ languages found in the conversation dataset, including:

Dutch, Portuguese, Spanish, French, German
Vietnamese, Japanese, Korean, Chinese, Thai
Arabic, Persian, Turkish, Russian
And many more...

Agents preserve original language names (no translation) and adapt pattern matching to language context.

Data Quality

Extracted data is marked as:

Data Source: CONVERSATION_NLP
Data Tier: TIER_4_INFERRED
Provenance: Includes conversation ID, extraction date, method, and confidence score

This ensures proper provenance tracking and quality assessment.

Next Steps

After extraction:

Validate with LinkML schema:

linkml-validate -s schemas/20251121/linkml/01_custodian_name.yaml data.yaml

Cross-link with authoritative CSV data (ISIL registry, Dutch orgs) via ISIL code or name matching
Geocode locations using GeoNames database
Generate GHCIDs for persistent identification
Export to JSON-LD, RDF/Turtle, CSV, or Parquet

See /AGENTS.md for detailed extraction guidelines and examples.

See /docs/SCHEMA_MODULES.md for schema architecture and usage patterns.

Contributing

To add a new extraction agent:

Create .opencode/agent/your-agent-name.md
Configure with mode: subagent, appropriate model and temperature
Define input/output format with examples
Document extraction patterns and confidence scoring
Add multilingual support and edge case handling
Test with real conversation data

Schema Version: v0.2.1 (ISO 20275 migration)
Schema Location: schemas/20251121/linkml/
Last Updated: 2025-11-21
Agent Count: 4
Languages Supported: 60+
Conversations Ready: 139

11 KiB Raw Blame History

OpenCode NLP Extraction Agents

🚨 Schema Source of Truth

Schema Reference (v0.2.1 - ISO 20275 Migration)

Available Agents

1. @institution-extractor

2. @location-extractor

3. @identifier-extractor

4. @event-extractor

Usage Workflow

Option 1: Using the Orchestration Script

Option 2: Direct Agent Invocation

Option 3: Batch Processing

Agent Configuration

Output Format

Confidence Scoring

Multilingual Support

Data Quality

Next Steps

Contributing

11 KiB

Raw Blame History