277 lines
9.5 KiB
Markdown
277 lines
9.5 KiB
Markdown
# Subagent-Based NER Architecture
|
|
|
|
## Overview
|
|
|
|
This document describes the architectural decision to use **coding subagents** for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.
|
|
|
|
## Architecture Decision
|
|
|
|
### Decision
|
|
Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.
|
|
|
|
### Status
|
|
**Accepted** (2025-11-05)
|
|
|
|
### Context
|
|
The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.
|
|
|
|
Traditional approaches would involve:
|
|
- Installing spaCy, transformers, PyTorch as dependencies
|
|
- Managing NLP model downloads and storage
|
|
- Writing NER extraction code in the main application
|
|
- Handling multilingual model selection
|
|
- Managing GPU/CPU resources for model inference
|
|
|
|
### Decision Drivers
|
|
|
|
1. **Separation of Concerns**: Keep extraction logic separate from data pipeline logic
|
|
2. **Flexibility**: Allow experimentation with different NER approaches without changing main code
|
|
3. **Resource Management**: Subagents can manage heavy NLP dependencies independently
|
|
4. **Maintainability**: Cleaner main codebase without NLP-specific code
|
|
5. **Modularity**: Subagents can be swapped or upgraded without affecting pipeline
|
|
|
|
### Consequences
|
|
|
|
#### Positive
|
|
- ✅ **Clean Dependencies**: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
|
|
- ✅ **Flexibility**: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
|
|
- ✅ **Testability**: Easier to mock extraction results for testing
|
|
- ✅ **Scalability**: Subagents can run in parallel, distributed across workers
|
|
- ✅ **Maintainability**: Clear separation between "what to extract" and "how to extract"
|
|
|
|
#### Negative
|
|
- ⚠️ **Complexity**: Additional layer of abstraction
|
|
- ⚠️ **Debugging**: Harder to debug extraction issues (need to inspect subagent behavior)
|
|
- ⚠️ **Latency**: Subagent invocation adds overhead compared to direct function calls
|
|
- ⚠️ **Control**: Less fine-grained control over NER parameters
|
|
|
|
#### Neutral
|
|
- 🔄 **Testing Strategy**: Need integration tests that use real subagents
|
|
- 🔄 **Documentation**: Must document subagent interface and expectations
|
|
|
|
## Implementation Pattern
|
|
|
|
### Subagent Invocation
|
|
|
|
```python
|
|
from glam_extractor.task import Task
|
|
|
|
# Use subagent for NER extraction
|
|
result = Task(
|
|
subagent_type="general",
|
|
description="Extract institutions from text",
|
|
prompt=f"""
|
|
Extract all GLAM institution names, locations, and identifiers
|
|
from the following text:
|
|
|
|
{conversation_text}
|
|
|
|
Return results as JSON with fields:
|
|
- institutions: list of institution names
|
|
- locations: list of locations
|
|
- identifiers: list of {scheme, value} pairs
|
|
"""
|
|
)
|
|
|
|
# Process subagent results
|
|
institutions = result.get("institutions", [])
|
|
```
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ Conversation JSON │
|
|
└──────────┬──────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ ConversationParser │ (Main code)
|
|
│ - Parse JSON │
|
|
│ - Extract text │
|
|
└──────────┬──────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ SUBAGENT BOUNDARY │ ← Task tool invocation
|
|
└──────────┬──────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Coding Subagent │ (Autonomous)
|
|
│ - Load spaCy model │
|
|
│ - Run NER │
|
|
│ - Extract entities │
|
|
│ - Return JSON │
|
|
└──────────┬──────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ InstitutionBuilder │ (Main code)
|
|
│ - Validate results │
|
|
│ - Build models │
|
|
│ - Add provenance │
|
|
└──────────┬──────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ HeritageCustodian │
|
|
│ (Pydantic models) │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
## Extraction Tasks Delegated to Subagents
|
|
|
|
### Task 1: Institution Name Extraction
|
|
**Input**: Conversation text
|
|
**Output**: List of institution names with types
|
|
**Subagent Approach**: NER for ORG entities + keyword filtering
|
|
|
|
### Task 2: Location Extraction
|
|
**Input**: Conversation text
|
|
**Output**: Locations with geocoding data
|
|
**Subagent Approach**: NER for GPE/LOC entities + geocoding API calls
|
|
|
|
### Task 3: Relationship Extraction
|
|
**Input**: Conversation text
|
|
**Output**: Relationships between institutions
|
|
**Subagent Approach**: Dependency parsing + relation extraction
|
|
|
|
### Task 4: Collection Metadata Extraction
|
|
**Input**: Conversation text
|
|
**Output**: Collection details (size, subject, dates)
|
|
**Subagent Approach**: Pattern matching + entity extraction
|
|
|
|
## Main Code Responsibilities
|
|
|
|
What **stays in the main codebase** (NOT delegated to subagents):
|
|
|
|
1. **Pattern Matching for Identifiers**
|
|
- ISIL code regex: `[A-Z]{2}-[A-Za-z0-9]+`
|
|
- Wikidata ID regex: `Q[0-9]+`
|
|
- VIAF ID extraction from URLs
|
|
- KvK number validation
|
|
|
|
2. **CSV Parsing**
|
|
- Read ISIL registry CSV
|
|
- Parse Dutch organizations CSV
|
|
- Map CSV columns to models
|
|
|
|
3. **Data Validation**
|
|
- LinkML schema validation
|
|
- Pydantic model validation
|
|
- Cross-reference checking
|
|
|
|
4. **Data Integration**
|
|
- Merge CSV and conversation data
|
|
- Conflict resolution
|
|
- Deduplication (using rapidfuzz)
|
|
|
|
5. **Export**
|
|
- JSON-LD generation
|
|
- RDF/Turtle serialization
|
|
- Parquet export
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests (No Subagents)
|
|
```python
|
|
def test_identifier_extraction():
|
|
"""Test regex pattern matching without subagents"""
|
|
text = "The ISIL code is NL-AsdRM"
|
|
identifiers = extract_identifiers(text) # Pure regex, no subagent
|
|
assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]
|
|
```
|
|
|
|
### Integration Tests (With Mocked Subagents)
|
|
```python
|
|
@pytest.mark.subagent
|
|
def test_institution_extraction_with_mock_subagent(mocker):
|
|
"""Test with mocked subagent response"""
|
|
mock_result = {
|
|
"institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
|
|
}
|
|
mocker.patch("glam_extractor.task.Task", return_value=mock_result)
|
|
|
|
result = extract_institutions(sample_text)
|
|
assert len(result) == 1
|
|
assert result[0].name == "Rijksmuseum"
|
|
```
|
|
|
|
### End-to-End Tests (Real Subagents)
|
|
```python
|
|
@pytest.mark.subagent
|
|
@pytest.mark.slow
|
|
def test_real_extraction_pipeline():
|
|
"""Test with real subagent (slow, requires network)"""
|
|
conversation = load_sample_conversation()
|
|
institutions = extract_with_subagent(conversation) # Real subagent call
|
|
assert len(institutions) > 0
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Latency
|
|
- Subagent invocation: ~2-5 seconds overhead per call
|
|
- NER processing: ~1-10 seconds depending on text length
|
|
- **Total**: ~3-15 seconds per conversation file
|
|
|
|
### Optimization Strategies
|
|
1. **Batch Processing**: Process multiple conversations in parallel subagents
|
|
2. **Caching**: Cache subagent results keyed by conversation UUID
|
|
3. **Incremental Processing**: Only process new/updated conversations
|
|
4. **Selective Extraction**: Use cheap pattern matching first, subagents only when needed
|
|
|
|
### Resource Usage
|
|
- Main process: Low memory (~100MB), no GPU needed
|
|
- Subagent process: High memory (~2-4GB for NLP models), optional GPU
|
|
- **Benefit**: Main application stays lightweight
|
|
|
|
## Migration Path
|
|
|
|
If we later decide to bring NER into the main codebase:
|
|
|
|
1. Implement `NERExtractor` class that replicates subagent behavior
|
|
2. Add spaCy/transformers dependencies to `pyproject.toml`
|
|
3. Update extraction code to call local NER instead of subagents
|
|
4. Keep subagent interface as fallback option
|
|
|
|
The architecture supports both approaches without major refactoring.
|
|
|
|
## Alternative Approaches Considered
|
|
|
|
### Alternative 1: Direct spaCy Integration
|
|
**Pros**: Lower latency, more control
|
|
**Cons**: Heavy dependencies, harder to swap implementations
|
|
**Decision**: Rejected due to complexity and maintenance burden
|
|
|
|
### Alternative 2: External API Service (e.g., GPT-4 API)
|
|
**Pros**: No local dependencies, very flexible
|
|
**Cons**: Cost per request, requires API keys, network dependency
|
|
**Decision**: Could be used by subagents, but not as main architecture
|
|
|
|
### Alternative 3: Hybrid Approach
|
|
**Pros**: Use regex for simple cases, subagents for complex
|
|
**Cons**: Two code paths to maintain
|
|
**Decision**: Partially adopted (regex for identifiers, subagents for NER)
|
|
|
|
## References
|
|
|
|
- **Task Tool Documentation**: OpenCode agent framework
|
|
- **spaCy Documentation**: https://spacy.io
|
|
- **LinkML Schema**: `schemas/heritage_custodian.yaml`
|
|
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md`
|
|
|
|
## Decision Log
|
|
|
|
| Date | Decision | Rationale |
|
|
|------|----------|-----------|
|
|
| 2025-11-05 | Use subagents for NER | Clean separation, flexibility, maintainability |
|
|
| 2025-11-05 | Keep regex in main code | Simple patterns don't need subagents |
|
|
| 2025-11-05 | Remove spaCy from dependencies | Not used directly in main code |
|
|
|
|
---
|
|
|
|
**Version**: 1.0
|
|
**Last Updated**: 2025-11-05
|
|
**Status**: Active
|