glam/docs/plan/global_glam/07-subagent-architecture.md
2025-11-19 23:25:22 +01:00

277 lines
9.5 KiB
Markdown

# Subagent-Based NER Architecture
## Overview
This document describes the architectural decision to use **coding subagents** for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.
## Architecture Decision
### Decision
Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.
### Status
**Accepted** (2025-11-05)
### Context
The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.
Traditional approaches would involve:
- Installing spaCy, transformers, PyTorch as dependencies
- Managing NLP model downloads and storage
- Writing NER extraction code in the main application
- Handling multilingual model selection
- Managing GPU/CPU resources for model inference
### Decision Drivers
1. **Separation of Concerns**: Keep extraction logic separate from data pipeline logic
2. **Flexibility**: Allow experimentation with different NER approaches without changing main code
3. **Resource Management**: Subagents can manage heavy NLP dependencies independently
4. **Maintainability**: Cleaner main codebase without NLP-specific code
5. **Modularity**: Subagents can be swapped or upgraded without affecting pipeline
### Consequences
#### Positive
-**Clean Dependencies**: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
-**Flexibility**: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
-**Testability**: Easier to mock extraction results for testing
-**Scalability**: Subagents can run in parallel, distributed across workers
-**Maintainability**: Clear separation between "what to extract" and "how to extract"
#### Negative
- ⚠️ **Complexity**: Additional layer of abstraction
- ⚠️ **Debugging**: Harder to debug extraction issues (need to inspect subagent behavior)
- ⚠️ **Latency**: Subagent invocation adds overhead compared to direct function calls
- ⚠️ **Control**: Less fine-grained control over NER parameters
#### Neutral
- 🔄 **Testing Strategy**: Need integration tests that use real subagents
- 🔄 **Documentation**: Must document subagent interface and expectations
## Implementation Pattern
### Subagent Invocation
```python
from glam_extractor.task import Task
# Use subagent for NER extraction
result = Task(
subagent_type="general",
description="Extract institutions from text",
prompt=f"""
Extract all GLAM institution names, locations, and identifiers
from the following text:
{conversation_text}
Return results as JSON with fields:
- institutions: list of institution names
- locations: list of locations
- identifiers: list of {scheme, value} pairs
"""
)
# Process subagent results
institutions = result.get("institutions", [])
```
### Data Flow
```
┌─────────────────────┐
│ Conversation JSON │
└──────────┬──────────┘
┌─────────────────────┐
│ ConversationParser │ (Main code)
│ - Parse JSON │
│ - Extract text │
└──────────┬──────────┘
┌─────────────────────┐
│ SUBAGENT BOUNDARY │ ← Task tool invocation
└──────────┬──────────┘
┌─────────────────────┐
│ Coding Subagent │ (Autonomous)
│ - Load spaCy model │
│ - Run NER │
│ - Extract entities │
│ - Return JSON │
└──────────┬──────────┘
┌─────────────────────┐
│ InstitutionBuilder │ (Main code)
│ - Validate results │
│ - Build models │
│ - Add provenance │
└──────────┬──────────┘
┌─────────────────────┐
│ HeritageCustodian │
│ (Pydantic models) │
└─────────────────────┘
```
## Extraction Tasks Delegated to Subagents
### Task 1: Institution Name Extraction
**Input**: Conversation text
**Output**: List of institution names with types
**Subagent Approach**: NER for ORG entities + keyword filtering
### Task 2: Location Extraction
**Input**: Conversation text
**Output**: Locations with geocoding data
**Subagent Approach**: NER for GPE/LOC entities + geocoding API calls
### Task 3: Relationship Extraction
**Input**: Conversation text
**Output**: Relationships between institutions
**Subagent Approach**: Dependency parsing + relation extraction
### Task 4: Collection Metadata Extraction
**Input**: Conversation text
**Output**: Collection details (size, subject, dates)
**Subagent Approach**: Pattern matching + entity extraction
## Main Code Responsibilities
What **stays in the main codebase** (NOT delegated to subagents):
1. **Pattern Matching for Identifiers**
- ISIL code regex: `[A-Z]{2}-[A-Za-z0-9]+`
- Wikidata ID regex: `Q[0-9]+`
- VIAF ID extraction from URLs
- KvK number validation
2. **CSV Parsing**
- Read ISIL registry CSV
- Parse Dutch organizations CSV
- Map CSV columns to models
3. **Data Validation**
- LinkML schema validation
- Pydantic model validation
- Cross-reference checking
4. **Data Integration**
- Merge CSV and conversation data
- Conflict resolution
- Deduplication (using rapidfuzz)
5. **Export**
- JSON-LD generation
- RDF/Turtle serialization
- Parquet export
## Testing Strategy
### Unit Tests (No Subagents)
```python
def test_identifier_extraction():
"""Test regex pattern matching without subagents"""
text = "The ISIL code is NL-AsdRM"
identifiers = extract_identifiers(text) # Pure regex, no subagent
assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]
```
### Integration Tests (With Mocked Subagents)
```python
@pytest.mark.subagent
def test_institution_extraction_with_mock_subagent(mocker):
"""Test with mocked subagent response"""
mock_result = {
"institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
}
mocker.patch("glam_extractor.task.Task", return_value=mock_result)
result = extract_institutions(sample_text)
assert len(result) == 1
assert result[0].name == "Rijksmuseum"
```
### End-to-End Tests (Real Subagents)
```python
@pytest.mark.subagent
@pytest.mark.slow
def test_real_extraction_pipeline():
"""Test with real subagent (slow, requires network)"""
conversation = load_sample_conversation()
institutions = extract_with_subagent(conversation) # Real subagent call
assert len(institutions) > 0
```
## Performance Considerations
### Latency
- Subagent invocation: ~2-5 seconds overhead per call
- NER processing: ~1-10 seconds depending on text length
- **Total**: ~3-15 seconds per conversation file
### Optimization Strategies
1. **Batch Processing**: Process multiple conversations in parallel subagents
2. **Caching**: Cache subagent results keyed by conversation UUID
3. **Incremental Processing**: Only process new/updated conversations
4. **Selective Extraction**: Use cheap pattern matching first, subagents only when needed
### Resource Usage
- Main process: Low memory (~100MB), no GPU needed
- Subagent process: High memory (~2-4GB for NLP models), optional GPU
- **Benefit**: Main application stays lightweight
## Migration Path
If we later decide to bring NER into the main codebase:
1. Implement `NERExtractor` class that replicates subagent behavior
2. Add spaCy/transformers dependencies to `pyproject.toml`
3. Update extraction code to call local NER instead of subagents
4. Keep subagent interface as fallback option
The architecture supports both approaches without major refactoring.
## Alternative Approaches Considered
### Alternative 1: Direct spaCy Integration
**Pros**: Lower latency, more control
**Cons**: Heavy dependencies, harder to swap implementations
**Decision**: Rejected due to complexity and maintenance burden
### Alternative 2: External API Service (e.g., GPT-4 API)
**Pros**: No local dependencies, very flexible
**Cons**: Cost per request, requires API keys, network dependency
**Decision**: Could be used by subagents, but not as main architecture
### Alternative 3: Hybrid Approach
**Pros**: Use regex for simple cases, subagents for complex
**Cons**: Two code paths to maintain
**Decision**: Partially adopted (regex for identifiers, subagents for NER)
## References
- **Task Tool Documentation**: OpenCode agent framework
- **spaCy Documentation**: https://spacy.io
- **LinkML Schema**: `schemas/heritage_custodian.yaml`
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md`
## Decision Log
| Date | Decision | Rationale |
|------|----------|-----------|
| 2025-11-05 | Use subagents for NER | Clean separation, flexibility, maintainability |
| 2025-11-05 | Keep regex in main code | Simple patterns don't need subagents |
| 2025-11-05 | Remove spaCy from dependencies | Not used directly in main code |
---
**Version**: 1.0
**Last Updated**: 2025-11-05
**Status**: Active