9.5 KiB
Subagent-Based NER Architecture
Overview
This document describes the architectural decision to use coding subagents for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.
Architecture Decision
Decision
Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.
Status
Accepted (2025-11-05)
Context
The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.
Traditional approaches would involve:
- Installing spaCy, transformers, PyTorch as dependencies
- Managing NLP model downloads and storage
- Writing NER extraction code in the main application
- Handling multilingual model selection
- Managing GPU/CPU resources for model inference
Decision Drivers
- Separation of Concerns: Keep extraction logic separate from data pipeline logic
- Flexibility: Allow experimentation with different NER approaches without changing main code
- Resource Management: Subagents can manage heavy NLP dependencies independently
- Maintainability: Cleaner main codebase without NLP-specific code
- Modularity: Subagents can be swapped or upgraded without affecting pipeline
Consequences
Positive
- ✅ Clean Dependencies: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
- ✅ Flexibility: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
- ✅ Testability: Easier to mock extraction results for testing
- ✅ Scalability: Subagents can run in parallel, distributed across workers
- ✅ Maintainability: Clear separation between "what to extract" and "how to extract"
Negative
- ⚠️ Complexity: Additional layer of abstraction
- ⚠️ Debugging: Harder to debug extraction issues (need to inspect subagent behavior)
- ⚠️ Latency: Subagent invocation adds overhead compared to direct function calls
- ⚠️ Control: Less fine-grained control over NER parameters
Neutral
- 🔄 Testing Strategy: Need integration tests that use real subagents
- 🔄 Documentation: Must document subagent interface and expectations
Implementation Pattern
Subagent Invocation
from glam_extractor.task import Task
# Use subagent for NER extraction
result = Task(
subagent_type="general",
description="Extract institutions from text",
prompt=f"""
Extract all GLAM institution names, locations, and identifiers
from the following text:
{conversation_text}
Return results as JSON with fields:
- institutions: list of institution names
- locations: list of locations
- identifiers: list of {scheme, value} pairs
"""
)
# Process subagent results
institutions = result.get("institutions", [])
Data Flow
┌─────────────────────┐
│ Conversation JSON │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ ConversationParser │ (Main code)
│ - Parse JSON │
│ - Extract text │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ SUBAGENT BOUNDARY │ ← Task tool invocation
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Coding Subagent │ (Autonomous)
│ - Load spaCy model │
│ - Run NER │
│ - Extract entities │
│ - Return JSON │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ InstitutionBuilder │ (Main code)
│ - Validate results │
│ - Build models │
│ - Add provenance │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ HeritageCustodian │
│ (Pydantic models) │
└─────────────────────┘
Extraction Tasks Delegated to Subagents
Task 1: Institution Name Extraction
Input: Conversation text
Output: List of institution names with types
Subagent Approach: NER for ORG entities + keyword filtering
Task 2: Location Extraction
Input: Conversation text
Output: Locations with geocoding data
Subagent Approach: NER for GPE/LOC entities + geocoding API calls
Task 3: Relationship Extraction
Input: Conversation text
Output: Relationships between institutions
Subagent Approach: Dependency parsing + relation extraction
Task 4: Collection Metadata Extraction
Input: Conversation text
Output: Collection details (size, subject, dates)
Subagent Approach: Pattern matching + entity extraction
Main Code Responsibilities
What stays in the main codebase (NOT delegated to subagents):
-
Pattern Matching for Identifiers
- ISIL code regex:
[A-Z]{2}-[A-Za-z0-9]+ - Wikidata ID regex:
Q[0-9]+ - VIAF ID extraction from URLs
- KvK number validation
- ISIL code regex:
-
CSV Parsing
- Read ISIL registry CSV
- Parse Dutch organizations CSV
- Map CSV columns to models
-
Data Validation
- LinkML schema validation
- Pydantic model validation
- Cross-reference checking
-
Data Integration
- Merge CSV and conversation data
- Conflict resolution
- Deduplication (using rapidfuzz)
-
Export
- JSON-LD generation
- RDF/Turtle serialization
- Parquet export
Testing Strategy
Unit Tests (No Subagents)
def test_identifier_extraction():
"""Test regex pattern matching without subagents"""
text = "The ISIL code is NL-AsdRM"
identifiers = extract_identifiers(text) # Pure regex, no subagent
assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]
Integration Tests (With Mocked Subagents)
@pytest.mark.subagent
def test_institution_extraction_with_mock_subagent(mocker):
"""Test with mocked subagent response"""
mock_result = {
"institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
}
mocker.patch("glam_extractor.task.Task", return_value=mock_result)
result = extract_institutions(sample_text)
assert len(result) == 1
assert result[0].name == "Rijksmuseum"
End-to-End Tests (Real Subagents)
@pytest.mark.subagent
@pytest.mark.slow
def test_real_extraction_pipeline():
"""Test with real subagent (slow, requires network)"""
conversation = load_sample_conversation()
institutions = extract_with_subagent(conversation) # Real subagent call
assert len(institutions) > 0
Performance Considerations
Latency
- Subagent invocation: ~2-5 seconds overhead per call
- NER processing: ~1-10 seconds depending on text length
- Total: ~3-15 seconds per conversation file
Optimization Strategies
- Batch Processing: Process multiple conversations in parallel subagents
- Caching: Cache subagent results keyed by conversation UUID
- Incremental Processing: Only process new/updated conversations
- Selective Extraction: Use cheap pattern matching first, subagents only when needed
Resource Usage
- Main process: Low memory (~100MB), no GPU needed
- Subagent process: High memory (~2-4GB for NLP models), optional GPU
- Benefit: Main application stays lightweight
Migration Path
If we later decide to bring NER into the main codebase:
- Implement
NERExtractorclass that replicates subagent behavior - Add spaCy/transformers dependencies to
pyproject.toml - Update extraction code to call local NER instead of subagents
- Keep subagent interface as fallback option
The architecture supports both approaches without major refactoring.
Alternative Approaches Considered
Alternative 1: Direct spaCy Integration
Pros: Lower latency, more control
Cons: Heavy dependencies, harder to swap implementations
Decision: Rejected due to complexity and maintenance burden
Alternative 2: External API Service (e.g., GPT-4 API)
Pros: No local dependencies, very flexible
Cons: Cost per request, requires API keys, network dependency
Decision: Could be used by subagents, but not as main architecture
Alternative 3: Hybrid Approach
Pros: Use regex for simple cases, subagents for complex
Cons: Two code paths to maintain
Decision: Partially adopted (regex for identifiers, subagents for NER)
References
- Task Tool Documentation: OpenCode agent framework
- spaCy Documentation: https://spacy.io
- LinkML Schema:
schemas/heritage_custodian.yaml - Design Patterns:
docs/plan/global_glam/05-design-patterns.md
Decision Log
| Date | Decision | Rationale |
|---|---|---|
| 2025-11-05 | Use subagents for NER | Clean separation, flexibility, maintainability |
| 2025-11-05 | Keep regex in main code | Simple patterns don't need subagents |
| 2025-11-05 | Remove spaCy from dependencies | Not used directly in main code |
Version: 1.0
Last Updated: 2025-11-05
Status: Active