# Subagent-Based NER Architecture ## Overview This document describes the architectural decision to use **coding subagents** for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase. ## Architecture Decision ### Decision Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code. ### Status **Accepted** (2025-11-05) ### Context The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages. Traditional approaches would involve: - Installing spaCy, transformers, PyTorch as dependencies - Managing NLP model downloads and storage - Writing NER extraction code in the main application - Handling multilingual model selection - Managing GPU/CPU resources for model inference ### Decision Drivers 1. **Separation of Concerns**: Keep extraction logic separate from data pipeline logic 2. **Flexibility**: Allow experimentation with different NER approaches without changing main code 3. **Resource Management**: Subagents can manage heavy NLP dependencies independently 4. **Maintainability**: Cleaner main codebase without NLP-specific code 5. **Modularity**: Subagents can be swapped or upgraded without affecting pipeline ### Consequences #### Positive - ✅ **Clean Dependencies**: Main application has minimal dependencies (no PyTorch, spaCy, transformers) - ✅ **Flexibility**: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes - ✅ **Testability**: Easier to mock extraction results for testing - ✅ **Scalability**: Subagents can run in parallel, distributed across workers - ✅ **Maintainability**: Clear separation between "what to extract" and "how to extract" #### Negative - ⚠️ **Complexity**: Additional layer of abstraction - ⚠️ **Debugging**: Harder to debug extraction issues (need to inspect subagent behavior) - ⚠️ **Latency**: Subagent invocation adds overhead compared to direct function calls - ⚠️ **Control**: Less fine-grained control over NER parameters #### Neutral - 🔄 **Testing Strategy**: Need integration tests that use real subagents - 🔄 **Documentation**: Must document subagent interface and expectations ## Implementation Pattern ### Subagent Invocation ```python from glam_extractor.task import Task # Use subagent for NER extraction result = Task( subagent_type="general", description="Extract institutions from text", prompt=f""" Extract all GLAM institution names, locations, and identifiers from the following text: {conversation_text} Return results as JSON with fields: - institutions: list of institution names - locations: list of locations - identifiers: list of {scheme, value} pairs """ ) # Process subagent results institutions = result.get("institutions", []) ``` ### Data Flow ``` ┌─────────────────────┐ │ Conversation JSON │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ ConversationParser │ (Main code) │ - Parse JSON │ │ - Extract text │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ SUBAGENT BOUNDARY │ ← Task tool invocation └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Coding Subagent │ (Autonomous) │ - Load spaCy model │ │ - Run NER │ │ - Extract entities │ │ - Return JSON │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ InstitutionBuilder │ (Main code) │ - Validate results │ │ - Build models │ │ - Add provenance │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ HeritageCustodian │ │ (Pydantic models) │ └─────────────────────┘ ``` ## Extraction Tasks Delegated to Subagents ### Task 1: Institution Name Extraction **Input**: Conversation text **Output**: List of institution names with types **Subagent Approach**: NER for ORG entities + keyword filtering ### Task 2: Location Extraction **Input**: Conversation text **Output**: Locations with geocoding data **Subagent Approach**: NER for GPE/LOC entities + geocoding API calls ### Task 3: Relationship Extraction **Input**: Conversation text **Output**: Relationships between institutions **Subagent Approach**: Dependency parsing + relation extraction ### Task 4: Collection Metadata Extraction **Input**: Conversation text **Output**: Collection details (size, subject, dates) **Subagent Approach**: Pattern matching + entity extraction ## Main Code Responsibilities What **stays in the main codebase** (NOT delegated to subagents): 1. **Pattern Matching for Identifiers** - ISIL code regex: `[A-Z]{2}-[A-Za-z0-9]+` - Wikidata ID regex: `Q[0-9]+` - VIAF ID extraction from URLs - KvK number validation 2. **CSV Parsing** - Read ISIL registry CSV - Parse Dutch organizations CSV - Map CSV columns to models 3. **Data Validation** - LinkML schema validation - Pydantic model validation - Cross-reference checking 4. **Data Integration** - Merge CSV and conversation data - Conflict resolution - Deduplication (using rapidfuzz) 5. **Export** - JSON-LD generation - RDF/Turtle serialization - Parquet export ## Testing Strategy ### Unit Tests (No Subagents) ```python def test_identifier_extraction(): """Test regex pattern matching without subagents""" text = "The ISIL code is NL-AsdRM" identifiers = extract_identifiers(text) # Pure regex, no subagent assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}] ``` ### Integration Tests (With Mocked Subagents) ```python @pytest.mark.subagent def test_institution_extraction_with_mock_subagent(mocker): """Test with mocked subagent response""" mock_result = { "institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}] } mocker.patch("glam_extractor.task.Task", return_value=mock_result) result = extract_institutions(sample_text) assert len(result) == 1 assert result[0].name == "Rijksmuseum" ``` ### End-to-End Tests (Real Subagents) ```python @pytest.mark.subagent @pytest.mark.slow def test_real_extraction_pipeline(): """Test with real subagent (slow, requires network)""" conversation = load_sample_conversation() institutions = extract_with_subagent(conversation) # Real subagent call assert len(institutions) > 0 ``` ## Performance Considerations ### Latency - Subagent invocation: ~2-5 seconds overhead per call - NER processing: ~1-10 seconds depending on text length - **Total**: ~3-15 seconds per conversation file ### Optimization Strategies 1. **Batch Processing**: Process multiple conversations in parallel subagents 2. **Caching**: Cache subagent results keyed by conversation UUID 3. **Incremental Processing**: Only process new/updated conversations 4. **Selective Extraction**: Use cheap pattern matching first, subagents only when needed ### Resource Usage - Main process: Low memory (~100MB), no GPU needed - Subagent process: High memory (~2-4GB for NLP models), optional GPU - **Benefit**: Main application stays lightweight ## Migration Path If we later decide to bring NER into the main codebase: 1. Implement `NERExtractor` class that replicates subagent behavior 2. Add spaCy/transformers dependencies to `pyproject.toml` 3. Update extraction code to call local NER instead of subagents 4. Keep subagent interface as fallback option The architecture supports both approaches without major refactoring. ## Alternative Approaches Considered ### Alternative 1: Direct spaCy Integration **Pros**: Lower latency, more control **Cons**: Heavy dependencies, harder to swap implementations **Decision**: Rejected due to complexity and maintenance burden ### Alternative 2: External API Service (e.g., GPT-4 API) **Pros**: No local dependencies, very flexible **Cons**: Cost per request, requires API keys, network dependency **Decision**: Could be used by subagents, but not as main architecture ### Alternative 3: Hybrid Approach **Pros**: Use regex for simple cases, subagents for complex **Cons**: Two code paths to maintain **Decision**: Partially adopted (regex for identifiers, subagents for NER) ## References - **Task Tool Documentation**: OpenCode agent framework - **spaCy Documentation**: https://spacy.io - **LinkML Schema**: `schemas/heritage_custodian.yaml` - **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md` ## Decision Log | Date | Decision | Rationale | |------|----------|-----------| | 2025-11-05 | Use subagents for NER | Clean separation, flexibility, maintainability | | 2025-11-05 | Keep regex in main code | Simple patterns don't need subagents | | 2025-11-05 | Remove spaCy from dependencies | Not used directly in main code | --- **Version**: 1.0 **Last Updated**: 2025-11-05 **Status**: Active