# Subagent-Based NER Architecture

## Overview

This document describes the architectural decision to use **coding subagents** for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.

## Architecture Decision

### Decision
Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.

### Status
**Accepted** (2025-11-05)

### Context
The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.

Traditional approaches would involve:
- Installing spaCy, transformers, PyTorch as dependencies
- Managing NLP model downloads and storage
- Writing NER extraction code in the main application
- Handling multilingual model selection
- Managing GPU/CPU resources for model inference

### Decision Drivers

1. **Separation of Concerns**: Keep extraction logic separate from data pipeline logic
2. **Flexibility**: Allow experimentation with different NER approaches without changing main code
3. **Resource Management**: Subagents can manage heavy NLP dependencies independently
4. **Maintainability**: Cleaner main codebase without NLP-specific code
5. **Modularity**: Subagents can be swapped or upgraded without affecting pipeline

### Consequences

#### Positive
- ✅ **Clean Dependencies**: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
- ✅ **Flexibility**: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
- ✅ **Testability**: Easier to mock extraction results for testing
- ✅ **Scalability**: Subagents can run in parallel, distributed across workers
- ✅ **Maintainability**: Clear separation between "what to extract" and "how to extract"

#### Negative
- ⚠️ **Complexity**: Additional layer of abstraction
- ⚠️ **Debugging**: Harder to debug extraction issues (need to inspect subagent behavior)
- ⚠️ **Latency**: Subagent invocation adds overhead compared to direct function calls
- ⚠️ **Control**: Less fine-grained control over NER parameters

#### Neutral
- 🔄 **Testing Strategy**: Need integration tests that use real subagents
- 🔄 **Documentation**: Must document subagent interface and expectations

## Implementation Pattern

### Subagent Invocation

```python
from glam_extractor.task import Task

# Use subagent for NER extraction
result = Task(
    subagent_type="general",
    description="Extract institutions from text",
    prompt=f"""
    Extract all GLAM institution names, locations, and identifiers 
    from the following text:
    
    {conversation_text}
    
    Return results as JSON with fields:
    - institutions: list of institution names
    - locations: list of locations
    - identifiers: list of {scheme, value} pairs
    """
)

# Process subagent results
institutions = result.get("institutions", [])
```

### Data Flow

```
┌─────────────────────┐
│ Conversation JSON   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ ConversationParser  │ (Main code)
│ - Parse JSON        │
│ - Extract text      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ SUBAGENT BOUNDARY   │ ← Task tool invocation
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Coding Subagent     │ (Autonomous)
│ - Load spaCy model  │
│ - Run NER           │
│ - Extract entities  │
│ - Return JSON       │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ InstitutionBuilder  │ (Main code)
│ - Validate results  │
│ - Build models      │
│ - Add provenance    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ HeritageCustodian   │
│ (Pydantic models)   │
└─────────────────────┘
```

## Extraction Tasks Delegated to Subagents

### Task 1: Institution Name Extraction
**Input**: Conversation text  
**Output**: List of institution names with types  
**Subagent Approach**: NER for ORG entities + keyword filtering

### Task 2: Location Extraction
**Input**: Conversation text  
**Output**: Locations with geocoding data  
**Subagent Approach**: NER for GPE/LOC entities + geocoding API calls

### Task 3: Relationship Extraction
**Input**: Conversation text  
**Output**: Relationships between institutions  
**Subagent Approach**: Dependency parsing + relation extraction

### Task 4: Collection Metadata Extraction
**Input**: Conversation text  
**Output**: Collection details (size, subject, dates)  
**Subagent Approach**: Pattern matching + entity extraction

## Main Code Responsibilities

What **stays in the main codebase** (NOT delegated to subagents):

1. **Pattern Matching for Identifiers**
   - ISIL code regex: `[A-Z]{2}-[A-Za-z0-9]+`
   - Wikidata ID regex: `Q[0-9]+`
   - VIAF ID extraction from URLs
   - KvK number validation

2. **CSV Parsing**
   - Read ISIL registry CSV
   - Parse Dutch organizations CSV
   - Map CSV columns to models

3. **Data Validation**
   - LinkML schema validation
   - Pydantic model validation
   - Cross-reference checking

4. **Data Integration**
   - Merge CSV and conversation data
   - Conflict resolution
   - Deduplication (using rapidfuzz)

5. **Export**
   - JSON-LD generation
   - RDF/Turtle serialization
   - Parquet export

## Testing Strategy

### Unit Tests (No Subagents)
```python
def test_identifier_extraction():
    """Test regex pattern matching without subagents"""
    text = "The ISIL code is NL-AsdRM"
    identifiers = extract_identifiers(text)  # Pure regex, no subagent
    assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]
```

### Integration Tests (With Mocked Subagents)
```python
@pytest.mark.subagent
def test_institution_extraction_with_mock_subagent(mocker):
    """Test with mocked subagent response"""
    mock_result = {
        "institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
    }
    mocker.patch("glam_extractor.task.Task", return_value=mock_result)
    
    result = extract_institutions(sample_text)
    assert len(result) == 1
    assert result[0].name == "Rijksmuseum"
```

### End-to-End Tests (Real Subagents)
```python
@pytest.mark.subagent
@pytest.mark.slow
def test_real_extraction_pipeline():
    """Test with real subagent (slow, requires network)"""
    conversation = load_sample_conversation()
    institutions = extract_with_subagent(conversation)  # Real subagent call
    assert len(institutions) > 0
```

## Performance Considerations

### Latency
- Subagent invocation: ~2-5 seconds overhead per call
- NER processing: ~1-10 seconds depending on text length
- **Total**: ~3-15 seconds per conversation file

### Optimization Strategies
1. **Batch Processing**: Process multiple conversations in parallel subagents
2. **Caching**: Cache subagent results keyed by conversation UUID
3. **Incremental Processing**: Only process new/updated conversations
4. **Selective Extraction**: Use cheap pattern matching first, subagents only when needed

### Resource Usage
- Main process: Low memory (~100MB), no GPU needed
- Subagent process: High memory (~2-4GB for NLP models), optional GPU
- **Benefit**: Main application stays lightweight

## Migration Path

If we later decide to bring NER into the main codebase:

1. Implement `NERExtractor` class that replicates subagent behavior
2. Add spaCy/transformers dependencies to `pyproject.toml`
3. Update extraction code to call local NER instead of subagents
4. Keep subagent interface as fallback option

The architecture supports both approaches without major refactoring.

## Alternative Approaches Considered

### Alternative 1: Direct spaCy Integration
**Pros**: Lower latency, more control  
**Cons**: Heavy dependencies, harder to swap implementations  
**Decision**: Rejected due to complexity and maintenance burden

### Alternative 2: External API Service (e.g., GPT-4 API)
**Pros**: No local dependencies, very flexible  
**Cons**: Cost per request, requires API keys, network dependency  
**Decision**: Could be used by subagents, but not as main architecture

### Alternative 3: Hybrid Approach
**Pros**: Use regex for simple cases, subagents for complex  
**Cons**: Two code paths to maintain  
**Decision**: Partially adopted (regex for identifiers, subagents for NER)

## References

- **Task Tool Documentation**: OpenCode agent framework
- **spaCy Documentation**: https://spacy.io
- **LinkML Schema**: `schemas/heritage_custodian.yaml`
- **Design Patterns**: `docs/plan/global_glam/05-design-patterns.md`

## Decision Log

| Date | Decision | Rationale |
|------|----------|-----------|
| 2025-11-05 | Use subagents for NER | Clean separation, flexibility, maintainability |
| 2025-11-05 | Keep regex in main code | Simple patterns don't need subagents |
| 2025-11-05 | Remove spaCy from dependencies | Not used directly in main code |

---

**Version**: 1.0  
**Last Updated**: 2025-11-05  
**Status**: Active