glam/docs/plan/global_glam/07-subagent-architecture.md
2025-11-19 23:25:22 +01:00

9.5 KiB

Subagent-Based NER Architecture

Overview

This document describes the architectural decision to use coding subagents for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.

Architecture Decision

Decision

Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.

Status

Accepted (2025-11-05)

Context

The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.

Traditional approaches would involve:

  • Installing spaCy, transformers, PyTorch as dependencies
  • Managing NLP model downloads and storage
  • Writing NER extraction code in the main application
  • Handling multilingual model selection
  • Managing GPU/CPU resources for model inference

Decision Drivers

  1. Separation of Concerns: Keep extraction logic separate from data pipeline logic
  2. Flexibility: Allow experimentation with different NER approaches without changing main code
  3. Resource Management: Subagents can manage heavy NLP dependencies independently
  4. Maintainability: Cleaner main codebase without NLP-specific code
  5. Modularity: Subagents can be swapped or upgraded without affecting pipeline

Consequences

Positive

  • Clean Dependencies: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
  • Flexibility: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
  • Testability: Easier to mock extraction results for testing
  • Scalability: Subagents can run in parallel, distributed across workers
  • Maintainability: Clear separation between "what to extract" and "how to extract"

Negative

  • ⚠️ Complexity: Additional layer of abstraction
  • ⚠️ Debugging: Harder to debug extraction issues (need to inspect subagent behavior)
  • ⚠️ Latency: Subagent invocation adds overhead compared to direct function calls
  • ⚠️ Control: Less fine-grained control over NER parameters

Neutral

  • 🔄 Testing Strategy: Need integration tests that use real subagents
  • 🔄 Documentation: Must document subagent interface and expectations

Implementation Pattern

Subagent Invocation

from glam_extractor.task import Task

# Use subagent for NER extraction
result = Task(
    subagent_type="general",
    description="Extract institutions from text",
    prompt=f"""
    Extract all GLAM institution names, locations, and identifiers 
    from the following text:
    
    {conversation_text}
    
    Return results as JSON with fields:
    - institutions: list of institution names
    - locations: list of locations
    - identifiers: list of {scheme, value} pairs
    """
)

# Process subagent results
institutions = result.get("institutions", [])

Data Flow

┌─────────────────────┐
│ Conversation JSON   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ ConversationParser  │ (Main code)
│ - Parse JSON        │
│ - Extract text      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ SUBAGENT BOUNDARY   │ ← Task tool invocation
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Coding Subagent     │ (Autonomous)
│ - Load spaCy model  │
│ - Run NER           │
│ - Extract entities  │
│ - Return JSON       │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ InstitutionBuilder  │ (Main code)
│ - Validate results  │
│ - Build models      │
│ - Add provenance    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ HeritageCustodian   │
│ (Pydantic models)   │
└─────────────────────┘

Extraction Tasks Delegated to Subagents

Task 1: Institution Name Extraction

Input: Conversation text
Output: List of institution names with types
Subagent Approach: NER for ORG entities + keyword filtering

Task 2: Location Extraction

Input: Conversation text
Output: Locations with geocoding data
Subagent Approach: NER for GPE/LOC entities + geocoding API calls

Task 3: Relationship Extraction

Input: Conversation text
Output: Relationships between institutions
Subagent Approach: Dependency parsing + relation extraction

Task 4: Collection Metadata Extraction

Input: Conversation text
Output: Collection details (size, subject, dates)
Subagent Approach: Pattern matching + entity extraction

Main Code Responsibilities

What stays in the main codebase (NOT delegated to subagents):

  1. Pattern Matching for Identifiers

    • ISIL code regex: [A-Z]{2}-[A-Za-z0-9]+
    • Wikidata ID regex: Q[0-9]+
    • VIAF ID extraction from URLs
    • KvK number validation
  2. CSV Parsing

    • Read ISIL registry CSV
    • Parse Dutch organizations CSV
    • Map CSV columns to models
  3. Data Validation

    • LinkML schema validation
    • Pydantic model validation
    • Cross-reference checking
  4. Data Integration

    • Merge CSV and conversation data
    • Conflict resolution
    • Deduplication (using rapidfuzz)
  5. Export

    • JSON-LD generation
    • RDF/Turtle serialization
    • Parquet export

Testing Strategy

Unit Tests (No Subagents)

def test_identifier_extraction():
    """Test regex pattern matching without subagents"""
    text = "The ISIL code is NL-AsdRM"
    identifiers = extract_identifiers(text)  # Pure regex, no subagent
    assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]

Integration Tests (With Mocked Subagents)

@pytest.mark.subagent
def test_institution_extraction_with_mock_subagent(mocker):
    """Test with mocked subagent response"""
    mock_result = {
        "institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
    }
    mocker.patch("glam_extractor.task.Task", return_value=mock_result)
    
    result = extract_institutions(sample_text)
    assert len(result) == 1
    assert result[0].name == "Rijksmuseum"

End-to-End Tests (Real Subagents)

@pytest.mark.subagent
@pytest.mark.slow
def test_real_extraction_pipeline():
    """Test with real subagent (slow, requires network)"""
    conversation = load_sample_conversation()
    institutions = extract_with_subagent(conversation)  # Real subagent call
    assert len(institutions) > 0

Performance Considerations

Latency

  • Subagent invocation: ~2-5 seconds overhead per call
  • NER processing: ~1-10 seconds depending on text length
  • Total: ~3-15 seconds per conversation file

Optimization Strategies

  1. Batch Processing: Process multiple conversations in parallel subagents
  2. Caching: Cache subagent results keyed by conversation UUID
  3. Incremental Processing: Only process new/updated conversations
  4. Selective Extraction: Use cheap pattern matching first, subagents only when needed

Resource Usage

  • Main process: Low memory (~100MB), no GPU needed
  • Subagent process: High memory (~2-4GB for NLP models), optional GPU
  • Benefit: Main application stays lightweight

Migration Path

If we later decide to bring NER into the main codebase:

  1. Implement NERExtractor class that replicates subagent behavior
  2. Add spaCy/transformers dependencies to pyproject.toml
  3. Update extraction code to call local NER instead of subagents
  4. Keep subagent interface as fallback option

The architecture supports both approaches without major refactoring.

Alternative Approaches Considered

Alternative 1: Direct spaCy Integration

Pros: Lower latency, more control
Cons: Heavy dependencies, harder to swap implementations
Decision: Rejected due to complexity and maintenance burden

Alternative 2: External API Service (e.g., GPT-4 API)

Pros: No local dependencies, very flexible
Cons: Cost per request, requires API keys, network dependency
Decision: Could be used by subagents, but not as main architecture

Alternative 3: Hybrid Approach

Pros: Use regex for simple cases, subagents for complex
Cons: Two code paths to maintain
Decision: Partially adopted (regex for identifiers, subagents for NER)

References

  • Task Tool Documentation: OpenCode agent framework
  • spaCy Documentation: https://spacy.io
  • LinkML Schema: schemas/heritage_custodian.yaml
  • Design Patterns: docs/plan/global_glam/05-design-patterns.md

Decision Log

Date Decision Rationale
2025-11-05 Use subagents for NER Clean separation, flexibility, maintainability
2025-11-05 Keep regex in main code Simple patterns don't need subagents
2025-11-05 Remove spaCy from dependencies Not used directly in main code

Version: 1.0
Last Updated: 2025-11-05
Status: Active