kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

9.5 KiB

Raw Blame History

Subagent-Based NER Architecture

Overview

This document describes the architectural decision to use coding subagents for Named Entity Recognition (NER) instead of directly integrating NLP libraries like spaCy or transformers into the main codebase.

Architecture Decision

Decision

Use coding subagents via the Task tool for all NER and entity extraction tasks, rather than directly importing and using NLP libraries in the main application code.

Status

Accepted (2025-11-05)

Context

The GLAM data extraction project needs to extract structured information (institution names, locations, identifiers, etc.) from 139+ conversation JSON files containing unstructured text in 60+ languages.

Traditional approaches would involve:

Installing spaCy, transformers, PyTorch as dependencies
Managing NLP model downloads and storage
Writing NER extraction code in the main application
Handling multilingual model selection
Managing GPU/CPU resources for model inference

Decision Drivers

Separation of Concerns: Keep extraction logic separate from data pipeline logic
Flexibility: Allow experimentation with different NER approaches without changing main code
Resource Management: Subagents can manage heavy NLP dependencies independently
Maintainability: Cleaner main codebase without NLP-specific code
Modularity: Subagents can be swapped or upgraded without affecting pipeline

Consequences

Positive

✅ Clean Dependencies: Main application has minimal dependencies (no PyTorch, spaCy, transformers)
✅ Flexibility: Can use different extraction methods (spaCy, GPT-4, regex, custom) without code changes
✅ Testability: Easier to mock extraction results for testing
✅ Scalability: Subagents can run in parallel, distributed across workers
✅ Maintainability: Clear separation between "what to extract" and "how to extract"

Negative

⚠️ Complexity: Additional layer of abstraction
⚠️ Debugging: Harder to debug extraction issues (need to inspect subagent behavior)
⚠️ Latency: Subagent invocation adds overhead compared to direct function calls
⚠️ Control: Less fine-grained control over NER parameters

Neutral

🔄 Testing Strategy: Need integration tests that use real subagents
🔄 Documentation: Must document subagent interface and expectations

Implementation Pattern

Subagent Invocation

from glam_extractor.task import Task

# Use subagent for NER extraction
result = Task(
    subagent_type="general",
    description="Extract institutions from text",
    prompt=f"""
    Extract all GLAM institution names, locations, and identifiers 
    from the following text:
    
    {conversation_text}
    
    Return results as JSON with fields:
    - institutions: list of institution names
    - locations: list of locations
    - identifiers: list of {scheme, value} pairs
    """
)

# Process subagent results
institutions = result.get("institutions", [])

Data Flow

┌─────────────────────┐
│ Conversation JSON   │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ ConversationParser  │ (Main code)
│ - Parse JSON        │
│ - Extract text      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ SUBAGENT BOUNDARY   │ ← Task tool invocation
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Coding Subagent     │ (Autonomous)
│ - Load spaCy model  │
│ - Run NER           │
│ - Extract entities  │
│ - Return JSON       │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ InstitutionBuilder  │ (Main code)
│ - Validate results  │
│ - Build models      │
│ - Add provenance    │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ HeritageCustodian   │
│ (Pydantic models)   │
└─────────────────────┘

Extraction Tasks Delegated to Subagents

Task 1: Institution Name Extraction

Input: Conversation text
Output: List of institution names with types
Subagent Approach: NER for ORG entities + keyword filtering

Task 2: Location Extraction

Input: Conversation text
Output: Locations with geocoding data
Subagent Approach: NER for GPE/LOC entities + geocoding API calls

Task 3: Relationship Extraction

Input: Conversation text
Output: Relationships between institutions
Subagent Approach: Dependency parsing + relation extraction

Task 4: Collection Metadata Extraction

Input: Conversation text
Output: Collection details (size, subject, dates)
Subagent Approach: Pattern matching + entity extraction

Main Code Responsibilities

What stays in the main codebase (NOT delegated to subagents):

Pattern Matching for Identifiers
- ISIL code regex: [A-Z]{2}-[A-Za-z0-9]+
- Wikidata ID regex: Q[0-9]+
- VIAF ID extraction from URLs
- KvK number validation
CSV Parsing
- Read ISIL registry CSV
- Parse Dutch organizations CSV
- Map CSV columns to models
Data Validation
- LinkML schema validation
- Pydantic model validation
- Cross-reference checking
Data Integration
- Merge CSV and conversation data
- Conflict resolution
- Deduplication (using rapidfuzz)
Export
- JSON-LD generation
- RDF/Turtle serialization
- Parquet export

Testing Strategy

Unit Tests (No Subagents)

def test_identifier_extraction():
    """Test regex pattern matching without subagents"""
    text = "The ISIL code is NL-AsdRM"
    identifiers = extract_identifiers(text)  # Pure regex, no subagent
    assert identifiers == [{"scheme": "ISIL", "value": "NL-AsdRM"}]

Integration Tests (With Mocked Subagents)

@pytest.mark.subagent
def test_institution_extraction_with_mock_subagent(mocker):
    """Test with mocked subagent response"""
    mock_result = {
        "institutions": [{"name": "Rijksmuseum", "type": "MUSEUM"}]
    }
    mocker.patch("glam_extractor.task.Task", return_value=mock_result)
    
    result = extract_institutions(sample_text)
    assert len(result) == 1
    assert result[0].name == "Rijksmuseum"

End-to-End Tests (Real Subagents)

@pytest.mark.subagent
@pytest.mark.slow
def test_real_extraction_pipeline():
    """Test with real subagent (slow, requires network)"""
    conversation = load_sample_conversation()
    institutions = extract_with_subagent(conversation)  # Real subagent call
    assert len(institutions) > 0

Performance Considerations

Latency

Subagent invocation: ~2-5 seconds overhead per call
NER processing: ~1-10 seconds depending on text length
Total: ~3-15 seconds per conversation file

Optimization Strategies

Batch Processing: Process multiple conversations in parallel subagents
Caching: Cache subagent results keyed by conversation UUID
Incremental Processing: Only process new/updated conversations
Selective Extraction: Use cheap pattern matching first, subagents only when needed

Resource Usage

Main process: Low memory (~100MB), no GPU needed
Subagent process: High memory (~2-4GB for NLP models), optional GPU
Benefit: Main application stays lightweight

Migration Path

If we later decide to bring NER into the main codebase:

Implement NERExtractor class that replicates subagent behavior
Add spaCy/transformers dependencies to pyproject.toml
Update extraction code to call local NER instead of subagents
Keep subagent interface as fallback option

The architecture supports both approaches without major refactoring.

Alternative Approaches Considered

Alternative 1: Direct spaCy Integration

Pros: Lower latency, more control
Cons: Heavy dependencies, harder to swap implementations
Decision: Rejected due to complexity and maintenance burden

Alternative 2: External API Service (e.g., GPT-4 API)

Pros: No local dependencies, very flexible
Cons: Cost per request, requires API keys, network dependency
Decision: Could be used by subagents, but not as main architecture

Alternative 3: Hybrid Approach

Pros: Use regex for simple cases, subagents for complex
Cons: Two code paths to maintain
Decision: Partially adopted (regex for identifiers, subagents for NER)

References

Task Tool Documentation: OpenCode agent framework
spaCy Documentation: https://spacy.io
LinkML Schema: schemas/heritage_custodian.yaml
Design Patterns: docs/plan/global_glam/05-design-patterns.md

Decision Log

Date	Decision	Rationale
2025-11-05	Use subagents for NER	Clean separation, flexibility, maintainability
2025-11-05	Keep regex in main code	Simple patterns don't need subagents
2025-11-05	Remove spaCy from dependencies	Not used directly in main code

Version: 1.0
Last Updated: 2025-11-05
Status: Active

9.5 KiB Raw Blame History