glam/docs/progress/session-03-architecture-update.md

# Session 3: Architecture Documentation Update

**Date**: 2025-11-05
**Task**: Update `docs/plan/global_glam/02-architecture.md` to reflect subagent-based NER architecture

## Changes Made

### 1. Added Extraction Architecture Overview (New Section)

Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:
- **Pattern Matching (Main Code)**: For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
- **Subagent-Based NER**: For unstructured entities (institution names, locations, relationships)
- Reference to `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale

**Location**: Lines 30-47 (new section after "High-Level Architecture")

### 2. Updated NLP Extraction Pipeline Section

**Before**:
- Mentioned direct use of spaCy, transformers, scikit-learn
- Single "Entity Extractor" component
- No distinction between pattern matching and NER

**After**:
- Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
- Explicitly states which extractors use subagents
- Reference to subagent architecture document

**Location**: Lines 107-130

### 3. Updated Technology Stack

**Before**:
```
**Technology**:
- spaCy: Core NLP framework
- transformers: For advanced NER (BERT-based models)
- scikit-learn: Classification tasks
- regex: Pattern-based extraction
- langdetect: Language identification
```

**After**:
```
**Technology**:
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing: `langdetect` for language identification (main code)
```

**Location**: Lines 182-186

### 4. Updated Key Libraries Section

**Before**:
- Listed spaCy, transformers, scikit-learn as direct dependencies

**After**:
- Removed NLP libraries from main dependencies
- Added "Pattern Matching" and "Subagent Orchestration" categories
- Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."

**Location**: Lines 542-551

### 5. Updated Data Flow Architecture Diagram

**Before**:
```
Conversation Parser
    ↓
NLP Extraction Pipeline  ──> [SQLite Cache]
    ↓
Web Crawler
```

**After**:
```
Conversation Parser
    ↓
Pattern-based Extraction  ──> Extract: ISIL, Wikidata, URLs
    ↓
SUBAGENT BOUNDARY  ──> Launch subagents for NER
    ↓
Coding Subagents  ──> Subagents use spaCy/transformers
    ↓
Web Crawler
```

**Location**: Lines 429-479 (updated diagram)

### 6. Added Related Documentation Section

Added comprehensive references at the end of the document:
- Subagent Architecture document
- Design Patterns
- Data Standardization
- Dependencies
- LinkML Schema
- Agent Instructions (AGENTS.md)

**Location**: End of document (after "Integration Opportunities")

## Impact

These changes ensure that:
1. ✅ Architecture documentation accurately reflects implementation (no spaCy in main code)
2. ✅ Clear guidance for developers on when to use pattern matching vs subagents
3. ✅ Proper references to related documentation
4. ✅ Accurate technology stack listing (no misleading dependencies)

## Files Modified

1. `docs/plan/global_glam/02-architecture.md` - **UPDATED**

## Next Steps

According to the session resume summary, the next priority tasks are:

### HIGH PRIORITY
1. **CSVParser** - Parse Dutch ISIL registry and organizations CSV files
2. **LinkMLValidator** - Validate HeritageCustodian records against schema

### MEDIUM PRIORITY
3. **InstitutionExtractor** - Use Task tool to launch subagents for NER

## Status

✅ **Architecture documentation update COMPLETE**

All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.