# Session 3: Architecture Documentation Update **Date**: 2025-11-05 **Task**: Update `docs/plan/global_glam/02-architecture.md` to reflect subagent-based NER architecture ## Changes Made ### 1. Added Extraction Architecture Overview (New Section) Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains: - **Pattern Matching (Main Code)**: For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs) - **Subagent-Based NER**: For unstructured entities (institution names, locations, relationships) - Reference to `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale **Location**: Lines 30-47 (new section after "High-Level Architecture") ### 2. Updated NLP Extraction Pipeline Section **Before**: - Mentioned direct use of spaCy, transformers, scikit-learn - Single "Entity Extractor" component - No distinction between pattern matching and NER **After**: - Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)" - Explicitly states which extractors use subagents - Reference to subagent architecture document **Location**: Lines 107-130 ### 3. Updated Technology Stack **Before**: ``` **Technology**: - spaCy: Core NLP framework - transformers: For advanced NER (BERT-based models) - scikit-learn: Classification tasks - regex: Pattern-based extraction - langdetect: Language identification ``` **After**: ``` **Technology**: - Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching - NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.) - Task Orchestration: Task tool for subagent invocation - Text Processing: `langdetect` for language identification (main code) ``` **Location**: Lines 182-186 ### 4. Updated Key Libraries Section **Before**: - Listed spaCy, transformers, scikit-learn as direct dependencies **After**: - Removed NLP libraries from main dependencies - Added "Pattern Matching" and "Subagent Orchestration" categories - Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction." **Location**: Lines 542-551 ### 5. Updated Data Flow Architecture Diagram **Before**: ``` Conversation Parser ↓ NLP Extraction Pipeline ──> [SQLite Cache] ↓ Web Crawler ``` **After**: ``` Conversation Parser ↓ Pattern-based Extraction ──> Extract: ISIL, Wikidata, URLs ↓ SUBAGENT BOUNDARY ──> Launch subagents for NER ↓ Coding Subagents ──> Subagents use spaCy/transformers ↓ Web Crawler ``` **Location**: Lines 429-479 (updated diagram) ### 6. Added Related Documentation Section Added comprehensive references at the end of the document: - Subagent Architecture document - Design Patterns - Data Standardization - Dependencies - LinkML Schema - Agent Instructions (AGENTS.md) **Location**: End of document (after "Integration Opportunities") ## Impact These changes ensure that: 1. ✅ Architecture documentation accurately reflects implementation (no spaCy in main code) 2. ✅ Clear guidance for developers on when to use pattern matching vs subagents 3. ✅ Proper references to related documentation 4. ✅ Accurate technology stack listing (no misleading dependencies) ## Files Modified 1. `docs/plan/global_glam/02-architecture.md` - **UPDATED** ## Next Steps According to the session resume summary, the next priority tasks are: ### HIGH PRIORITY 1. **CSVParser** - Parse Dutch ISIL registry and organizations CSV files 2. **LinkMLValidator** - Validate HeritageCustodian records against schema ### MEDIUM PRIORITY 3. **InstitutionExtractor** - Use Task tool to launch subagents for NER ## Status ✅ **Architecture documentation update COMPLETE** All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.