3.9 KiB
Session 3: Architecture Documentation Update
Date: 2025-11-05
Task: Update docs/plan/global_glam/02-architecture.md to reflect subagent-based NER architecture
Changes Made
1. Added Extraction Architecture Overview (New Section)
Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:
- Pattern Matching (Main Code): For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
- Subagent-Based NER: For unstructured entities (institution names, locations, relationships)
- Reference to
docs/plan/global_glam/07-subagent-architecture.mdfor detailed rationale
Location: Lines 30-47 (new section after "High-Level Architecture")
2. Updated NLP Extraction Pipeline Section
Before:
- Mentioned direct use of spaCy, transformers, scikit-learn
- Single "Entity Extractor" component
- No distinction between pattern matching and NER
After:
- Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
- Explicitly states which extractors use subagents
- Reference to subagent architecture document
Location: Lines 107-130
3. Updated Technology Stack
Before:
**Technology**:
- spaCy: Core NLP framework
- transformers: For advanced NER (BERT-based models)
- scikit-learn: Classification tasks
- regex: Pattern-based extraction
- langdetect: Language identification
After:
**Technology**:
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing: `langdetect` for language identification (main code)
Location: Lines 182-186
4. Updated Key Libraries Section
Before:
- Listed spaCy, transformers, scikit-learn as direct dependencies
After:
- Removed NLP libraries from main dependencies
- Added "Pattern Matching" and "Subagent Orchestration" categories
- Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."
Location: Lines 542-551
5. Updated Data Flow Architecture Diagram
Before:
Conversation Parser
↓
NLP Extraction Pipeline ──> [SQLite Cache]
↓
Web Crawler
After:
Conversation Parser
↓
Pattern-based Extraction ──> Extract: ISIL, Wikidata, URLs
↓
SUBAGENT BOUNDARY ──> Launch subagents for NER
↓
Coding Subagents ──> Subagents use spaCy/transformers
↓
Web Crawler
Location: Lines 429-479 (updated diagram)
6. Added Related Documentation Section
Added comprehensive references at the end of the document:
- Subagent Architecture document
- Design Patterns
- Data Standardization
- Dependencies
- LinkML Schema
- Agent Instructions (AGENTS.md)
Location: End of document (after "Integration Opportunities")
Impact
These changes ensure that:
- ✅ Architecture documentation accurately reflects implementation (no spaCy in main code)
- ✅ Clear guidance for developers on when to use pattern matching vs subagents
- ✅ Proper references to related documentation
- ✅ Accurate technology stack listing (no misleading dependencies)
Files Modified
docs/plan/global_glam/02-architecture.md- UPDATED
Next Steps
According to the session resume summary, the next priority tasks are:
HIGH PRIORITY
- CSVParser - Parse Dutch ISIL registry and organizations CSV files
- LinkMLValidator - Validate HeritageCustodian records against schema
MEDIUM PRIORITY
- InstitutionExtractor - Use Task tool to launch subagents for NER
Status
✅ Architecture documentation update COMPLETE
All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.