glam/docs/progress/session-03-architecture-update.md
2025-11-19 23:25:22 +01:00

131 lines
3.9 KiB
Markdown

# Session 3: Architecture Documentation Update
**Date**: 2025-11-05
**Task**: Update `docs/plan/global_glam/02-architecture.md` to reflect subagent-based NER architecture
## Changes Made
### 1. Added Extraction Architecture Overview (New Section)
Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:
- **Pattern Matching (Main Code)**: For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
- **Subagent-Based NER**: For unstructured entities (institution names, locations, relationships)
- Reference to `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale
**Location**: Lines 30-47 (new section after "High-Level Architecture")
### 2. Updated NLP Extraction Pipeline Section
**Before**:
- Mentioned direct use of spaCy, transformers, scikit-learn
- Single "Entity Extractor" component
- No distinction between pattern matching and NER
**After**:
- Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
- Explicitly states which extractors use subagents
- Reference to subagent architecture document
**Location**: Lines 107-130
### 3. Updated Technology Stack
**Before**:
```
**Technology**:
- spaCy: Core NLP framework
- transformers: For advanced NER (BERT-based models)
- scikit-learn: Classification tasks
- regex: Pattern-based extraction
- langdetect: Language identification
```
**After**:
```
**Technology**:
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing: `langdetect` for language identification (main code)
```
**Location**: Lines 182-186
### 4. Updated Key Libraries Section
**Before**:
- Listed spaCy, transformers, scikit-learn as direct dependencies
**After**:
- Removed NLP libraries from main dependencies
- Added "Pattern Matching" and "Subagent Orchestration" categories
- Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."
**Location**: Lines 542-551
### 5. Updated Data Flow Architecture Diagram
**Before**:
```
Conversation Parser
NLP Extraction Pipeline ──> [SQLite Cache]
Web Crawler
```
**After**:
```
Conversation Parser
Pattern-based Extraction ──> Extract: ISIL, Wikidata, URLs
SUBAGENT BOUNDARY ──> Launch subagents for NER
Coding Subagents ──> Subagents use spaCy/transformers
Web Crawler
```
**Location**: Lines 429-479 (updated diagram)
### 6. Added Related Documentation Section
Added comprehensive references at the end of the document:
- Subagent Architecture document
- Design Patterns
- Data Standardization
- Dependencies
- LinkML Schema
- Agent Instructions (AGENTS.md)
**Location**: End of document (after "Integration Opportunities")
## Impact
These changes ensure that:
1. ✅ Architecture documentation accurately reflects implementation (no spaCy in main code)
2. ✅ Clear guidance for developers on when to use pattern matching vs subagents
3. ✅ Proper references to related documentation
4. ✅ Accurate technology stack listing (no misleading dependencies)
## Files Modified
1. `docs/plan/global_glam/02-architecture.md` - **UPDATED**
## Next Steps
According to the session resume summary, the next priority tasks are:
### HIGH PRIORITY
1. **CSVParser** - Parse Dutch ISIL registry and organizations CSV files
2. **LinkMLValidator** - Validate HeritageCustodian records against schema
### MEDIUM PRIORITY
3. **InstitutionExtractor** - Use Task tool to launch subagents for NER
## Status
**Architecture documentation update COMPLETE**
All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.