131 lines
3.9 KiB
Markdown
131 lines
3.9 KiB
Markdown
# Session 3: Architecture Documentation Update
|
|
|
|
**Date**: 2025-11-05
|
|
**Task**: Update `docs/plan/global_glam/02-architecture.md` to reflect subagent-based NER architecture
|
|
|
|
## Changes Made
|
|
|
|
### 1. Added Extraction Architecture Overview (New Section)
|
|
|
|
Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:
|
|
- **Pattern Matching (Main Code)**: For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
|
|
- **Subagent-Based NER**: For unstructured entities (institution names, locations, relationships)
|
|
- Reference to `docs/plan/global_glam/07-subagent-architecture.md` for detailed rationale
|
|
|
|
**Location**: Lines 30-47 (new section after "High-Level Architecture")
|
|
|
|
### 2. Updated NLP Extraction Pipeline Section
|
|
|
|
**Before**:
|
|
- Mentioned direct use of spaCy, transformers, scikit-learn
|
|
- Single "Entity Extractor" component
|
|
- No distinction between pattern matching and NER
|
|
|
|
**After**:
|
|
- Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
|
|
- Explicitly states which extractors use subagents
|
|
- Reference to subagent architecture document
|
|
|
|
**Location**: Lines 107-130
|
|
|
|
### 3. Updated Technology Stack
|
|
|
|
**Before**:
|
|
```
|
|
**Technology**:
|
|
- spaCy: Core NLP framework
|
|
- transformers: For advanced NER (BERT-based models)
|
|
- scikit-learn: Classification tasks
|
|
- regex: Pattern-based extraction
|
|
- langdetect: Language identification
|
|
```
|
|
|
|
**After**:
|
|
```
|
|
**Technology**:
|
|
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
|
|
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
|
|
- Task Orchestration: Task tool for subagent invocation
|
|
- Text Processing: `langdetect` for language identification (main code)
|
|
```
|
|
|
|
**Location**: Lines 182-186
|
|
|
|
### 4. Updated Key Libraries Section
|
|
|
|
**Before**:
|
|
- Listed spaCy, transformers, scikit-learn as direct dependencies
|
|
|
|
**After**:
|
|
- Removed NLP libraries from main dependencies
|
|
- Added "Pattern Matching" and "Subagent Orchestration" categories
|
|
- Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."
|
|
|
|
**Location**: Lines 542-551
|
|
|
|
### 5. Updated Data Flow Architecture Diagram
|
|
|
|
**Before**:
|
|
```
|
|
Conversation Parser
|
|
↓
|
|
NLP Extraction Pipeline ──> [SQLite Cache]
|
|
↓
|
|
Web Crawler
|
|
```
|
|
|
|
**After**:
|
|
```
|
|
Conversation Parser
|
|
↓
|
|
Pattern-based Extraction ──> Extract: ISIL, Wikidata, URLs
|
|
↓
|
|
SUBAGENT BOUNDARY ──> Launch subagents for NER
|
|
↓
|
|
Coding Subagents ──> Subagents use spaCy/transformers
|
|
↓
|
|
Web Crawler
|
|
```
|
|
|
|
**Location**: Lines 429-479 (updated diagram)
|
|
|
|
### 6. Added Related Documentation Section
|
|
|
|
Added comprehensive references at the end of the document:
|
|
- Subagent Architecture document
|
|
- Design Patterns
|
|
- Data Standardization
|
|
- Dependencies
|
|
- LinkML Schema
|
|
- Agent Instructions (AGENTS.md)
|
|
|
|
**Location**: End of document (after "Integration Opportunities")
|
|
|
|
## Impact
|
|
|
|
These changes ensure that:
|
|
1. ✅ Architecture documentation accurately reflects implementation (no spaCy in main code)
|
|
2. ✅ Clear guidance for developers on when to use pattern matching vs subagents
|
|
3. ✅ Proper references to related documentation
|
|
4. ✅ Accurate technology stack listing (no misleading dependencies)
|
|
|
|
## Files Modified
|
|
|
|
1. `docs/plan/global_glam/02-architecture.md` - **UPDATED**
|
|
|
|
## Next Steps
|
|
|
|
According to the session resume summary, the next priority tasks are:
|
|
|
|
### HIGH PRIORITY
|
|
1. **CSVParser** - Parse Dutch ISIL registry and organizations CSV files
|
|
2. **LinkMLValidator** - Validate HeritageCustodian records against schema
|
|
|
|
### MEDIUM PRIORITY
|
|
3. **InstitutionExtractor** - Use Task tool to launch subagents for NER
|
|
|
|
## Status
|
|
|
|
✅ **Architecture documentation update COMPLETE**
|
|
|
|
All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.
|