glam/docs/progress/session-03-architecture-update.md
2025-11-19 23:25:22 +01:00

3.9 KiB

Session 3: Architecture Documentation Update

Date: 2025-11-05
Task: Update docs/plan/global_glam/02-architecture.md to reflect subagent-based NER architecture

Changes Made

1. Added Extraction Architecture Overview (New Section)

Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:

  • Pattern Matching (Main Code): For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
  • Subagent-Based NER: For unstructured entities (institution names, locations, relationships)
  • Reference to docs/plan/global_glam/07-subagent-architecture.md for detailed rationale

Location: Lines 30-47 (new section after "High-Level Architecture")

2. Updated NLP Extraction Pipeline Section

Before:

  • Mentioned direct use of spaCy, transformers, scikit-learn
  • Single "Entity Extractor" component
  • No distinction between pattern matching and NER

After:

  • Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
  • Explicitly states which extractors use subagents
  • Reference to subagent architecture document

Location: Lines 107-130

3. Updated Technology Stack

Before:

**Technology**:
- spaCy: Core NLP framework
- transformers: For advanced NER (BERT-based models)
- scikit-learn: Classification tasks
- regex: Pattern-based extraction
- langdetect: Language identification

After:

**Technology**:
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing: `langdetect` for language identification (main code)

Location: Lines 182-186

4. Updated Key Libraries Section

Before:

  • Listed spaCy, transformers, scikit-learn as direct dependencies

After:

  • Removed NLP libraries from main dependencies
  • Added "Pattern Matching" and "Subagent Orchestration" categories
  • Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."

Location: Lines 542-551

5. Updated Data Flow Architecture Diagram

Before:

Conversation Parser
    ↓
NLP Extraction Pipeline  ──> [SQLite Cache]
    ↓
Web Crawler

After:

Conversation Parser
    ↓
Pattern-based Extraction  ──> Extract: ISIL, Wikidata, URLs
    ↓
SUBAGENT BOUNDARY  ──> Launch subagents for NER
    ↓
Coding Subagents  ──> Subagents use spaCy/transformers
    ↓
Web Crawler

Location: Lines 429-479 (updated diagram)

Added comprehensive references at the end of the document:

  • Subagent Architecture document
  • Design Patterns
  • Data Standardization
  • Dependencies
  • LinkML Schema
  • Agent Instructions (AGENTS.md)

Location: End of document (after "Integration Opportunities")

Impact

These changes ensure that:

  1. Architecture documentation accurately reflects implementation (no spaCy in main code)
  2. Clear guidance for developers on when to use pattern matching vs subagents
  3. Proper references to related documentation
  4. Accurate technology stack listing (no misleading dependencies)

Files Modified

  1. docs/plan/global_glam/02-architecture.md - UPDATED

Next Steps

According to the session resume summary, the next priority tasks are:

HIGH PRIORITY

  1. CSVParser - Parse Dutch ISIL registry and organizations CSV files
  2. LinkMLValidator - Validate HeritageCustodian records against schema

MEDIUM PRIORITY

  1. InstitutionExtractor - Use Task tool to launch subagents for NER

Status

Architecture documentation update COMPLETE

All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.