kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

3.9 KiB

Raw Blame History

Session 3: Architecture Documentation Update

Date: 2025-11-05
Task: Update docs/plan/global_glam/02-architecture.md to reflect subagent-based NER architecture

Changes Made

1. Added Extraction Architecture Overview (New Section)

Added a new "Extraction Architecture: Hybrid Approach" section that clearly explains:

Pattern Matching (Main Code): For structured identifiers (ISIL, Wikidata, VIAF, KvK, URLs)
Subagent-Based NER: For unstructured entities (institution names, locations, relationships)
Reference to docs/plan/global_glam/07-subagent-architecture.md for detailed rationale

Location: Lines 30-47 (new section after "High-Level Architecture")

2. Updated NLP Extraction Pipeline Section

Before:

Mentioned direct use of spaCy, transformers, scikit-learn
Single "Entity Extractor" component
No distinction between pattern matching and NER

After:

Clear separation: "Entity Extractor (Main Code - Pattern Matching)" vs "Entity Extractor (Subagent-Based NER)"
Explicitly states which extractors use subagents
Reference to subagent architecture document

Location: Lines 107-130

3. Updated Technology Stack

Before:

**Technology**:
- spaCy: Core NLP framework
- transformers: For advanced NER (BERT-based models)
- scikit-learn: Classification tasks
- regex: Pattern-based extraction
- langdetect: Language identification

After:

**Technology**:
- Pattern Matching (main code): Python `re` module, `rapidfuzz` for fuzzy matching
- NER & Extraction (subagents): Coding subagents choose appropriate tools (spaCy, transformers, etc.)
- Task Orchestration: Task tool for subagent invocation
- Text Processing: `langdetect` for language identification (main code)

Location: Lines 182-186

4. Updated Key Libraries Section

Before:

Listed spaCy, transformers, scikit-learn as direct dependencies

After:

Removed NLP libraries from main dependencies
Added "Pattern Matching" and "Subagent Orchestration" categories
Added important note: "NLP libraries (spaCy, transformers, PyTorch) are NOT dependencies of the main application. They are used by coding subagents when needed for entity extraction."

Location: Lines 542-551

5. Updated Data Flow Architecture Diagram

Before:

Conversation Parser
    ↓
NLP Extraction Pipeline  ──> [SQLite Cache]
    ↓
Web Crawler

After:

Conversation Parser
    ↓
Pattern-based Extraction  ──> Extract: ISIL, Wikidata, URLs
    ↓
SUBAGENT BOUNDARY  ──> Launch subagents for NER
    ↓
Coding Subagents  ──> Subagents use spaCy/transformers
    ↓
Web Crawler

Location: Lines 429-479 (updated diagram)

Added comprehensive references at the end of the document:

Subagent Architecture document
Design Patterns
Data Standardization
Dependencies
LinkML Schema
Agent Instructions (AGENTS.md)

Location: End of document (after "Integration Opportunities")

Impact

These changes ensure that:

✅ Architecture documentation accurately reflects implementation (no spaCy in main code)
✅ Clear guidance for developers on when to use pattern matching vs subagents
✅ Proper references to related documentation
✅ Accurate technology stack listing (no misleading dependencies)

Files Modified

docs/plan/global_glam/02-architecture.md - UPDATED

Next Steps

According to the session resume summary, the next priority tasks are:

HIGH PRIORITY

CSVParser - Parse Dutch ISIL registry and organizations CSV files
LinkMLValidator - Validate HeritageCustodian records against schema

MEDIUM PRIORITY

InstitutionExtractor - Use Task tool to launch subagents for NER

Status

✅ Architecture documentation update COMPLETE

All references to direct spaCy/transformers usage have been removed from the main architecture document and replaced with accurate descriptions of the subagent-based approach.

3.9 KiB Raw Blame History

Session 3: Architecture Documentation Update

Changes Made

1. Added Extraction Architecture Overview (New Section)

2. Updated NLP Extraction Pipeline Section

3. Updated Technology Stack

4. Updated Key Libraries Section

5. Updated Data Flow Architecture Diagram

6. Added Related Documentation Section

Impact

Files Modified

Next Steps

HIGH PRIORITY

MEDIUM PRIORITY

Status

3.9 KiB

Raw Blame History