glam/docs/progress/session-02-summary.md
2025-11-19 23:25:22 +01:00

10 KiB

GLAM Data Extraction - Progress Update

Date: 2025-11-05
Session: 2

Summary

Successfully implemented the foundational parsing and identifier extraction components for the GLAM data extraction pipeline. All components are working with comprehensive test coverage.

Completed Components

1. ConversationParser

File: src/glam_extractor/parsers/conversation.py
Tests: tests/parsers/test_conversation.py (25 tests, all passing)

Features:

  • Parse conversation JSON files with full schema validation
  • Extract text from chat messages (with deduplication)
  • Filter messages by sender (human/assistant)
  • Extract conversation metadata for provenance tracking
  • Datetime parsing with multiple format support
  • Helper methods for institution-focused text extraction

Models:

  • MessageContent - Individual content blocks within messages
  • ChatMessage - Single message with sender, text, and content
  • Conversation - Complete conversation with metadata and messages
  • ConversationParser - Parser class with file/dict parsing methods

Usage:

from glam_extractor.parsers.conversation import ConversationParser

parser = ConversationParser()
conversation = parser.parse_file("conversation.json")
text = parser.extract_institutions_context(conversation)
metadata = parser.get_conversation_metadata(conversation)

2. IdentifierExtractor

File: src/glam_extractor/extractors/identifiers.py
Tests: tests/extractors/test_identifiers.py (35 tests, all passing)

Features:

  • Regex-based extraction of multiple identifier types
  • ISIL codes with country code validation (100+ valid prefixes)
  • Wikidata IDs with automatic URL construction
  • VIAF IDs from URLs
  • Dutch KvK numbers with context validation
  • URL extraction with optional domain filtering
  • Deduplication of extracted identifiers
  • Context extraction (surrounding text for each identifier)

Supported Identifiers:

  • ISIL: Format [Country]-[Code] (e.g., NL-AsdRM, US-DLC)
  • Wikidata: Format Q[digits] (e.g., Q190804)
  • VIAF: Format viaf.org/viaf/[digits]
  • KvK: 8-digit Dutch Chamber of Commerce numbers
  • URLs: HTTP/HTTPS with optional domain filtering

Usage:

from glam_extractor.extractors.identifiers import IdentifierExtractor

extractor = IdentifierExtractor()
identifiers = extractor.extract_all(text, include_urls=True)

# Extract specific types
isil_codes = extractor.extract_isil_codes(text)
wikidata_ids = extractor.extract_wikidata_ids(text)

# Extract with context
contexts = extractor.extract_with_context(text, context_window=50)

3. Integration Example

File: examples/extract_identifiers.py

Demonstrates end-to-end workflow:

  1. Parse conversation JSON file
  2. Extract text from assistant messages
  3. Extract all identifiers using regex patterns
  4. Group and display results

Output:

=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4

Identifiers by scheme:
  ISIL: NL-ASDRM, NL-HANA
  URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl

Test Coverage

Component Tests Status
ConversationParser 25 All passing
IdentifierExtractor 35 All passing
Total 60 All passing

Test execution time: 0.09 seconds

Test Fixtures

Created:

  • tests/fixtures/sample_conversation.json - Sample Dutch GLAM conversation
  • tests/fixtures/expected_extraction.json - Expected extraction output

Sample conversation contains:

  • Rijksmuseum (ISIL: NL-AsdRM)
  • Nationaal Archief (ISIL: NL-HaNA)
  • Addresses, metadata standards, URLs, partnerships

Architecture Decisions Implemented

1. Pydantic v1 Compatibility

  • Using Pydantic 1.10.24 (installed in environment)
  • All models use v1 syntax (validator, class Config)
  • Compatible with LinkML runtime

2. Subagent-Based NER

  • Main codebase has NO spaCy, transformers, or torch dependencies
  • Identifier extraction uses pure regex patterns
  • NER for institutions will use Task tool + coding subagents (future work)

3. Pattern-Based Extraction

  • ISIL codes validated against 100+ country codes
  • KvK numbers require context to avoid false positives
  • URLs extracted with optional domain filtering
  • All patterns tested with edge cases

Files Modified/Created

Source Code

  • src/glam_extractor/parsers/conversation.py - New
  • src/glam_extractor/parsers/__init__.py - Updated
  • src/glam_extractor/extractors/identifiers.py - New
  • src/glam_extractor/extractors/__init__.py - Updated
  • src/glam_extractor/models.py - Previously updated (Pydantic v1)

Tests

  • tests/parsers/test_conversation.py - New (25 tests)
  • tests/extractors/test_identifiers.py - New (35 tests)

Examples

  • examples/extract_identifiers.py - New

Fixtures

  • tests/fixtures/sample_conversation.json - New
  • tests/fixtures/expected_extraction.json - New

Documentation

  • AGENTS.md - Previously updated (subagent architecture)
  • docs/plan/global_glam/03-dependencies.md - Previously updated
  • docs/plan/global_glam/07-subagent-architecture.md - Previously created
  • pyproject.toml - Previously updated (Pydantic v1, no NLP libs)

Next Steps (Priority Order)

High Priority

7. Implement CSVParser

  • Files:
    • src/glam_extractor/parsers/isil_registry.py
    • src/glam_extractor/parsers/dutch_orgs.py
  • Purpose: Parse Dutch ISIL registry and organizations CSV
  • Input: CSV files (data/ISIL-codes_2025-08-01.csv, data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv)
  • Output: HeritageCustodian and DutchHeritageCustodian model instances
  • Provenance: Mark as TIER_1_AUTHORITATIVE

9. Implement LinkMLValidator

  • File: src/glam_extractor/validators/linkml_validator.py
  • Purpose: Validate HeritageCustodian records against LinkML schema
  • Dependencies: linkml library (already in pyproject.toml)
  • Schema: schemas/heritage_custodian.yaml

8. Implement InstitutionExtractor (Subagent-based)

  • File: src/glam_extractor/extractors/institutions.py
  • Purpose: Extract institution names, types, locations using coding subagents
  • Method: Use Task tool to launch NER subagent
  • Input: Conversation text
  • Output: Structured institution data with confidence scores

Medium Priority

13. Implement JSON-LD Exporter

  • File: src/glam_extractor/exporters/jsonld.py
  • Purpose: Export HeritageCustodian records to JSON-LD format
  • Schema: Use LinkML context for JSON-LD mapping

14. Update Architecture Documentation

  • Document the complete extraction pipeline
  • Add flowcharts for data flow
  • Document provenance tracking approach

Future Work

  • RDF/Turtle exporter
  • CSV exporter
  • Geocoding module (Nominatim integration)
  • Duplicate detection module
  • Cross-reference validator (CSV vs. conversation data)

Performance Metrics

  • Test execution: 60 tests in 0.09s
  • Conversation parsing: < 10ms per file (tested with sample)
  • Identifier extraction: < 5ms per document (regex-based)

Data Quality Features

Provenance Tracking (Ready)

Every extracted identifier can be tracked to:

  • Source conversation UUID
  • Extraction timestamp
  • Extraction method (regex pattern-based)
  • Confidence score (for NER-based extraction)

Validation (Implemented)

  • ISIL country code validation (100+ valid codes)
  • KvK number format validation (8 digits)
  • Context-based filtering (e.g., KvK requires "KvK" mention)
  • Deduplication of extracted identifiers

Technical Stack

  • Python: 3.12.4
  • Pydantic: 1.10.24 (v1 for compatibility)
  • pytest: 8.4.1
  • Pattern matching: Standard library re module
  • No NLP dependencies: As per subagent architecture decision

Running Tests

cd /Users/kempersc/Documents/claude/glam

# Run all tests
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/ -v -o addopts=""

# Run specific test suite
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/parsers/test_conversation.py -v -o addopts=""

PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/extractors/test_identifiers.py -v -o addopts=""

Running Examples

cd /Users/kempersc/Documents/claude/glam

PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python examples/extract_identifiers.py

Key Achievements

  1. Solid foundation: Conversation parsing and identifier extraction working
  2. Test-driven: 60 comprehensive tests covering edge cases
  3. Architecture clarity: Clear separation between regex-based and NER-based extraction
  4. No heavy dependencies: Main codebase stays lightweight (no spaCy/torch)
  5. Practical validation: Works with real Dutch GLAM institution data
  6. Production-ready patterns: ISIL validation, KvK context checking, deduplication

Risks and Mitigations

Risk Impact Mitigation Status
False positive identifiers Medium Context validation, confidence scores Implemented
Missing identifiers Medium Combine regex + NER approaches NER pending
CSV parsing complexity Low Use pandas, validate schemas Pending
LinkML schema drift Medium Automated validation tests Pending

Notes for Next Session

  1. Start with CSVParser: This is high priority and doesn't require subagents
  2. Test with real data: Once CSV parser is ready, test with actual ISIL/Dutch org CSV files
  3. Validate schema compliance: Implement LinkMLValidator to ensure data quality
  4. Then tackle NER: Once data pipeline works for structured data, add subagent-based NER

Session 2 Status: Successful
Components Delivered: 2 (ConversationParser, IdentifierExtractor)
Tests Written: 60 (all passing)
Next Priority: CSVParser for Dutch datasets