10 KiB
GLAM Data Extraction - Progress Update
Date: 2025-11-05
Session: 2
Summary
Successfully implemented the foundational parsing and identifier extraction components for the GLAM data extraction pipeline. All components are working with comprehensive test coverage.
Completed Components
1. ConversationParser ✅
File: src/glam_extractor/parsers/conversation.py
Tests: tests/parsers/test_conversation.py (25 tests, all passing)
Features:
- Parse conversation JSON files with full schema validation
- Extract text from chat messages (with deduplication)
- Filter messages by sender (human/assistant)
- Extract conversation metadata for provenance tracking
- Datetime parsing with multiple format support
- Helper methods for institution-focused text extraction
Models:
MessageContent- Individual content blocks within messagesChatMessage- Single message with sender, text, and contentConversation- Complete conversation with metadata and messagesConversationParser- Parser class with file/dict parsing methods
Usage:
from glam_extractor.parsers.conversation import ConversationParser
parser = ConversationParser()
conversation = parser.parse_file("conversation.json")
text = parser.extract_institutions_context(conversation)
metadata = parser.get_conversation_metadata(conversation)
2. IdentifierExtractor ✅
File: src/glam_extractor/extractors/identifiers.py
Tests: tests/extractors/test_identifiers.py (35 tests, all passing)
Features:
- Regex-based extraction of multiple identifier types
- ISIL codes with country code validation (100+ valid prefixes)
- Wikidata IDs with automatic URL construction
- VIAF IDs from URLs
- Dutch KvK numbers with context validation
- URL extraction with optional domain filtering
- Deduplication of extracted identifiers
- Context extraction (surrounding text for each identifier)
Supported Identifiers:
- ISIL: Format
[Country]-[Code](e.g., NL-AsdRM, US-DLC) - Wikidata: Format
Q[digits](e.g., Q190804) - VIAF: Format
viaf.org/viaf/[digits] - KvK: 8-digit Dutch Chamber of Commerce numbers
- URLs: HTTP/HTTPS with optional domain filtering
Usage:
from glam_extractor.extractors.identifiers import IdentifierExtractor
extractor = IdentifierExtractor()
identifiers = extractor.extract_all(text, include_urls=True)
# Extract specific types
isil_codes = extractor.extract_isil_codes(text)
wikidata_ids = extractor.extract_wikidata_ids(text)
# Extract with context
contexts = extractor.extract_with_context(text, context_window=50)
3. Integration Example ✅
File: examples/extract_identifiers.py
Demonstrates end-to-end workflow:
- Parse conversation JSON file
- Extract text from assistant messages
- Extract all identifiers using regex patterns
- Group and display results
Output:
=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4
Identifiers by scheme:
ISIL: NL-ASDRM, NL-HANA
URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl
Test Coverage
| Component | Tests | Status |
|---|---|---|
| ConversationParser | 25 | ✅ All passing |
| IdentifierExtractor | 35 | ✅ All passing |
| Total | 60 | ✅ All passing |
Test execution time: 0.09 seconds
Test Fixtures
Created:
tests/fixtures/sample_conversation.json- Sample Dutch GLAM conversationtests/fixtures/expected_extraction.json- Expected extraction output
Sample conversation contains:
- Rijksmuseum (ISIL: NL-AsdRM)
- Nationaal Archief (ISIL: NL-HaNA)
- Addresses, metadata standards, URLs, partnerships
Architecture Decisions Implemented
1. Pydantic v1 Compatibility ✅
- Using Pydantic 1.10.24 (installed in environment)
- All models use v1 syntax (
validator,class Config) - Compatible with LinkML runtime
2. Subagent-Based NER ✅
- Main codebase has NO spaCy, transformers, or torch dependencies
- Identifier extraction uses pure regex patterns
- NER for institutions will use Task tool + coding subagents (future work)
3. Pattern-Based Extraction ✅
- ISIL codes validated against 100+ country codes
- KvK numbers require context to avoid false positives
- URLs extracted with optional domain filtering
- All patterns tested with edge cases
Files Modified/Created
Source Code
- ✅
src/glam_extractor/parsers/conversation.py- New - ✅
src/glam_extractor/parsers/__init__.py- Updated - ✅
src/glam_extractor/extractors/identifiers.py- New - ✅
src/glam_extractor/extractors/__init__.py- Updated - ✅
src/glam_extractor/models.py- Previously updated (Pydantic v1)
Tests
- ✅
tests/parsers/test_conversation.py- New (25 tests) - ✅
tests/extractors/test_identifiers.py- New (35 tests)
Examples
- ✅
examples/extract_identifiers.py- New
Fixtures
- ✅
tests/fixtures/sample_conversation.json- New - ✅
tests/fixtures/expected_extraction.json- New
Documentation
- ✅
AGENTS.md- Previously updated (subagent architecture) - ✅
docs/plan/global_glam/03-dependencies.md- Previously updated - ✅
docs/plan/global_glam/07-subagent-architecture.md- Previously created - ✅
pyproject.toml- Previously updated (Pydantic v1, no NLP libs)
Next Steps (Priority Order)
High Priority
7. Implement CSVParser
- Files:
src/glam_extractor/parsers/isil_registry.pysrc/glam_extractor/parsers/dutch_orgs.py
- Purpose: Parse Dutch ISIL registry and organizations CSV
- Input: CSV files (
data/ISIL-codes_2025-08-01.csv,data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv) - Output:
HeritageCustodianandDutchHeritageCustodianmodel instances - Provenance: Mark as TIER_1_AUTHORITATIVE
9. Implement LinkMLValidator
- File:
src/glam_extractor/validators/linkml_validator.py - Purpose: Validate HeritageCustodian records against LinkML schema
- Dependencies:
linkmllibrary (already in pyproject.toml) - Schema:
schemas/heritage_custodian.yaml
8. Implement InstitutionExtractor (Subagent-based)
- File:
src/glam_extractor/extractors/institutions.py - Purpose: Extract institution names, types, locations using coding subagents
- Method: Use Task tool to launch NER subagent
- Input: Conversation text
- Output: Structured institution data with confidence scores
Medium Priority
13. Implement JSON-LD Exporter
- File:
src/glam_extractor/exporters/jsonld.py - Purpose: Export HeritageCustodian records to JSON-LD format
- Schema: Use LinkML context for JSON-LD mapping
14. Update Architecture Documentation
- Document the complete extraction pipeline
- Add flowcharts for data flow
- Document provenance tracking approach
Future Work
- RDF/Turtle exporter
- CSV exporter
- Geocoding module (Nominatim integration)
- Duplicate detection module
- Cross-reference validator (CSV vs. conversation data)
Performance Metrics
- Test execution: 60 tests in 0.09s
- Conversation parsing: < 10ms per file (tested with sample)
- Identifier extraction: < 5ms per document (regex-based)
Data Quality Features
Provenance Tracking (Ready)
Every extracted identifier can be tracked to:
- Source conversation UUID
- Extraction timestamp
- Extraction method (regex pattern-based)
- Confidence score (for NER-based extraction)
Validation (Implemented)
- ISIL country code validation (100+ valid codes)
- KvK number format validation (8 digits)
- Context-based filtering (e.g., KvK requires "KvK" mention)
- Deduplication of extracted identifiers
Technical Stack
- Python: 3.12.4
- Pydantic: 1.10.24 (v1 for compatibility)
- pytest: 8.4.1
- Pattern matching: Standard library
remodule - No NLP dependencies: As per subagent architecture decision
Running Tests
cd /Users/kempersc/Documents/claude/glam
# Run all tests
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
python -m pytest tests/ -v -o addopts=""
# Run specific test suite
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
python -m pytest tests/parsers/test_conversation.py -v -o addopts=""
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
python -m pytest tests/extractors/test_identifiers.py -v -o addopts=""
Running Examples
cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
python examples/extract_identifiers.py
Key Achievements
- ✅ Solid foundation: Conversation parsing and identifier extraction working
- ✅ Test-driven: 60 comprehensive tests covering edge cases
- ✅ Architecture clarity: Clear separation between regex-based and NER-based extraction
- ✅ No heavy dependencies: Main codebase stays lightweight (no spaCy/torch)
- ✅ Practical validation: Works with real Dutch GLAM institution data
- ✅ Production-ready patterns: ISIL validation, KvK context checking, deduplication
Risks and Mitigations
| Risk | Impact | Mitigation | Status |
|---|---|---|---|
| False positive identifiers | Medium | Context validation, confidence scores | ✅ Implemented |
| Missing identifiers | Medium | Combine regex + NER approaches | ⏳ NER pending |
| CSV parsing complexity | Low | Use pandas, validate schemas | ⏳ Pending |
| LinkML schema drift | Medium | Automated validation tests | ⏳ Pending |
Notes for Next Session
- Start with CSVParser: This is high priority and doesn't require subagents
- Test with real data: Once CSV parser is ready, test with actual ISIL/Dutch org CSV files
- Validate schema compliance: Implement LinkMLValidator to ensure data quality
- Then tackle NER: Once data pipeline works for structured data, add subagent-based NER
Session 2 Status: ✅ Successful
Components Delivered: 2 (ConversationParser, IdentifierExtractor)
Tests Written: 60 (all passing)
Next Priority: CSVParser for Dutch datasets