kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

10 KiB

Raw Blame History

GLAM Data Extraction - Progress Update

Date: 2025-11-05
Session: 2

Summary

Successfully implemented the foundational parsing and identifier extraction components for the GLAM data extraction pipeline. All components are working with comprehensive test coverage.

Completed Components

1. ConversationParser ✅

File: src/glam_extractor/parsers/conversation.py
Tests: tests/parsers/test_conversation.py (25 tests, all passing)

Features:

Parse conversation JSON files with full schema validation
Extract text from chat messages (with deduplication)
Filter messages by sender (human/assistant)
Extract conversation metadata for provenance tracking
Datetime parsing with multiple format support
Helper methods for institution-focused text extraction

Models:

MessageContent - Individual content blocks within messages
ChatMessage - Single message with sender, text, and content
Conversation - Complete conversation with metadata and messages
ConversationParser - Parser class with file/dict parsing methods

Usage:

from glam_extractor.parsers.conversation import ConversationParser

parser = ConversationParser()
conversation = parser.parse_file("conversation.json")
text = parser.extract_institutions_context(conversation)
metadata = parser.get_conversation_metadata(conversation)

2. IdentifierExtractor ✅

File: src/glam_extractor/extractors/identifiers.py
Tests: tests/extractors/test_identifiers.py (35 tests, all passing)

Features:

Regex-based extraction of multiple identifier types
ISIL codes with country code validation (100+ valid prefixes)
Wikidata IDs with automatic URL construction
VIAF IDs from URLs
Dutch KvK numbers with context validation
URL extraction with optional domain filtering
Deduplication of extracted identifiers
Context extraction (surrounding text for each identifier)

Supported Identifiers:

ISIL: Format [Country]-[Code] (e.g., NL-AsdRM, US-DLC)
Wikidata: Format Q[digits] (e.g., Q190804)
VIAF: Format viaf.org/viaf/[digits]
KvK: 8-digit Dutch Chamber of Commerce numbers
URLs: HTTP/HTTPS with optional domain filtering

Usage:

from glam_extractor.extractors.identifiers import IdentifierExtractor

extractor = IdentifierExtractor()
identifiers = extractor.extract_all(text, include_urls=True)

# Extract specific types
isil_codes = extractor.extract_isil_codes(text)
wikidata_ids = extractor.extract_wikidata_ids(text)

# Extract with context
contexts = extractor.extract_with_context(text, context_window=50)

3. Integration Example ✅

File: examples/extract_identifiers.py

Demonstrates end-to-end workflow:

Parse conversation JSON file
Extract text from assistant messages
Extract all identifiers using regex patterns
Group and display results

Output:

=== Conversation: Test Dutch GLAM Institutions ===
Messages: 4
Total identifiers found: 4

Identifiers by scheme:
  ISIL: NL-ASDRM, NL-HANA
  URL: https://www.rijksmuseum.nl/en/rijksstudio, https://www.nationaalarchief.nl

Test Coverage

Component	Tests	Status
ConversationParser	25	✅ All passing
IdentifierExtractor	35	✅ All passing
Total	60	✅ All passing

Test execution time: 0.09 seconds

Test Fixtures

Created:

tests/fixtures/sample_conversation.json - Sample Dutch GLAM conversation
tests/fixtures/expected_extraction.json - Expected extraction output

Sample conversation contains:

Rijksmuseum (ISIL: NL-AsdRM)
Nationaal Archief (ISIL: NL-HaNA)
Addresses, metadata standards, URLs, partnerships

Architecture Decisions Implemented

1. Pydantic v1 Compatibility ✅

Using Pydantic 1.10.24 (installed in environment)
All models use v1 syntax (validator, class Config)
Compatible with LinkML runtime

2. Subagent-Based NER ✅

Main codebase has NO spaCy, transformers, or torch dependencies
Identifier extraction uses pure regex patterns
NER for institutions will use Task tool + coding subagents (future work)

3. Pattern-Based Extraction ✅

ISIL codes validated against 100+ country codes
KvK numbers require context to avoid false positives
URLs extracted with optional domain filtering
All patterns tested with edge cases

Files Modified/Created

Source Code

✅ src/glam_extractor/parsers/conversation.py - New
✅ src/glam_extractor/parsers/__init__.py - Updated
✅ src/glam_extractor/extractors/identifiers.py - New
✅ src/glam_extractor/extractors/__init__.py - Updated
✅ src/glam_extractor/models.py - Previously updated (Pydantic v1)

Tests

✅ tests/parsers/test_conversation.py - New (25 tests)
✅ tests/extractors/test_identifiers.py - New (35 tests)

Examples

✅ examples/extract_identifiers.py - New

Fixtures

✅ tests/fixtures/sample_conversation.json - New
✅ tests/fixtures/expected_extraction.json - New

Documentation

✅ AGENTS.md - Previously updated (subagent architecture)
✅ docs/plan/global_glam/03-dependencies.md - Previously updated
✅ docs/plan/global_glam/07-subagent-architecture.md - Previously created
✅ pyproject.toml - Previously updated (Pydantic v1, no NLP libs)

Next Steps (Priority Order)

High Priority

7. Implement CSVParser

Files:
- src/glam_extractor/parsers/isil_registry.py
- src/glam_extractor/parsers/dutch_orgs.py
Purpose: Parse Dutch ISIL registry and organizations CSV
Input: CSV files (data/ISIL-codes_2025-08-01.csv, data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv)
Output: HeritageCustodian and DutchHeritageCustodian model instances
Provenance: Mark as TIER_1_AUTHORITATIVE

9. Implement LinkMLValidator

File: src/glam_extractor/validators/linkml_validator.py
Purpose: Validate HeritageCustodian records against LinkML schema
Dependencies: linkml library (already in pyproject.toml)
Schema: schemas/heritage_custodian.yaml

8. Implement InstitutionExtractor (Subagent-based)

File: src/glam_extractor/extractors/institutions.py
Purpose: Extract institution names, types, locations using coding subagents
Method: Use Task tool to launch NER subagent
Input: Conversation text
Output: Structured institution data with confidence scores

Medium Priority

13. Implement JSON-LD Exporter

File: src/glam_extractor/exporters/jsonld.py
Purpose: Export HeritageCustodian records to JSON-LD format
Schema: Use LinkML context for JSON-LD mapping

14. Update Architecture Documentation

Document the complete extraction pipeline
Add flowcharts for data flow
Document provenance tracking approach

Future Work

RDF/Turtle exporter
CSV exporter
Geocoding module (Nominatim integration)
Duplicate detection module
Cross-reference validator (CSV vs. conversation data)

Performance Metrics

Test execution: 60 tests in 0.09s
Conversation parsing: < 10ms per file (tested with sample)
Identifier extraction: < 5ms per document (regex-based)

Data Quality Features

Provenance Tracking (Ready)

Every extracted identifier can be tracked to:

Source conversation UUID
Extraction timestamp
Extraction method (regex pattern-based)
Confidence score (for NER-based extraction)

Validation (Implemented)

ISIL country code validation (100+ valid codes)
KvK number format validation (8 digits)
Context-based filtering (e.g., KvK requires "KvK" mention)
Deduplication of extracted identifiers

Technical Stack

Python: 3.12.4
Pydantic: 1.10.24 (v1 for compatibility)
pytest: 8.4.1
Pattern matching: Standard library re module
No NLP dependencies: As per subagent architecture decision

Running Tests

cd /Users/kempersc/Documents/claude/glam

# Run all tests
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/ -v -o addopts=""

# Run specific test suite
PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/parsers/test_conversation.py -v -o addopts=""

PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python -m pytest tests/extractors/test_identifiers.py -v -o addopts=""

Running Examples

cd /Users/kempersc/Documents/claude/glam

PYTHONPATH=/Users/kempersc/Documents/claude/glam/src:$PYTHONPATH \
  python examples/extract_identifiers.py

Key Achievements

✅ Solid foundation: Conversation parsing and identifier extraction working
✅ Test-driven: 60 comprehensive tests covering edge cases
✅ Architecture clarity: Clear separation between regex-based and NER-based extraction
✅ No heavy dependencies: Main codebase stays lightweight (no spaCy/torch)
✅ Practical validation: Works with real Dutch GLAM institution data
✅ Production-ready patterns: ISIL validation, KvK context checking, deduplication

Risks and Mitigations

Risk	Impact	Mitigation	Status
False positive identifiers	Medium	Context validation, confidence scores	✅ Implemented
Missing identifiers	Medium	Combine regex + NER approaches	⏳ NER pending
CSV parsing complexity	Low	Use pandas, validate schemas	⏳ Pending
LinkML schema drift	Medium	Automated validation tests	⏳ Pending

Notes for Next Session

Start with CSVParser: This is high priority and doesn't require subagents
Test with real data: Once CSV parser is ready, test with actual ISIL/Dutch org CSV files
Validate schema compliance: Implement LinkMLValidator to ensure data quality
Then tackle NER: Once data pipeline works for structured data, add subagent-based NER

Session 2 Status: ✅ Successful
Components Delivered: 2 (ConversationParser, IdentifierExtractor)
Tests Written: 60 (all passing)
Next Priority: CSVParser for Dutch datasets

10 KiB Raw Blame History