kempersc/glam

Fork 0

kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

5.9 KiB

Raw Blame History

🎯 Session 2 Summary - Ready for Next Session

✅ What We Accomplished

Components Implemented

ConversationParser - Parse Claude conversation JSON files
IdentifierExtractor - Extract ISIL, Wikidata, VIAF, KvK, URLs using regex

Test Coverage

60 tests, all passing in 0.11 seconds
Comprehensive edge case coverage
Real-world data validation (Rijksmuseum, Nationaal Archief)

Files Created/Modified

Source Code (4 files):

src/glam_extractor/parsers/conversation.py - NEW
src/glam_extractor/parsers/__init__.py - UPDATED
src/glam_extractor/extractors/identifiers.py - NEW
src/glam_extractor/extractors/__init__.py - UPDATED

Tests (2 files):

tests/parsers/test_conversation.py - NEW (25 tests)
tests/extractors/test_identifiers.py - NEW (35 tests)

Examples (2 files):

examples/extract_identifiers.py - NEW
examples/README.md - NEW

Documentation (2 files):

docs/progress/session-02-summary.md - NEW

Test Fixtures (previously created):

tests/fixtures/sample_conversation.json
tests/fixtures/expected_extraction.json

🚀 Quick Start for Next Session

Running Tests

cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=./src:$PYTHONPATH python -m pytest tests/ -v -o addopts=""

Running Examples

cd /Users/kempersc/Documents/claude/glam
PYTHONPATH=./src:$PYTHONPATH python examples/extract_identifiers.py

📋 Next Priority Tasks

Task 7: CSVParser (HIGH PRIORITY - Start Here!)

Why: Parse authoritative Dutch GLAM data (TIER_1)
Files to create:

src/glam_extractor/parsers/isil_registry.py
src/glam_extractor/parsers/dutch_orgs.py
tests/parsers/test_csv_parsers.py

Data sources:

data/ISIL-codes_2025-08-01.csv (~300 Dutch institutions)
data/voorbeeld_lijst_organisaties_en_diensten-totaallijst_nederland.csv (40+ columns)

Output: HeritageCustodian and DutchHeritageCustodian model instances

Key points:

Use pandas for CSV parsing
Map CSV columns to Pydantic models
Set provenance: data_source=ISIL_REGISTRY, data_tier=TIER_1_AUTHORITATIVE
Handle Dutch-specific fields: KvK, gemeente_code, provincie, etc.

Task 9: LinkMLValidator (HIGH PRIORITY)

Why: Validate data quality against schema
File: src/glam_extractor/validators/linkml_validator.py

What to do:

Use linkml library (already in dependencies)
Validate HeritageCustodian records against schemas/heritage_custodian.yaml
Report validation errors with clear messages
Test with both valid and invalid data

Task 8: InstitutionExtractor (MEDIUM PRIORITY)

Why: Extract institution names using NER
File: src/glam_extractor/extractors/institutions.py

Key approach:

Use Task tool to launch coding subagent
Subagent runs spaCy/transformers for NER
Main code stays lightweight (no NLP dependencies)
Return structured data with confidence scores

🔑 Key Architecture Points

Pydantic v1: Using v1.10.24 for compatibility
No NLP in main code: Use Task tool + subagents for NER
Pattern matching: ISIL, Wikidata, VIAF, KvK via regex
Provenance tracking: Every record tracks source, method, confidence
Data tiers: TIER_1 (CSV) > TIER_4 (conversation NLP)

📊 Current Status

Component	Status	Tests	Priority
Models (Pydantic v1)	✅ Done	N/A	-
ConversationParser	✅ Done	25/25 ✅	-
IdentifierExtractor	✅ Done	35/35 ✅	-
CSVParser	⏳ TODO	-	🔴 HIGH
LinkMLValidator	⏳ TODO	-	🔴 HIGH
InstitutionExtractor	⏳ TODO	-	🟡 MEDIUM
JSON-LD Exporter	⏳ TODO	-	🟡 MEDIUM

🎓 What You Learned

ConversationParser:

Parse JSON with Pydantic validation
Handle datetime parsing with multiple formats
Extract and deduplicate text content
Filter by sender (human/assistant)

IdentifierExtractor:

Regex pattern matching for identifiers
Country code validation (ISIL)
Context-based filtering (KvK)
URL extraction with domain filtering
Deduplication strategies

Integration:

Combining multiple extractors in a pipeline
Pydantic models for data validation
Test-driven development with pytest
Fixture-based testing

🐛 Known Issues

None! All 60 tests passing.

🎯 Success Metrics This Session

✅ 2 major components implemented
✅ 60 comprehensive tests (100% passing)
✅ 0.11s test execution time
✅ Real-world data validation
✅ Working integration example
✅ Clear documentation

💡 Tips for Next Session

Start with CSVParser: High priority, no subagents needed, validates with real data
Check CSV files exist: Verify data files are at expected paths
Test incrementally: Parse one CSV first, then the other
Use pandas: Makes CSV parsing much easier
Validate early: Once CSVParser works, implement LinkMLValidator immediately

📂 Project Structure (Completed Parts)

glam/
├── src/glam_extractor/
│   ├── models.py                    ✅ (Pydantic v1 models)
│   ├── parsers/
│   │   ├── __init__.py              ✅
│   │   └── conversation.py          ✅ NEW
│   └── extractors/
│       ├── __init__.py              ✅
│       └── identifiers.py           ✅ NEW
├── tests/
│   ├── fixtures/
│   │   ├── sample_conversation.json ✅
│   │   └── expected_extraction.json ✅
│   ├── parsers/
│   │   └── test_conversation.py     ✅ NEW (25 tests)
│   └── extractors/
│       └── test_identifiers.py      ✅ NEW (35 tests)
├── examples/
│   ├── README.md                    ✅ NEW
│   └── extract_identifiers.py      ✅ NEW
└── docs/
    └── progress/
        └── session-02-summary.md    ✅ NEW

Ready to continue! Start with CSVParser next session. 🚀

5.9 KiB Raw Blame History