65 lines
1.7 KiB
Markdown
65 lines
1.7 KiB
Markdown
# V5 Extraction - Quick Reference
|
|
|
|
## Status: ✅ 75% PRECISION ACHIEVED
|
|
|
|
### Architecture
|
|
```
|
|
Conversation Text → Subagent NER → V5 Validation → Clean Institutions (75% precision)
|
|
```
|
|
|
|
### What Works
|
|
- ✅ **Subagent NER**: Clean, accurate names (no mangling)
|
|
- ✅ **V5 Validation**: 3 filters (country, organization, proper name)
|
|
- ✅ **75% precision**: 3/4 correct (up from V4's 50%)
|
|
|
|
### What Doesn't Work
|
|
- ❌ **Pattern-based extraction**: 0% precision (names mangled)
|
|
|
|
### Commands
|
|
|
|
**Run V5 demonstration:**
|
|
```bash
|
|
bash /Users/kempersc/apps/glam/scripts/demo_v5_success.sh
|
|
```
|
|
|
|
**Test subagent + V5 integration:**
|
|
```bash
|
|
python /Users/kempersc/apps/glam/scripts/test_subagent_v5_integration.py
|
|
```
|
|
|
|
### Subagent NER Prompt Template
|
|
|
|
```
|
|
Extract ALL heritage institutions from the following text.
|
|
|
|
Return JSON array with:
|
|
{
|
|
"name": "Full institution name",
|
|
"institution_type": "MUSEUM | ARCHIVE | LIBRARY | GALLERY",
|
|
"city": "City name",
|
|
"country": "2-letter ISO code",
|
|
"isil_code": "ISIL code if mentioned",
|
|
"confidence": 0.0-1.0
|
|
}
|
|
|
|
Rules:
|
|
1. Preserve full names (e.g., "Van Abbemuseum", not "Abbemuseum")
|
|
2. Classify by primary function
|
|
3. Determine country from city names or context
|
|
4. Exclude: organizations, networks, generic descriptors
|
|
```
|
|
|
|
### Next Steps for Production
|
|
1. Implement `extract_from_text_subagent()` in `InstitutionExtractor`
|
|
2. Update batch extraction scripts
|
|
3. Process 139 conversation files
|
|
|
|
### Files
|
|
- **Documentation**: `output/V5_VALIDATION_SUMMARY.md`
|
|
- **Session Summary**: `SESSION_SUMMARY_V5.md`
|
|
- **Test Scripts**: `scripts/test_subagent_v5_integration.py`
|
|
- **Demo**: `scripts/demo_v5_success.sh`
|
|
|
|
---
|
|
|
|
**Result:** V5 achieves 75% precision via subagent NER + validation filters
|