glam/SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md
2025-11-19 23:25:22 +01:00

458 lines
17 KiB
Markdown

# Session Summary: Dutch Institution Extraction Validation
**Date**: November 7, 2025
**Session Goal**: Validate extraction quality against authoritative ISIL registry (Priority 1 from last session)
---
## What We Accomplished ✅
### 1. Created Validation Script
**File**: `scripts/validate_dutch_extraction.py`
**Features**:
- Loads extracted NL institutions from batch extraction CSV (58 institutions)
- Loads ISIL registry ground truth (365 authoritative institutions)
- Cross-links by ISIL code (exact matching)
- Fuzzy name matching using `rapidfuzz` (≥85% similarity threshold)
- Calculates precision, recall, and F1 score
- Identifies false positives and false negatives
- Generates comprehensive validation report
**Technology**:
- Uses regex parsing (reused pattern from `isil_registry.py`)
- Fuzzy matching with 3 strategies: ratio, partial_ratio, token_sort_ratio
- Name normalization (remove common Dutch terms, punctuation)
### 2. Ran Validation Analysis
**Command**:
```bash
python scripts/validate_dutch_extraction.py > output/dutch_validation_report.txt
```
**Output Files**:
1. `output/dutch_validation_report.txt` - Full validation report (console output)
2. `output/DUTCH_VALIDATION_ANALYSIS.md` - Detailed analysis document
---
## Key Findings
### Quality Metrics 📊
| Metric | Value | Grade |
|--------|-------|-------|
| **Precision** | 27.6% | ❌ Poor |
| **Recall** | 4.4% | ❌ Very Low |
| **F1 Score** | 7.6% | ❌ Failing |
**Interpretation**:
- **Precision 27.6%**: Only ~1 in 4 extracted NL institutions are real (72% false positive rate)
- **Recall 4.4%**: Found only 16 out of 365 known Dutch institutions (95.6% false negative rate)
- **F1 Score 7.6%**: Overall extraction quality is very low
### Matching Results 🎯
**Total Extracted (country=NL)**: 58 institutions
**Total ISIL Registry**: 365 institutions
**Matches**:
-**ISIL code matches**: 1 institution (1.7% of extracted)
- ⚠️ **Fuzzy name matches**: 15 institutions (25.9% of extracted)
-**Total correct**: 16 institutions (27.6% precision)
**Errors**:
-**False positives**: 42 institutions (72.4% of extracted)
-**False negatives**: 349 institutions (95.6% of registry missed)
### Top Validated Institutions ✅
These 16 institutions matched the ISIL registry (high confidence):
| Extracted Name | Registry Name | ISIL Code | Match Type |
|----------------|---------------|-----------|------------|
| Gogh Museum | Van Gogh Museum | NL-AsdVGM | ISIL (exact) |
| Van Gogh Museum | Van Gogh Museum | NL-AsdVGM | Fuzzy 100% |
| Verzetsmuseum | Verzetsmuseum Amsterdam | NL-AsdVMA | Fuzzy 100% |
| Rijksmuseum | Rijksmuseum | NL-AsdRM | Fuzzy 100% |
| Scheepvaartmuseum | Het Scheepvaartmuseum | NL-AsdHSM | Fuzzy 100% |
| Major Amsterdam Museum | Amsterdam Museum | NL-AsdAM | Fuzzy 100% |
| Groninger Museum | Groninger Archieven | NL-GnGRA | Fuzzy 100% |
| Fotoarchief | Twents Fotoarchief | NL-OdzTFA | Fuzzy 100% |
| ... | ... | ... | ... |
---
## False Positive Analysis (42 institutions) ❌
**Definition**: Extracted as `country=NL` but NOT in authoritative ISIL registry
### Categories
#### 1. **Sentence Fragments** (19 instances, 45%)
Examples:
- "for Museum"
- "for Archives"
- "Archivees of the"
- "Archivees and the"
- "Library, Archive"
- "Museum Connections:"
- "Galleries, Libraries, Archive"
**Root Cause**: NER extracting incomplete sentences, list headers, or markdown artifacts
#### 2. **Generic/Vague Names** (13 instances, 31%)
Examples:
- "Dutch Museum"
- "Dutch Archive"
- "Dutch National Archive"
- "General Pattern: Most Dutch Museum"
- "Latest Museum"
- "Core Museum"
- "Maritime Museum"
**Root Cause**: Extracting descriptive text instead of proper institution names
#### 3. **Non-Dutch Institutions Misclassified** (6 instances, 14%)
Examples:
- "Library of Congress" (US, not NL)
- "Linnaeus University" (Sweden)
- "International Islamic University" (Malaysia)
- "University Malaysia"
**Root Cause**: Country code assignment errors in multi-country conversations
#### 4. **Legitimate But Not in ISIL Registry** (4 instances, 10%)
Examples:
- "Stedelijk Museum Amsterdam"
- "Van Abbemuseum"
- "Koninklijke Library"
- "University of Groningen"
**Root Cause**: These may be real institutions but lack ISIL codes (not yet assigned or not eligible)
**Note**: Some of these could be true positives that need manual verification.
---
## False Negative Analysis (349 institutions) ❌
**Definition**: In ISIL registry but NOT extracted from conversations
### High-Value Missing Institutions
Major institutions that should have been found:
| Institution | ISIL Code | City | Type |
|-------------|-----------|------|------|
| Anne Frank Stichting | NL-AsdAFS | Amsterdam | Museum |
| Nationaal Archief | NL-HaNA | The Hague | Archive |
| Koninklijke Bibliotheek | NL-HaKB | The Hague | Library |
| NIOD | NL-AsdNIOD | Amsterdam | Research |
| Stadsarchief Amsterdam | NL-AsdSAA | Amsterdam | Archive |
| Regionaal Archief Alkmaar | NL-AmrRAA | Alkmaar | Archive |
| Drents Archief | NL-AsnDA | Assen | Archive |
### Why So Many False Negatives?
**Root Causes**:
1. **Conversation Coverage Bias** (PRIMARY ISSUE)
- Conversations focus on **global/international GLAM** (60+ countries)
- Only ~5-10 conversations (out of 453) focus on Netherlands
- Dutch institutions mentioned **incidentally**, not systematically
- Most Dutch conversations discuss metadata standards, not specific institutions
2. **Generic Name Filtering** (Intended Behavior)
- Quality filters intentionally remove "National Archive", "Library"
- Some legitimate institutions filtered (e.g., "Nationaal Archief")
3. **Regional Institution Underrepresentation**
- Conversations discuss major museums (Rijksmuseum, Van Gogh)
- Skip regional archives, city libraries, specialized collections
- ISIL registry includes institutions across ALL 365+ Dutch cities
4. **NER Model Limitations**
- General NER models, not heritage-specific
- May miss Dutch compound words ("Streekarchief Rijnlands Midden")
---
## Complete False Positive List (58 extracted NL institutions)
✅ = Matched registry | ❌ = False positive
```
1. ❌ Dutch Museum | conf: 0.7
2. ✅ Museumm and Rijksmuseum → Rijksmuseum [NL-AsdRM] | conf: 0.7
3. ✅ Van Gogh Museum → Van Gogh Museum [NL-AsdVGM] | conf: 0.7
4. ❌ Resistance Museum | conf: 0.7
5. ❌ for Museum | conf: 0.7
6. ✅ Verzetsmuseum → Verzetsmuseum Amsterdam [NL-AsdVMA] | conf: 0.5
7. ✅ Gogh Museum → Van Gogh Museum [NL-AsdVGM] (ISIL match) | conf: 1.0
8. ❌ General Pattern: Most Dutch Museum | conf: 0.7
9. ✅ Major Amsterdam Museum → Amsterdam Museum [NL-AsdAM] | conf: 0.7
10. ✅ Rijksmuseum → Rijksmuseum van Oudheden [NL-LdnRMO] | conf: 0.5
11. ✅ Scheepvaartmuseum → Het Scheepvaartmuseum [NL-AsdHSM] | conf: 0.5
12. ❌ Maritime Museum | conf: 0.7
13. ❌ Linnaeus University (Sweden, not NL!) | conf: 0.7
14. ❌ Archivees of the | conf: 0.7
15. ❌ for Archives | conf: 0.8
16. ❌ Dutch Archive | conf: 0.8
17. ❌ Archivees and the | conf: 0.8
18. ❌ HMML, Library | conf: 0.7
19. ❌ KB National Library | conf: 0.7
20. ❌ Library of Congress (US, not NL!) | conf: 0.7
21. ❌ Dutch National Archive | conf: 0.7
22. ❌ Latest Museum | conf: 0.7
23. ❌ Library, Archive | conf: 0.7
24. ❌ LIDO/CIDOC-CRM Museum | conf: 0.7
25. ❌ Core Museum | conf: 0.7
26. ❌ Museum Connections: | conf: 0.7
27. ❌ Stedelijk Museum (likely real, but no ISIL match) | conf: 0.7
28. ❌ Museum Bureau Amsterdam | conf: 0.7
29. ❌ Stedelijk Museum Amsterdam and Stedelijk Museum | conf: 0.7
30. ✅ Museum Amsterdam → Stadsarchief Amsterdam [NL-AsdSAA] | conf: 0.7
31. ❌ Koninklijke Library (likely KB, but no match) | conf: 0.7
32. ❌ Libraries, Archive | conf: 0.8
33. ❌ Libraries, Archives, and Museum | conf: 0.8
34. ❌ Corporate Archives | conf: 0.8
35. ❌ Religious Archives | conf: 0.8
36. ❌ Family Archives | conf: 0.8
37. ❌ Archives Limburg | conf: 0.8
38. ❌ for Research Institute | conf: 0.8
39. ❌ Van Abbemuseum and Het Noordbrabants Museum | conf: 0.8
40. ❌ Abbemuseum | conf: 0.6
41. ❌ UNESCO-recognized Archives | conf: 0.8
42. ❌ Galleries, Libraries, Archive | conf: 0.8
43. ❌ University of Groningen | conf: 0.7
44. ✅ Groninger Museum → Groninger Archieven [NL-GnGRA] | conf: 0.7
45. ✅ University Museum → Wageningen University [NL-WgWUR] | conf: 0.7
46. ❌ for Archive | conf: 0.8
47. ❌ Library FabLab | conf: 0.8
48. ✅ Fotoarchief → Twents Fotoarchief [NL-OdzTFA] | conf: 0.6
49. ❌ Archive Net | conf: 0.8
50. ❌ Frisian Archives | conf: 0.8
51. ❌ Fries Archive | conf: 0.8
52. ❌ Natuurmuseum | conf: 0.6
53. ❌ Purchasing System), Noord Veluws Archive | conf: 0.7
54. ❌ for Noord-Hollands Archive | conf: 0.8
55. ❌ IFLA Library | conf: 0.8
56. ❌ Sociology and Anthropology International Islamic University | conf: 0.8
57. ❌ University Malaysia (Malaysia, not NL!) | conf: 0.8
58. ❌ Studies/Southeast Asian Studies) Leiden University | conf: 0.8
```
**Summary**:
-**16 correct matches** (27.6%)
-**42 false positives** (72.4%)
---
## Recommendations (Prioritized)
### 🔴 **Immediate Actions** (This Week)
#### 1. **Strengthen Quality Filters**
Add filters to `scripts/batch_extract_institutions.py`:
```python
# Block generic patterns
generic_dutch_patterns = [
r'^dutch (museum|archive|library)$',
r'^for (museum|archive|library|archives?)$',
r'^(museum|archive|library) amsterdam$',
r'^major .* museum$',
r'^general pattern:',
r'^latest museum$',
r'^core museum$',
r'^.*museum connections:.*$',
r'^galleries,? libraries,? archives?$',
r'^libraries,? archives?$',
r'^corporate archives?$',
r'^religious archives?$',
r'^family archives?$',
]
# Block sentence fragments (strengthen existing)
fragment_patterns = [
r'^(for|of|and|the|a|an)\s', # Already exists
r':\s*$', # Ends with colon
r'^\(', # Starts with parenthesis
]
# For country=NL, require city name
if country == 'NL' and not city:
reject(reason="No city for Dutch institution")
```
**Expected Impact**: Reduce false positives from 42 → 15 (64% reduction)
#### 2. **Fix Country Code Assignment**
Current logic is unreliable. Implement stricter validation:
```python
# Validate Dutch institutions
if country == 'NL':
# Check if city is a known Dutch city
dutch_cities = load_dutch_cities() # From ISIL registry
if city and city not in dutch_cities:
country = 'UNKNOWN'
# Require minimum name quality
if len(name.split()) < 2: # Single-word names
reject(reason="Single-word name for NL institution")
```
**Expected Impact**: Remove 6 misclassified institutions (Linnaeus Univ, Library of Congress, etc.)
#### 3. **Enhance ISIL Extraction**
Only 1/58 institutions had ISIL codes. Improve patterns:
```python
isil_patterns = [
r'ISIL[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
r'\b([A-Z]{2}-[A-Za-z]{2,}[A-Z0-9]{2,})\b', # Standalone NL-AsdRM
r'code[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
r'\(([A-Z]{2}-[A-Za-z0-9]+)\)',
]
```
**Expected Impact**: Increase ISIL extraction from 1.7% → 10-15%
### 🟡 **Medium-Term Actions** (Next Month)
#### 4. **Use Dedicated Dutch Conversation Files**
- Identify conversations specifically about Dutch GLAM
- Create separate extraction pipeline with stricter filters
- Cross-check against ISIL registry at extraction time
#### 5. **Enrich with Web Scraping**
For 349 missing ISIL registry institutions:
- Use `crawl4ai` to scrape institutional websites
- Upgrade data tier from TIER_4 → TIER_2
- Complement conversation extraction
#### 6. **Analyze Confidence Score Distribution**
- Plot confidence for matched vs. unmatched institutions
- Determine optimal threshold (likely 0.85-0.9)
- Currently using 0.5+ (too permissive)
### 🟢 **Long-Term Actions** (3-6 Months)
#### 7. **Build Dutch-Specific NER Model**
Train on:
- ISIL registry (365 institutions)
- Dutch organizations CSV (1,351 institutions)
- Annotated conversation excerpts
#### 8. **Integrate with External APIs**
- Collections Netherlands
- Wikidata SPARQL
- OpenStreetMap validation
---
## Conclusion
### Current Assessment
The Dutch institution extraction from conversations achieves **poor quality (27.6% precision, 4.4% recall)**. Primary issues:
1.**Conversations are NOT institution catalogs** - they discuss metadata standards, not list institutions
2.**Quality filters insufficient** - too many generic names pass through
3.**Country assignment unreliable** - institutions from other countries misclassified
4.**ISIL extraction nearly non-functional** - only 1.7% have identifiers
### Is This Acceptable?
**For exploratory research**: Yes, with caveats
- The 16 validated institutions are valuable for linking conversational context
- Low recall is expected given conversation coverage
**For production use**: No
- 72% false positive rate is unacceptable
- Must implement immediate actions #1-3 before using this data
### Path Forward
**Recommended Next Steps**:
1.**DONE**: Validate Dutch extraction quality (this session)
2. ⏭️ **NEXT**: Implement enhanced quality filters (#1-3 above)
3. ⏭️ **THEN**: Re-run batch extraction (v4) and measure improvement
4. ⏭️ **THEN**: Analyze confidence score distribution (#6)
5. ⏭️ **THEN**: Web scraping to complement conversation data (#5)
**Do NOT rely on conversation NLP alone for comprehensive Dutch institution data.**
---
## Files Created This Session
1.`scripts/validate_dutch_extraction.py` - Validation script
2.`output/dutch_validation_report.txt` - Console output report
3.`output/DUTCH_VALIDATION_ANALYSIS.md` - Detailed analysis document
4.`SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md` - This summary
---
## Next Session TODO
### Priority 1: Implement Enhanced Quality Filters ⚠️
**Goal**: Reduce false positive rate from 72% → <30%
**Tasks**:
1. Add 15+ new generic pattern filters to `batch_extract_institutions.py`
2. Strengthen country code validation for `country=NL`
3. Require city names for Dutch institutions
4. Enhance ISIL extraction patterns in `nlp_extractor.py`
5. Re-run batch extraction (v4)
6. Re-run Dutch validation to measure improvement
**Expected Metrics After v4**:
- Precision: 27.6% 60-70%
- Recall: 4.4% 3-4% (slight drop due to stricter filters)
- F1 Score: 7.6% 10-15%
- False positives: 42 10-15
### Priority 2: Generate Quality Filter Analysis Report
**Goal**: Document filter effectiveness across all 594 institutions (not just NL)
**Tasks**:
1. Create `output/QUALITY_FILTER_ANALYSIS_v3.md`
2. Breakdown of all 8 filters with examples
3. Country distribution analysis
4. Comparison with v2 results
### Priority 3: Confidence Score Analysis
**Goal**: Understand why confidence filter removed 0 institutions
**Tasks**:
1. Plot confidence score distribution (histogram)
2. Compare matched vs. false positive confidence scores
3. Determine optimal threshold for Dutch institutions
4. Document findings
---
**Session End Time**: November 7, 2025
**Total Time**: ~2 hours
**Status**: Priority 1 Complete - Ready for Priority 2 (Enhanced Filters)