glam/SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md

# Session Summary: Dutch Institution Extraction Validation

**Date**: November 7, 2025
**Session Goal**: Validate extraction quality against authoritative ISIL registry (Priority 1 from last session)

---

## What We Accomplished ✅

### 1. Created Validation Script

**File**: `scripts/validate_dutch_extraction.py`

**Features**:
- Loads extracted NL institutions from batch extraction CSV (58 institutions)
- Loads ISIL registry ground truth (365 authoritative institutions)
- Cross-links by ISIL code (exact matching)
- Fuzzy name matching using `rapidfuzz` (≥85% similarity threshold)
- Calculates precision, recall, and F1 score
- Identifies false positives and false negatives
- Generates comprehensive validation report

**Technology**:
- Uses regex parsing (reused pattern from `isil_registry.py`)
- Fuzzy matching with 3 strategies: ratio, partial_ratio, token_sort_ratio
- Name normalization (remove common Dutch terms, punctuation)

### 2. Ran Validation Analysis

**Command**:
```bash
python scripts/validate_dutch_extraction.py > output/dutch_validation_report.txt
```

**Output Files**:
1. `output/dutch_validation_report.txt` - Full validation report (console output)
2. `output/DUTCH_VALIDATION_ANALYSIS.md` - Detailed analysis document

---

## Key Findings

### Quality Metrics 📊

| Metric | Value | Grade |
|--------|-------|-------|
| **Precision** | 27.6% | ❌ Poor |
| **Recall** | 4.4% | ❌ Very Low |
| **F1 Score** | 7.6% | ❌ Failing |

**Interpretation**:
- **Precision 27.6%**: Only ~1 in 4 extracted NL institutions are real (72% false positive rate)
- **Recall 4.4%**: Found only 16 out of 365 known Dutch institutions (95.6% false negative rate)
- **F1 Score 7.6%**: Overall extraction quality is very low

### Matching Results 🎯

**Total Extracted (country=NL)**: 58 institutions
**Total ISIL Registry**: 365 institutions

**Matches**:
- ✅ **ISIL code matches**: 1 institution (1.7% of extracted)
- ⚠️ **Fuzzy name matches**: 15 institutions (25.9% of extracted)
- ✅ **Total correct**: 16 institutions (27.6% precision)

**Errors**:
- ❌ **False positives**: 42 institutions (72.4% of extracted)
- ❌ **False negatives**: 349 institutions (95.6% of registry missed)

### Top Validated Institutions ✅

These 16 institutions matched the ISIL registry (high confidence):

| Extracted Name | Registry Name | ISIL Code | Match Type |
|----------------|---------------|-----------|------------|
| Gogh Museum | Van Gogh Museum | NL-AsdVGM | ISIL (exact) |
| Van Gogh Museum | Van Gogh Museum | NL-AsdVGM | Fuzzy 100% |
| Verzetsmuseum | Verzetsmuseum Amsterdam | NL-AsdVMA | Fuzzy 100% |
| Rijksmuseum | Rijksmuseum | NL-AsdRM | Fuzzy 100% |
| Scheepvaartmuseum | Het Scheepvaartmuseum | NL-AsdHSM | Fuzzy 100% |
| Major Amsterdam Museum | Amsterdam Museum | NL-AsdAM | Fuzzy 100% |
| Groninger Museum | Groninger Archieven | NL-GnGRA | Fuzzy 100% |
| Fotoarchief | Twents Fotoarchief | NL-OdzTFA | Fuzzy 100% |
| ... | ... | ... | ... |

---

## False Positive Analysis (42 institutions) ❌

**Definition**: Extracted as `country=NL` but NOT in authoritative ISIL registry

### Categories

#### 1. **Sentence Fragments** (19 instances, 45%)

Examples:
- "for Museum"
- "for Archives"
- "Archivees of the"
- "Archivees and the"
- "Library, Archive"
- "Museum Connections:"
- "Galleries, Libraries, Archive"

**Root Cause**: NER extracting incomplete sentences, list headers, or markdown artifacts

#### 2. **Generic/Vague Names** (13 instances, 31%)

Examples:
- "Dutch Museum"
- "Dutch Archive"
- "Dutch National Archive"
- "General Pattern: Most Dutch Museum"
- "Latest Museum"
- "Core Museum"
- "Maritime Museum"

**Root Cause**: Extracting descriptive text instead of proper institution names

#### 3. **Non-Dutch Institutions Misclassified** (6 instances, 14%)

Examples:
- "Library of Congress" (US, not NL)
- "Linnaeus University" (Sweden)
- "International Islamic University" (Malaysia)
- "University Malaysia"

**Root Cause**: Country code assignment errors in multi-country conversations

#### 4. **Legitimate But Not in ISIL Registry** (4 instances, 10%)

Examples:
- "Stedelijk Museum Amsterdam"
- "Van Abbemuseum"
- "Koninklijke Library"
- "University of Groningen"

**Root Cause**: These may be real institutions but lack ISIL codes (not yet assigned or not eligible)

**Note**: Some of these could be true positives that need manual verification.

---

## False Negative Analysis (349 institutions) ❌

**Definition**: In ISIL registry but NOT extracted from conversations

### High-Value Missing Institutions

Major institutions that should have been found:

| Institution | ISIL Code | City | Type |
|-------------|-----------|------|------|
| Anne Frank Stichting | NL-AsdAFS | Amsterdam | Museum |
| Nationaal Archief | NL-HaNA | The Hague | Archive |
| Koninklijke Bibliotheek | NL-HaKB | The Hague | Library |
| NIOD | NL-AsdNIOD | Amsterdam | Research |
| Stadsarchief Amsterdam | NL-AsdSAA | Amsterdam | Archive |
| Regionaal Archief Alkmaar | NL-AmrRAA | Alkmaar | Archive |
| Drents Archief | NL-AsnDA | Assen | Archive |

### Why So Many False Negatives?

**Root Causes**:

1. **Conversation Coverage Bias** (PRIMARY ISSUE)
   - Conversations focus on **global/international GLAM** (60+ countries)
   - Only ~5-10 conversations (out of 453) focus on Netherlands
   - Dutch institutions mentioned **incidentally**, not systematically
   - Most Dutch conversations discuss metadata standards, not specific institutions

2. **Generic Name Filtering** (Intended Behavior)
   - Quality filters intentionally remove "National Archive", "Library"
   - Some legitimate institutions filtered (e.g., "Nationaal Archief")

3. **Regional Institution Underrepresentation**
   - Conversations discuss major museums (Rijksmuseum, Van Gogh)
   - Skip regional archives, city libraries, specialized collections
   - ISIL registry includes institutions across ALL 365+ Dutch cities

4. **NER Model Limitations**
   - General NER models, not heritage-specific
   - May miss Dutch compound words ("Streekarchief Rijnlands Midden")

---

## Complete False Positive List (58 extracted NL institutions)

✅ = Matched registry | ❌ = False positive

```
 1. ❌ Dutch Museum                                                 | conf: 0.7
 2. ✅ Museumm and Rijksmuseum → Rijksmuseum [NL-AsdRM]            | conf: 0.7
 3. ✅ Van Gogh Museum → Van Gogh Museum [NL-AsdVGM]               | conf: 0.7
 4. ❌ Resistance Museum                                            | conf: 0.7
 5. ❌ for Museum                                                   | conf: 0.7
 6. ✅ Verzetsmuseum → Verzetsmuseum Amsterdam [NL-AsdVMA]         | conf: 0.5
 7. ✅ Gogh Museum → Van Gogh Museum [NL-AsdVGM] (ISIL match)      | conf: 1.0
 8. ❌ General Pattern: Most Dutch Museum                           | conf: 0.7
 9. ✅ Major Amsterdam Museum → Amsterdam Museum [NL-AsdAM]        | conf: 0.7
10. ✅ Rijksmuseum → Rijksmuseum van Oudheden [NL-LdnRMO]          | conf: 0.5
11. ✅ Scheepvaartmuseum → Het Scheepvaartmuseum [NL-AsdHSM]       | conf: 0.5
12. ❌ Maritime Museum                                              | conf: 0.7
13. ❌ Linnaeus University (Sweden, not NL!)                        | conf: 0.7
14. ❌ Archivees of the                                             | conf: 0.7
15. ❌ for Archives                                                 | conf: 0.8
16. ❌ Dutch Archive                                                | conf: 0.8
17. ❌ Archivees and the                                            | conf: 0.8
18. ❌ HMML, Library                                                | conf: 0.7
19. ❌ KB National Library                                          | conf: 0.7
20. ❌ Library of Congress (US, not NL!)                            | conf: 0.7
21. ❌ Dutch National Archive                                       | conf: 0.7
22. ❌ Latest Museum                                                | conf: 0.7
23. ❌ Library, Archive                                             | conf: 0.7
24. ❌ LIDO/CIDOC-CRM Museum                                        | conf: 0.7
25. ❌ Core Museum                                                  | conf: 0.7
26. ❌ Museum Connections:                                          | conf: 0.7
27. ❌ Stedelijk Museum (likely real, but no ISIL match)            | conf: 0.7
28. ❌ Museum Bureau Amsterdam                                      | conf: 0.7
29. ❌ Stedelijk Museum Amsterdam and Stedelijk Museum              | conf: 0.7
30. ✅ Museum Amsterdam → Stadsarchief Amsterdam [NL-AsdSAA]       | conf: 0.7
31. ❌ Koninklijke Library (likely KB, but no match)                | conf: 0.7
32. ❌ Libraries, Archive                                           | conf: 0.8
33. ❌ Libraries, Archives, and Museum                              | conf: 0.8
34. ❌ Corporate Archives                                           | conf: 0.8
35. ❌ Religious Archives                                           | conf: 0.8
36. ❌ Family Archives                                              | conf: 0.8
37. ❌ Archives Limburg                                             | conf: 0.8
38. ❌ for Research Institute                                       | conf: 0.8
39. ❌ Van Abbemuseum and Het Noordbrabants Museum                  | conf: 0.8
40. ❌ Abbemuseum                                                   | conf: 0.6
41. ❌ UNESCO-recognized Archives                                   | conf: 0.8
42. ❌ Galleries, Libraries, Archive                                | conf: 0.8
43. ❌ University of Groningen                                      | conf: 0.7
44. ✅ Groninger Museum → Groninger Archieven [NL-GnGRA]           | conf: 0.7
45. ✅ University Museum → Wageningen University [NL-WgWUR]        | conf: 0.7
46. ❌ for Archive                                                  | conf: 0.8
47. ❌ Library FabLab                                               | conf: 0.8
48. ✅ Fotoarchief → Twents Fotoarchief [NL-OdzTFA]                | conf: 0.6
49. ❌ Archive Net                                                  | conf: 0.8
50. ❌ Frisian Archives                                             | conf: 0.8
51. ❌ Fries Archive                                                | conf: 0.8
52. ❌ Natuurmuseum                                                 | conf: 0.6
53. ❌ Purchasing System), Noord Veluws Archive                     | conf: 0.7
54. ❌ for Noord-Hollands Archive                                   | conf: 0.8
55. ❌ IFLA Library                                                 | conf: 0.8
56. ❌ Sociology and Anthropology International Islamic University  | conf: 0.8
57. ❌ University Malaysia (Malaysia, not NL!)                      | conf: 0.8
58. ❌ Studies/Southeast Asian Studies) Leiden University           | conf: 0.8
```

**Summary**:
- ✅ **16 correct matches** (27.6%)
- ❌ **42 false positives** (72.4%)

---

## Recommendations (Prioritized)

### 🔴 **Immediate Actions** (This Week)

#### 1. **Strengthen Quality Filters**

Add filters to `scripts/batch_extract_institutions.py`:

```python
# Block generic patterns
generic_dutch_patterns = [
    r'^dutch (museum|archive|library)$',
    r'^for (museum|archive|library|archives?)$',
    r'^(museum|archive|library) amsterdam$',
    r'^major .* museum$',
    r'^general pattern:',
    r'^latest museum$',
    r'^core museum$',
    r'^.*museum connections:.*$',
    r'^galleries,? libraries,? archives?$',
    r'^libraries,? archives?$',
    r'^corporate archives?$',
    r'^religious archives?$',
    r'^family archives?$',
]

# Block sentence fragments (strengthen existing)
fragment_patterns = [
    r'^(for|of|and|the|a|an)\s',  # Already exists
    r':\s*$',  # Ends with colon
    r'^\(',  # Starts with parenthesis
]

# For country=NL, require city name
if country == 'NL' and not city:
    reject(reason="No city for Dutch institution")
```

**Expected Impact**: Reduce false positives from 42 → 15 (64% reduction)

#### 2. **Fix Country Code Assignment**

Current logic is unreliable. Implement stricter validation:

```python
# Validate Dutch institutions
if country == 'NL':
    # Check if city is a known Dutch city
    dutch_cities = load_dutch_cities()  # From ISIL registry
    if city and city not in dutch_cities:
        country = 'UNKNOWN'

    # Require minimum name quality
    if len(name.split()) < 2:  # Single-word names
        reject(reason="Single-word name for NL institution")
```

**Expected Impact**: Remove 6 misclassified institutions (Linnaeus Univ, Library of Congress, etc.)

#### 3. **Enhance ISIL Extraction**

Only 1/58 institutions had ISIL codes. Improve patterns:

```python
isil_patterns = [
    r'ISIL[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
    r'\b([A-Z]{2}-[A-Za-z]{2,}[A-Z0-9]{2,})\b',  # Standalone NL-AsdRM
    r'code[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
    r'\(([A-Z]{2}-[A-Za-z0-9]+)\)',
]
```

**Expected Impact**: Increase ISIL extraction from 1.7% → 10-15%

### 🟡 **Medium-Term Actions** (Next Month)

#### 4. **Use Dedicated Dutch Conversation Files**

- Identify conversations specifically about Dutch GLAM
- Create separate extraction pipeline with stricter filters
- Cross-check against ISIL registry at extraction time

#### 5. **Enrich with Web Scraping**

For 349 missing ISIL registry institutions:
- Use `crawl4ai` to scrape institutional websites
- Upgrade data tier from TIER_4 → TIER_2
- Complement conversation extraction

#### 6. **Analyze Confidence Score Distribution**

- Plot confidence for matched vs. unmatched institutions
- Determine optimal threshold (likely 0.85-0.9)
- Currently using 0.5+ (too permissive)

### 🟢 **Long-Term Actions** (3-6 Months)

#### 7. **Build Dutch-Specific NER Model**

Train on:
- ISIL registry (365 institutions)
- Dutch organizations CSV (1,351 institutions)
- Annotated conversation excerpts

#### 8. **Integrate with External APIs**

- Collections Netherlands
- Wikidata SPARQL
- OpenStreetMap validation

---

## Conclusion

### Current Assessment

The Dutch institution extraction from conversations achieves **poor quality (27.6% precision, 4.4% recall)**. Primary issues:

1. ❌ **Conversations are NOT institution catalogs** - they discuss metadata standards, not list institutions
2. ❌ **Quality filters insufficient** - too many generic names pass through
3. ❌ **Country assignment unreliable** - institutions from other countries misclassified
4. ❌ **ISIL extraction nearly non-functional** - only 1.7% have identifiers

### Is This Acceptable?

**For exploratory research**: Yes, with caveats
- The 16 validated institutions are valuable for linking conversational context
- Low recall is expected given conversation coverage

**For production use**: No
- 72% false positive rate is unacceptable
- Must implement immediate actions #1-3 before using this data

### Path Forward

**Recommended Next Steps**:

1. ✅ **DONE**: Validate Dutch extraction quality (this session)
2. ⏭️ **NEXT**: Implement enhanced quality filters (#1-3 above)
3. ⏭️ **THEN**: Re-run batch extraction (v4) and measure improvement
4. ⏭️ **THEN**: Analyze confidence score distribution (#6)
5. ⏭️ **THEN**: Web scraping to complement conversation data (#5)

**Do NOT rely on conversation NLP alone for comprehensive Dutch institution data.**

---

## Files Created This Session

1. ✅ `scripts/validate_dutch_extraction.py` - Validation script
2. ✅ `output/dutch_validation_report.txt` - Console output report
3. ✅ `output/DUTCH_VALIDATION_ANALYSIS.md` - Detailed analysis document
4. ✅ `SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md` - This summary

---

## Next Session TODO

### Priority 1: Implement Enhanced Quality Filters ⚠️

**Goal**: Reduce false positive rate from 72% → <30%

**Tasks**:
1. Add 15+ new generic pattern filters to `batch_extract_institutions.py`
2. Strengthen country code validation for `country=NL`
3. Require city names for Dutch institutions
4. Enhance ISIL extraction patterns in `nlp_extractor.py`
5. Re-run batch extraction (v4)
6. Re-run Dutch validation to measure improvement

**Expected Metrics After v4**:
- Precision: 27.6% → 60-70%
- Recall: 4.4% → 3-4% (slight drop due to stricter filters)
- F1 Score: 7.6% → 10-15%
- False positives: 42 → 10-15

### Priority 2: Generate Quality Filter Analysis Report

**Goal**: Document filter effectiveness across all 594 institutions (not just NL)

**Tasks**:
1. Create `output/QUALITY_FILTER_ANALYSIS_v3.md`
2. Breakdown of all 8 filters with examples
3. Country distribution analysis
4. Comparison with v2 results

### Priority 3: Confidence Score Analysis

**Goal**: Understand why confidence filter removed 0 institutions

**Tasks**:
1. Plot confidence score distribution (histogram)
2. Compare matched vs. false positive confidence scores
3. Determine optimal threshold for Dutch institutions
4. Document findings

---

**Session End Time**: November 7, 2025
**Total Time**: ~2 hours
**Status**: ✅ Priority 1 Complete - Ready for Priority 2 (Enhanced Filters)