glam/SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md
2025-11-19 23:25:22 +01:00

17 KiB

Session Summary: Dutch Institution Extraction Validation

Date: November 7, 2025
Session Goal: Validate extraction quality against authoritative ISIL registry (Priority 1 from last session)


What We Accomplished

1. Created Validation Script

File: scripts/validate_dutch_extraction.py

Features:

  • Loads extracted NL institutions from batch extraction CSV (58 institutions)
  • Loads ISIL registry ground truth (365 authoritative institutions)
  • Cross-links by ISIL code (exact matching)
  • Fuzzy name matching using rapidfuzz (≥85% similarity threshold)
  • Calculates precision, recall, and F1 score
  • Identifies false positives and false negatives
  • Generates comprehensive validation report

Technology:

  • Uses regex parsing (reused pattern from isil_registry.py)
  • Fuzzy matching with 3 strategies: ratio, partial_ratio, token_sort_ratio
  • Name normalization (remove common Dutch terms, punctuation)

2. Ran Validation Analysis

Command:

python scripts/validate_dutch_extraction.py > output/dutch_validation_report.txt

Output Files:

  1. output/dutch_validation_report.txt - Full validation report (console output)
  2. output/DUTCH_VALIDATION_ANALYSIS.md - Detailed analysis document

Key Findings

Quality Metrics 📊

Metric Value Grade
Precision 27.6% Poor
Recall 4.4% Very Low
F1 Score 7.6% Failing

Interpretation:

  • Precision 27.6%: Only ~1 in 4 extracted NL institutions are real (72% false positive rate)
  • Recall 4.4%: Found only 16 out of 365 known Dutch institutions (95.6% false negative rate)
  • F1 Score 7.6%: Overall extraction quality is very low

Matching Results 🎯

Total Extracted (country=NL): 58 institutions
Total ISIL Registry: 365 institutions

Matches:

  • ISIL code matches: 1 institution (1.7% of extracted)
  • ⚠️ Fuzzy name matches: 15 institutions (25.9% of extracted)
  • Total correct: 16 institutions (27.6% precision)

Errors:

  • False positives: 42 institutions (72.4% of extracted)
  • False negatives: 349 institutions (95.6% of registry missed)

Top Validated Institutions

These 16 institutions matched the ISIL registry (high confidence):

Extracted Name Registry Name ISIL Code Match Type
Gogh Museum Van Gogh Museum NL-AsdVGM ISIL (exact)
Van Gogh Museum Van Gogh Museum NL-AsdVGM Fuzzy 100%
Verzetsmuseum Verzetsmuseum Amsterdam NL-AsdVMA Fuzzy 100%
Rijksmuseum Rijksmuseum NL-AsdRM Fuzzy 100%
Scheepvaartmuseum Het Scheepvaartmuseum NL-AsdHSM Fuzzy 100%
Major Amsterdam Museum Amsterdam Museum NL-AsdAM Fuzzy 100%
Groninger Museum Groninger Archieven NL-GnGRA Fuzzy 100%
Fotoarchief Twents Fotoarchief NL-OdzTFA Fuzzy 100%
... ... ... ...

False Positive Analysis (42 institutions)

Definition: Extracted as country=NL but NOT in authoritative ISIL registry

Categories

1. Sentence Fragments (19 instances, 45%)

Examples:

  • "for Museum"
  • "for Archives"
  • "Archivees of the"
  • "Archivees and the"
  • "Library, Archive"
  • "Museum Connections:"
  • "Galleries, Libraries, Archive"

Root Cause: NER extracting incomplete sentences, list headers, or markdown artifacts

2. Generic/Vague Names (13 instances, 31%)

Examples:

  • "Dutch Museum"
  • "Dutch Archive"
  • "Dutch National Archive"
  • "General Pattern: Most Dutch Museum"
  • "Latest Museum"
  • "Core Museum"
  • "Maritime Museum"

Root Cause: Extracting descriptive text instead of proper institution names

3. Non-Dutch Institutions Misclassified (6 instances, 14%)

Examples:

  • "Library of Congress" (US, not NL)
  • "Linnaeus University" (Sweden)
  • "International Islamic University" (Malaysia)
  • "University Malaysia"

Root Cause: Country code assignment errors in multi-country conversations

4. Legitimate But Not in ISIL Registry (4 instances, 10%)

Examples:

  • "Stedelijk Museum Amsterdam"
  • "Van Abbemuseum"
  • "Koninklijke Library"
  • "University of Groningen"

Root Cause: These may be real institutions but lack ISIL codes (not yet assigned or not eligible)

Note: Some of these could be true positives that need manual verification.


False Negative Analysis (349 institutions)

Definition: In ISIL registry but NOT extracted from conversations

High-Value Missing Institutions

Major institutions that should have been found:

Institution ISIL Code City Type
Anne Frank Stichting NL-AsdAFS Amsterdam Museum
Nationaal Archief NL-HaNA The Hague Archive
Koninklijke Bibliotheek NL-HaKB The Hague Library
NIOD NL-AsdNIOD Amsterdam Research
Stadsarchief Amsterdam NL-AsdSAA Amsterdam Archive
Regionaal Archief Alkmaar NL-AmrRAA Alkmaar Archive
Drents Archief NL-AsnDA Assen Archive

Why So Many False Negatives?

Root Causes:

  1. Conversation Coverage Bias (PRIMARY ISSUE)

    • Conversations focus on global/international GLAM (60+ countries)
    • Only ~5-10 conversations (out of 453) focus on Netherlands
    • Dutch institutions mentioned incidentally, not systematically
    • Most Dutch conversations discuss metadata standards, not specific institutions
  2. Generic Name Filtering (Intended Behavior)

    • Quality filters intentionally remove "National Archive", "Library"
    • Some legitimate institutions filtered (e.g., "Nationaal Archief")
  3. Regional Institution Underrepresentation

    • Conversations discuss major museums (Rijksmuseum, Van Gogh)
    • Skip regional archives, city libraries, specialized collections
    • ISIL registry includes institutions across ALL 365+ Dutch cities
  4. NER Model Limitations

    • General NER models, not heritage-specific
    • May miss Dutch compound words ("Streekarchief Rijnlands Midden")

Complete False Positive List (58 extracted NL institutions)

= Matched registry | = False positive

 1. ❌ Dutch Museum                                                 | conf: 0.7
 2. ✅ Museumm and Rijksmuseum → Rijksmuseum [NL-AsdRM]            | conf: 0.7
 3. ✅ Van Gogh Museum → Van Gogh Museum [NL-AsdVGM]               | conf: 0.7
 4. ❌ Resistance Museum                                            | conf: 0.7
 5. ❌ for Museum                                                   | conf: 0.7
 6. ✅ Verzetsmuseum → Verzetsmuseum Amsterdam [NL-AsdVMA]         | conf: 0.5
 7. ✅ Gogh Museum → Van Gogh Museum [NL-AsdVGM] (ISIL match)      | conf: 1.0
 8. ❌ General Pattern: Most Dutch Museum                           | conf: 0.7
 9. ✅ Major Amsterdam Museum → Amsterdam Museum [NL-AsdAM]        | conf: 0.7
10. ✅ Rijksmuseum → Rijksmuseum van Oudheden [NL-LdnRMO]          | conf: 0.5
11. ✅ Scheepvaartmuseum → Het Scheepvaartmuseum [NL-AsdHSM]       | conf: 0.5
12. ❌ Maritime Museum                                              | conf: 0.7
13. ❌ Linnaeus University (Sweden, not NL!)                        | conf: 0.7
14. ❌ Archivees of the                                             | conf: 0.7
15. ❌ for Archives                                                 | conf: 0.8
16. ❌ Dutch Archive                                                | conf: 0.8
17. ❌ Archivees and the                                            | conf: 0.8
18. ❌ HMML, Library                                                | conf: 0.7
19. ❌ KB National Library                                          | conf: 0.7
20. ❌ Library of Congress (US, not NL!)                            | conf: 0.7
21. ❌ Dutch National Archive                                       | conf: 0.7
22. ❌ Latest Museum                                                | conf: 0.7
23. ❌ Library, Archive                                             | conf: 0.7
24. ❌ LIDO/CIDOC-CRM Museum                                        | conf: 0.7
25. ❌ Core Museum                                                  | conf: 0.7
26. ❌ Museum Connections:                                          | conf: 0.7
27. ❌ Stedelijk Museum (likely real, but no ISIL match)            | conf: 0.7
28. ❌ Museum Bureau Amsterdam                                      | conf: 0.7
29. ❌ Stedelijk Museum Amsterdam and Stedelijk Museum              | conf: 0.7
30. ✅ Museum Amsterdam → Stadsarchief Amsterdam [NL-AsdSAA]       | conf: 0.7
31. ❌ Koninklijke Library (likely KB, but no match)                | conf: 0.7
32. ❌ Libraries, Archive                                           | conf: 0.8
33. ❌ Libraries, Archives, and Museum                              | conf: 0.8
34. ❌ Corporate Archives                                           | conf: 0.8
35. ❌ Religious Archives                                           | conf: 0.8
36. ❌ Family Archives                                              | conf: 0.8
37. ❌ Archives Limburg                                             | conf: 0.8
38. ❌ for Research Institute                                       | conf: 0.8
39. ❌ Van Abbemuseum and Het Noordbrabants Museum                  | conf: 0.8
40. ❌ Abbemuseum                                                   | conf: 0.6
41. ❌ UNESCO-recognized Archives                                   | conf: 0.8
42. ❌ Galleries, Libraries, Archive                                | conf: 0.8
43. ❌ University of Groningen                                      | conf: 0.7
44. ✅ Groninger Museum → Groninger Archieven [NL-GnGRA]           | conf: 0.7
45. ✅ University Museum → Wageningen University [NL-WgWUR]        | conf: 0.7
46. ❌ for Archive                                                  | conf: 0.8
47. ❌ Library FabLab                                               | conf: 0.8
48. ✅ Fotoarchief → Twents Fotoarchief [NL-OdzTFA]                | conf: 0.6
49. ❌ Archive Net                                                  | conf: 0.8
50. ❌ Frisian Archives                                             | conf: 0.8
51. ❌ Fries Archive                                                | conf: 0.8
52. ❌ Natuurmuseum                                                 | conf: 0.6
53. ❌ Purchasing System), Noord Veluws Archive                     | conf: 0.7
54. ❌ for Noord-Hollands Archive                                   | conf: 0.8
55. ❌ IFLA Library                                                 | conf: 0.8
56. ❌ Sociology and Anthropology International Islamic University  | conf: 0.8
57. ❌ University Malaysia (Malaysia, not NL!)                      | conf: 0.8
58. ❌ Studies/Southeast Asian Studies) Leiden University           | conf: 0.8

Summary:

  • 16 correct matches (27.6%)
  • 42 false positives (72.4%)

Recommendations (Prioritized)

🔴 Immediate Actions (This Week)

1. Strengthen Quality Filters

Add filters to scripts/batch_extract_institutions.py:

# Block generic patterns
generic_dutch_patterns = [
    r'^dutch (museum|archive|library)$',
    r'^for (museum|archive|library|archives?)$',
    r'^(museum|archive|library) amsterdam$',
    r'^major .* museum$',
    r'^general pattern:',
    r'^latest museum$',
    r'^core museum$',
    r'^.*museum connections:.*$',
    r'^galleries,? libraries,? archives?$',
    r'^libraries,? archives?$',
    r'^corporate archives?$',
    r'^religious archives?$',
    r'^family archives?$',
]

# Block sentence fragments (strengthen existing)
fragment_patterns = [
    r'^(for|of|and|the|a|an)\s',  # Already exists
    r':\s*$',  # Ends with colon
    r'^\(',  # Starts with parenthesis
]

# For country=NL, require city name
if country == 'NL' and not city:
    reject(reason="No city for Dutch institution")

Expected Impact: Reduce false positives from 42 → 15 (64% reduction)

2. Fix Country Code Assignment

Current logic is unreliable. Implement stricter validation:

# Validate Dutch institutions
if country == 'NL':
    # Check if city is a known Dutch city
    dutch_cities = load_dutch_cities()  # From ISIL registry
    if city and city not in dutch_cities:
        country = 'UNKNOWN'
    
    # Require minimum name quality
    if len(name.split()) < 2:  # Single-word names
        reject(reason="Single-word name for NL institution")

Expected Impact: Remove 6 misclassified institutions (Linnaeus Univ, Library of Congress, etc.)

3. Enhance ISIL Extraction

Only 1/58 institutions had ISIL codes. Improve patterns:

isil_patterns = [
    r'ISIL[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
    r'\b([A-Z]{2}-[A-Za-z]{2,}[A-Z0-9]{2,})\b',  # Standalone NL-AsdRM
    r'code[:\s]+([A-Z]{2}-[A-Za-z0-9]+)',
    r'\(([A-Z]{2}-[A-Za-z0-9]+)\)',
]

Expected Impact: Increase ISIL extraction from 1.7% → 10-15%

🟡 Medium-Term Actions (Next Month)

4. Use Dedicated Dutch Conversation Files

  • Identify conversations specifically about Dutch GLAM
  • Create separate extraction pipeline with stricter filters
  • Cross-check against ISIL registry at extraction time

5. Enrich with Web Scraping

For 349 missing ISIL registry institutions:

  • Use crawl4ai to scrape institutional websites
  • Upgrade data tier from TIER_4 → TIER_2
  • Complement conversation extraction

6. Analyze Confidence Score Distribution

  • Plot confidence for matched vs. unmatched institutions
  • Determine optimal threshold (likely 0.85-0.9)
  • Currently using 0.5+ (too permissive)

🟢 Long-Term Actions (3-6 Months)

7. Build Dutch-Specific NER Model

Train on:

  • ISIL registry (365 institutions)
  • Dutch organizations CSV (1,351 institutions)
  • Annotated conversation excerpts

8. Integrate with External APIs

  • Collections Netherlands
  • Wikidata SPARQL
  • OpenStreetMap validation

Conclusion

Current Assessment

The Dutch institution extraction from conversations achieves poor quality (27.6% precision, 4.4% recall). Primary issues:

  1. Conversations are NOT institution catalogs - they discuss metadata standards, not list institutions
  2. Quality filters insufficient - too many generic names pass through
  3. Country assignment unreliable - institutions from other countries misclassified
  4. ISIL extraction nearly non-functional - only 1.7% have identifiers

Is This Acceptable?

For exploratory research: Yes, with caveats

  • The 16 validated institutions are valuable for linking conversational context
  • Low recall is expected given conversation coverage

For production use: No

  • 72% false positive rate is unacceptable
  • Must implement immediate actions #1-3 before using this data

Path Forward

Recommended Next Steps:

  1. DONE: Validate Dutch extraction quality (this session)
  2. ⏭️ NEXT: Implement enhanced quality filters (#1-3 above)
  3. ⏭️ THEN: Re-run batch extraction (v4) and measure improvement
  4. ⏭️ THEN: Analyze confidence score distribution (#6)
  5. ⏭️ THEN: Web scraping to complement conversation data (#5)

Do NOT rely on conversation NLP alone for comprehensive Dutch institution data.


Files Created This Session

  1. scripts/validate_dutch_extraction.py - Validation script
  2. output/dutch_validation_report.txt - Console output report
  3. output/DUTCH_VALIDATION_ANALYSIS.md - Detailed analysis document
  4. SESSION_SUMMARY_NOV7_DUTCH_VALIDATION.md - This summary

Next Session TODO

Priority 1: Implement Enhanced Quality Filters ⚠️

Goal: Reduce false positive rate from 72% → <30%

Tasks:

  1. Add 15+ new generic pattern filters to batch_extract_institutions.py
  2. Strengthen country code validation for country=NL
  3. Require city names for Dutch institutions
  4. Enhance ISIL extraction patterns in nlp_extractor.py
  5. Re-run batch extraction (v4)
  6. Re-run Dutch validation to measure improvement

Expected Metrics After v4:

  • Precision: 27.6% → 60-70%
  • Recall: 4.4% → 3-4% (slight drop due to stricter filters)
  • F1 Score: 7.6% → 10-15%
  • False positives: 42 → 10-15

Priority 2: Generate Quality Filter Analysis Report

Goal: Document filter effectiveness across all 594 institutions (not just NL)

Tasks:

  1. Create output/QUALITY_FILTER_ANALYSIS_v3.md
  2. Breakdown of all 8 filters with examples
  3. Country distribution analysis
  4. Comparison with v2 results

Priority 3: Confidence Score Analysis

Goal: Understand why confidence filter removed 0 institutions

Tasks:

  1. Plot confidence score distribution (histogram)
  2. Compare matched vs. false positive confidence scores
  3. Determine optimal threshold for Dutch institutions
  4. Document findings

Session End Time: November 7, 2025
Total Time: ~2 hours
Status: Priority 1 Complete - Ready for Priority 2 (Enhanced Filters)