kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

17 KiB

Raw Blame History

V5 Extraction Design - Improvements Based on Dutch Validation

Date: 2025-11-07
Baseline: V4 extraction with 50% precision on Dutch institutions
Goal: Achieve 75-90% precision through targeted improvements

Problem Statement

V4 extraction of Dutch heritage institutions achieved 50% precision (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use.

Validated Error Categories (from 12 Dutch extractions):

Error Type	Count	%	Examples
Geographic errors	2	16.7%	University Malaysia, Islamic University Malaysia extracted as Dutch
Organizations vs Institutions	1	8.3%	IFLA (federation) extracted as library
Networks/Platforms	1	8.3%	Archive Net (network of 250+ institutions)
Academic Departments	1	8.3%	Southeast Asian Studies (department, not institution)
Concepts/Services	1	8.3%	Library FabLab (service type, not named institution)
Valid institutions	6	50.0%	Historisch Centrum Overijssel, Van Abbemuseum, etc.

Root Cause Analysis

1. Geographic Validation Weakness

Current V4 behavior:

Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL)
Problem: Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries
Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL

V4 code location: nlp_extractor.py:509-511

# Use inferred country as fallback if not found in text
if not country and inferred_country:
    country = inferred_country

Why this fails:

No validation that institution actually belongs to inferred country
No check for explicit country mentions in the surrounding context
Silent override of actual geographic context

2. Entity Type Classification Gaps

Current V4 behavior:

Classifies based on keywords (museum, library, archive, etc.)
Problem: Doesn't distinguish between:
- Physical institutions (Rijksmuseum) ✅
- Organizations (IFLA, UNESCO, ICOM) ❌
- Networks (Archive Net, Museum Association) ❌
- Academic units (departments, study programmes) ❌

V4 code location: nlp_extractor.py:606-800 (_extract_institution_names)

Why this fails:

Pattern matching on "library" catches "IFLA Library" (organization name)
No semantic understanding of entity hierarchy (network vs. member institution)
No blacklist for known organization/network names

3. Proper Name Validation Missing

Current V4 behavior:

Extracts capitalized phrases containing institution keywords
Problem: Accepts generic descriptors and concepts as institution names

Examples of false positives:

"Library FabLab" (service concept, like "library café")
"Archive Net" (network abbreviation)
"Dutch Museum" (too generic, likely part of discussion)

V4 code location: nlp_extractor.py:667-700 (Pattern 1a/1b extraction)

Why this fails:

No minimum name length (single-word names allowed)
No validation that name is a proper noun (not just capitalized generic term)
No blacklist for known generic patterns

V5 Improvement Strategy

Priority 1: Geographic Validation Enhancement

Goal: Eliminate wrong-country extractions (16.7% of errors)

Implementation:

Context-based country validation:

def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]:
    """
    Validate that inferred country is actually correct for this institution.

    Returns:
    - Explicit country code if found in sentence
    - None if explicit country contradicts inferred country
    - inferred_country only if no contradictory evidence
    """

Explicit country mention detection:
- Pattern: "[Name] in [Country]"
- Pattern: "[Name], [City], [Country]"
- Pattern: "[Country] institution [Name]"
Contradiction detection:
- If sentence contains "Malaysia" and inferred country is "NL" → reject extraction
- If sentence contains "Islamic University" without "Netherlands" → reject for NL
Confidence penalty for inferred-only country:
- Explicit country in text: confidence +0.2
- Inferred country only: confidence +0.0 (no bonus)

Expected impact: Reduce geographic errors from 16.7% to <5%

Priority 2: Entity Type Classification Rules

Goal: Filter out organizations, networks, and departments (25% of errors)

Implementation:

Organization blacklist:

ORGANIZATION_BLACKLIST = {
    # International organizations
    'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA',
    'International Federation of Library Associations',
    'International Council of Museums',

    # Networks and associations
    'Archive Net', 'Netwerk Oorlogsbronnen',
    'Museum Association', 'Archives Association',

    # Academic units (generic)
    'Studies', 'Department of', 'Faculty of',
    'School of', 'Institute for',
}

Entity type detection patterns:

# Pattern: "X is a network of Y institutions"
NETWORK_PATTERNS = [
    r'\b(\w+)\s+is\s+a\s+network',
    r'\b(\w+)\s+platform\s+connecting',
    r'\bnetwork\s+of\s+\d+\s+\w+',
]

# Pattern: "X is an organization that"
ORGANIZATION_PATTERNS = [
    r'\b(\w+)\s+is\s+an?\s+organization',
    r'\b(\w+)\s+is\s+an?\s+association',
    r'\b(\w+)\s+is\s+an?\s+federation',
]

Academic unit detection:
- Reject if name contains "Studies" without proper institutional name
- Example: "Southeast Asian Studies" ❌
- Example: "Southeast Asian Studies Library, Leiden University" ✅ (if "Library" keyword present)

Expected impact: Reduce organization/network errors from 25% to <10%

Priority 3: Proper Name Validation

Goal: Filter generic descriptors and concepts (8.3% of errors)

Implementation:

Minimum proper name requirements:

def _is_proper_institutional_name(self, name: str, sentence: str) -> bool:
    """
    Validate that name is a proper institution name, not a generic term.

    Requirements:
    - Minimum 2 words for most types (except compounds like "Rijksmuseum")
    - At least one word that's NOT just the institution type keyword
    - Not in generic descriptor blacklist
    """

Generic descriptor blacklist:

GENERIC_DESCRIPTORS = {
    'Library FabLab',        # Service concept
    'Museum Café',           # Facility type
    'Archive Reading Room',  # Room/service
    'Museum Shop',
    'Library Makerspace',

    # Too generic
    'Dutch Museum',
    'Local Archive',
    'University Library',  # Without specific university name
}

Compound word validation:
- Single-word names allowed ONLY if:
  - Contains institution keyword as suffix (Rijksmuseum ✅)
  - First letter capitalized (proper noun)
  - Not in generic blacklist

Expected impact: Reduce concept/descriptor errors from 8.3% to <5%

Priority 4: Enhanced Confidence Scoring

Current V4 scoring (base 0.3):

+0.2 if has institution type
+0.1 if has location
+0.3 if has identifier
+0.2 if 2-6 words
+0.2 if explicit "is a" pattern

V5 improved scoring:

def _calculate_confidence_v5(
    self,
    name: str,
    institution_type: Optional[InstitutionType],
    city: Optional[str],
    country: Optional[str],
    identifiers: List[Identifier],
    sentence: str,
    country_source: str  # 'explicit' | 'inferred' | 'none'
) -> float:
    """
    V5 confidence scoring with stricter validation.
    
    Base: 0.2 (lower than v4 to penalize uncertain extractions)
    
    Positive signals:
    - +0.3 Has institution type keyword
    - +0.2 Has explicit location in text (city OR country)
    - +0.4 Has identifier (ISIL/Wikidata/VIAF)
    - +0.2 Name is 2-6 words
    - +0.2 Explicit "is a" or "located in" pattern
    - +0.1 Country from explicit mention (not just inferred)
    
    Negative signals:
    - -0.2 Single-word name without compound validation
    - -0.3 Name matches generic descriptor pattern
    - -0.2 Country only inferred (not mentioned in text)
    - -0.5 Name in organization/network blacklist
    
    Threshold: 0.6 (increased from v4's 0.5)
    """

Expected impact: Better separation of valid (>0.6) vs invalid (<0.6) institutions

Implementation Plan

Phase 1: Core Validation Enhancements (High Priority)

Files to modify:

src/glam_extractor/extractors/nlp_extractor.py

New methods to add:

_validate_country_context(sentence, name, inferred_country) -> Optional[str]
- Detect explicit country mentions
- Check for contradictions with inferred country
- Return validated country or None
_is_organization_or_network(name, sentence) -> bool
- Check against organization blacklist
- Detect network/association patterns
- Return True if should be filtered
_is_proper_institutional_name(name, sentence) -> bool
- Validate minimum name requirements
- Check generic descriptor blacklist
- Validate compound words
_calculate_confidence_v5(...) -> float
- Implement enhanced scoring with penalties
- Use country_source parameter

Modified methods:

_extract_entities(text, inferred_country):

# BEFORE (v4):
if not country and inferred_country:
    country = inferred_country

# AFTER (v5):
# Validate country from context first
validated_country = self._validate_country_context(
    sentence, name, inferred_country
)
if validated_country:
    country = validated_country
    country_source = 'explicit'
elif not country and inferred_country:
    country = inferred_country
    country_source = 'inferred'
else:
    country_source = 'none'

_extract_institution_names(sentence):

# Add validation before adding to names list
for potential_name in extracted_names:
    # V5: Filter organizations and networks
    if self._is_organization_or_network(potential_name, sentence):
        continue

    # V5: Validate proper institutional name
    if not self._is_proper_institutional_name(potential_name, sentence):
        continue

    names.append(potential_name)

Phase 2: Quality Filter Updates (Medium Priority)

File: scripts/batch_extract_institutions.py

Enhancements to apply_quality_filters() method:

Add Dutch-specific validation for NL extractions:

# For NL institutions, require explicit city mention
if country == 'NL' and not city:
    removed_reasons['nl_missing_city'] += 1
    continue

Add organization/network detection:

# Check organization blacklist
if name.upper() in ORGANIZATION_BLACKLIST:
    removed_reasons['organization_not_institution'] += 1
    continue

Phase 3: Testing and Validation

Test dataset: Dutch conversation files only (subset for faster iteration)

Validation methodology:

Run v5 extraction on Dutch conversations
Compare results to v4 (12 institutions)
Validate new extractions via web search
Calculate precision, recall, F1 score

Success criteria:

Precision: ≥75% (up from 50%)
False positive reduction: ≥50% (from 6 errors to ≤3)
Valid institutions preserved: 6/6 (no regression)

Testing Strategy

Unit Tests

# tests/extractors/test_nlp_extractor_v5.py

def test_geographic_validation():
    """Test that Malaysian institutions aren't assigned to Netherlands."""
    extractor = InstitutionExtractor()
    
    text = "University Malaysia is a leading research institution in Kuala Lumpur."
    result = extractor.extract_from_text(
        text, 
        conversation_name="Dutch_institutions"  # Inferred country: NL
    )
    
    # Should NOT extract or should assign country=MY, not NL
    assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL'

def test_organization_filtering():
    """Test that IFLA is not extracted as a library."""
    extractor = InstitutionExtractor()
    
    text = "IFLA (International Federation of Library Associations) sets library standards."
    result = extractor.extract_from_text(text)
    
    # Should not extract IFLA as an institution
    assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value]

def test_generic_descriptor_filtering():
    """Test that 'Library FabLab' is not extracted as institution."""
    extractor = InstitutionExtractor()
    
    text = "The Library FabLab provides 3D printing services to patrons."
    result = extractor.extract_from_text(text)
    
    # Should not extract generic service descriptor
    assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value]

Integration Tests

def test_dutch_extraction_precision():
    """
    Test v5 extraction on real Dutch conversations.
    
    Expected: Precision ≥75% (6 valid / ≤8 total)
    """
    batch_extractor = BatchInstitutionExtractor(
        conversation_dir=Path("path/to/dutch/conversations"),
        output_dir=Path("output/v5_test")
    )
    
    stats = batch_extractor.process_all(country_filter="dutch")
    batch_extractor.apply_quality_filters()
    
    # Manual validation required
    # Expected: 6-8 institutions extracted (vs 12 in v4)
    assert 6 <= len(batch_extractor.all_institutions) <= 10

Rollout Plan

Stage 1: Dutch-only validation (THIS SPRINT)

Implement v5 improvements
Test on Dutch conversations only
Validate precision ≥75%
Compare v4 vs v5 results

Stage 2: Multi-country validation

Test on Brazil, Mexico, Chile (v4 countries)
Ensure improvements don't harm other regions
Adjust blacklists for region-specific patterns

Stage 3: Full re-extraction

Run v5 on all 139 conversation files
Generate new output/v5_institutions.json
Validate sample across multiple countries
Update documentation

Risk Mitigation

Risk 1: Over-filtering (false negatives)

Concern: Stricter validation might reject valid institutions

Mitigation:

Preserve v4 output for comparison
Log all filtered institutions with reason
Manual review of filtered items
Adjustable confidence threshold (default 0.6, can lower to 0.55)

Risk 2: Regional bias

Concern: Dutch-optimized rules might not work globally

Mitigation:

Blacklists should be culturally neutral where possible
Test on diverse regions (Asia, Africa, Latin America)
Separate region-specific rules (e.g., DUTCH_SPECIFIC_FILTERS)

Risk 3: Regression on valid institutions

Concern: Might lose some of the 6 valid Dutch institutions

Mitigation:

Run v5 on same Dutch conversations
Compare extracted names to v4's 6 valid institutions
If any valid institution missing, adjust filters
Maintain whitelist of known-good institutions

Metrics and Monitoring

Key Metrics

Metric	V4 Baseline	V5 Target	Measurement Method
Precision (Dutch)	50%	≥75%	Web validation of sample
Geographic errors	16.7%	<5%	Manual review
Organization errors	25%	<10%	Blacklist matching
Generic descriptor errors	8.3%	<5%	Pattern matching
Total extracted (Dutch)	12	6-10	Count after filters
Valid institutions preserved	6/6	6/6	Compare to v4 valid list

Success Criteria

Must achieve:

✅ Precision ≥75% on Dutch institutions
✅ All 6 v4-valid institutions still extracted
✅ ≤3 false positives (down from 6)

Nice to have:

Precision ≥80%
No geographic errors (0%)
Confidence scores >0.7 for all valid institutions

Files Modified

New files:

docs/V5_EXTRACTION_DESIGN.md (this document)
tests/extractors/test_nlp_extractor_v5.py (unit tests)
output/v5_dutch_institutions.json (v5 extraction results)
output/V5_VALIDATION_COMPARISON.md (v4 vs v5 analysis)

Modified files:

src/glam_extractor/extractors/nlp_extractor.py (core improvements)
scripts/batch_extract_institutions.py (quality filter updates)

Next Actions

✅ Design v5 improvements (this document)
⏳ Implement core validation methods (geographic, organization, proper name)
⏳ Update confidence scoring (v5 algorithm)
⏳ Run v5 extraction on Dutch conversations
⏳ Validate v5 results (web search + comparison to v4)
⏳ Measure precision improvement (target ≥75%)
⏳ Document findings (V5_VALIDATION_COMPARISON.md)

Status: Design complete, ready for implementation
Owner: GLAM extraction team
Review date: 2025-11-07
Implementation timeline: 1-2 days

17 KiB Raw Blame History

V5 Extraction Design - Improvements Based on Dutch Validation

Problem Statement

Validated Error Categories (from 12 Dutch extractions):

Root Cause Analysis

1. Geographic Validation Weakness

2. Entity Type Classification Gaps

3. Proper Name Validation Missing

V5 Improvement Strategy

Priority 1: Geographic Validation Enhancement

Priority 2: Entity Type Classification Rules

Priority 3: Proper Name Validation

Priority 4: Enhanced Confidence Scoring

Implementation Plan

Phase 1: Core Validation Enhancements (High Priority)

Phase 2: Quality Filter Updates (Medium Priority)

Phase 3: Testing and Validation

Testing Strategy

Unit Tests

Integration Tests

Rollout Plan

Stage 1: Dutch-only validation (THIS SPRINT)

Stage 2: Multi-country validation

Stage 3: Full re-extraction

Risk Mitigation

Risk 1: Over-filtering (false negatives)

Risk 2: Regional bias

Risk 3: Regression on valid institutions

Metrics and Monitoring

Key Metrics

Success Criteria

Files Modified

Next Actions

17 KiB

Raw Blame History