glam/docs/V5_EXTRACTION_DESIGN.md
2025-11-19 23:25:22 +01:00

17 KiB

V5 Extraction Design - Improvements Based on Dutch Validation

Date: 2025-11-07
Baseline: V4 extraction with 50% precision on Dutch institutions
Goal: Achieve 75-90% precision through targeted improvements


Problem Statement

V4 extraction of Dutch heritage institutions achieved 50% precision (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use.

Validated Error Categories (from 12 Dutch extractions):

Error Type Count % Examples
Geographic errors 2 16.7% University Malaysia, Islamic University Malaysia extracted as Dutch
Organizations vs Institutions 1 8.3% IFLA (federation) extracted as library
Networks/Platforms 1 8.3% Archive Net (network of 250+ institutions)
Academic Departments 1 8.3% Southeast Asian Studies (department, not institution)
Concepts/Services 1 8.3% Library FabLab (service type, not named institution)
Valid institutions 6 50.0% Historisch Centrum Overijssel, Van Abbemuseum, etc.

Root Cause Analysis

1. Geographic Validation Weakness

Current V4 behavior:

  • Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL)
  • Problem: Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries
  • Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL

V4 code location: nlp_extractor.py:509-511

# Use inferred country as fallback if not found in text
if not country and inferred_country:
    country = inferred_country

Why this fails:

  • No validation that institution actually belongs to inferred country
  • No check for explicit country mentions in the surrounding context
  • Silent override of actual geographic context

2. Entity Type Classification Gaps

Current V4 behavior:

  • Classifies based on keywords (museum, library, archive, etc.)
  • Problem: Doesn't distinguish between:
    • Physical institutions (Rijksmuseum)
    • Organizations (IFLA, UNESCO, ICOM)
    • Networks (Archive Net, Museum Association)
    • Academic units (departments, study programmes)

V4 code location: nlp_extractor.py:606-800 (_extract_institution_names)

Why this fails:

  • Pattern matching on "library" catches "IFLA Library" (organization name)
  • No semantic understanding of entity hierarchy (network vs. member institution)
  • No blacklist for known organization/network names

3. Proper Name Validation Missing

Current V4 behavior:

  • Extracts capitalized phrases containing institution keywords
  • Problem: Accepts generic descriptors and concepts as institution names

Examples of false positives:

  • "Library FabLab" (service concept, like "library café")
  • "Archive Net" (network abbreviation)
  • "Dutch Museum" (too generic, likely part of discussion)

V4 code location: nlp_extractor.py:667-700 (Pattern 1a/1b extraction)

Why this fails:

  • No minimum name length (single-word names allowed)
  • No validation that name is a proper noun (not just capitalized generic term)
  • No blacklist for known generic patterns

V5 Improvement Strategy

Priority 1: Geographic Validation Enhancement

Goal: Eliminate wrong-country extractions (16.7% of errors)

Implementation:

  1. Context-based country validation:

    def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]:
        """
        Validate that inferred country is actually correct for this institution.
    
        Returns:
        - Explicit country code if found in sentence
        - None if explicit country contradicts inferred country
        - inferred_country only if no contradictory evidence
        """
    
  2. Explicit country mention detection:

    • Pattern: "[Name] in [Country]"
    • Pattern: "[Name], [City], [Country]"
    • Pattern: "[Country] institution [Name]"
  3. Contradiction detection:

    • If sentence contains "Malaysia" and inferred country is "NL" → reject extraction
    • If sentence contains "Islamic University" without "Netherlands" → reject for NL
  4. Confidence penalty for inferred-only country:

    • Explicit country in text: confidence +0.2
    • Inferred country only: confidence +0.0 (no bonus)

Expected impact: Reduce geographic errors from 16.7% to <5%


Priority 2: Entity Type Classification Rules

Goal: Filter out organizations, networks, and departments (25% of errors)

Implementation:

  1. Organization blacklist:

    ORGANIZATION_BLACKLIST = {
        # International organizations
        'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA',
        'International Federation of Library Associations',
        'International Council of Museums',
    
        # Networks and associations
        'Archive Net', 'Netwerk Oorlogsbronnen',
        'Museum Association', 'Archives Association',
    
        # Academic units (generic)
        'Studies', 'Department of', 'Faculty of',
        'School of', 'Institute for',
    }
    
  2. Entity type detection patterns:

    # Pattern: "X is a network of Y institutions"
    NETWORK_PATTERNS = [
        r'\b(\w+)\s+is\s+a\s+network',
        r'\b(\w+)\s+platform\s+connecting',
        r'\bnetwork\s+of\s+\d+\s+\w+',
    ]
    
    # Pattern: "X is an organization that"
    ORGANIZATION_PATTERNS = [
        r'\b(\w+)\s+is\s+an?\s+organization',
        r'\b(\w+)\s+is\s+an?\s+association',
        r'\b(\w+)\s+is\s+an?\s+federation',
    ]
    
  3. Academic unit detection:

    • Reject if name contains "Studies" without proper institutional name
    • Example: "Southeast Asian Studies"
    • Example: "Southeast Asian Studies Library, Leiden University" (if "Library" keyword present)

Expected impact: Reduce organization/network errors from 25% to <10%


Priority 3: Proper Name Validation

Goal: Filter generic descriptors and concepts (8.3% of errors)

Implementation:

  1. Minimum proper name requirements:

    def _is_proper_institutional_name(self, name: str, sentence: str) -> bool:
        """
        Validate that name is a proper institution name, not a generic term.
    
        Requirements:
        - Minimum 2 words for most types (except compounds like "Rijksmuseum")
        - At least one word that's NOT just the institution type keyword
        - Not in generic descriptor blacklist
        """
    
  2. Generic descriptor blacklist:

    GENERIC_DESCRIPTORS = {
        'Library FabLab',        # Service concept
        'Museum Café',           # Facility type
        'Archive Reading Room',  # Room/service
        'Museum Shop',
        'Library Makerspace',
    
        # Too generic
        'Dutch Museum',
        'Local Archive',
        'University Library',  # Without specific university name
    }
    
  3. Compound word validation:

    • Single-word names allowed ONLY if:
      • Contains institution keyword as suffix (Rijksmuseum )
      • First letter capitalized (proper noun)
      • Not in generic blacklist

Expected impact: Reduce concept/descriptor errors from 8.3% to <5%


Priority 4: Enhanced Confidence Scoring

Current V4 scoring (base 0.3):

  • +0.2 if has institution type
  • +0.1 if has location
  • +0.3 if has identifier
  • +0.2 if 2-6 words
  • +0.2 if explicit "is a" pattern

V5 improved scoring:

def _calculate_confidence_v5(
    self,
    name: str,
    institution_type: Optional[InstitutionType],
    city: Optional[str],
    country: Optional[str],
    identifiers: List[Identifier],
    sentence: str,
    country_source: str  # 'explicit' | 'inferred' | 'none'
) -> float:
    """
    V5 confidence scoring with stricter validation.
    
    Base: 0.2 (lower than v4 to penalize uncertain extractions)
    
    Positive signals:
    - +0.3 Has institution type keyword
    - +0.2 Has explicit location in text (city OR country)
    - +0.4 Has identifier (ISIL/Wikidata/VIAF)
    - +0.2 Name is 2-6 words
    - +0.2 Explicit "is a" or "located in" pattern
    - +0.1 Country from explicit mention (not just inferred)
    
    Negative signals:
    - -0.2 Single-word name without compound validation
    - -0.3 Name matches generic descriptor pattern
    - -0.2 Country only inferred (not mentioned in text)
    - -0.5 Name in organization/network blacklist
    
    Threshold: 0.6 (increased from v4's 0.5)
    """

Expected impact: Better separation of valid (>0.6) vs invalid (<0.6) institutions


Implementation Plan

Phase 1: Core Validation Enhancements (High Priority)

Files to modify:

  • src/glam_extractor/extractors/nlp_extractor.py

New methods to add:

  1. _validate_country_context(sentence, name, inferred_country) -> Optional[str]

    • Detect explicit country mentions
    • Check for contradictions with inferred country
    • Return validated country or None
  2. _is_organization_or_network(name, sentence) -> bool

    • Check against organization blacklist
    • Detect network/association patterns
    • Return True if should be filtered
  3. _is_proper_institutional_name(name, sentence) -> bool

    • Validate minimum name requirements
    • Check generic descriptor blacklist
    • Validate compound words
  4. _calculate_confidence_v5(...) -> float

    • Implement enhanced scoring with penalties
    • Use country_source parameter

Modified methods:

  1. _extract_entities(text, inferred_country):

    # BEFORE (v4):
    if not country and inferred_country:
        country = inferred_country
    
    # AFTER (v5):
    # Validate country from context first
    validated_country = self._validate_country_context(
        sentence, name, inferred_country
    )
    if validated_country:
        country = validated_country
        country_source = 'explicit'
    elif not country and inferred_country:
        country = inferred_country
        country_source = 'inferred'
    else:
        country_source = 'none'
    
  2. _extract_institution_names(sentence):

    # Add validation before adding to names list
    for potential_name in extracted_names:
        # V5: Filter organizations and networks
        if self._is_organization_or_network(potential_name, sentence):
            continue
    
        # V5: Validate proper institutional name
        if not self._is_proper_institutional_name(potential_name, sentence):
            continue
    
        names.append(potential_name)
    

Phase 2: Quality Filter Updates (Medium Priority)

File: scripts/batch_extract_institutions.py

Enhancements to apply_quality_filters() method:

  1. Add Dutch-specific validation for NL extractions:

    # For NL institutions, require explicit city mention
    if country == 'NL' and not city:
        removed_reasons['nl_missing_city'] += 1
        continue
    
  2. Add organization/network detection:

    # Check organization blacklist
    if name.upper() in ORGANIZATION_BLACKLIST:
        removed_reasons['organization_not_institution'] += 1
        continue
    

Phase 3: Testing and Validation

Test dataset: Dutch conversation files only (subset for faster iteration)

Validation methodology:

  1. Run v5 extraction on Dutch conversations
  2. Compare results to v4 (12 institutions)
  3. Validate new extractions via web search
  4. Calculate precision, recall, F1 score

Success criteria:

  • Precision: ≥75% (up from 50%)
  • False positive reduction: ≥50% (from 6 errors to ≤3)
  • Valid institutions preserved: 6/6 (no regression)

Testing Strategy

Unit Tests

# tests/extractors/test_nlp_extractor_v5.py

def test_geographic_validation():
    """Test that Malaysian institutions aren't assigned to Netherlands."""
    extractor = InstitutionExtractor()
    
    text = "University Malaysia is a leading research institution in Kuala Lumpur."
    result = extractor.extract_from_text(
        text, 
        conversation_name="Dutch_institutions"  # Inferred country: NL
    )
    
    # Should NOT extract or should assign country=MY, not NL
    assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL'

def test_organization_filtering():
    """Test that IFLA is not extracted as a library."""
    extractor = InstitutionExtractor()
    
    text = "IFLA (International Federation of Library Associations) sets library standards."
    result = extractor.extract_from_text(text)
    
    # Should not extract IFLA as an institution
    assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value]

def test_generic_descriptor_filtering():
    """Test that 'Library FabLab' is not extracted as institution."""
    extractor = InstitutionExtractor()
    
    text = "The Library FabLab provides 3D printing services to patrons."
    result = extractor.extract_from_text(text)
    
    # Should not extract generic service descriptor
    assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value]

Integration Tests

def test_dutch_extraction_precision():
    """
    Test v5 extraction on real Dutch conversations.
    
    Expected: Precision ≥75% (6 valid / ≤8 total)
    """
    batch_extractor = BatchInstitutionExtractor(
        conversation_dir=Path("path/to/dutch/conversations"),
        output_dir=Path("output/v5_test")
    )
    
    stats = batch_extractor.process_all(country_filter="dutch")
    batch_extractor.apply_quality_filters()
    
    # Manual validation required
    # Expected: 6-8 institutions extracted (vs 12 in v4)
    assert 6 <= len(batch_extractor.all_institutions) <= 10

Rollout Plan

Stage 1: Dutch-only validation (THIS SPRINT)

  • Implement v5 improvements
  • Test on Dutch conversations only
  • Validate precision ≥75%
  • Compare v4 vs v5 results

Stage 2: Multi-country validation

  • Test on Brazil, Mexico, Chile (v4 countries)
  • Ensure improvements don't harm other regions
  • Adjust blacklists for region-specific patterns

Stage 3: Full re-extraction

  • Run v5 on all 139 conversation files
  • Generate new output/v5_institutions.json
  • Validate sample across multiple countries
  • Update documentation

Risk Mitigation

Risk 1: Over-filtering (false negatives)

Concern: Stricter validation might reject valid institutions

Mitigation:

  • Preserve v4 output for comparison
  • Log all filtered institutions with reason
  • Manual review of filtered items
  • Adjustable confidence threshold (default 0.6, can lower to 0.55)

Risk 2: Regional bias

Concern: Dutch-optimized rules might not work globally

Mitigation:

  • Blacklists should be culturally neutral where possible
  • Test on diverse regions (Asia, Africa, Latin America)
  • Separate region-specific rules (e.g., DUTCH_SPECIFIC_FILTERS)

Risk 3: Regression on valid institutions

Concern: Might lose some of the 6 valid Dutch institutions

Mitigation:

  • Run v5 on same Dutch conversations
  • Compare extracted names to v4's 6 valid institutions
  • If any valid institution missing, adjust filters
  • Maintain whitelist of known-good institutions

Metrics and Monitoring

Key Metrics

Metric V4 Baseline V5 Target Measurement Method
Precision (Dutch) 50% ≥75% Web validation of sample
Geographic errors 16.7% <5% Manual review
Organization errors 25% <10% Blacklist matching
Generic descriptor errors 8.3% <5% Pattern matching
Total extracted (Dutch) 12 6-10 Count after filters
Valid institutions preserved 6/6 6/6 Compare to v4 valid list

Success Criteria

Must achieve:

  • Precision ≥75% on Dutch institutions
  • All 6 v4-valid institutions still extracted
  • ≤3 false positives (down from 6)

Nice to have:

  • Precision ≥80%
  • No geographic errors (0%)
  • Confidence scores >0.7 for all valid institutions

Files Modified

New files:

  • docs/V5_EXTRACTION_DESIGN.md (this document)
  • tests/extractors/test_nlp_extractor_v5.py (unit tests)
  • output/v5_dutch_institutions.json (v5 extraction results)
  • output/V5_VALIDATION_COMPARISON.md (v4 vs v5 analysis)

Modified files:

  • src/glam_extractor/extractors/nlp_extractor.py (core improvements)
  • scripts/batch_extract_institutions.py (quality filter updates)

Next Actions

  1. Design v5 improvements (this document)
  2. Implement core validation methods (geographic, organization, proper name)
  3. Update confidence scoring (v5 algorithm)
  4. Run v5 extraction on Dutch conversations
  5. Validate v5 results (web search + comparison to v4)
  6. Measure precision improvement (target ≥75%)
  7. Document findings (V5_VALIDATION_COMPARISON.md)

Status: Design complete, ready for implementation
Owner: GLAM extraction team
Review date: 2025-11-07
Implementation timeline: 1-2 days