glam/docs/V5_EXTRACTION_DESIGN.md

# V5 Extraction Design - Improvements Based on Dutch Validation

**Date:** 2025-11-07
**Baseline:** V4 extraction with 50% precision on Dutch institutions
**Goal:** Achieve 75-90% precision through targeted improvements

---

## Problem Statement

V4 extraction of Dutch heritage institutions achieved **50% precision** (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use.

### Validated Error Categories (from 12 Dutch extractions):

| Error Type | Count | % | Examples |
|------------|-------|---|----------|
| **Geographic errors** | 2 | 16.7% | University Malaysia, Islamic University Malaysia extracted as Dutch |
| **Organizations vs Institutions** | 1 | 8.3% | IFLA (federation) extracted as library |
| **Networks/Platforms** | 1 | 8.3% | Archive Net (network of 250+ institutions) |
| **Academic Departments** | 1 | 8.3% | Southeast Asian Studies (department, not institution) |
| **Concepts/Services** | 1 | 8.3% | Library FabLab (service type, not named institution) |
| **Valid institutions** | 6 | 50.0% | Historisch Centrum Overijssel, Van Abbemuseum, etc. |

---

## Root Cause Analysis

### 1. Geographic Validation Weakness

**Current V4 behavior:**
- Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL)
- **Problem:** Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries
- Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL

**V4 code location:** `nlp_extractor.py:509-511`
```python
# Use inferred country as fallback if not found in text
if not country and inferred_country:
    country = inferred_country
```

**Why this fails:**
- No validation that institution actually belongs to inferred country
- No check for explicit country mentions in the surrounding context
- Silent override of actual geographic context

### 2. Entity Type Classification Gaps

**Current V4 behavior:**
- Classifies based on keywords (museum, library, archive, etc.)
- **Problem:** Doesn't distinguish between:
  - **Physical institutions** (Rijksmuseum) ✅
  - **Organizations** (IFLA, UNESCO, ICOM) ❌
  - **Networks** (Archive Net, Museum Association) ❌
  - **Academic units** (departments, study programmes) ❌

**V4 code location:** `nlp_extractor.py:606-800` (`_extract_institution_names`)

**Why this fails:**
- Pattern matching on "library" catches "IFLA Library" (organization name)
- No semantic understanding of entity hierarchy (network vs. member institution)
- No blacklist for known organization/network names

### 3. Proper Name Validation Missing

**Current V4 behavior:**
- Extracts capitalized phrases containing institution keywords
- **Problem:** Accepts generic descriptors and concepts as institution names

**Examples of false positives:**
- "Library FabLab" (service concept, like "library café")
- "Archive Net" (network abbreviation)
- "Dutch Museum" (too generic, likely part of discussion)

**V4 code location:** `nlp_extractor.py:667-700` (Pattern 1a/1b extraction)

**Why this fails:**
- No minimum name length (single-word names allowed)
- No validation that name is a proper noun (not just capitalized generic term)
- No blacklist for known generic patterns

---

## V5 Improvement Strategy

### Priority 1: Geographic Validation Enhancement

**Goal:** Eliminate wrong-country extractions (16.7% of errors)

**Implementation:**

1. **Context-based country validation:**
   ```python
   def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]:
       """
       Validate that inferred country is actually correct for this institution.

       Returns:
       - Explicit country code if found in sentence
       - None if explicit country contradicts inferred country
       - inferred_country only if no contradictory evidence
       """
   ```

2. **Explicit country mention detection:**
   - Pattern: "[Name] in [Country]"
   - Pattern: "[Name], [City], [Country]"
   - Pattern: "[Country] institution [Name]"

3. **Contradiction detection:**
   - If sentence contains "Malaysia" and inferred country is "NL" → reject extraction
   - If sentence contains "Islamic University" without "Netherlands" → reject for NL

4. **Confidence penalty for inferred-only country:**
   - Explicit country in text: confidence +0.2
   - Inferred country only: confidence +0.0 (no bonus)

**Expected impact:** Reduce geographic errors from 16.7% to <5%

---

### Priority 2: Entity Type Classification Rules

**Goal:** Filter out organizations, networks, and departments (25% of errors)

**Implementation:**

1. **Organization blacklist:**
   ```python
   ORGANIZATION_BLACKLIST = {
       # International organizations
       'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA',
       'International Federation of Library Associations',
       'International Council of Museums',

       # Networks and associations
       'Archive Net', 'Netwerk Oorlogsbronnen',
       'Museum Association', 'Archives Association',

       # Academic units (generic)
       'Studies', 'Department of', 'Faculty of',
       'School of', 'Institute for',
   }
   ```

2. **Entity type detection patterns:**
   ```python
   # Pattern: "X is a network of Y institutions"
   NETWORK_PATTERNS = [
       r'\b(\w+)\s+is\s+a\s+network',
       r'\b(\w+)\s+platform\s+connecting',
       r'\bnetwork\s+of\s+\d+\s+\w+',
   ]

   # Pattern: "X is an organization that"
   ORGANIZATION_PATTERNS = [
       r'\b(\w+)\s+is\s+an?\s+organization',
       r'\b(\w+)\s+is\s+an?\s+association',
       r'\b(\w+)\s+is\s+an?\s+federation',
   ]
   ```

3. **Academic unit detection:**
   - Reject if name contains "Studies" without proper institutional name
   - Example: "Southeast Asian Studies" ❌
   - Example: "Southeast Asian Studies Library, Leiden University" ✅ (if "Library" keyword present)

**Expected impact:** Reduce organization/network errors from 25% to <10%

---

### Priority 3: Proper Name Validation

**Goal:** Filter generic descriptors and concepts (8.3% of errors)

**Implementation:**

1. **Minimum proper name requirements:**
   ```python
   def _is_proper_institutional_name(self, name: str, sentence: str) -> bool:
       """
       Validate that name is a proper institution name, not a generic term.

       Requirements:
       - Minimum 2 words for most types (except compounds like "Rijksmuseum")
       - At least one word that's NOT just the institution type keyword
       - Not in generic descriptor blacklist
       """
   ```

2. **Generic descriptor blacklist:**
   ```python
   GENERIC_DESCRIPTORS = {
       'Library FabLab',        # Service concept
       'Museum Café',           # Facility type
       'Archive Reading Room',  # Room/service
       'Museum Shop',
       'Library Makerspace',

       # Too generic
       'Dutch Museum',
       'Local Archive',
       'University Library',  # Without specific university name
   }
   ```

3. **Compound word validation:**
   - Single-word names allowed ONLY if:
     - Contains institution keyword as suffix (Rijksmuseum ✅)
     - First letter capitalized (proper noun)
     - Not in generic blacklist

**Expected impact:** Reduce concept/descriptor errors from 8.3% to <5%

---

### Priority 4: Enhanced Confidence Scoring

**Current V4 scoring (base 0.3):**
- +0.2 if has institution type
- +0.1 if has location
- +0.3 if has identifier
- +0.2 if 2-6 words
- +0.2 if explicit "is a" pattern

**V5 improved scoring:**

```python
def _calculate_confidence_v5(
    self,
    name: str,
    institution_type: Optional[InstitutionType],
    city: Optional[str],
    country: Optional[str],
    identifiers: List[Identifier],
    sentence: str,
    country_source: str  # 'explicit' | 'inferred' | 'none'
) -> float:
    """
    V5 confidence scoring with stricter validation.

    Base: 0.2 (lower than v4 to penalize uncertain extractions)

    Positive signals:
    - +0.3 Has institution type keyword
    - +0.2 Has explicit location in text (city OR country)
    - +0.4 Has identifier (ISIL/Wikidata/VIAF)
    - +0.2 Name is 2-6 words
    - +0.2 Explicit "is a" or "located in" pattern
    - +0.1 Country from explicit mention (not just inferred)

    Negative signals:
    - -0.2 Single-word name without compound validation
    - -0.3 Name matches generic descriptor pattern
    - -0.2 Country only inferred (not mentioned in text)
    - -0.5 Name in organization/network blacklist

    Threshold: 0.6 (increased from v4's 0.5)
    """
```

**Expected impact:** Better separation of valid (>0.6) vs invalid (<0.6) institutions

---

## Implementation Plan

### Phase 1: Core Validation Enhancements (High Priority)

**Files to modify:**
- `src/glam_extractor/extractors/nlp_extractor.py`

**New methods to add:**

1. `_validate_country_context(sentence, name, inferred_country) -> Optional[str]`
   - Detect explicit country mentions
   - Check for contradictions with inferred country
   - Return validated country or None

2. `_is_organization_or_network(name, sentence) -> bool`
   - Check against organization blacklist
   - Detect network/association patterns
   - Return True if should be filtered

3. `_is_proper_institutional_name(name, sentence) -> bool`
   - Validate minimum name requirements
   - Check generic descriptor blacklist
   - Validate compound words

4. `_calculate_confidence_v5(...) -> float`
   - Implement enhanced scoring with penalties
   - Use country_source parameter

**Modified methods:**

1. `_extract_entities(text, inferred_country)`:
   ```python
   # BEFORE (v4):
   if not country and inferred_country:
       country = inferred_country

   # AFTER (v5):
   # Validate country from context first
   validated_country = self._validate_country_context(
       sentence, name, inferred_country
   )
   if validated_country:
       country = validated_country
       country_source = 'explicit'
   elif not country and inferred_country:
       country = inferred_country
       country_source = 'inferred'
   else:
       country_source = 'none'
   ```

2. `_extract_institution_names(sentence)`:
   ```python
   # Add validation before adding to names list
   for potential_name in extracted_names:
       # V5: Filter organizations and networks
       if self._is_organization_or_network(potential_name, sentence):
           continue

       # V5: Validate proper institutional name
       if not self._is_proper_institutional_name(potential_name, sentence):
           continue

       names.append(potential_name)
   ```

### Phase 2: Quality Filter Updates (Medium Priority)

**File:** `scripts/batch_extract_institutions.py`

**Enhancements to `apply_quality_filters()` method:**

1. Add Dutch-specific validation for NL extractions:
   ```python
   # For NL institutions, require explicit city mention
   if country == 'NL' and not city:
       removed_reasons['nl_missing_city'] += 1
       continue
   ```

2. Add organization/network detection:
   ```python
   # Check organization blacklist
   if name.upper() in ORGANIZATION_BLACKLIST:
       removed_reasons['organization_not_institution'] += 1
       continue
   ```

### Phase 3: Testing and Validation

**Test dataset:** Dutch conversation files only (subset for faster iteration)

**Validation methodology:**
1. Run v5 extraction on Dutch conversations
2. Compare results to v4 (12 institutions)
3. Validate new extractions via web search
4. Calculate precision, recall, F1 score

**Success criteria:**
- **Precision:** ≥75% (up from 50%)
- **False positive reduction:** ≥50% (from 6 errors to ≤3)
- **Valid institutions preserved:** 6/6 (no regression)

---

## Testing Strategy

### Unit Tests

```python
# tests/extractors/test_nlp_extractor_v5.py

def test_geographic_validation():
    """Test that Malaysian institutions aren't assigned to Netherlands."""
    extractor = InstitutionExtractor()

    text = "University Malaysia is a leading research institution in Kuala Lumpur."
    result = extractor.extract_from_text(
        text,
        conversation_name="Dutch_institutions"  # Inferred country: NL
    )

    # Should NOT extract or should assign country=MY, not NL
    assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL'

def test_organization_filtering():
    """Test that IFLA is not extracted as a library."""
    extractor = InstitutionExtractor()

    text = "IFLA (International Federation of Library Associations) sets library standards."
    result = extractor.extract_from_text(text)

    # Should not extract IFLA as an institution
    assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value]

def test_generic_descriptor_filtering():
    """Test that 'Library FabLab' is not extracted as institution."""
    extractor = InstitutionExtractor()

    text = "The Library FabLab provides 3D printing services to patrons."
    result = extractor.extract_from_text(text)

    # Should not extract generic service descriptor
    assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value]
```

### Integration Tests

```python
def test_dutch_extraction_precision():
    """
    Test v5 extraction on real Dutch conversations.

    Expected: Precision ≥75% (6 valid / ≤8 total)
    """
    batch_extractor = BatchInstitutionExtractor(
        conversation_dir=Path("path/to/dutch/conversations"),
        output_dir=Path("output/v5_test")
    )

    stats = batch_extractor.process_all(country_filter="dutch")
    batch_extractor.apply_quality_filters()

    # Manual validation required
    # Expected: 6-8 institutions extracted (vs 12 in v4)
    assert 6 <= len(batch_extractor.all_institutions) <= 10
```

---

## Rollout Plan

### Stage 1: Dutch-only validation (THIS SPRINT)
- Implement v5 improvements
- Test on Dutch conversations only
- Validate precision ≥75%
- Compare v4 vs v5 results

### Stage 2: Multi-country validation
- Test on Brazil, Mexico, Chile (v4 countries)
- Ensure improvements don't harm other regions
- Adjust blacklists for region-specific patterns

### Stage 3: Full re-extraction
- Run v5 on all 139 conversation files
- Generate new `output/v5_institutions.json`
- Validate sample across multiple countries
- Update documentation

---

## Risk Mitigation

### Risk 1: Over-filtering (false negatives)

**Concern:** Stricter validation might reject valid institutions

**Mitigation:**
- Preserve v4 output for comparison
- Log all filtered institutions with reason
- Manual review of filtered items
- Adjustable confidence threshold (default 0.6, can lower to 0.55)

### Risk 2: Regional bias

**Concern:** Dutch-optimized rules might not work globally

**Mitigation:**
- Blacklists should be culturally neutral where possible
- Test on diverse regions (Asia, Africa, Latin America)
- Separate region-specific rules (e.g., `DUTCH_SPECIFIC_FILTERS`)

### Risk 3: Regression on valid institutions

**Concern:** Might lose some of the 6 valid Dutch institutions

**Mitigation:**
- Run v5 on same Dutch conversations
- Compare extracted names to v4's 6 valid institutions
- If any valid institution missing, adjust filters
- Maintain whitelist of known-good institutions

---

## Metrics and Monitoring

### Key Metrics

| Metric | V4 Baseline | V5 Target | Measurement Method |
|--------|-------------|-----------|-------------------|
| **Precision (Dutch)** | 50% | ≥75% | Web validation of sample |
| **Geographic errors** | 16.7% | <5% | Manual review |
| **Organization errors** | 25% | <10% | Blacklist matching |
| **Generic descriptor errors** | 8.3% | <5% | Pattern matching |
| **Total extracted (Dutch)** | 12 | 6-10 | Count after filters |
| **Valid institutions preserved** | 6/6 | 6/6 | Compare to v4 valid list |

### Success Criteria

**Must achieve:**
- ✅ Precision ≥75% on Dutch institutions
- ✅ All 6 v4-valid institutions still extracted
- ✅ ≤3 false positives (down from 6)

**Nice to have:**
- Precision ≥80%
- No geographic errors (0%)
- Confidence scores >0.7 for all valid institutions

---

## Files Modified

**New files:**
- `docs/V5_EXTRACTION_DESIGN.md` (this document)
- `tests/extractors/test_nlp_extractor_v5.py` (unit tests)
- `output/v5_dutch_institutions.json` (v5 extraction results)
- `output/V5_VALIDATION_COMPARISON.md` (v4 vs v5 analysis)

**Modified files:**
- `src/glam_extractor/extractors/nlp_extractor.py` (core improvements)
- `scripts/batch_extract_institutions.py` (quality filter updates)

---

## Next Actions

1. ✅ **Design v5 improvements** (this document)
2. ⏳ **Implement core validation methods** (geographic, organization, proper name)
3. ⏳ **Update confidence scoring** (v5 algorithm)
4. ⏳ **Run v5 extraction on Dutch conversations**
5. ⏳ **Validate v5 results** (web search + comparison to v4)
6. ⏳ **Measure precision improvement** (target ≥75%)
7. ⏳ **Document findings** (V5_VALIDATION_COMPARISON.md)

---

**Status:** Design complete, ready for implementation
**Owner:** GLAM extraction team
**Review date:** 2025-11-07
**Implementation timeline:** 1-2 days