glam/docs/V5_EXTRACTION_DESIGN.md
2025-11-19 23:25:22 +01:00

546 lines
17 KiB
Markdown

# V5 Extraction Design - Improvements Based on Dutch Validation
**Date:** 2025-11-07
**Baseline:** V4 extraction with 50% precision on Dutch institutions
**Goal:** Achieve 75-90% precision through targeted improvements
---
## Problem Statement
V4 extraction of Dutch heritage institutions achieved **50% precision** (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use.
### Validated Error Categories (from 12 Dutch extractions):
| Error Type | Count | % | Examples |
|------------|-------|---|----------|
| **Geographic errors** | 2 | 16.7% | University Malaysia, Islamic University Malaysia extracted as Dutch |
| **Organizations vs Institutions** | 1 | 8.3% | IFLA (federation) extracted as library |
| **Networks/Platforms** | 1 | 8.3% | Archive Net (network of 250+ institutions) |
| **Academic Departments** | 1 | 8.3% | Southeast Asian Studies (department, not institution) |
| **Concepts/Services** | 1 | 8.3% | Library FabLab (service type, not named institution) |
| **Valid institutions** | 6 | 50.0% | Historisch Centrum Overijssel, Van Abbemuseum, etc. |
---
## Root Cause Analysis
### 1. Geographic Validation Weakness
**Current V4 behavior:**
- Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL)
- **Problem:** Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries
- Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL
**V4 code location:** `nlp_extractor.py:509-511`
```python
# Use inferred country as fallback if not found in text
if not country and inferred_country:
country = inferred_country
```
**Why this fails:**
- No validation that institution actually belongs to inferred country
- No check for explicit country mentions in the surrounding context
- Silent override of actual geographic context
### 2. Entity Type Classification Gaps
**Current V4 behavior:**
- Classifies based on keywords (museum, library, archive, etc.)
- **Problem:** Doesn't distinguish between:
- **Physical institutions** (Rijksmuseum) ✅
- **Organizations** (IFLA, UNESCO, ICOM) ❌
- **Networks** (Archive Net, Museum Association) ❌
- **Academic units** (departments, study programmes) ❌
**V4 code location:** `nlp_extractor.py:606-800` (`_extract_institution_names`)
**Why this fails:**
- Pattern matching on "library" catches "IFLA Library" (organization name)
- No semantic understanding of entity hierarchy (network vs. member institution)
- No blacklist for known organization/network names
### 3. Proper Name Validation Missing
**Current V4 behavior:**
- Extracts capitalized phrases containing institution keywords
- **Problem:** Accepts generic descriptors and concepts as institution names
**Examples of false positives:**
- "Library FabLab" (service concept, like "library café")
- "Archive Net" (network abbreviation)
- "Dutch Museum" (too generic, likely part of discussion)
**V4 code location:** `nlp_extractor.py:667-700` (Pattern 1a/1b extraction)
**Why this fails:**
- No minimum name length (single-word names allowed)
- No validation that name is a proper noun (not just capitalized generic term)
- No blacklist for known generic patterns
---
## V5 Improvement Strategy
### Priority 1: Geographic Validation Enhancement
**Goal:** Eliminate wrong-country extractions (16.7% of errors)
**Implementation:**
1. **Context-based country validation:**
```python
def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]:
"""
Validate that inferred country is actually correct for this institution.
Returns:
- Explicit country code if found in sentence
- None if explicit country contradicts inferred country
- inferred_country only if no contradictory evidence
"""
```
2. **Explicit country mention detection:**
- Pattern: "[Name] in [Country]"
- Pattern: "[Name], [City], [Country]"
- Pattern: "[Country] institution [Name]"
3. **Contradiction detection:**
- If sentence contains "Malaysia" and inferred country is "NL" → reject extraction
- If sentence contains "Islamic University" without "Netherlands" → reject for NL
4. **Confidence penalty for inferred-only country:**
- Explicit country in text: confidence +0.2
- Inferred country only: confidence +0.0 (no bonus)
**Expected impact:** Reduce geographic errors from 16.7% to <5%
---
### Priority 2: Entity Type Classification Rules
**Goal:** Filter out organizations, networks, and departments (25% of errors)
**Implementation:**
1. **Organization blacklist:**
```python
ORGANIZATION_BLACKLIST = {
# International organizations
'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA',
'International Federation of Library Associations',
'International Council of Museums',
# Networks and associations
'Archive Net', 'Netwerk Oorlogsbronnen',
'Museum Association', 'Archives Association',
# Academic units (generic)
'Studies', 'Department of', 'Faculty of',
'School of', 'Institute for',
}
```
2. **Entity type detection patterns:**
```python
# Pattern: "X is a network of Y institutions"
NETWORK_PATTERNS = [
r'\b(\w+)\s+is\s+a\s+network',
r'\b(\w+)\s+platform\s+connecting',
r'\bnetwork\s+of\s+\d+\s+\w+',
]
# Pattern: "X is an organization that"
ORGANIZATION_PATTERNS = [
r'\b(\w+)\s+is\s+an?\s+organization',
r'\b(\w+)\s+is\s+an?\s+association',
r'\b(\w+)\s+is\s+an?\s+federation',
]
```
3. **Academic unit detection:**
- Reject if name contains "Studies" without proper institutional name
- Example: "Southeast Asian Studies"
- Example: "Southeast Asian Studies Library, Leiden University" (if "Library" keyword present)
**Expected impact:** Reduce organization/network errors from 25% to <10%
---
### Priority 3: Proper Name Validation
**Goal:** Filter generic descriptors and concepts (8.3% of errors)
**Implementation:**
1. **Minimum proper name requirements:**
```python
def _is_proper_institutional_name(self, name: str, sentence: str) -> bool:
"""
Validate that name is a proper institution name, not a generic term.
Requirements:
- Minimum 2 words for most types (except compounds like "Rijksmuseum")
- At least one word that's NOT just the institution type keyword
- Not in generic descriptor blacklist
"""
```
2. **Generic descriptor blacklist:**
```python
GENERIC_DESCRIPTORS = {
'Library FabLab', # Service concept
'Museum Café', # Facility type
'Archive Reading Room', # Room/service
'Museum Shop',
'Library Makerspace',
# Too generic
'Dutch Museum',
'Local Archive',
'University Library', # Without specific university name
}
```
3. **Compound word validation:**
- Single-word names allowed ONLY if:
- Contains institution keyword as suffix (Rijksmuseum ✅)
- First letter capitalized (proper noun)
- Not in generic blacklist
**Expected impact:** Reduce concept/descriptor errors from 8.3% to <5%
---
### Priority 4: Enhanced Confidence Scoring
**Current V4 scoring (base 0.3):**
- +0.2 if has institution type
- +0.1 if has location
- +0.3 if has identifier
- +0.2 if 2-6 words
- +0.2 if explicit "is a" pattern
**V5 improved scoring:**
```python
def _calculate_confidence_v5(
self,
name: str,
institution_type: Optional[InstitutionType],
city: Optional[str],
country: Optional[str],
identifiers: List[Identifier],
sentence: str,
country_source: str # 'explicit' | 'inferred' | 'none'
) -> float:
"""
V5 confidence scoring with stricter validation.
Base: 0.2 (lower than v4 to penalize uncertain extractions)
Positive signals:
- +0.3 Has institution type keyword
- +0.2 Has explicit location in text (city OR country)
- +0.4 Has identifier (ISIL/Wikidata/VIAF)
- +0.2 Name is 2-6 words
- +0.2 Explicit "is a" or "located in" pattern
- +0.1 Country from explicit mention (not just inferred)
Negative signals:
- -0.2 Single-word name without compound validation
- -0.3 Name matches generic descriptor pattern
- -0.2 Country only inferred (not mentioned in text)
- -0.5 Name in organization/network blacklist
Threshold: 0.6 (increased from v4's 0.5)
"""
```
**Expected impact:** Better separation of valid (>0.6) vs invalid (<0.6) institutions
---
## Implementation Plan
### Phase 1: Core Validation Enhancements (High Priority)
**Files to modify:**
- `src/glam_extractor/extractors/nlp_extractor.py`
**New methods to add:**
1. `_validate_country_context(sentence, name, inferred_country) -> Optional[str]`
- Detect explicit country mentions
- Check for contradictions with inferred country
- Return validated country or None
2. `_is_organization_or_network(name, sentence) -> bool`
- Check against organization blacklist
- Detect network/association patterns
- Return True if should be filtered
3. `_is_proper_institutional_name(name, sentence) -> bool`
- Validate minimum name requirements
- Check generic descriptor blacklist
- Validate compound words
4. `_calculate_confidence_v5(...) -> float`
- Implement enhanced scoring with penalties
- Use country_source parameter
**Modified methods:**
1. `_extract_entities(text, inferred_country)`:
```python
# BEFORE (v4):
if not country and inferred_country:
country = inferred_country
# AFTER (v5):
# Validate country from context first
validated_country = self._validate_country_context(
sentence, name, inferred_country
)
if validated_country:
country = validated_country
country_source = 'explicit'
elif not country and inferred_country:
country = inferred_country
country_source = 'inferred'
else:
country_source = 'none'
```
2. `_extract_institution_names(sentence)`:
```python
# Add validation before adding to names list
for potential_name in extracted_names:
# V5: Filter organizations and networks
if self._is_organization_or_network(potential_name, sentence):
continue
# V5: Validate proper institutional name
if not self._is_proper_institutional_name(potential_name, sentence):
continue
names.append(potential_name)
```
### Phase 2: Quality Filter Updates (Medium Priority)
**File:** `scripts/batch_extract_institutions.py`
**Enhancements to `apply_quality_filters()` method:**
1. Add Dutch-specific validation for NL extractions:
```python
# For NL institutions, require explicit city mention
if country == 'NL' and not city:
removed_reasons['nl_missing_city'] += 1
continue
```
2. Add organization/network detection:
```python
# Check organization blacklist
if name.upper() in ORGANIZATION_BLACKLIST:
removed_reasons['organization_not_institution'] += 1
continue
```
### Phase 3: Testing and Validation
**Test dataset:** Dutch conversation files only (subset for faster iteration)
**Validation methodology:**
1. Run v5 extraction on Dutch conversations
2. Compare results to v4 (12 institutions)
3. Validate new extractions via web search
4. Calculate precision, recall, F1 score
**Success criteria:**
- **Precision:** 75% (up from 50%)
- **False positive reduction:** 50% (from 6 errors to 3)
- **Valid institutions preserved:** 6/6 (no regression)
---
## Testing Strategy
### Unit Tests
```python
# tests/extractors/test_nlp_extractor_v5.py
def test_geographic_validation():
"""Test that Malaysian institutions aren't assigned to Netherlands."""
extractor = InstitutionExtractor()
text = "University Malaysia is a leading research institution in Kuala Lumpur."
result = extractor.extract_from_text(
text,
conversation_name="Dutch_institutions" # Inferred country: NL
)
# Should NOT extract or should assign country=MY, not NL
assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL'
def test_organization_filtering():
"""Test that IFLA is not extracted as a library."""
extractor = InstitutionExtractor()
text = "IFLA (International Federation of Library Associations) sets library standards."
result = extractor.extract_from_text(text)
# Should not extract IFLA as an institution
assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value]
def test_generic_descriptor_filtering():
"""Test that 'Library FabLab' is not extracted as institution."""
extractor = InstitutionExtractor()
text = "The Library FabLab provides 3D printing services to patrons."
result = extractor.extract_from_text(text)
# Should not extract generic service descriptor
assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value]
```
### Integration Tests
```python
def test_dutch_extraction_precision():
"""
Test v5 extraction on real Dutch conversations.
Expected: Precision ≥75% (6 valid / ≤8 total)
"""
batch_extractor = BatchInstitutionExtractor(
conversation_dir=Path("path/to/dutch/conversations"),
output_dir=Path("output/v5_test")
)
stats = batch_extractor.process_all(country_filter="dutch")
batch_extractor.apply_quality_filters()
# Manual validation required
# Expected: 6-8 institutions extracted (vs 12 in v4)
assert 6 <= len(batch_extractor.all_institutions) <= 10
```
---
## Rollout Plan
### Stage 1: Dutch-only validation (THIS SPRINT)
- Implement v5 improvements
- Test on Dutch conversations only
- Validate precision 75%
- Compare v4 vs v5 results
### Stage 2: Multi-country validation
- Test on Brazil, Mexico, Chile (v4 countries)
- Ensure improvements don't harm other regions
- Adjust blacklists for region-specific patterns
### Stage 3: Full re-extraction
- Run v5 on all 139 conversation files
- Generate new `output/v5_institutions.json`
- Validate sample across multiple countries
- Update documentation
---
## Risk Mitigation
### Risk 1: Over-filtering (false negatives)
**Concern:** Stricter validation might reject valid institutions
**Mitigation:**
- Preserve v4 output for comparison
- Log all filtered institutions with reason
- Manual review of filtered items
- Adjustable confidence threshold (default 0.6, can lower to 0.55)
### Risk 2: Regional bias
**Concern:** Dutch-optimized rules might not work globally
**Mitigation:**
- Blacklists should be culturally neutral where possible
- Test on diverse regions (Asia, Africa, Latin America)
- Separate region-specific rules (e.g., `DUTCH_SPECIFIC_FILTERS`)
### Risk 3: Regression on valid institutions
**Concern:** Might lose some of the 6 valid Dutch institutions
**Mitigation:**
- Run v5 on same Dutch conversations
- Compare extracted names to v4's 6 valid institutions
- If any valid institution missing, adjust filters
- Maintain whitelist of known-good institutions
---
## Metrics and Monitoring
### Key Metrics
| Metric | V4 Baseline | V5 Target | Measurement Method |
|--------|-------------|-----------|-------------------|
| **Precision (Dutch)** | 50% | 75% | Web validation of sample |
| **Geographic errors** | 16.7% | <5% | Manual review |
| **Organization errors** | 25% | <10% | Blacklist matching |
| **Generic descriptor errors** | 8.3% | <5% | Pattern matching |
| **Total extracted (Dutch)** | 12 | 6-10 | Count after filters |
| **Valid institutions preserved** | 6/6 | 6/6 | Compare to v4 valid list |
### Success Criteria
**Must achieve:**
- Precision 75% on Dutch institutions
- All 6 v4-valid institutions still extracted
- 3 false positives (down from 6)
**Nice to have:**
- Precision 80%
- No geographic errors (0%)
- Confidence scores >0.7 for all valid institutions
---
## Files Modified
**New files:**
- `docs/V5_EXTRACTION_DESIGN.md` (this document)
- `tests/extractors/test_nlp_extractor_v5.py` (unit tests)
- `output/v5_dutch_institutions.json` (v5 extraction results)
- `output/V5_VALIDATION_COMPARISON.md` (v4 vs v5 analysis)
**Modified files:**
- `src/glam_extractor/extractors/nlp_extractor.py` (core improvements)
- `scripts/batch_extract_institutions.py` (quality filter updates)
---
## Next Actions
1.**Design v5 improvements** (this document)
2.**Implement core validation methods** (geographic, organization, proper name)
3.**Update confidence scoring** (v5 algorithm)
4.**Run v5 extraction on Dutch conversations**
5.**Validate v5 results** (web search + comparison to v4)
6.**Measure precision improvement** (target ≥75%)
7.**Document findings** (V5_VALIDATION_COMPARISON.md)
---
**Status:** Design complete, ready for implementation
**Owner:** GLAM extraction team
**Review date:** 2025-11-07
**Implementation timeline:** 1-2 days