17 KiB
V5 Extraction Design - Improvements Based on Dutch Validation
Date: 2025-11-07
Baseline: V4 extraction with 50% precision on Dutch institutions
Goal: Achieve 75-90% precision through targeted improvements
Problem Statement
V4 extraction of Dutch heritage institutions achieved 50% precision (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use.
Validated Error Categories (from 12 Dutch extractions):
| Error Type | Count | % | Examples |
|---|---|---|---|
| Geographic errors | 2 | 16.7% | University Malaysia, Islamic University Malaysia extracted as Dutch |
| Organizations vs Institutions | 1 | 8.3% | IFLA (federation) extracted as library |
| Networks/Platforms | 1 | 8.3% | Archive Net (network of 250+ institutions) |
| Academic Departments | 1 | 8.3% | Southeast Asian Studies (department, not institution) |
| Concepts/Services | 1 | 8.3% | Library FabLab (service type, not named institution) |
| Valid institutions | 6 | 50.0% | Historisch Centrum Overijssel, Van Abbemuseum, etc. |
Root Cause Analysis
1. Geographic Validation Weakness
Current V4 behavior:
- Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL)
- Problem: Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries
- Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL
V4 code location: nlp_extractor.py:509-511
# Use inferred country as fallback if not found in text
if not country and inferred_country:
country = inferred_country
Why this fails:
- No validation that institution actually belongs to inferred country
- No check for explicit country mentions in the surrounding context
- Silent override of actual geographic context
2. Entity Type Classification Gaps
Current V4 behavior:
- Classifies based on keywords (museum, library, archive, etc.)
- Problem: Doesn't distinguish between:
- Physical institutions (Rijksmuseum) ✅
- Organizations (IFLA, UNESCO, ICOM) ❌
- Networks (Archive Net, Museum Association) ❌
- Academic units (departments, study programmes) ❌
V4 code location: nlp_extractor.py:606-800 (_extract_institution_names)
Why this fails:
- Pattern matching on "library" catches "IFLA Library" (organization name)
- No semantic understanding of entity hierarchy (network vs. member institution)
- No blacklist for known organization/network names
3. Proper Name Validation Missing
Current V4 behavior:
- Extracts capitalized phrases containing institution keywords
- Problem: Accepts generic descriptors and concepts as institution names
Examples of false positives:
- "Library FabLab" (service concept, like "library café")
- "Archive Net" (network abbreviation)
- "Dutch Museum" (too generic, likely part of discussion)
V4 code location: nlp_extractor.py:667-700 (Pattern 1a/1b extraction)
Why this fails:
- No minimum name length (single-word names allowed)
- No validation that name is a proper noun (not just capitalized generic term)
- No blacklist for known generic patterns
V5 Improvement Strategy
Priority 1: Geographic Validation Enhancement
Goal: Eliminate wrong-country extractions (16.7% of errors)
Implementation:
-
Context-based country validation:
def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]: """ Validate that inferred country is actually correct for this institution. Returns: - Explicit country code if found in sentence - None if explicit country contradicts inferred country - inferred_country only if no contradictory evidence """ -
Explicit country mention detection:
- Pattern: "[Name] in [Country]"
- Pattern: "[Name], [City], [Country]"
- Pattern: "[Country] institution [Name]"
-
Contradiction detection:
- If sentence contains "Malaysia" and inferred country is "NL" → reject extraction
- If sentence contains "Islamic University" without "Netherlands" → reject for NL
-
Confidence penalty for inferred-only country:
- Explicit country in text: confidence +0.2
- Inferred country only: confidence +0.0 (no bonus)
Expected impact: Reduce geographic errors from 16.7% to <5%
Priority 2: Entity Type Classification Rules
Goal: Filter out organizations, networks, and departments (25% of errors)
Implementation:
-
Organization blacklist:
ORGANIZATION_BLACKLIST = { # International organizations 'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA', 'International Federation of Library Associations', 'International Council of Museums', # Networks and associations 'Archive Net', 'Netwerk Oorlogsbronnen', 'Museum Association', 'Archives Association', # Academic units (generic) 'Studies', 'Department of', 'Faculty of', 'School of', 'Institute for', } -
Entity type detection patterns:
# Pattern: "X is a network of Y institutions" NETWORK_PATTERNS = [ r'\b(\w+)\s+is\s+a\s+network', r'\b(\w+)\s+platform\s+connecting', r'\bnetwork\s+of\s+\d+\s+\w+', ] # Pattern: "X is an organization that" ORGANIZATION_PATTERNS = [ r'\b(\w+)\s+is\s+an?\s+organization', r'\b(\w+)\s+is\s+an?\s+association', r'\b(\w+)\s+is\s+an?\s+federation', ] -
Academic unit detection:
- Reject if name contains "Studies" without proper institutional name
- Example: "Southeast Asian Studies" ❌
- Example: "Southeast Asian Studies Library, Leiden University" ✅ (if "Library" keyword present)
Expected impact: Reduce organization/network errors from 25% to <10%
Priority 3: Proper Name Validation
Goal: Filter generic descriptors and concepts (8.3% of errors)
Implementation:
-
Minimum proper name requirements:
def _is_proper_institutional_name(self, name: str, sentence: str) -> bool: """ Validate that name is a proper institution name, not a generic term. Requirements: - Minimum 2 words for most types (except compounds like "Rijksmuseum") - At least one word that's NOT just the institution type keyword - Not in generic descriptor blacklist """ -
Generic descriptor blacklist:
GENERIC_DESCRIPTORS = { 'Library FabLab', # Service concept 'Museum Café', # Facility type 'Archive Reading Room', # Room/service 'Museum Shop', 'Library Makerspace', # Too generic 'Dutch Museum', 'Local Archive', 'University Library', # Without specific university name } -
Compound word validation:
- Single-word names allowed ONLY if:
- Contains institution keyword as suffix (Rijksmuseum ✅)
- First letter capitalized (proper noun)
- Not in generic blacklist
- Single-word names allowed ONLY if:
Expected impact: Reduce concept/descriptor errors from 8.3% to <5%
Priority 4: Enhanced Confidence Scoring
Current V4 scoring (base 0.3):
- +0.2 if has institution type
- +0.1 if has location
- +0.3 if has identifier
- +0.2 if 2-6 words
- +0.2 if explicit "is a" pattern
V5 improved scoring:
def _calculate_confidence_v5(
self,
name: str,
institution_type: Optional[InstitutionType],
city: Optional[str],
country: Optional[str],
identifiers: List[Identifier],
sentence: str,
country_source: str # 'explicit' | 'inferred' | 'none'
) -> float:
"""
V5 confidence scoring with stricter validation.
Base: 0.2 (lower than v4 to penalize uncertain extractions)
Positive signals:
- +0.3 Has institution type keyword
- +0.2 Has explicit location in text (city OR country)
- +0.4 Has identifier (ISIL/Wikidata/VIAF)
- +0.2 Name is 2-6 words
- +0.2 Explicit "is a" or "located in" pattern
- +0.1 Country from explicit mention (not just inferred)
Negative signals:
- -0.2 Single-word name without compound validation
- -0.3 Name matches generic descriptor pattern
- -0.2 Country only inferred (not mentioned in text)
- -0.5 Name in organization/network blacklist
Threshold: 0.6 (increased from v4's 0.5)
"""
Expected impact: Better separation of valid (>0.6) vs invalid (<0.6) institutions
Implementation Plan
Phase 1: Core Validation Enhancements (High Priority)
Files to modify:
src/glam_extractor/extractors/nlp_extractor.py
New methods to add:
-
_validate_country_context(sentence, name, inferred_country) -> Optional[str]- Detect explicit country mentions
- Check for contradictions with inferred country
- Return validated country or None
-
_is_organization_or_network(name, sentence) -> bool- Check against organization blacklist
- Detect network/association patterns
- Return True if should be filtered
-
_is_proper_institutional_name(name, sentence) -> bool- Validate minimum name requirements
- Check generic descriptor blacklist
- Validate compound words
-
_calculate_confidence_v5(...) -> float- Implement enhanced scoring with penalties
- Use country_source parameter
Modified methods:
-
_extract_entities(text, inferred_country):# BEFORE (v4): if not country and inferred_country: country = inferred_country # AFTER (v5): # Validate country from context first validated_country = self._validate_country_context( sentence, name, inferred_country ) if validated_country: country = validated_country country_source = 'explicit' elif not country and inferred_country: country = inferred_country country_source = 'inferred' else: country_source = 'none' -
_extract_institution_names(sentence):# Add validation before adding to names list for potential_name in extracted_names: # V5: Filter organizations and networks if self._is_organization_or_network(potential_name, sentence): continue # V5: Validate proper institutional name if not self._is_proper_institutional_name(potential_name, sentence): continue names.append(potential_name)
Phase 2: Quality Filter Updates (Medium Priority)
File: scripts/batch_extract_institutions.py
Enhancements to apply_quality_filters() method:
-
Add Dutch-specific validation for NL extractions:
# For NL institutions, require explicit city mention if country == 'NL' and not city: removed_reasons['nl_missing_city'] += 1 continue -
Add organization/network detection:
# Check organization blacklist if name.upper() in ORGANIZATION_BLACKLIST: removed_reasons['organization_not_institution'] += 1 continue
Phase 3: Testing and Validation
Test dataset: Dutch conversation files only (subset for faster iteration)
Validation methodology:
- Run v5 extraction on Dutch conversations
- Compare results to v4 (12 institutions)
- Validate new extractions via web search
- Calculate precision, recall, F1 score
Success criteria:
- Precision: ≥75% (up from 50%)
- False positive reduction: ≥50% (from 6 errors to ≤3)
- Valid institutions preserved: 6/6 (no regression)
Testing Strategy
Unit Tests
# tests/extractors/test_nlp_extractor_v5.py
def test_geographic_validation():
"""Test that Malaysian institutions aren't assigned to Netherlands."""
extractor = InstitutionExtractor()
text = "University Malaysia is a leading research institution in Kuala Lumpur."
result = extractor.extract_from_text(
text,
conversation_name="Dutch_institutions" # Inferred country: NL
)
# Should NOT extract or should assign country=MY, not NL
assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL'
def test_organization_filtering():
"""Test that IFLA is not extracted as a library."""
extractor = InstitutionExtractor()
text = "IFLA (International Federation of Library Associations) sets library standards."
result = extractor.extract_from_text(text)
# Should not extract IFLA as an institution
assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value]
def test_generic_descriptor_filtering():
"""Test that 'Library FabLab' is not extracted as institution."""
extractor = InstitutionExtractor()
text = "The Library FabLab provides 3D printing services to patrons."
result = extractor.extract_from_text(text)
# Should not extract generic service descriptor
assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value]
Integration Tests
def test_dutch_extraction_precision():
"""
Test v5 extraction on real Dutch conversations.
Expected: Precision ≥75% (6 valid / ≤8 total)
"""
batch_extractor = BatchInstitutionExtractor(
conversation_dir=Path("path/to/dutch/conversations"),
output_dir=Path("output/v5_test")
)
stats = batch_extractor.process_all(country_filter="dutch")
batch_extractor.apply_quality_filters()
# Manual validation required
# Expected: 6-8 institutions extracted (vs 12 in v4)
assert 6 <= len(batch_extractor.all_institutions) <= 10
Rollout Plan
Stage 1: Dutch-only validation (THIS SPRINT)
- Implement v5 improvements
- Test on Dutch conversations only
- Validate precision ≥75%
- Compare v4 vs v5 results
Stage 2: Multi-country validation
- Test on Brazil, Mexico, Chile (v4 countries)
- Ensure improvements don't harm other regions
- Adjust blacklists for region-specific patterns
Stage 3: Full re-extraction
- Run v5 on all 139 conversation files
- Generate new
output/v5_institutions.json - Validate sample across multiple countries
- Update documentation
Risk Mitigation
Risk 1: Over-filtering (false negatives)
Concern: Stricter validation might reject valid institutions
Mitigation:
- Preserve v4 output for comparison
- Log all filtered institutions with reason
- Manual review of filtered items
- Adjustable confidence threshold (default 0.6, can lower to 0.55)
Risk 2: Regional bias
Concern: Dutch-optimized rules might not work globally
Mitigation:
- Blacklists should be culturally neutral where possible
- Test on diverse regions (Asia, Africa, Latin America)
- Separate region-specific rules (e.g.,
DUTCH_SPECIFIC_FILTERS)
Risk 3: Regression on valid institutions
Concern: Might lose some of the 6 valid Dutch institutions
Mitigation:
- Run v5 on same Dutch conversations
- Compare extracted names to v4's 6 valid institutions
- If any valid institution missing, adjust filters
- Maintain whitelist of known-good institutions
Metrics and Monitoring
Key Metrics
| Metric | V4 Baseline | V5 Target | Measurement Method |
|---|---|---|---|
| Precision (Dutch) | 50% | ≥75% | Web validation of sample |
| Geographic errors | 16.7% | <5% | Manual review |
| Organization errors | 25% | <10% | Blacklist matching |
| Generic descriptor errors | 8.3% | <5% | Pattern matching |
| Total extracted (Dutch) | 12 | 6-10 | Count after filters |
| Valid institutions preserved | 6/6 | 6/6 | Compare to v4 valid list |
Success Criteria
Must achieve:
- ✅ Precision ≥75% on Dutch institutions
- ✅ All 6 v4-valid institutions still extracted
- ✅ ≤3 false positives (down from 6)
Nice to have:
- Precision ≥80%
- No geographic errors (0%)
- Confidence scores >0.7 for all valid institutions
Files Modified
New files:
docs/V5_EXTRACTION_DESIGN.md(this document)tests/extractors/test_nlp_extractor_v5.py(unit tests)output/v5_dutch_institutions.json(v5 extraction results)output/V5_VALIDATION_COMPARISON.md(v4 vs v5 analysis)
Modified files:
src/glam_extractor/extractors/nlp_extractor.py(core improvements)scripts/batch_extract_institutions.py(quality filter updates)
Next Actions
- ✅ Design v5 improvements (this document)
- ⏳ Implement core validation methods (geographic, organization, proper name)
- ⏳ Update confidence scoring (v5 algorithm)
- ⏳ Run v5 extraction on Dutch conversations
- ⏳ Validate v5 results (web search + comparison to v4)
- ⏳ Measure precision improvement (target ≥75%)
- ⏳ Document findings (V5_VALIDATION_COMPARISON.md)
Status: Design complete, ready for implementation
Owner: GLAM extraction team
Review date: 2025-11-07
Implementation timeline: 1-2 days