# V5 Extraction Design - Improvements Based on Dutch Validation **Date:** 2025-11-07 **Baseline:** V4 extraction with 50% precision on Dutch institutions **Goal:** Achieve 75-90% precision through targeted improvements --- ## Problem Statement V4 extraction of Dutch heritage institutions achieved **50% precision** (6 valid / 12 extracted) when validated via web search. This is unacceptable for production use. ### Validated Error Categories (from 12 Dutch extractions): | Error Type | Count | % | Examples | |------------|-------|---|----------| | **Geographic errors** | 2 | 16.7% | University Malaysia, Islamic University Malaysia extracted as Dutch | | **Organizations vs Institutions** | 1 | 8.3% | IFLA (federation) extracted as library | | **Networks/Platforms** | 1 | 8.3% | Archive Net (network of 250+ institutions) | | **Academic Departments** | 1 | 8.3% | Southeast Asian Studies (department, not institution) | | **Concepts/Services** | 1 | 8.3% | Library FabLab (service type, not named institution) | | **Valid institutions** | 6 | 50.0% | Historisch Centrum Overijssel, Van Abbemuseum, etc. | --- ## Root Cause Analysis ### 1. Geographic Validation Weakness **Current V4 behavior:** - Infers country from conversation filename (e.g., "Dutch_institutions.json" → NL) - **Problem:** Applies inferred country to ALL institutions in conversation, even if text explicitly mentions other countries - Example error: Conversation about Dutch institutions mentions "University Malaysia" → wrongly assigned country=NL **V4 code location:** `nlp_extractor.py:509-511` ```python # Use inferred country as fallback if not found in text if not country and inferred_country: country = inferred_country ``` **Why this fails:** - No validation that institution actually belongs to inferred country - No check for explicit country mentions in the surrounding context - Silent override of actual geographic context ### 2. Entity Type Classification Gaps **Current V4 behavior:** - Classifies based on keywords (museum, library, archive, etc.) - **Problem:** Doesn't distinguish between: - **Physical institutions** (Rijksmuseum) ✅ - **Organizations** (IFLA, UNESCO, ICOM) ❌ - **Networks** (Archive Net, Museum Association) ❌ - **Academic units** (departments, study programmes) ❌ **V4 code location:** `nlp_extractor.py:606-800` (`_extract_institution_names`) **Why this fails:** - Pattern matching on "library" catches "IFLA Library" (organization name) - No semantic understanding of entity hierarchy (network vs. member institution) - No blacklist for known organization/network names ### 3. Proper Name Validation Missing **Current V4 behavior:** - Extracts capitalized phrases containing institution keywords - **Problem:** Accepts generic descriptors and concepts as institution names **Examples of false positives:** - "Library FabLab" (service concept, like "library café") - "Archive Net" (network abbreviation) - "Dutch Museum" (too generic, likely part of discussion) **V4 code location:** `nlp_extractor.py:667-700` (Pattern 1a/1b extraction) **Why this fails:** - No minimum name length (single-word names allowed) - No validation that name is a proper noun (not just capitalized generic term) - No blacklist for known generic patterns --- ## V5 Improvement Strategy ### Priority 1: Geographic Validation Enhancement **Goal:** Eliminate wrong-country extractions (16.7% of errors) **Implementation:** 1. **Context-based country validation:** ```python def _validate_country_context(self, sentence: str, name: str, inferred_country: str) -> Optional[str]: """ Validate that inferred country is actually correct for this institution. Returns: - Explicit country code if found in sentence - None if explicit country contradicts inferred country - inferred_country only if no contradictory evidence """ ``` 2. **Explicit country mention detection:** - Pattern: "[Name] in [Country]" - Pattern: "[Name], [City], [Country]" - Pattern: "[Country] institution [Name]" 3. **Contradiction detection:** - If sentence contains "Malaysia" and inferred country is "NL" → reject extraction - If sentence contains "Islamic University" without "Netherlands" → reject for NL 4. **Confidence penalty for inferred-only country:** - Explicit country in text: confidence +0.2 - Inferred country only: confidence +0.0 (no bonus) **Expected impact:** Reduce geographic errors from 16.7% to <5% --- ### Priority 2: Entity Type Classification Rules **Goal:** Filter out organizations, networks, and departments (25% of errors) **Implementation:** 1. **Organization blacklist:** ```python ORGANIZATION_BLACKLIST = { # International organizations 'IFLA', 'UNESCO', 'ICOM', 'ICOMOS', 'ICA', 'International Federation of Library Associations', 'International Council of Museums', # Networks and associations 'Archive Net', 'Netwerk Oorlogsbronnen', 'Museum Association', 'Archives Association', # Academic units (generic) 'Studies', 'Department of', 'Faculty of', 'School of', 'Institute for', } ``` 2. **Entity type detection patterns:** ```python # Pattern: "X is a network of Y institutions" NETWORK_PATTERNS = [ r'\b(\w+)\s+is\s+a\s+network', r'\b(\w+)\s+platform\s+connecting', r'\bnetwork\s+of\s+\d+\s+\w+', ] # Pattern: "X is an organization that" ORGANIZATION_PATTERNS = [ r'\b(\w+)\s+is\s+an?\s+organization', r'\b(\w+)\s+is\s+an?\s+association', r'\b(\w+)\s+is\s+an?\s+federation', ] ``` 3. **Academic unit detection:** - Reject if name contains "Studies" without proper institutional name - Example: "Southeast Asian Studies" ❌ - Example: "Southeast Asian Studies Library, Leiden University" ✅ (if "Library" keyword present) **Expected impact:** Reduce organization/network errors from 25% to <10% --- ### Priority 3: Proper Name Validation **Goal:** Filter generic descriptors and concepts (8.3% of errors) **Implementation:** 1. **Minimum proper name requirements:** ```python def _is_proper_institutional_name(self, name: str, sentence: str) -> bool: """ Validate that name is a proper institution name, not a generic term. Requirements: - Minimum 2 words for most types (except compounds like "Rijksmuseum") - At least one word that's NOT just the institution type keyword - Not in generic descriptor blacklist """ ``` 2. **Generic descriptor blacklist:** ```python GENERIC_DESCRIPTORS = { 'Library FabLab', # Service concept 'Museum Café', # Facility type 'Archive Reading Room', # Room/service 'Museum Shop', 'Library Makerspace', # Too generic 'Dutch Museum', 'Local Archive', 'University Library', # Without specific university name } ``` 3. **Compound word validation:** - Single-word names allowed ONLY if: - Contains institution keyword as suffix (Rijksmuseum ✅) - First letter capitalized (proper noun) - Not in generic blacklist **Expected impact:** Reduce concept/descriptor errors from 8.3% to <5% --- ### Priority 4: Enhanced Confidence Scoring **Current V4 scoring (base 0.3):** - +0.2 if has institution type - +0.1 if has location - +0.3 if has identifier - +0.2 if 2-6 words - +0.2 if explicit "is a" pattern **V5 improved scoring:** ```python def _calculate_confidence_v5( self, name: str, institution_type: Optional[InstitutionType], city: Optional[str], country: Optional[str], identifiers: List[Identifier], sentence: str, country_source: str # 'explicit' | 'inferred' | 'none' ) -> float: """ V5 confidence scoring with stricter validation. Base: 0.2 (lower than v4 to penalize uncertain extractions) Positive signals: - +0.3 Has institution type keyword - +0.2 Has explicit location in text (city OR country) - +0.4 Has identifier (ISIL/Wikidata/VIAF) - +0.2 Name is 2-6 words - +0.2 Explicit "is a" or "located in" pattern - +0.1 Country from explicit mention (not just inferred) Negative signals: - -0.2 Single-word name without compound validation - -0.3 Name matches generic descriptor pattern - -0.2 Country only inferred (not mentioned in text) - -0.5 Name in organization/network blacklist Threshold: 0.6 (increased from v4's 0.5) """ ``` **Expected impact:** Better separation of valid (>0.6) vs invalid (<0.6) institutions --- ## Implementation Plan ### Phase 1: Core Validation Enhancements (High Priority) **Files to modify:** - `src/glam_extractor/extractors/nlp_extractor.py` **New methods to add:** 1. `_validate_country_context(sentence, name, inferred_country) -> Optional[str]` - Detect explicit country mentions - Check for contradictions with inferred country - Return validated country or None 2. `_is_organization_or_network(name, sentence) -> bool` - Check against organization blacklist - Detect network/association patterns - Return True if should be filtered 3. `_is_proper_institutional_name(name, sentence) -> bool` - Validate minimum name requirements - Check generic descriptor blacklist - Validate compound words 4. `_calculate_confidence_v5(...) -> float` - Implement enhanced scoring with penalties - Use country_source parameter **Modified methods:** 1. `_extract_entities(text, inferred_country)`: ```python # BEFORE (v4): if not country and inferred_country: country = inferred_country # AFTER (v5): # Validate country from context first validated_country = self._validate_country_context( sentence, name, inferred_country ) if validated_country: country = validated_country country_source = 'explicit' elif not country and inferred_country: country = inferred_country country_source = 'inferred' else: country_source = 'none' ``` 2. `_extract_institution_names(sentence)`: ```python # Add validation before adding to names list for potential_name in extracted_names: # V5: Filter organizations and networks if self._is_organization_or_network(potential_name, sentence): continue # V5: Validate proper institutional name if not self._is_proper_institutional_name(potential_name, sentence): continue names.append(potential_name) ``` ### Phase 2: Quality Filter Updates (Medium Priority) **File:** `scripts/batch_extract_institutions.py` **Enhancements to `apply_quality_filters()` method:** 1. Add Dutch-specific validation for NL extractions: ```python # For NL institutions, require explicit city mention if country == 'NL' and not city: removed_reasons['nl_missing_city'] += 1 continue ``` 2. Add organization/network detection: ```python # Check organization blacklist if name.upper() in ORGANIZATION_BLACKLIST: removed_reasons['organization_not_institution'] += 1 continue ``` ### Phase 3: Testing and Validation **Test dataset:** Dutch conversation files only (subset for faster iteration) **Validation methodology:** 1. Run v5 extraction on Dutch conversations 2. Compare results to v4 (12 institutions) 3. Validate new extractions via web search 4. Calculate precision, recall, F1 score **Success criteria:** - **Precision:** ≥75% (up from 50%) - **False positive reduction:** ≥50% (from 6 errors to ≤3) - **Valid institutions preserved:** 6/6 (no regression) --- ## Testing Strategy ### Unit Tests ```python # tests/extractors/test_nlp_extractor_v5.py def test_geographic_validation(): """Test that Malaysian institutions aren't assigned to Netherlands.""" extractor = InstitutionExtractor() text = "University Malaysia is a leading research institution in Kuala Lumpur." result = extractor.extract_from_text( text, conversation_name="Dutch_institutions" # Inferred country: NL ) # Should NOT extract or should assign country=MY, not NL assert len(result.value) == 0 or result.value[0].locations[0].country != 'NL' def test_organization_filtering(): """Test that IFLA is not extracted as a library.""" extractor = InstitutionExtractor() text = "IFLA (International Federation of Library Associations) sets library standards." result = extractor.extract_from_text(text) # Should not extract IFLA as an institution assert len(result.value) == 0 or 'IFLA' not in [inst.name for inst in result.value] def test_generic_descriptor_filtering(): """Test that 'Library FabLab' is not extracted as institution.""" extractor = InstitutionExtractor() text = "The Library FabLab provides 3D printing services to patrons." result = extractor.extract_from_text(text) # Should not extract generic service descriptor assert len(result.value) == 0 or 'Library FabLab' not in [inst.name for inst in result.value] ``` ### Integration Tests ```python def test_dutch_extraction_precision(): """ Test v5 extraction on real Dutch conversations. Expected: Precision ≥75% (6 valid / ≤8 total) """ batch_extractor = BatchInstitutionExtractor( conversation_dir=Path("path/to/dutch/conversations"), output_dir=Path("output/v5_test") ) stats = batch_extractor.process_all(country_filter="dutch") batch_extractor.apply_quality_filters() # Manual validation required # Expected: 6-8 institutions extracted (vs 12 in v4) assert 6 <= len(batch_extractor.all_institutions) <= 10 ``` --- ## Rollout Plan ### Stage 1: Dutch-only validation (THIS SPRINT) - Implement v5 improvements - Test on Dutch conversations only - Validate precision ≥75% - Compare v4 vs v5 results ### Stage 2: Multi-country validation - Test on Brazil, Mexico, Chile (v4 countries) - Ensure improvements don't harm other regions - Adjust blacklists for region-specific patterns ### Stage 3: Full re-extraction - Run v5 on all 139 conversation files - Generate new `output/v5_institutions.json` - Validate sample across multiple countries - Update documentation --- ## Risk Mitigation ### Risk 1: Over-filtering (false negatives) **Concern:** Stricter validation might reject valid institutions **Mitigation:** - Preserve v4 output for comparison - Log all filtered institutions with reason - Manual review of filtered items - Adjustable confidence threshold (default 0.6, can lower to 0.55) ### Risk 2: Regional bias **Concern:** Dutch-optimized rules might not work globally **Mitigation:** - Blacklists should be culturally neutral where possible - Test on diverse regions (Asia, Africa, Latin America) - Separate region-specific rules (e.g., `DUTCH_SPECIFIC_FILTERS`) ### Risk 3: Regression on valid institutions **Concern:** Might lose some of the 6 valid Dutch institutions **Mitigation:** - Run v5 on same Dutch conversations - Compare extracted names to v4's 6 valid institutions - If any valid institution missing, adjust filters - Maintain whitelist of known-good institutions --- ## Metrics and Monitoring ### Key Metrics | Metric | V4 Baseline | V5 Target | Measurement Method | |--------|-------------|-----------|-------------------| | **Precision (Dutch)** | 50% | ≥75% | Web validation of sample | | **Geographic errors** | 16.7% | <5% | Manual review | | **Organization errors** | 25% | <10% | Blacklist matching | | **Generic descriptor errors** | 8.3% | <5% | Pattern matching | | **Total extracted (Dutch)** | 12 | 6-10 | Count after filters | | **Valid institutions preserved** | 6/6 | 6/6 | Compare to v4 valid list | ### Success Criteria **Must achieve:** - ✅ Precision ≥75% on Dutch institutions - ✅ All 6 v4-valid institutions still extracted - ✅ ≤3 false positives (down from 6) **Nice to have:** - Precision ≥80% - No geographic errors (0%) - Confidence scores >0.7 for all valid institutions --- ## Files Modified **New files:** - `docs/V5_EXTRACTION_DESIGN.md` (this document) - `tests/extractors/test_nlp_extractor_v5.py` (unit tests) - `output/v5_dutch_institutions.json` (v5 extraction results) - `output/V5_VALIDATION_COMPARISON.md` (v4 vs v5 analysis) **Modified files:** - `src/glam_extractor/extractors/nlp_extractor.py` (core improvements) - `scripts/batch_extract_institutions.py` (quality filter updates) --- ## Next Actions 1. ✅ **Design v5 improvements** (this document) 2. ⏳ **Implement core validation methods** (geographic, organization, proper name) 3. ⏳ **Update confidence scoring** (v5 algorithm) 4. ⏳ **Run v5 extraction on Dutch conversations** 5. ⏳ **Validate v5 results** (web search + comparison to v4) 6. ⏳ **Measure precision improvement** (target ≥75%) 7. ⏳ **Document findings** (V5_VALIDATION_COMPARISON.md) --- **Status:** Design complete, ready for implementation **Owner:** GLAM extraction team **Review date:** 2025-11-07 **Implementation timeline:** 1-2 days