# NLP Institution Extractor ## Overview The `InstitutionExtractor` class provides NLP-based extraction of heritage institution data from unstructured conversation text. It uses pattern matching, keyword detection, and heuristic rules to identify museums, libraries, archives, galleries, and other GLAM institutions. ## Features - **Institution Name Extraction**: Identifies institution names using capitalization patterns and keyword context - **Type Classification**: Classifies institutions into 13 types (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.) - **Multilingual Support**: Recognizes institution keywords in English, Dutch, Spanish, Portuguese, French, German, and more - **Identifier Extraction**: Extracts ISIL codes, Wikidata IDs, VIAF IDs, and KvK numbers - **Location Extraction**: Identifies cities and countries mentioned in text - **Confidence Scoring**: Assigns 0.0-1.0 confidence scores based on available evidence - **Provenance Tracking**: Records extraction method, date, confidence, and source conversation ## Installation The extractor is part of the `glam_extractor` package and has no external NLP dependencies (uses pure Python pattern matching). ```python from glam_extractor.extractors.nlp_extractor import InstitutionExtractor ``` ## Usage ### Basic Text Extraction ```python from glam_extractor.extractors.nlp_extractor import InstitutionExtractor extractor = InstitutionExtractor() text = "The Rijksmuseum in Amsterdam (ISIL: NL-AsdRM) is a major art museum." result = extractor.extract_from_text(text) if result.success: for institution in result.value: print(f"{institution.name} - {institution.institution_type}") print(f"Confidence: {institution.provenance.confidence_score}") ``` ### Extracting from Conversation ```python from glam_extractor.parsers.conversation import ConversationParser from glam_extractor.extractors.nlp_extractor import InstitutionExtractor # Parse conversation file parser = ConversationParser() conversation = parser.parse_file("path/to/conversation.json") # Extract institutions extractor = InstitutionExtractor() result = extractor.extract_from_conversation(conversation) if result.success: print(f"Found {len(result.value)} institutions") for inst in result.value: print(f"- {inst.name} ({inst.institution_type})") ``` ### With Provenance Tracking ```python result = extractor.extract_from_text( text="The Amsterdam Museum has ISIL code NL-AsdAM.", conversation_id="conv-12345", conversation_name="Dutch Heritage Institutions" ) if result.success and result.value: institution = result.value[0] # Access provenance metadata prov = institution.provenance print(f"Data Source: {prov.data_source}") # CONVERSATION_NLP print(f"Data Tier: {prov.data_tier}") # TIER_4_INFERRED print(f"Confidence: {prov.confidence_score}") print(f"Method: {prov.extraction_method}") print(f"Conversation: {prov.conversation_id}") ``` ## Extracted Data Model The extractor returns `HeritageCustodian` objects with the following fields populated: | Field | Description | Example | |-------|-------------|---------| | `name` | Institution name | "Rijksmuseum" | | `institution_type` | Type enum | `InstitutionType.MUSEUM` | | `organization_status` | Status enum | `OrganizationStatus.UNKNOWN` | | `locations` | List of Location objects | `[Location(city="Amsterdam", country="NL")]` | | `identifiers` | List of Identifier objects | `[Identifier(scheme="ISIL", value="NL-AsdRM")]` | | `provenance` | Provenance metadata | See below | | `description` | Contextual description | Includes conversation name and text snippet | ### Provenance Fields Every extracted record includes complete provenance metadata: ```python provenance = Provenance( data_source=DataSource.CONVERSATION_NLP, data_tier=DataTier.TIER_4_INFERRED, extraction_date=datetime.now(timezone.utc), extraction_method="Pattern matching + heuristic NER", confidence_score=0.85, conversation_id="conversation-uuid", source_url=None, verified_date=None, verified_by=None ) ``` ## Pattern Detection ### Identifier Patterns The extractor recognizes the following identifier patterns: - **ISIL codes**: `[A-Z]{2}-[A-Za-z0-9]+` (e.g., `NL-AsdRM`, `US-MWA`) - **Wikidata IDs**: `Q[0-9]+` (e.g., `Q924335`) - **VIAF IDs**: `viaf.org/viaf/[0-9]+` (e.g., `viaf.org/viaf/123456789`) - **KvK numbers**: `[0-9]{8}` (Dutch Chamber of Commerce) ### Institution Type Keywords Multilingual keyword detection for institution types: ```python InstitutionType.MUSEUM: [ 'museum', 'museo', 'museu', 'musée', 'muzeum', 'muzeul', 'kunstmuseum', 'kunsthalle', 'muzej' ] InstitutionType.LIBRARY: [ 'library', 'biblioteca', 'bibliothek', 'bibliotheek', 'bibliothèque', 'biblioteka', 'national library' ] InstitutionType.ARCHIVE: [ 'archive', 'archivo', 'archiv', 'archief', 'archives', 'arkiv', 'national archive' ] ``` ### Location Patterns - **City extraction**: `in [City Name]` pattern (e.g., "in Amsterdam") - **Country extraction**: From ISIL code prefix (e.g., `NL-` → `NL`) ### Name Extraction Two primary patterns: 1. **Keyword-based**: `[Name] + [Type Keyword]` - Example: "Rijks" + "museum" → "Rijks Museum" 2. **ISIL-based**: `[ISIL Code] for [Name]` - Example: "NL-AsdAM for Amsterdam Museum" ## Confidence Scoring Confidence scores range from 0.0 to 1.0 based on the following criteria: | Evidence | Score Increment | |----------|----------------| | Base score | +0.3 | | Has institution type | +0.2 | | Has location (city) | +0.1 | | Has identifier (ISIL, Wikidata, VIAF) | +0.3 | | Name length is 2-6 words | +0.2 | | Explicit "is a" pattern | +0.2 | **Maximum score**: 1.0 **Interpretation**: - **0.9-1.0**: Explicit, unambiguous mentions with context - **0.7-0.9**: Clear mentions with some ambiguity - **0.5-0.7**: Inferred from context, may need verification - **0.3-0.5**: Low confidence, likely needs verification - **0.0-0.3**: Very uncertain, flag for manual review ## Error Handling The extractor uses the Result pattern for error handling: ```python result = extractor.extract_from_text(text) if result.success: institutions = result.value # List[HeritageCustodian] for inst in institutions: # Process institution pass else: error_message = result.error # str print(f"Extraction failed: {error_message}") ``` ## Limitations 1. **Name Extraction Accuracy**: Pattern-based name extraction may produce incomplete names or variants - Example: "Rijksmuseum" might be extracted as "Rijks Museum" or "Rijks Museu" - Mitigation: Deduplication by name (case-insensitive) 2. **No Dependency Parsing**: Does not perform syntactic parsing, relies on keyword proximity - Complex sentences may confuse the extractor - Names with articles (The, Het, La, Le) may be inconsistently captured 3. **Multilingual Coverage**: Keyword lists cover major European languages but not all languages - Can be extended by adding keywords to `ExtractionPatterns.INSTITUTION_KEYWORDS` 4. **Context-Dependent Accuracy**: Works best on well-formed sentences with clear institution mentions - Fragmentary text or lists may produce lower-quality extractions 5. **No Entity Linking**: Does not link entities to external knowledge bases (Wikidata, VIAF) - Identifiers are extracted but not validated ## Extending the Extractor ### Adding New Institution Keywords ```python from glam_extractor.extractors.nlp_extractor import ExtractionPatterns from glam_extractor.models import InstitutionType # Add new keywords ExtractionPatterns.INSTITUTION_KEYWORDS[InstitutionType.MUSEUM].extend([ 'museion', # Greek 'mtcf-museo', # Hebrew '博物馆' # Chinese ]) ``` ### Custom Confidence Scoring Subclass `InstitutionExtractor` and override `_calculate_confidence`: ```python class CustomExtractor(InstitutionExtractor): def _calculate_confidence(self, name, institution_type, city, identifiers, sentence): # Custom scoring logic score = 0.5 if len(identifiers) > 1: score += 0.4 return min(1.0, score) ``` ## Testing Run the test suite: ```bash pytest tests/test_nlp_extractor.py -v ``` Test coverage: **90%** Run the demo: ```bash python examples/demo_nlp_extractor.py ``` ## Performance - **Speed**: ~1000 sentences/second on modern hardware - **Memory**: Minimal (no ML models loaded) - **Scalability**: Can process 139 conversation files in seconds ## Future Enhancements Potential improvements (as noted in AGENTS.md): 1. **Use spaCy NER models** via subagents for better named entity recognition 2. **Dependency parsing** for more accurate name extraction 3. **Entity linking** to Wikidata/VIAF for validation 4. **Geocoding integration** to enrich location data with coordinates 5. **Cross-reference validation** with CSV registries (ISIL, Dutch orgs) ## References - **Schema**: `schemas/heritage_custodian.yaml` - **Models**: `src/glam_extractor/models.py` - **Conversation Parser**: `src/glam_extractor/parsers/conversation.py` - **AGENTS.md**: Lines 62-380 (NLP extraction tasks and patterns)