9 KiB
NLP Institution Extractor
Overview
The InstitutionExtractor class provides NLP-based extraction of heritage institution data from unstructured conversation text. It uses pattern matching, keyword detection, and heuristic rules to identify museums, libraries, archives, galleries, and other GLAM institutions.
Features
- Institution Name Extraction: Identifies institution names using capitalization patterns and keyword context
- Type Classification: Classifies institutions into 13 types (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.)
- Multilingual Support: Recognizes institution keywords in English, Dutch, Spanish, Portuguese, French, German, and more
- Identifier Extraction: Extracts ISIL codes, Wikidata IDs, VIAF IDs, and KvK numbers
- Location Extraction: Identifies cities and countries mentioned in text
- Confidence Scoring: Assigns 0.0-1.0 confidence scores based on available evidence
- Provenance Tracking: Records extraction method, date, confidence, and source conversation
Installation
The extractor is part of the glam_extractor package and has no external NLP dependencies (uses pure Python pattern matching).
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
Usage
Basic Text Extraction
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
extractor = InstitutionExtractor()
text = "The Rijksmuseum in Amsterdam (ISIL: NL-AsdRM) is a major art museum."
result = extractor.extract_from_text(text)
if result.success:
for institution in result.value:
print(f"{institution.name} - {institution.institution_type}")
print(f"Confidence: {institution.provenance.confidence_score}")
Extracting from Conversation
from glam_extractor.parsers.conversation import ConversationParser
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
# Parse conversation file
parser = ConversationParser()
conversation = parser.parse_file("path/to/conversation.json")
# Extract institutions
extractor = InstitutionExtractor()
result = extractor.extract_from_conversation(conversation)
if result.success:
print(f"Found {len(result.value)} institutions")
for inst in result.value:
print(f"- {inst.name} ({inst.institution_type})")
With Provenance Tracking
result = extractor.extract_from_text(
text="The Amsterdam Museum has ISIL code NL-AsdAM.",
conversation_id="conv-12345",
conversation_name="Dutch Heritage Institutions"
)
if result.success and result.value:
institution = result.value[0]
# Access provenance metadata
prov = institution.provenance
print(f"Data Source: {prov.data_source}") # CONVERSATION_NLP
print(f"Data Tier: {prov.data_tier}") # TIER_4_INFERRED
print(f"Confidence: {prov.confidence_score}")
print(f"Method: {prov.extraction_method}")
print(f"Conversation: {prov.conversation_id}")
Extracted Data Model
The extractor returns HeritageCustodian objects with the following fields populated:
| Field | Description | Example |
|---|---|---|
name |
Institution name | "Rijksmuseum" |
institution_type |
Type enum | InstitutionType.MUSEUM |
organization_status |
Status enum | OrganizationStatus.UNKNOWN |
locations |
List of Location objects | [Location(city="Amsterdam", country="NL")] |
identifiers |
List of Identifier objects | [Identifier(scheme="ISIL", value="NL-AsdRM")] |
provenance |
Provenance metadata | See below |
description |
Contextual description | Includes conversation name and text snippet |
Provenance Fields
Every extracted record includes complete provenance metadata:
provenance = Provenance(
data_source=DataSource.CONVERSATION_NLP,
data_tier=DataTier.TIER_4_INFERRED,
extraction_date=datetime.now(timezone.utc),
extraction_method="Pattern matching + heuristic NER",
confidence_score=0.85,
conversation_id="conversation-uuid",
source_url=None,
verified_date=None,
verified_by=None
)
Pattern Detection
Identifier Patterns
The extractor recognizes the following identifier patterns:
- ISIL codes:
[A-Z]{2}-[A-Za-z0-9]+(e.g.,NL-AsdRM,US-MWA) - Wikidata IDs:
Q[0-9]+(e.g.,Q924335) - VIAF IDs:
viaf.org/viaf/[0-9]+(e.g.,viaf.org/viaf/123456789) - KvK numbers:
[0-9]{8}(Dutch Chamber of Commerce)
Institution Type Keywords
Multilingual keyword detection for institution types:
InstitutionType.MUSEUM: [
'museum', 'museo', 'museu', 'musée', 'muzeum', 'muzeul',
'kunstmuseum', 'kunsthalle', 'muzej'
]
InstitutionType.LIBRARY: [
'library', 'biblioteca', 'bibliothek', 'bibliotheek',
'bibliothèque', 'biblioteka', 'national library'
]
InstitutionType.ARCHIVE: [
'archive', 'archivo', 'archiv', 'archief', 'archives',
'arkiv', 'national archive'
]
Location Patterns
- City extraction:
in [City Name]pattern (e.g., "in Amsterdam") - Country extraction: From ISIL code prefix (e.g.,
NL-→NL)
Name Extraction
Two primary patterns:
-
Keyword-based:
[Name] + [Type Keyword]- Example: "Rijks" + "museum" → "Rijks Museum"
-
ISIL-based:
[ISIL Code] for [Name]- Example: "NL-AsdAM for Amsterdam Museum"
Confidence Scoring
Confidence scores range from 0.0 to 1.0 based on the following criteria:
| Evidence | Score Increment |
|---|---|
| Base score | +0.3 |
| Has institution type | +0.2 |
| Has location (city) | +0.1 |
| Has identifier (ISIL, Wikidata, VIAF) | +0.3 |
| Name length is 2-6 words | +0.2 |
| Explicit "is a" pattern | +0.2 |
Maximum score: 1.0
Interpretation:
- 0.9-1.0: Explicit, unambiguous mentions with context
- 0.7-0.9: Clear mentions with some ambiguity
- 0.5-0.7: Inferred from context, may need verification
- 0.3-0.5: Low confidence, likely needs verification
- 0.0-0.3: Very uncertain, flag for manual review
Error Handling
The extractor uses the Result pattern for error handling:
result = extractor.extract_from_text(text)
if result.success:
institutions = result.value # List[HeritageCustodian]
for inst in institutions:
# Process institution
pass
else:
error_message = result.error # str
print(f"Extraction failed: {error_message}")
Limitations
-
Name Extraction Accuracy: Pattern-based name extraction may produce incomplete names or variants
- Example: "Rijksmuseum" might be extracted as "Rijks Museum" or "Rijks Museu"
- Mitigation: Deduplication by name (case-insensitive)
-
No Dependency Parsing: Does not perform syntactic parsing, relies on keyword proximity
- Complex sentences may confuse the extractor
- Names with articles (The, Het, La, Le) may be inconsistently captured
-
Multilingual Coverage: Keyword lists cover major European languages but not all languages
- Can be extended by adding keywords to
ExtractionPatterns.INSTITUTION_KEYWORDS
- Can be extended by adding keywords to
-
Context-Dependent Accuracy: Works best on well-formed sentences with clear institution mentions
- Fragmentary text or lists may produce lower-quality extractions
-
No Entity Linking: Does not link entities to external knowledge bases (Wikidata, VIAF)
- Identifiers are extracted but not validated
Extending the Extractor
Adding New Institution Keywords
from glam_extractor.extractors.nlp_extractor import ExtractionPatterns
from glam_extractor.models import InstitutionType
# Add new keywords
ExtractionPatterns.INSTITUTION_KEYWORDS[InstitutionType.MUSEUM].extend([
'museion', # Greek
'mtcf-museo', # Hebrew
'博物馆' # Chinese
])
Custom Confidence Scoring
Subclass InstitutionExtractor and override _calculate_confidence:
class CustomExtractor(InstitutionExtractor):
def _calculate_confidence(self, name, institution_type, city, identifiers, sentence):
# Custom scoring logic
score = 0.5
if len(identifiers) > 1:
score += 0.4
return min(1.0, score)
Testing
Run the test suite:
pytest tests/test_nlp_extractor.py -v
Test coverage: 90%
Run the demo:
python examples/demo_nlp_extractor.py
Performance
- Speed: ~1000 sentences/second on modern hardware
- Memory: Minimal (no ML models loaded)
- Scalability: Can process 139 conversation files in seconds
Future Enhancements
Potential improvements (as noted in AGENTS.md):
- Use spaCy NER models via subagents for better named entity recognition
- Dependency parsing for more accurate name extraction
- Entity linking to Wikidata/VIAF for validation
- Geocoding integration to enrich location data with coordinates
- Cross-reference validation with CSV registries (ISIL, Dutch orgs)
References
- Schema:
schemas/heritage_custodian.yaml - Models:
src/glam_extractor/models.py - Conversation Parser:
src/glam_extractor/parsers/conversation.py - AGENTS.md: Lines 62-380 (NLP extraction tasks and patterns)