288 lines
9 KiB
Markdown
288 lines
9 KiB
Markdown
# NLP Institution Extractor
|
|
|
|
## Overview
|
|
|
|
The `InstitutionExtractor` class provides NLP-based extraction of heritage institution data from unstructured conversation text. It uses pattern matching, keyword detection, and heuristic rules to identify museums, libraries, archives, galleries, and other GLAM institutions.
|
|
|
|
## Features
|
|
|
|
- **Institution Name Extraction**: Identifies institution names using capitalization patterns and keyword context
|
|
- **Type Classification**: Classifies institutions into 13 types (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.)
|
|
- **Multilingual Support**: Recognizes institution keywords in English, Dutch, Spanish, Portuguese, French, German, and more
|
|
- **Identifier Extraction**: Extracts ISIL codes, Wikidata IDs, VIAF IDs, and KvK numbers
|
|
- **Location Extraction**: Identifies cities and countries mentioned in text
|
|
- **Confidence Scoring**: Assigns 0.0-1.0 confidence scores based on available evidence
|
|
- **Provenance Tracking**: Records extraction method, date, confidence, and source conversation
|
|
|
|
## Installation
|
|
|
|
The extractor is part of the `glam_extractor` package and has no external NLP dependencies (uses pure Python pattern matching).
|
|
|
|
```python
|
|
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Text Extraction
|
|
|
|
```python
|
|
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
|
|
|
|
extractor = InstitutionExtractor()
|
|
|
|
text = "The Rijksmuseum in Amsterdam (ISIL: NL-AsdRM) is a major art museum."
|
|
result = extractor.extract_from_text(text)
|
|
|
|
if result.success:
|
|
for institution in result.value:
|
|
print(f"{institution.name} - {institution.institution_type}")
|
|
print(f"Confidence: {institution.provenance.confidence_score}")
|
|
```
|
|
|
|
### Extracting from Conversation
|
|
|
|
```python
|
|
from glam_extractor.parsers.conversation import ConversationParser
|
|
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor
|
|
|
|
# Parse conversation file
|
|
parser = ConversationParser()
|
|
conversation = parser.parse_file("path/to/conversation.json")
|
|
|
|
# Extract institutions
|
|
extractor = InstitutionExtractor()
|
|
result = extractor.extract_from_conversation(conversation)
|
|
|
|
if result.success:
|
|
print(f"Found {len(result.value)} institutions")
|
|
for inst in result.value:
|
|
print(f"- {inst.name} ({inst.institution_type})")
|
|
```
|
|
|
|
### With Provenance Tracking
|
|
|
|
```python
|
|
result = extractor.extract_from_text(
|
|
text="The Amsterdam Museum has ISIL code NL-AsdAM.",
|
|
conversation_id="conv-12345",
|
|
conversation_name="Dutch Heritage Institutions"
|
|
)
|
|
|
|
if result.success and result.value:
|
|
institution = result.value[0]
|
|
|
|
# Access provenance metadata
|
|
prov = institution.provenance
|
|
print(f"Data Source: {prov.data_source}") # CONVERSATION_NLP
|
|
print(f"Data Tier: {prov.data_tier}") # TIER_4_INFERRED
|
|
print(f"Confidence: {prov.confidence_score}")
|
|
print(f"Method: {prov.extraction_method}")
|
|
print(f"Conversation: {prov.conversation_id}")
|
|
```
|
|
|
|
## Extracted Data Model
|
|
|
|
The extractor returns `HeritageCustodian` objects with the following fields populated:
|
|
|
|
| Field | Description | Example |
|
|
|-------|-------------|---------|
|
|
| `name` | Institution name | "Rijksmuseum" |
|
|
| `institution_type` | Type enum | `InstitutionType.MUSEUM` |
|
|
| `organization_status` | Status enum | `OrganizationStatus.UNKNOWN` |
|
|
| `locations` | List of Location objects | `[Location(city="Amsterdam", country="NL")]` |
|
|
| `identifiers` | List of Identifier objects | `[Identifier(scheme="ISIL", value="NL-AsdRM")]` |
|
|
| `provenance` | Provenance metadata | See below |
|
|
| `description` | Contextual description | Includes conversation name and text snippet |
|
|
|
|
### Provenance Fields
|
|
|
|
Every extracted record includes complete provenance metadata:
|
|
|
|
```python
|
|
provenance = Provenance(
|
|
data_source=DataSource.CONVERSATION_NLP,
|
|
data_tier=DataTier.TIER_4_INFERRED,
|
|
extraction_date=datetime.now(timezone.utc),
|
|
extraction_method="Pattern matching + heuristic NER",
|
|
confidence_score=0.85,
|
|
conversation_id="conversation-uuid",
|
|
source_url=None,
|
|
verified_date=None,
|
|
verified_by=None
|
|
)
|
|
```
|
|
|
|
## Pattern Detection
|
|
|
|
### Identifier Patterns
|
|
|
|
The extractor recognizes the following identifier patterns:
|
|
|
|
- **ISIL codes**: `[A-Z]{2}-[A-Za-z0-9]+` (e.g., `NL-AsdRM`, `US-MWA`)
|
|
- **Wikidata IDs**: `Q[0-9]+` (e.g., `Q924335`)
|
|
- **VIAF IDs**: `viaf.org/viaf/[0-9]+` (e.g., `viaf.org/viaf/123456789`)
|
|
- **KvK numbers**: `[0-9]{8}` (Dutch Chamber of Commerce)
|
|
|
|
### Institution Type Keywords
|
|
|
|
Multilingual keyword detection for institution types:
|
|
|
|
```python
|
|
InstitutionType.MUSEUM: [
|
|
'museum', 'museo', 'museu', 'musée', 'muzeum', 'muzeul',
|
|
'kunstmuseum', 'kunsthalle', 'muzej'
|
|
]
|
|
|
|
InstitutionType.LIBRARY: [
|
|
'library', 'biblioteca', 'bibliothek', 'bibliotheek',
|
|
'bibliothèque', 'biblioteka', 'national library'
|
|
]
|
|
|
|
InstitutionType.ARCHIVE: [
|
|
'archive', 'archivo', 'archiv', 'archief', 'archives',
|
|
'arkiv', 'national archive'
|
|
]
|
|
```
|
|
|
|
### Location Patterns
|
|
|
|
- **City extraction**: `in [City Name]` pattern (e.g., "in Amsterdam")
|
|
- **Country extraction**: From ISIL code prefix (e.g., `NL-` → `NL`)
|
|
|
|
### Name Extraction
|
|
|
|
Two primary patterns:
|
|
|
|
1. **Keyword-based**: `[Name] + [Type Keyword]`
|
|
- Example: "Rijks" + "museum" → "Rijks Museum"
|
|
|
|
2. **ISIL-based**: `[ISIL Code] for [Name]`
|
|
- Example: "NL-AsdAM for Amsterdam Museum"
|
|
|
|
## Confidence Scoring
|
|
|
|
Confidence scores range from 0.0 to 1.0 based on the following criteria:
|
|
|
|
| Evidence | Score Increment |
|
|
|----------|----------------|
|
|
| Base score | +0.3 |
|
|
| Has institution type | +0.2 |
|
|
| Has location (city) | +0.1 |
|
|
| Has identifier (ISIL, Wikidata, VIAF) | +0.3 |
|
|
| Name length is 2-6 words | +0.2 |
|
|
| Explicit "is a" pattern | +0.2 |
|
|
|
|
**Maximum score**: 1.0
|
|
|
|
**Interpretation**:
|
|
- **0.9-1.0**: Explicit, unambiguous mentions with context
|
|
- **0.7-0.9**: Clear mentions with some ambiguity
|
|
- **0.5-0.7**: Inferred from context, may need verification
|
|
- **0.3-0.5**: Low confidence, likely needs verification
|
|
- **0.0-0.3**: Very uncertain, flag for manual review
|
|
|
|
## Error Handling
|
|
|
|
The extractor uses the Result pattern for error handling:
|
|
|
|
```python
|
|
result = extractor.extract_from_text(text)
|
|
|
|
if result.success:
|
|
institutions = result.value # List[HeritageCustodian]
|
|
for inst in institutions:
|
|
# Process institution
|
|
pass
|
|
else:
|
|
error_message = result.error # str
|
|
print(f"Extraction failed: {error_message}")
|
|
```
|
|
|
|
## Limitations
|
|
|
|
1. **Name Extraction Accuracy**: Pattern-based name extraction may produce incomplete names or variants
|
|
- Example: "Rijksmuseum" might be extracted as "Rijks Museum" or "Rijks Museu"
|
|
- Mitigation: Deduplication by name (case-insensitive)
|
|
|
|
2. **No Dependency Parsing**: Does not perform syntactic parsing, relies on keyword proximity
|
|
- Complex sentences may confuse the extractor
|
|
- Names with articles (The, Het, La, Le) may be inconsistently captured
|
|
|
|
3. **Multilingual Coverage**: Keyword lists cover major European languages but not all languages
|
|
- Can be extended by adding keywords to `ExtractionPatterns.INSTITUTION_KEYWORDS`
|
|
|
|
4. **Context-Dependent Accuracy**: Works best on well-formed sentences with clear institution mentions
|
|
- Fragmentary text or lists may produce lower-quality extractions
|
|
|
|
5. **No Entity Linking**: Does not link entities to external knowledge bases (Wikidata, VIAF)
|
|
- Identifiers are extracted but not validated
|
|
|
|
## Extending the Extractor
|
|
|
|
### Adding New Institution Keywords
|
|
|
|
```python
|
|
from glam_extractor.extractors.nlp_extractor import ExtractionPatterns
|
|
from glam_extractor.models import InstitutionType
|
|
|
|
# Add new keywords
|
|
ExtractionPatterns.INSTITUTION_KEYWORDS[InstitutionType.MUSEUM].extend([
|
|
'museion', # Greek
|
|
'mtcf-museo', # Hebrew
|
|
'博物馆' # Chinese
|
|
])
|
|
```
|
|
|
|
### Custom Confidence Scoring
|
|
|
|
Subclass `InstitutionExtractor` and override `_calculate_confidence`:
|
|
|
|
```python
|
|
class CustomExtractor(InstitutionExtractor):
|
|
def _calculate_confidence(self, name, institution_type, city, identifiers, sentence):
|
|
# Custom scoring logic
|
|
score = 0.5
|
|
if len(identifiers) > 1:
|
|
score += 0.4
|
|
return min(1.0, score)
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run the test suite:
|
|
|
|
```bash
|
|
pytest tests/test_nlp_extractor.py -v
|
|
```
|
|
|
|
Test coverage: **90%**
|
|
|
|
Run the demo:
|
|
|
|
```bash
|
|
python examples/demo_nlp_extractor.py
|
|
```
|
|
|
|
## Performance
|
|
|
|
- **Speed**: ~1000 sentences/second on modern hardware
|
|
- **Memory**: Minimal (no ML models loaded)
|
|
- **Scalability**: Can process 139 conversation files in seconds
|
|
|
|
## Future Enhancements
|
|
|
|
Potential improvements (as noted in AGENTS.md):
|
|
|
|
1. **Use spaCy NER models** via subagents for better named entity recognition
|
|
2. **Dependency parsing** for more accurate name extraction
|
|
3. **Entity linking** to Wikidata/VIAF for validation
|
|
4. **Geocoding integration** to enrich location data with coordinates
|
|
5. **Cross-reference validation** with CSV registries (ISIL, Dutch orgs)
|
|
|
|
## References
|
|
|
|
- **Schema**: `schemas/heritage_custodian.yaml`
|
|
- **Models**: `src/glam_extractor/models.py`
|
|
- **Conversation Parser**: `src/glam_extractor/parsers/conversation.py`
|
|
- **AGENTS.md**: Lines 62-380 (NLP extraction tasks and patterns)
|