kempersc 3c80de87e0 add isil entries

2025-11-19 23:25:22 +01:00

9 KiB

Raw Blame History

NLP Institution Extractor

Overview

The InstitutionExtractor class provides NLP-based extraction of heritage institution data from unstructured conversation text. It uses pattern matching, keyword detection, and heuristic rules to identify museums, libraries, archives, galleries, and other GLAM institutions.

Features

Institution Name Extraction: Identifies institution names using capitalization patterns and keyword context
Type Classification: Classifies institutions into 13 types (MUSEUM, LIBRARY, ARCHIVE, GALLERY, etc.)
Multilingual Support: Recognizes institution keywords in English, Dutch, Spanish, Portuguese, French, German, and more
Identifier Extraction: Extracts ISIL codes, Wikidata IDs, VIAF IDs, and KvK numbers
Location Extraction: Identifies cities and countries mentioned in text
Confidence Scoring: Assigns 0.0-1.0 confidence scores based on available evidence
Provenance Tracking: Records extraction method, date, confidence, and source conversation

Installation

The extractor is part of the glam_extractor package and has no external NLP dependencies (uses pure Python pattern matching).

from glam_extractor.extractors.nlp_extractor import InstitutionExtractor

Usage

Basic Text Extraction

from glam_extractor.extractors.nlp_extractor import InstitutionExtractor

extractor = InstitutionExtractor()

text = "The Rijksmuseum in Amsterdam (ISIL: NL-AsdRM) is a major art museum."
result = extractor.extract_from_text(text)

if result.success:
    for institution in result.value:
        print(f"{institution.name} - {institution.institution_type}")
        print(f"Confidence: {institution.provenance.confidence_score}")

Extracting from Conversation

from glam_extractor.parsers.conversation import ConversationParser
from glam_extractor.extractors.nlp_extractor import InstitutionExtractor

# Parse conversation file
parser = ConversationParser()
conversation = parser.parse_file("path/to/conversation.json")

# Extract institutions
extractor = InstitutionExtractor()
result = extractor.extract_from_conversation(conversation)

if result.success:
    print(f"Found {len(result.value)} institutions")
    for inst in result.value:
        print(f"- {inst.name} ({inst.institution_type})")

With Provenance Tracking

result = extractor.extract_from_text(
    text="The Amsterdam Museum has ISIL code NL-AsdAM.",
    conversation_id="conv-12345",
    conversation_name="Dutch Heritage Institutions"
)

if result.success and result.value:
    institution = result.value[0]
    
    # Access provenance metadata
    prov = institution.provenance
    print(f"Data Source: {prov.data_source}")  # CONVERSATION_NLP
    print(f"Data Tier: {prov.data_tier}")      # TIER_4_INFERRED
    print(f"Confidence: {prov.confidence_score}")
    print(f"Method: {prov.extraction_method}")
    print(f"Conversation: {prov.conversation_id}")

Extracted Data Model

The extractor returns HeritageCustodian objects with the following fields populated:

Field	Description	Example
`name`	Institution name	"Rijksmuseum"
`institution_type`	Type enum	`InstitutionType.MUSEUM`
`organization_status`	Status enum	`OrganizationStatus.UNKNOWN`
`locations`	List of Location objects	`[Location(city="Amsterdam", country="NL")]`
`identifiers`	List of Identifier objects	`[Identifier(scheme="ISIL", value="NL-AsdRM")]`
`provenance`	Provenance metadata	See below
`description`	Contextual description	Includes conversation name and text snippet

Provenance Fields

Every extracted record includes complete provenance metadata:

provenance = Provenance(
    data_source=DataSource.CONVERSATION_NLP,
    data_tier=DataTier.TIER_4_INFERRED,
    extraction_date=datetime.now(timezone.utc),
    extraction_method="Pattern matching + heuristic NER",
    confidence_score=0.85,
    conversation_id="conversation-uuid",
    source_url=None,
    verified_date=None,
    verified_by=None
)

Pattern Detection

Identifier Patterns

The extractor recognizes the following identifier patterns:

ISIL codes: [A-Z]{2}-[A-Za-z0-9]+ (e.g., NL-AsdRM, US-MWA)
Wikidata IDs: Q[0-9]+ (e.g., Q924335)
VIAF IDs: viaf.org/viaf/[0-9]+ (e.g., viaf.org/viaf/123456789)
KvK numbers: [0-9]{8} (Dutch Chamber of Commerce)

Institution Type Keywords

Multilingual keyword detection for institution types:

InstitutionType.MUSEUM: [
    'museum', 'museo', 'museu', 'musée', 'muzeum', 'muzeul',
    'kunstmuseum', 'kunsthalle', 'muzej'
]

InstitutionType.LIBRARY: [
    'library', 'biblioteca', 'bibliothek', 'bibliotheek',
    'bibliothèque', 'biblioteka', 'national library'
]

InstitutionType.ARCHIVE: [
    'archive', 'archivo', 'archiv', 'archief', 'archives',
    'arkiv', 'national archive'
]

Location Patterns

City extraction: in [City Name] pattern (e.g., "in Amsterdam")
Country extraction: From ISIL code prefix (e.g., NL- → NL)

Name Extraction

Two primary patterns:

Keyword-based: [Name] + [Type Keyword]
- Example: "Rijks" + "museum" → "Rijks Museum"
ISIL-based: [ISIL Code] for [Name]
- Example: "NL-AsdAM for Amsterdam Museum"

Confidence Scoring

Confidence scores range from 0.0 to 1.0 based on the following criteria:

Evidence	Score Increment
Base score	+0.3
Has institution type	+0.2
Has location (city)	+0.1
Has identifier (ISIL, Wikidata, VIAF)	+0.3
Name length is 2-6 words	+0.2
Explicit "is a" pattern	+0.2

Maximum score: 1.0

Interpretation:

0.9-1.0: Explicit, unambiguous mentions with context
0.7-0.9: Clear mentions with some ambiguity
0.5-0.7: Inferred from context, may need verification
0.3-0.5: Low confidence, likely needs verification
0.0-0.3: Very uncertain, flag for manual review

Error Handling

The extractor uses the Result pattern for error handling:

result = extractor.extract_from_text(text)

if result.success:
    institutions = result.value  # List[HeritageCustodian]
    for inst in institutions:
        # Process institution
        pass
else:
    error_message = result.error  # str
    print(f"Extraction failed: {error_message}")

Limitations

Name Extraction Accuracy: Pattern-based name extraction may produce incomplete names or variants
- Example: "Rijksmuseum" might be extracted as "Rijks Museum" or "Rijks Museu"
- Mitigation: Deduplication by name (case-insensitive)
No Dependency Parsing: Does not perform syntactic parsing, relies on keyword proximity
- Complex sentences may confuse the extractor
- Names with articles (The, Het, La, Le) may be inconsistently captured
Multilingual Coverage: Keyword lists cover major European languages but not all languages
- Can be extended by adding keywords to ExtractionPatterns.INSTITUTION_KEYWORDS
Context-Dependent Accuracy: Works best on well-formed sentences with clear institution mentions
- Fragmentary text or lists may produce lower-quality extractions
No Entity Linking: Does not link entities to external knowledge bases (Wikidata, VIAF)
- Identifiers are extracted but not validated

Extending the Extractor

Adding New Institution Keywords

from glam_extractor.extractors.nlp_extractor import ExtractionPatterns
from glam_extractor.models import InstitutionType

# Add new keywords
ExtractionPatterns.INSTITUTION_KEYWORDS[InstitutionType.MUSEUM].extend([
    'museion',  # Greek
    'mtcf-museo',  # Hebrew
    '博物馆'  # Chinese
])

Custom Confidence Scoring

Subclass InstitutionExtractor and override _calculate_confidence:

class CustomExtractor(InstitutionExtractor):
    def _calculate_confidence(self, name, institution_type, city, identifiers, sentence):
        # Custom scoring logic
        score = 0.5
        if len(identifiers) > 1:
            score += 0.4
        return min(1.0, score)

Testing

Run the test suite:

pytest tests/test_nlp_extractor.py -v

Test coverage: 90%

Run the demo:

python examples/demo_nlp_extractor.py

Performance

Speed: ~1000 sentences/second on modern hardware
Memory: Minimal (no ML models loaded)
Scalability: Can process 139 conversation files in seconds

Future Enhancements

Potential improvements (as noted in AGENTS.md):

Use spaCy NER models via subagents for better named entity recognition
Dependency parsing for more accurate name extraction
Entity linking to Wikidata/VIAF for validation
Geocoding integration to enrich location data with coordinates
Cross-reference validation with CSV registries (ISIL, Dutch orgs)

References

Schema: schemas/heritage_custodian.yaml
Models: src/glam_extractor/models.py
Conversation Parser: src/glam_extractor/parsers/conversation.py
AGENTS.md: Lines 62-380 (NLP extraction tasks and patterns)

9 KiB Raw Blame History