kempersc ef89b1213a validate enrichments

2025-12-02 14:36:01 +01:00

17 KiB

Raw Permalink Blame History

Name Matching Algorithms for Heritage Institution Deduplication

Executive Summary

This document analyzes string similarity algorithms for matching heritage institution names. Our current implementation uses Python's difflib.SequenceMatcher which failed to match 4 out of 536 museums (0.75% failure rate) due to legal entity prefixes like "Stichting" (Dutch foundation).

Recommendation: Implement a hybrid multi-layer approach combining:

Name normalization (remove legal entity prefixes)
Token-based matching (TF-IDF + Soft Jaccard)
Character-based refinement (Jaro-Winkler)

Current Implementation

Algorithm: `difflib.SequenceMatcher`

File: scripts/integrate_museumregister_nl.py

from difflib import SequenceMatcher

def similarity(a: str, b: str) -> float:
    """Calculate string similarity ratio."""
    return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()

Threshold: 0.85 (85% similarity required for match)

Algorithm Details

SequenceMatcher uses the Ratcliff/Obershelp algorithm (also called Gestalt Pattern Matching):

Finds the longest contiguous matching subsequence
Recursively finds matches in the remaining parts
Returns ratio: 2.0 * M / T where M = matches, T = total characters

Failure Cases

Museum Register Name	NDE Entry Name	Similarity	Status
Drents Museum	Stichting Drents Museum	0.722	FAILED
Stedelijk Museum Alkmaar	Stichting Stedelijk Museum Alkmaar	0.828	FAILED
Museum Het Warenhuis	Het Warenhuis - Museum Het Land van Axel	0.433	FAILED
Museum Joure	Museum Joure	1.000	WRONG MATCH*

*Museum Joure matched correctly by name but got wrong enrichment data (Museum MORE) due to earlier URL-based match.

Algorithm Taxonomy

1. Edit-Distance Based (Character-Level)

Levenshtein Distance

Metric: Minimum number of single-character edits (insertions, deletions, substitutions)
Complexity: O(n×m) time and space
Best for: Typos, OCR errors, minor spelling variations
Weakness: Sensitive to word order, poor for long strings with reordered tokens

# Example
levenshtein("Drents Museum", "Stichting Drents Museum")  # Distance: 10

Damerau-Levenshtein

Adds transposition as a valid edit operation
Better for keyboard typos (e.g., "teh" → "the")

Normalized Levenshtein

1 - (levenshtein_distance / max(len(a), len(b)))
Returns similarity ratio 0-1

2. Phonetic Algorithms

Soundex

Encodes words by sound (English-focused)
"Rijksmuseum" and "Ryksmuzeum" → same code
Weakness: Language-specific, poor for Dutch/German names

Metaphone / Double Metaphone

More sophisticated phonetic encoding
Better for non-English names
Still limited for Dutch institutional names

3. Token-Based (Word-Level)

Jaccard Similarity

|A ∩ B| / |A ∪ B| (intersection over union of token sets)
Best for: Names with different word orders
Weakness: Exact token matching only

# Example
jaccard({"Drents", "Museum"}, {"Stichting", "Drents", "Museum"})
# = 2/3 = 0.667

Soft Jaccard (Generalized Jaccard)

Allows fuzzy token matching using secondary similarity measure
Tokens match if similarity > threshold
Best for: Token-level typos + word reordering

TF-IDF Cosine Similarity

Weights tokens by Term Frequency × Inverse Document Frequency
Common words (like "Museum") get lower weight
Best for: Large corpora with repeated terms

# "Stichting" appears in many Dutch institution names
# → Low IDF weight → Less impact on similarity

Overlap Coefficient

|A ∩ B| / min(|A|, |B|)
Good when one name is a subset of another
"Drents Museum" vs "Stichting Drents Museum" → 2/2 = 1.0

4. Hybrid Methods

Jaro-Winkler

Character-based with prefix bonus
Designed for short strings (names)
Weights matching characters by proximity
Best for: Person names, short entity names

# Jaro-Winkler formula:
# jaro = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)
# winkler = jaro + (l * p * (1 - jaro))
# where l = common prefix length (max 4), p = scaling factor (0.1)

Monge-Elkan

Hybrid token + character approach
For each token in A, finds best-matching token in B
Uses secondary measure (Jaro-Winkler) for token comparison
Returns average of best matches

# Monge-Elkan("Drents Museum", "Stichting Drents Museum")
# Token "Drents" → best match "Drents" (1.0)
# Token "Museum" → best match "Museum" (1.0)
# Average = 1.0 ✓

Soft TF-IDF

Combines TF-IDF weighting with Jaro-Winkler token similarity
Best overall performer in Cohen et al. (2003) benchmark
Handles both token reordering and character-level variations

5. Semantic / ML-Based

Transformer Embeddings (BERT, Sentence-BERT)

Encode names as dense vectors
Cosine similarity in embedding space
Best for: Semantic variations, synonyms, translations
Weakness: Overkill for syntactic matching, requires training data

Learned Matchers (Magellan, DeepMatcher)

Train classifiers on labeled match/non-match pairs
Can combine multiple features
Weakness: Requires labeled training data

Recommended Solution for Heritage Institutions

Multi-Layer Matching Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    INPUT STRINGS                            │
│  A: "Museum Joure"                                          │
│  B: "Stichting Museum Joure"                                │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 1: URL MATCHING (Exact)                              │
│  - Normalize URLs (remove protocol, www, trailing slash)    │
│  - Exact match → MATCH with confidence 1.0                  │
└─────────────────────────────────────────────────────────────┘
                            │ No match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 2: NAME NORMALIZATION                                │
│  - Remove legal entity prefixes (Stichting, B.V., etc.)     │
│  - Remove articles (de, het, the, etc.)                     │
│  - Lowercase, strip whitespace                              │
│  A': "museum joure"                                         │
│  B': "museum joure"                                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 3: EXACT NORMALIZED MATCH                            │
│  - If normalized names are identical → MATCH (0.98)         │
└─────────────────────────────────────────────────────────────┘
                            │ No exact match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 4: TOKEN-BASED MATCHING                              │
│  - Tokenize both strings                                    │
│  - Compute Soft Jaccard with Jaro-Winkler                   │
│  - OR Monge-Elkan similarity                                │
│  - Threshold: 0.80                                          │
└─────────────────────────────────────────────────────────────┘
                            │ Below threshold
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 5: CHARACTER-BASED FALLBACK                          │
│  - Jaro-Winkler on normalized names                         │
│  - Threshold: 0.85                                          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
                    MATCH / NO MATCH

Legal Entity Prefix Removal

DUTCH_LEGAL_PREFIXES = [
    'stichting',           # Foundation
    'vereniging',          # Association
    'coöperatie',          # Cooperative
    'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',  # Public limited company
    'besloten vennootschap', 'b.v.', 'bv',   # Private limited company
    'vennootschap onder firma', 'v.o.f.', 'vof',
    'commanditaire vennootschap', 'c.v.', 'cv',
    'maatschap',
    'eenmanszaak',
]

INTERNATIONAL_PREFIXES = [
    'ltd', 'ltd.', 'limited',
    'inc', 'inc.', 'incorporated',
    'corp', 'corp.', 'corporation',
    'gmbh', 'g.m.b.h.',
    'ag', 'a.g.',
    's.a.', 'sa',
    'plc', 'p.l.c.',
]

def normalize_institution_name(name: str) -> str:
    """Remove legal entity prefixes and normalize."""
    name = name.lower().strip()
    
    # Remove legal prefixes
    for prefix in DUTCH_LEGAL_PREFIXES + INTERNATIONAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix) + 1:]
        if name.endswith(' ' + prefix):
            name = name[:-(len(prefix) + 1)]
    
    # Remove articles
    articles = ['de', 'het', 'een', 'the', 'a', 'an']
    tokens = name.split()
    tokens = [t for t in tokens if t not in articles]
    
    return ' '.join(tokens).strip()

Python Library Recommendations

1. RapidFuzz (Recommended)

Why: Fastest implementation, MIT licensed, actively maintained, drop-in replacement for fuzzywuzzy.

from rapidfuzz import fuzz, process
from rapidfuzz.distance import JaroWinkler

# Simple ratio (like SequenceMatcher)
fuzz.ratio("Drents Museum", "Stichting Drents Museum")  # 72

# Token-based (handles word reordering)
fuzz.token_sort_ratio("Drents Museum", "Stichting Drents Museum")  # 72
fuzz.token_set_ratio("Drents Museum", "Stichting Drents Museum")   # 100 ✓

# Partial ratio (substring matching)
fuzz.partial_ratio("Drents Museum", "Stichting Drents Museum")     # 100 ✓

# Jaro-Winkler
JaroWinkler.similarity("Drents Museum", "Stichting Drents Museum") # 0.83

Installation: pip install rapidfuzz

2. py_stringmatching

Why: Comprehensive library with all major algorithms, developed by UW-Madison data integration group.

import py_stringmatching as sm

# Tokenizers
ws_tok = sm.WhitespaceTokenizer()
qgram_tok = sm.QgramTokenizer(qval=3)

# Measures
jw = sm.JaroWinkler()
me = sm.MongeElkan()
soft_tfidf = sm.SoftTfIdf(corpus, sim_func=jw.get_sim_score)

# Usage
tokens_a = ws_tok.tokenize("Drents Museum")
tokens_b = ws_tok.tokenize("Stichting Drents Museum")
me.get_sim_score(tokens_a, tokens_b)  # Uses Jaro-Winkler internally

Installation: pip install py_stringmatching

3. recordlinkage

Why: Full record linkage toolkit with blocking, comparison, classification.

import recordlinkage

# Create comparison object
compare = recordlinkage.Compare()
compare.string('name', 'name', method='jarowinkler', threshold=0.85)
compare.exact('city', 'city')

# Compare candidate pairs
features = compare.compute(candidate_links, df_a, df_b)

Installation: pip install recordlinkage

Algorithm Comparison for Our Use Cases

Algorithm	"Drents Museum" vs "Stichting Drents Museum"	Handles Prefix	Speed	Recommended
SequenceMatcher	0.72 ❌	No	Fast	No
Levenshtein	0.57 ❌	No	Fast	No
Jaro-Winkler	0.83 ❌	No	Fast	Partial
Jaccard	0.67 ❌	No	Fast	No
Token Set Ratio	1.00 ✓	Yes	Fast	Yes
Partial Ratio	1.00 ✓	Yes	Fast	Yes
Monge-Elkan	1.00 ✓	Yes	Medium	Yes
Normalized + JW	1.00 ✓	Yes	Fast	Yes

Proposed Implementation

Updated `integrate_museumregister_nl.py`

from rapidfuzz import fuzz
from rapidfuzz.distance import JaroWinkler

DUTCH_LEGAL_PREFIXES = [
    'stichting', 'vereniging', 'coöperatie', 'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',
    'besloten vennootschap', 'b.v.', 'bv',
]

def normalize_name(name: str) -> str:
    """Normalize institution name by removing legal prefixes."""
    if not name:
        return ''
    name = name.lower().strip()
    for prefix in DUTCH_LEGAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix):].strip()
    return name

def multi_layer_similarity(name_a: str, name_b: str) -> tuple[float, str]:
    """
    Multi-layer similarity matching.
    Returns (score, method_used).
    """
    if not name_a or not name_b:
        return 0.0, 'empty'
    
    # Layer 1: Exact match (case-insensitive)
    if name_a.lower().strip() == name_b.lower().strip():
        return 1.0, 'exact'
    
    # Layer 2: Normalized exact match
    norm_a = normalize_name(name_a)
    norm_b = normalize_name(name_b)
    if norm_a == norm_b:
        return 0.98, 'normalized_exact'
    
    # Layer 3: Token set ratio (handles subset relationships)
    token_set_score = fuzz.token_set_ratio(norm_a, norm_b) / 100
    if token_set_score >= 0.95:
        return token_set_score, 'token_set'
    
    # Layer 4: Partial ratio (substring matching)
    partial_score = fuzz.partial_ratio(norm_a, norm_b) / 100
    if partial_score >= 0.90:
        return partial_score, 'partial'
    
    # Layer 5: Jaro-Winkler on normalized names
    jw_score = JaroWinkler.similarity(norm_a, norm_b)
    if jw_score >= 0.85:
        return jw_score, 'jaro_winkler'
    
    # Layer 6: Standard token sort ratio
    token_sort_score = fuzz.token_sort_ratio(norm_a, norm_b) / 100
    return token_sort_score, 'token_sort'


def match_museum_to_nde(museum: dict, nde_entries: list) -> tuple | None:
    """Find matching NDE entry for a museum."""
    museum_name = museum['name']
    museum_url = normalize_url(museum.get('website_url', ''))
    
    best_match = None
    best_score = 0.0
    best_method = None
    
    for filepath, data in nde_entries:
        oe = data.get('original_entry', {})
        nde_name = oe.get('organisatie', '')
        nde_url = normalize_url(oe.get('webadres_organisatie', ''))
        
        # URL match (highest priority)
        if museum_url and nde_url and museum_url == nde_url:
            return (filepath, data, 'URL', 1.0)
        
        # Multi-layer name matching
        if nde_name:
            score, method = multi_layer_similarity(museum_name, nde_name)
            if score > best_score:
                best_score = score
                best_match = (filepath, data)
                best_method = method
    
    # Return if good enough match
    if best_match and best_score >= 0.80:
        return (best_match[0], best_match[1], best_method, best_score)
    
    return None

References

Cohen, W.W., Ravikumar, P., Fienberg, S.E. (2003). A Comparison of String Distance Metrics for Name-Matching Tasks. IJCAI-03 Workshop on Information Integration on the Web. PDF
Zając, A.O. (2025). A Comparative Evaluation of Organization Name Matching: String-Based Similarity vs. Transformer Embeddings. Utrecht University Master Thesis.
Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
RapidFuzz Documentation: https://rapidfuzz.github.io/RapidFuzz/
py_stringmatching Tutorial: https://anhaidgroup.github.io/py_stringmatching/
Python Record Linkage Toolkit: https://recordlinkage.readthedocs.io/

Appendix: Test Results

Before (SequenceMatcher only)

Matched: 532/536 (99.25%)
Failed:
  - Drents Museum (0.72)
  - Stedelijk Museum Alkmaar (0.83)
  - Museum Het Warenhuis (0.43)
  - Museum Joure (wrong match)

After (Multi-layer with RapidFuzz)

Expected: 536/536 (100%)
All legal prefix issues resolved
All subset relationships handled

Document created: 2025-12-02 Last updated: 2025-12-02

17 KiB Raw Permalink Blame History Unescape Escape