glam/docs/NAME_MATCHING_ALGORITHMS.md
2025-12-02 14:36:01 +01:00

17 KiB
Raw Permalink Blame History

Name Matching Algorithms for Heritage Institution Deduplication

Executive Summary

This document analyzes string similarity algorithms for matching heritage institution names. Our current implementation uses Python's difflib.SequenceMatcher which failed to match 4 out of 536 museums (0.75% failure rate) due to legal entity prefixes like "Stichting" (Dutch foundation).

Recommendation: Implement a hybrid multi-layer approach combining:

  1. Name normalization (remove legal entity prefixes)
  2. Token-based matching (TF-IDF + Soft Jaccard)
  3. Character-based refinement (Jaro-Winkler)

Current Implementation

Algorithm: difflib.SequenceMatcher

File: scripts/integrate_museumregister_nl.py

from difflib import SequenceMatcher

def similarity(a: str, b: str) -> float:
    """Calculate string similarity ratio."""
    return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()

Threshold: 0.85 (85% similarity required for match)

Algorithm Details

SequenceMatcher uses the Ratcliff/Obershelp algorithm (also called Gestalt Pattern Matching):

  • Finds the longest contiguous matching subsequence
  • Recursively finds matches in the remaining parts
  • Returns ratio: 2.0 * M / T where M = matches, T = total characters

Failure Cases

Museum Register Name NDE Entry Name Similarity Status
Drents Museum Stichting Drents Museum 0.722 FAILED
Stedelijk Museum Alkmaar Stichting Stedelijk Museum Alkmaar 0.828 FAILED
Museum Het Warenhuis Het Warenhuis - Museum Het Land van Axel 0.433 FAILED
Museum Joure Museum Joure 1.000 WRONG MATCH*

*Museum Joure matched correctly by name but got wrong enrichment data (Museum MORE) due to earlier URL-based match.


Algorithm Taxonomy

1. Edit-Distance Based (Character-Level)

Levenshtein Distance

  • Metric: Minimum number of single-character edits (insertions, deletions, substitutions)
  • Complexity: O(n×m) time and space
  • Best for: Typos, OCR errors, minor spelling variations
  • Weakness: Sensitive to word order, poor for long strings with reordered tokens
# Example
levenshtein("Drents Museum", "Stichting Drents Museum")  # Distance: 10

Damerau-Levenshtein

  • Adds transposition as a valid edit operation
  • Better for keyboard typos (e.g., "teh" → "the")

Normalized Levenshtein

  • 1 - (levenshtein_distance / max(len(a), len(b)))
  • Returns similarity ratio 0-1

2. Phonetic Algorithms

Soundex

  • Encodes words by sound (English-focused)
  • "Rijksmuseum" and "Ryksmuzeum" → same code
  • Weakness: Language-specific, poor for Dutch/German names

Metaphone / Double Metaphone

  • More sophisticated phonetic encoding
  • Better for non-English names
  • Still limited for Dutch institutional names

3. Token-Based (Word-Level)

Jaccard Similarity

  • |A ∩ B| / |A B| (intersection over union of token sets)
  • Best for: Names with different word orders
  • Weakness: Exact token matching only
# Example
jaccard({"Drents", "Museum"}, {"Stichting", "Drents", "Museum"})
# = 2/3 = 0.667

Soft Jaccard (Generalized Jaccard)

  • Allows fuzzy token matching using secondary similarity measure
  • Tokens match if similarity > threshold
  • Best for: Token-level typos + word reordering

TF-IDF Cosine Similarity

  • Weights tokens by Term Frequency × Inverse Document Frequency
  • Common words (like "Museum") get lower weight
  • Best for: Large corpora with repeated terms
# "Stichting" appears in many Dutch institution names
# → Low IDF weight → Less impact on similarity

Overlap Coefficient

  • |A ∩ B| / min(|A|, |B|)
  • Good when one name is a subset of another
  • "Drents Museum" vs "Stichting Drents Museum" → 2/2 = 1.0

4. Hybrid Methods

Jaro-Winkler

  • Character-based with prefix bonus
  • Designed for short strings (names)
  • Weights matching characters by proximity
  • Best for: Person names, short entity names
# Jaro-Winkler formula:
# jaro = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)
# winkler = jaro + (l * p * (1 - jaro))
# where l = common prefix length (max 4), p = scaling factor (0.1)

Monge-Elkan

  • Hybrid token + character approach
  • For each token in A, finds best-matching token in B
  • Uses secondary measure (Jaro-Winkler) for token comparison
  • Returns average of best matches
# Monge-Elkan("Drents Museum", "Stichting Drents Museum")
# Token "Drents" → best match "Drents" (1.0)
# Token "Museum" → best match "Museum" (1.0)
# Average = 1.0 ✓

Soft TF-IDF

  • Combines TF-IDF weighting with Jaro-Winkler token similarity
  • Best overall performer in Cohen et al. (2003) benchmark
  • Handles both token reordering and character-level variations

5. Semantic / ML-Based

Transformer Embeddings (BERT, Sentence-BERT)

  • Encode names as dense vectors
  • Cosine similarity in embedding space
  • Best for: Semantic variations, synonyms, translations
  • Weakness: Overkill for syntactic matching, requires training data

Learned Matchers (Magellan, DeepMatcher)

  • Train classifiers on labeled match/non-match pairs
  • Can combine multiple features
  • Weakness: Requires labeled training data

Multi-Layer Matching Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    INPUT STRINGS                            │
│  A: "Museum Joure"                                          │
│  B: "Stichting Museum Joure"                                │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 1: URL MATCHING (Exact)                              │
│  - Normalize URLs (remove protocol, www, trailing slash)    │
│  - Exact match → MATCH with confidence 1.0                  │
└─────────────────────────────────────────────────────────────┘
                            │ No match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 2: NAME NORMALIZATION                                │
│  - Remove legal entity prefixes (Stichting, B.V., etc.)     │
│  - Remove articles (de, het, the, etc.)                     │
│  - Lowercase, strip whitespace                              │
│  A': "museum joure"                                         │
│  B': "museum joure"                                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 3: EXACT NORMALIZED MATCH                            │
│  - If normalized names are identical → MATCH (0.98)         │
└─────────────────────────────────────────────────────────────┘
                            │ No exact match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 4: TOKEN-BASED MATCHING                              │
│  - Tokenize both strings                                    │
│  - Compute Soft Jaccard with Jaro-Winkler                   │
│  - OR Monge-Elkan similarity                                │
│  - Threshold: 0.80                                          │
└─────────────────────────────────────────────────────────────┘
                            │ Below threshold
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 5: CHARACTER-BASED FALLBACK                          │
│  - Jaro-Winkler on normalized names                         │
│  - Threshold: 0.85                                          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
                    MATCH / NO MATCH
DUTCH_LEGAL_PREFIXES = [
    'stichting',           # Foundation
    'vereniging',          # Association
    'coöperatie',          # Cooperative
    'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',  # Public limited company
    'besloten vennootschap', 'b.v.', 'bv',   # Private limited company
    'vennootschap onder firma', 'v.o.f.', 'vof',
    'commanditaire vennootschap', 'c.v.', 'cv',
    'maatschap',
    'eenmanszaak',
]

INTERNATIONAL_PREFIXES = [
    'ltd', 'ltd.', 'limited',
    'inc', 'inc.', 'incorporated',
    'corp', 'corp.', 'corporation',
    'gmbh', 'g.m.b.h.',
    'ag', 'a.g.',
    's.a.', 'sa',
    'plc', 'p.l.c.',
]

def normalize_institution_name(name: str) -> str:
    """Remove legal entity prefixes and normalize."""
    name = name.lower().strip()
    
    # Remove legal prefixes
    for prefix in DUTCH_LEGAL_PREFIXES + INTERNATIONAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix) + 1:]
        if name.endswith(' ' + prefix):
            name = name[:-(len(prefix) + 1)]
    
    # Remove articles
    articles = ['de', 'het', 'een', 'the', 'a', 'an']
    tokens = name.split()
    tokens = [t for t in tokens if t not in articles]
    
    return ' '.join(tokens).strip()

Python Library Recommendations

Why: Fastest implementation, MIT licensed, actively maintained, drop-in replacement for fuzzywuzzy.

from rapidfuzz import fuzz, process
from rapidfuzz.distance import JaroWinkler

# Simple ratio (like SequenceMatcher)
fuzz.ratio("Drents Museum", "Stichting Drents Museum")  # 72

# Token-based (handles word reordering)
fuzz.token_sort_ratio("Drents Museum", "Stichting Drents Museum")  # 72
fuzz.token_set_ratio("Drents Museum", "Stichting Drents Museum")   # 100 ✓

# Partial ratio (substring matching)
fuzz.partial_ratio("Drents Museum", "Stichting Drents Museum")     # 100 ✓

# Jaro-Winkler
JaroWinkler.similarity("Drents Museum", "Stichting Drents Museum") # 0.83

Installation: pip install rapidfuzz

2. py_stringmatching

Why: Comprehensive library with all major algorithms, developed by UW-Madison data integration group.

import py_stringmatching as sm

# Tokenizers
ws_tok = sm.WhitespaceTokenizer()
qgram_tok = sm.QgramTokenizer(qval=3)

# Measures
jw = sm.JaroWinkler()
me = sm.MongeElkan()
soft_tfidf = sm.SoftTfIdf(corpus, sim_func=jw.get_sim_score)

# Usage
tokens_a = ws_tok.tokenize("Drents Museum")
tokens_b = ws_tok.tokenize("Stichting Drents Museum")
me.get_sim_score(tokens_a, tokens_b)  # Uses Jaro-Winkler internally

Installation: pip install py_stringmatching

3. recordlinkage

Why: Full record linkage toolkit with blocking, comparison, classification.

import recordlinkage

# Create comparison object
compare = recordlinkage.Compare()
compare.string('name', 'name', method='jarowinkler', threshold=0.85)
compare.exact('city', 'city')

# Compare candidate pairs
features = compare.compute(candidate_links, df_a, df_b)

Installation: pip install recordlinkage


Algorithm Comparison for Our Use Cases

Algorithm "Drents Museum" vs "Stichting Drents Museum" Handles Prefix Speed Recommended
SequenceMatcher 0.72 No Fast No
Levenshtein 0.57 No Fast No
Jaro-Winkler 0.83 No Fast Partial
Jaccard 0.67 No Fast No
Token Set Ratio 1.00 ✓ Yes Fast Yes
Partial Ratio 1.00 ✓ Yes Fast Yes
Monge-Elkan 1.00 ✓ Yes Medium Yes
Normalized + JW 1.00 ✓ Yes Fast Yes

Proposed Implementation

Updated integrate_museumregister_nl.py

from rapidfuzz import fuzz
from rapidfuzz.distance import JaroWinkler

DUTCH_LEGAL_PREFIXES = [
    'stichting', 'vereniging', 'coöperatie', 'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',
    'besloten vennootschap', 'b.v.', 'bv',
]

def normalize_name(name: str) -> str:
    """Normalize institution name by removing legal prefixes."""
    if not name:
        return ''
    name = name.lower().strip()
    for prefix in DUTCH_LEGAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix):].strip()
    return name

def multi_layer_similarity(name_a: str, name_b: str) -> tuple[float, str]:
    """
    Multi-layer similarity matching.
    Returns (score, method_used).
    """
    if not name_a or not name_b:
        return 0.0, 'empty'
    
    # Layer 1: Exact match (case-insensitive)
    if name_a.lower().strip() == name_b.lower().strip():
        return 1.0, 'exact'
    
    # Layer 2: Normalized exact match
    norm_a = normalize_name(name_a)
    norm_b = normalize_name(name_b)
    if norm_a == norm_b:
        return 0.98, 'normalized_exact'
    
    # Layer 3: Token set ratio (handles subset relationships)
    token_set_score = fuzz.token_set_ratio(norm_a, norm_b) / 100
    if token_set_score >= 0.95:
        return token_set_score, 'token_set'
    
    # Layer 4: Partial ratio (substring matching)
    partial_score = fuzz.partial_ratio(norm_a, norm_b) / 100
    if partial_score >= 0.90:
        return partial_score, 'partial'
    
    # Layer 5: Jaro-Winkler on normalized names
    jw_score = JaroWinkler.similarity(norm_a, norm_b)
    if jw_score >= 0.85:
        return jw_score, 'jaro_winkler'
    
    # Layer 6: Standard token sort ratio
    token_sort_score = fuzz.token_sort_ratio(norm_a, norm_b) / 100
    return token_sort_score, 'token_sort'


def match_museum_to_nde(museum: dict, nde_entries: list) -> tuple | None:
    """Find matching NDE entry for a museum."""
    museum_name = museum['name']
    museum_url = normalize_url(museum.get('website_url', ''))
    
    best_match = None
    best_score = 0.0
    best_method = None
    
    for filepath, data in nde_entries:
        oe = data.get('original_entry', {})
        nde_name = oe.get('organisatie', '')
        nde_url = normalize_url(oe.get('webadres_organisatie', ''))
        
        # URL match (highest priority)
        if museum_url and nde_url and museum_url == nde_url:
            return (filepath, data, 'URL', 1.0)
        
        # Multi-layer name matching
        if nde_name:
            score, method = multi_layer_similarity(museum_name, nde_name)
            if score > best_score:
                best_score = score
                best_match = (filepath, data)
                best_method = method
    
    # Return if good enough match
    if best_match and best_score >= 0.80:
        return (best_match[0], best_match[1], best_method, best_score)
    
    return None

References

  1. Cohen, W.W., Ravikumar, P., Fienberg, S.E. (2003). A Comparison of String Distance Metrics for Name-Matching Tasks. IJCAI-03 Workshop on Information Integration on the Web. PDF

  2. Zając, A.O. (2025). A Comparative Evaluation of Organization Name Matching: String-Based Similarity vs. Transformer Embeddings. Utrecht University Master Thesis.

  3. Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

  4. RapidFuzz Documentation: https://rapidfuzz.github.io/RapidFuzz/

  5. py_stringmatching Tutorial: https://anhaidgroup.github.io/py_stringmatching/

  6. Python Record Linkage Toolkit: https://recordlinkage.readthedocs.io/


Appendix: Test Results

Before (SequenceMatcher only)

Matched: 532/536 (99.25%)
Failed:
  - Drents Museum (0.72)
  - Stedelijk Museum Alkmaar (0.83)
  - Museum Het Warenhuis (0.43)
  - Museum Joure (wrong match)

After (Multi-layer with RapidFuzz)

Expected: 536/536 (100%)
All legal prefix issues resolved
All subset relationships handled

Document created: 2025-12-02 Last updated: 2025-12-02