489 lines
17 KiB
Markdown
489 lines
17 KiB
Markdown
# Name Matching Algorithms for Heritage Institution Deduplication
|
||
|
||
## Executive Summary
|
||
|
||
This document analyzes string similarity algorithms for matching heritage institution names. Our current implementation uses Python's `difflib.SequenceMatcher` which failed to match 4 out of 536 museums (0.75% failure rate) due to legal entity prefixes like "Stichting" (Dutch foundation).
|
||
|
||
**Recommendation**: Implement a **hybrid multi-layer approach** combining:
|
||
1. **Name normalization** (remove legal entity prefixes)
|
||
2. **Token-based matching** (TF-IDF + Soft Jaccard)
|
||
3. **Character-based refinement** (Jaro-Winkler)
|
||
|
||
---
|
||
|
||
## Current Implementation
|
||
|
||
### Algorithm: `difflib.SequenceMatcher`
|
||
|
||
**File**: `scripts/integrate_museumregister_nl.py`
|
||
|
||
```python
|
||
from difflib import SequenceMatcher
|
||
|
||
def similarity(a: str, b: str) -> float:
|
||
"""Calculate string similarity ratio."""
|
||
return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()
|
||
```
|
||
|
||
**Threshold**: 0.85 (85% similarity required for match)
|
||
|
||
### Algorithm Details
|
||
|
||
`SequenceMatcher` uses the **Ratcliff/Obershelp** algorithm (also called Gestalt Pattern Matching):
|
||
- Finds the longest contiguous matching subsequence
|
||
- Recursively finds matches in the remaining parts
|
||
- Returns ratio: `2.0 * M / T` where M = matches, T = total characters
|
||
|
||
### Failure Cases
|
||
|
||
| Museum Register Name | NDE Entry Name | Similarity | Status |
|
||
|---------------------|----------------|------------|--------|
|
||
| Drents Museum | Stichting Drents Museum | 0.722 | FAILED |
|
||
| Stedelijk Museum Alkmaar | Stichting Stedelijk Museum Alkmaar | 0.828 | FAILED |
|
||
| Museum Het Warenhuis | Het Warenhuis - Museum Het Land van Axel | 0.433 | FAILED |
|
||
| Museum Joure | Museum Joure | 1.000 | WRONG MATCH* |
|
||
|
||
*Museum Joure matched correctly by name but got wrong enrichment data (Museum MORE) due to earlier URL-based match.
|
||
|
||
---
|
||
|
||
## Algorithm Taxonomy
|
||
|
||
### 1. Edit-Distance Based (Character-Level)
|
||
|
||
#### Levenshtein Distance
|
||
- **Metric**: Minimum number of single-character edits (insertions, deletions, substitutions)
|
||
- **Complexity**: O(n×m) time and space
|
||
- **Best for**: Typos, OCR errors, minor spelling variations
|
||
- **Weakness**: Sensitive to word order, poor for long strings with reordered tokens
|
||
|
||
```python
|
||
# Example
|
||
levenshtein("Drents Museum", "Stichting Drents Museum") # Distance: 10
|
||
```
|
||
|
||
#### Damerau-Levenshtein
|
||
- Adds **transposition** as a valid edit operation
|
||
- Better for keyboard typos (e.g., "teh" → "the")
|
||
|
||
#### Normalized Levenshtein
|
||
- `1 - (levenshtein_distance / max(len(a), len(b)))`
|
||
- Returns similarity ratio 0-1
|
||
|
||
### 2. Phonetic Algorithms
|
||
|
||
#### Soundex
|
||
- Encodes words by sound (English-focused)
|
||
- "Rijksmuseum" and "Ryksmuzeum" → same code
|
||
- **Weakness**: Language-specific, poor for Dutch/German names
|
||
|
||
#### Metaphone / Double Metaphone
|
||
- More sophisticated phonetic encoding
|
||
- Better for non-English names
|
||
- Still limited for Dutch institutional names
|
||
|
||
### 3. Token-Based (Word-Level)
|
||
|
||
#### Jaccard Similarity
|
||
- `|A ∩ B| / |A ∪ B|` (intersection over union of token sets)
|
||
- **Best for**: Names with different word orders
|
||
- **Weakness**: Exact token matching only
|
||
|
||
```python
|
||
# Example
|
||
jaccard({"Drents", "Museum"}, {"Stichting", "Drents", "Museum"})
|
||
# = 2/3 = 0.667
|
||
```
|
||
|
||
#### Soft Jaccard (Generalized Jaccard)
|
||
- Allows fuzzy token matching using secondary similarity measure
|
||
- Tokens match if similarity > threshold
|
||
- **Best for**: Token-level typos + word reordering
|
||
|
||
#### TF-IDF Cosine Similarity
|
||
- Weights tokens by Term Frequency × Inverse Document Frequency
|
||
- Common words (like "Museum") get lower weight
|
||
- **Best for**: Large corpora with repeated terms
|
||
|
||
```python
|
||
# "Stichting" appears in many Dutch institution names
|
||
# → Low IDF weight → Less impact on similarity
|
||
```
|
||
|
||
#### Overlap Coefficient
|
||
- `|A ∩ B| / min(|A|, |B|)`
|
||
- Good when one name is a subset of another
|
||
- "Drents Museum" vs "Stichting Drents Museum" → 2/2 = 1.0
|
||
|
||
### 4. Hybrid Methods
|
||
|
||
#### Jaro-Winkler
|
||
- Character-based with prefix bonus
|
||
- Designed for short strings (names)
|
||
- Weights matching characters by proximity
|
||
- **Best for**: Person names, short entity names
|
||
|
||
```python
|
||
# Jaro-Winkler formula:
|
||
# jaro = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)
|
||
# winkler = jaro + (l * p * (1 - jaro))
|
||
# where l = common prefix length (max 4), p = scaling factor (0.1)
|
||
```
|
||
|
||
#### Monge-Elkan
|
||
- **Hybrid token + character approach**
|
||
- For each token in A, finds best-matching token in B
|
||
- Uses secondary measure (Jaro-Winkler) for token comparison
|
||
- Returns average of best matches
|
||
|
||
```python
|
||
# Monge-Elkan("Drents Museum", "Stichting Drents Museum")
|
||
# Token "Drents" → best match "Drents" (1.0)
|
||
# Token "Museum" → best match "Museum" (1.0)
|
||
# Average = 1.0 ✓
|
||
```
|
||
|
||
#### Soft TF-IDF
|
||
- Combines TF-IDF weighting with Jaro-Winkler token similarity
|
||
- **Best overall performer** in Cohen et al. (2003) benchmark
|
||
- Handles both token reordering and character-level variations
|
||
|
||
### 5. Semantic / ML-Based
|
||
|
||
#### Transformer Embeddings (BERT, Sentence-BERT)
|
||
- Encode names as dense vectors
|
||
- Cosine similarity in embedding space
|
||
- **Best for**: Semantic variations, synonyms, translations
|
||
- **Weakness**: Overkill for syntactic matching, requires training data
|
||
|
||
#### Learned Matchers (Magellan, DeepMatcher)
|
||
- Train classifiers on labeled match/non-match pairs
|
||
- Can combine multiple features
|
||
- **Weakness**: Requires labeled training data
|
||
|
||
---
|
||
|
||
## Recommended Solution for Heritage Institutions
|
||
|
||
### Multi-Layer Matching Pipeline
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ INPUT STRINGS │
|
||
│ A: "Museum Joure" │
|
||
│ B: "Stichting Museum Joure" │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ LAYER 1: URL MATCHING (Exact) │
|
||
│ - Normalize URLs (remove protocol, www, trailing slash) │
|
||
│ - Exact match → MATCH with confidence 1.0 │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│ No match
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ LAYER 2: NAME NORMALIZATION │
|
||
│ - Remove legal entity prefixes (Stichting, B.V., etc.) │
|
||
│ - Remove articles (de, het, the, etc.) │
|
||
│ - Lowercase, strip whitespace │
|
||
│ A': "museum joure" │
|
||
│ B': "museum joure" │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ LAYER 3: EXACT NORMALIZED MATCH │
|
||
│ - If normalized names are identical → MATCH (0.98) │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│ No exact match
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ LAYER 4: TOKEN-BASED MATCHING │
|
||
│ - Tokenize both strings │
|
||
│ - Compute Soft Jaccard with Jaro-Winkler │
|
||
│ - OR Monge-Elkan similarity │
|
||
│ - Threshold: 0.80 │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│ Below threshold
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ LAYER 5: CHARACTER-BASED FALLBACK │
|
||
│ - Jaro-Winkler on normalized names │
|
||
│ - Threshold: 0.85 │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
MATCH / NO MATCH
|
||
```
|
||
|
||
### Legal Entity Prefix Removal
|
||
|
||
```python
|
||
DUTCH_LEGAL_PREFIXES = [
|
||
'stichting', # Foundation
|
||
'vereniging', # Association
|
||
'coöperatie', # Cooperative
|
||
'coöperatieve',
|
||
'naamloze vennootschap', 'n.v.', 'nv', # Public limited company
|
||
'besloten vennootschap', 'b.v.', 'bv', # Private limited company
|
||
'vennootschap onder firma', 'v.o.f.', 'vof',
|
||
'commanditaire vennootschap', 'c.v.', 'cv',
|
||
'maatschap',
|
||
'eenmanszaak',
|
||
]
|
||
|
||
INTERNATIONAL_PREFIXES = [
|
||
'ltd', 'ltd.', 'limited',
|
||
'inc', 'inc.', 'incorporated',
|
||
'corp', 'corp.', 'corporation',
|
||
'gmbh', 'g.m.b.h.',
|
||
'ag', 'a.g.',
|
||
's.a.', 'sa',
|
||
'plc', 'p.l.c.',
|
||
]
|
||
|
||
def normalize_institution_name(name: str) -> str:
|
||
"""Remove legal entity prefixes and normalize."""
|
||
name = name.lower().strip()
|
||
|
||
# Remove legal prefixes
|
||
for prefix in DUTCH_LEGAL_PREFIXES + INTERNATIONAL_PREFIXES:
|
||
if name.startswith(prefix + ' '):
|
||
name = name[len(prefix) + 1:]
|
||
if name.endswith(' ' + prefix):
|
||
name = name[:-(len(prefix) + 1)]
|
||
|
||
# Remove articles
|
||
articles = ['de', 'het', 'een', 'the', 'a', 'an']
|
||
tokens = name.split()
|
||
tokens = [t for t in tokens if t not in articles]
|
||
|
||
return ' '.join(tokens).strip()
|
||
```
|
||
|
||
---
|
||
|
||
## Python Library Recommendations
|
||
|
||
### 1. RapidFuzz (Recommended)
|
||
|
||
**Why**: Fastest implementation, MIT licensed, actively maintained, drop-in replacement for fuzzywuzzy.
|
||
|
||
```python
|
||
from rapidfuzz import fuzz, process
|
||
from rapidfuzz.distance import JaroWinkler
|
||
|
||
# Simple ratio (like SequenceMatcher)
|
||
fuzz.ratio("Drents Museum", "Stichting Drents Museum") # 72
|
||
|
||
# Token-based (handles word reordering)
|
||
fuzz.token_sort_ratio("Drents Museum", "Stichting Drents Museum") # 72
|
||
fuzz.token_set_ratio("Drents Museum", "Stichting Drents Museum") # 100 ✓
|
||
|
||
# Partial ratio (substring matching)
|
||
fuzz.partial_ratio("Drents Museum", "Stichting Drents Museum") # 100 ✓
|
||
|
||
# Jaro-Winkler
|
||
JaroWinkler.similarity("Drents Museum", "Stichting Drents Museum") # 0.83
|
||
```
|
||
|
||
**Installation**: `pip install rapidfuzz`
|
||
|
||
### 2. py_stringmatching
|
||
|
||
**Why**: Comprehensive library with all major algorithms, developed by UW-Madison data integration group.
|
||
|
||
```python
|
||
import py_stringmatching as sm
|
||
|
||
# Tokenizers
|
||
ws_tok = sm.WhitespaceTokenizer()
|
||
qgram_tok = sm.QgramTokenizer(qval=3)
|
||
|
||
# Measures
|
||
jw = sm.JaroWinkler()
|
||
me = sm.MongeElkan()
|
||
soft_tfidf = sm.SoftTfIdf(corpus, sim_func=jw.get_sim_score)
|
||
|
||
# Usage
|
||
tokens_a = ws_tok.tokenize("Drents Museum")
|
||
tokens_b = ws_tok.tokenize("Stichting Drents Museum")
|
||
me.get_sim_score(tokens_a, tokens_b) # Uses Jaro-Winkler internally
|
||
```
|
||
|
||
**Installation**: `pip install py_stringmatching`
|
||
|
||
### 3. recordlinkage
|
||
|
||
**Why**: Full record linkage toolkit with blocking, comparison, classification.
|
||
|
||
```python
|
||
import recordlinkage
|
||
|
||
# Create comparison object
|
||
compare = recordlinkage.Compare()
|
||
compare.string('name', 'name', method='jarowinkler', threshold=0.85)
|
||
compare.exact('city', 'city')
|
||
|
||
# Compare candidate pairs
|
||
features = compare.compute(candidate_links, df_a, df_b)
|
||
```
|
||
|
||
**Installation**: `pip install recordlinkage`
|
||
|
||
---
|
||
|
||
## Algorithm Comparison for Our Use Cases
|
||
|
||
| Algorithm | "Drents Museum" vs "Stichting Drents Museum" | Handles Prefix | Speed | Recommended |
|
||
|-----------|----------------------------------------------|----------------|-------|-------------|
|
||
| SequenceMatcher | 0.72 ❌ | No | Fast | No |
|
||
| Levenshtein | 0.57 ❌ | No | Fast | No |
|
||
| Jaro-Winkler | 0.83 ❌ | No | Fast | Partial |
|
||
| Jaccard | 0.67 ❌ | No | Fast | No |
|
||
| **Token Set Ratio** | **1.00 ✓** | **Yes** | Fast | **Yes** |
|
||
| **Partial Ratio** | **1.00 ✓** | **Yes** | Fast | **Yes** |
|
||
| **Monge-Elkan** | **1.00 ✓** | **Yes** | Medium | **Yes** |
|
||
| **Normalized + JW** | **1.00 ✓** | **Yes** | Fast | **Yes** |
|
||
|
||
---
|
||
|
||
## Proposed Implementation
|
||
|
||
### Updated `integrate_museumregister_nl.py`
|
||
|
||
```python
|
||
from rapidfuzz import fuzz
|
||
from rapidfuzz.distance import JaroWinkler
|
||
|
||
DUTCH_LEGAL_PREFIXES = [
|
||
'stichting', 'vereniging', 'coöperatie', 'coöperatieve',
|
||
'naamloze vennootschap', 'n.v.', 'nv',
|
||
'besloten vennootschap', 'b.v.', 'bv',
|
||
]
|
||
|
||
def normalize_name(name: str) -> str:
|
||
"""Normalize institution name by removing legal prefixes."""
|
||
if not name:
|
||
return ''
|
||
name = name.lower().strip()
|
||
for prefix in DUTCH_LEGAL_PREFIXES:
|
||
if name.startswith(prefix + ' '):
|
||
name = name[len(prefix):].strip()
|
||
return name
|
||
|
||
def multi_layer_similarity(name_a: str, name_b: str) -> tuple[float, str]:
|
||
"""
|
||
Multi-layer similarity matching.
|
||
Returns (score, method_used).
|
||
"""
|
||
if not name_a or not name_b:
|
||
return 0.0, 'empty'
|
||
|
||
# Layer 1: Exact match (case-insensitive)
|
||
if name_a.lower().strip() == name_b.lower().strip():
|
||
return 1.0, 'exact'
|
||
|
||
# Layer 2: Normalized exact match
|
||
norm_a = normalize_name(name_a)
|
||
norm_b = normalize_name(name_b)
|
||
if norm_a == norm_b:
|
||
return 0.98, 'normalized_exact'
|
||
|
||
# Layer 3: Token set ratio (handles subset relationships)
|
||
token_set_score = fuzz.token_set_ratio(norm_a, norm_b) / 100
|
||
if token_set_score >= 0.95:
|
||
return token_set_score, 'token_set'
|
||
|
||
# Layer 4: Partial ratio (substring matching)
|
||
partial_score = fuzz.partial_ratio(norm_a, norm_b) / 100
|
||
if partial_score >= 0.90:
|
||
return partial_score, 'partial'
|
||
|
||
# Layer 5: Jaro-Winkler on normalized names
|
||
jw_score = JaroWinkler.similarity(norm_a, norm_b)
|
||
if jw_score >= 0.85:
|
||
return jw_score, 'jaro_winkler'
|
||
|
||
# Layer 6: Standard token sort ratio
|
||
token_sort_score = fuzz.token_sort_ratio(norm_a, norm_b) / 100
|
||
return token_sort_score, 'token_sort'
|
||
|
||
|
||
def match_museum_to_nde(museum: dict, nde_entries: list) -> tuple | None:
|
||
"""Find matching NDE entry for a museum."""
|
||
museum_name = museum['name']
|
||
museum_url = normalize_url(museum.get('website_url', ''))
|
||
|
||
best_match = None
|
||
best_score = 0.0
|
||
best_method = None
|
||
|
||
for filepath, data in nde_entries:
|
||
oe = data.get('original_entry', {})
|
||
nde_name = oe.get('organisatie', '')
|
||
nde_url = normalize_url(oe.get('webadres_organisatie', ''))
|
||
|
||
# URL match (highest priority)
|
||
if museum_url and nde_url and museum_url == nde_url:
|
||
return (filepath, data, 'URL', 1.0)
|
||
|
||
# Multi-layer name matching
|
||
if nde_name:
|
||
score, method = multi_layer_similarity(museum_name, nde_name)
|
||
if score > best_score:
|
||
best_score = score
|
||
best_match = (filepath, data)
|
||
best_method = method
|
||
|
||
# Return if good enough match
|
||
if best_match and best_score >= 0.80:
|
||
return (best_match[0], best_match[1], best_method, best_score)
|
||
|
||
return None
|
||
```
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
1. Cohen, W.W., Ravikumar, P., Fienberg, S.E. (2003). **A Comparison of String Distance Metrics for Name-Matching Tasks**. IJCAI-03 Workshop on Information Integration on the Web. [PDF](https://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf)
|
||
|
||
2. Zając, A.O. (2025). **A Comparative Evaluation of Organization Name Matching: String-Based Similarity vs. Transformer Embeddings**. Utrecht University Master Thesis.
|
||
|
||
3. Christen, P. (2012). **Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection**. Springer.
|
||
|
||
4. RapidFuzz Documentation: https://rapidfuzz.github.io/RapidFuzz/
|
||
|
||
5. py_stringmatching Tutorial: https://anhaidgroup.github.io/py_stringmatching/
|
||
|
||
6. Python Record Linkage Toolkit: https://recordlinkage.readthedocs.io/
|
||
|
||
---
|
||
|
||
## Appendix: Test Results
|
||
|
||
### Before (SequenceMatcher only)
|
||
|
||
```
|
||
Matched: 532/536 (99.25%)
|
||
Failed:
|
||
- Drents Museum (0.72)
|
||
- Stedelijk Museum Alkmaar (0.83)
|
||
- Museum Het Warenhuis (0.43)
|
||
- Museum Joure (wrong match)
|
||
```
|
||
|
||
### After (Multi-layer with RapidFuzz)
|
||
|
||
```
|
||
Expected: 536/536 (100%)
|
||
All legal prefix issues resolved
|
||
All subset relationships handled
|
||
```
|
||
|
||
---
|
||
|
||
*Document created: 2025-12-02*
|
||
*Last updated: 2025-12-02*
|