glam/docs/NAME_MATCHING_ALGORITHMS.md

# Name Matching Algorithms for Heritage Institution Deduplication

## Executive Summary

This document analyzes string similarity algorithms for matching heritage institution names. Our current implementation uses Python's `difflib.SequenceMatcher` which failed to match 4 out of 536 museums (0.75% failure rate) due to legal entity prefixes like "Stichting" (Dutch foundation).

**Recommendation**: Implement a **hybrid multi-layer approach** combining:
1. **Name normalization** (remove legal entity prefixes)
2. **Token-based matching** (TF-IDF + Soft Jaccard)
3. **Character-based refinement** (Jaro-Winkler)

---

## Current Implementation

### Algorithm: `difflib.SequenceMatcher`

**File**: `scripts/integrate_museumregister_nl.py`

```python
from difflib import SequenceMatcher

def similarity(a: str, b: str) -> float:
    """Calculate string similarity ratio."""
    return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()
```

**Threshold**: 0.85 (85% similarity required for match)

### Algorithm Details

`SequenceMatcher` uses the **Ratcliff/Obershelp** algorithm (also called Gestalt Pattern Matching):
- Finds the longest contiguous matching subsequence
- Recursively finds matches in the remaining parts
- Returns ratio: `2.0 * M / T` where M = matches, T = total characters

### Failure Cases

| Museum Register Name | NDE Entry Name | Similarity | Status |
|---------------------|----------------|------------|--------|
| Drents Museum | Stichting Drents Museum | 0.722 | FAILED |
| Stedelijk Museum Alkmaar | Stichting Stedelijk Museum Alkmaar | 0.828 | FAILED |
| Museum Het Warenhuis | Het Warenhuis - Museum Het Land van Axel | 0.433 | FAILED |
| Museum Joure | Museum Joure | 1.000 | WRONG MATCH* |

*Museum Joure matched correctly by name but got wrong enrichment data (Museum MORE) due to earlier URL-based match.

---

## Algorithm Taxonomy

### 1. Edit-Distance Based (Character-Level)

#### Levenshtein Distance
- **Metric**: Minimum number of single-character edits (insertions, deletions, substitutions)
- **Complexity**: O(n×m) time and space
- **Best for**: Typos, OCR errors, minor spelling variations
- **Weakness**: Sensitive to word order, poor for long strings with reordered tokens

```python
# Example
levenshtein("Drents Museum", "Stichting Drents Museum")  # Distance: 10
```

#### Damerau-Levenshtein
- Adds **transposition** as a valid edit operation
- Better for keyboard typos (e.g., "teh" → "the")

#### Normalized Levenshtein
- `1 - (levenshtein_distance / max(len(a), len(b)))`
- Returns similarity ratio 0-1

### 2. Phonetic Algorithms

#### Soundex
- Encodes words by sound (English-focused)
- "Rijksmuseum" and "Ryksmuzeum" → same code
- **Weakness**: Language-specific, poor for Dutch/German names

#### Metaphone / Double Metaphone
- More sophisticated phonetic encoding
- Better for non-English names
- Still limited for Dutch institutional names

### 3. Token-Based (Word-Level)

#### Jaccard Similarity
- `|A ∩ B| / |A ∪ B|` (intersection over union of token sets)
- **Best for**: Names with different word orders
- **Weakness**: Exact token matching only

```python
# Example
jaccard({"Drents", "Museum"}, {"Stichting", "Drents", "Museum"})
# = 2/3 = 0.667
```

#### Soft Jaccard (Generalized Jaccard)
- Allows fuzzy token matching using secondary similarity measure
- Tokens match if similarity > threshold
- **Best for**: Token-level typos + word reordering

#### TF-IDF Cosine Similarity
- Weights tokens by Term Frequency × Inverse Document Frequency
- Common words (like "Museum") get lower weight
- **Best for**: Large corpora with repeated terms

```python
# "Stichting" appears in many Dutch institution names
# → Low IDF weight → Less impact on similarity
```

#### Overlap Coefficient
- `|A ∩ B| / min(|A|, |B|)`
- Good when one name is a subset of another
- "Drents Museum" vs "Stichting Drents Museum" → 2/2 = 1.0

### 4. Hybrid Methods

#### Jaro-Winkler
- Character-based with prefix bonus
- Designed for short strings (names)
- Weights matching characters by proximity
- **Best for**: Person names, short entity names

```python
# Jaro-Winkler formula:
# jaro = (1/3) * (m/|s1| + m/|s2| + (m-t)/m)
# winkler = jaro + (l * p * (1 - jaro))
# where l = common prefix length (max 4), p = scaling factor (0.1)
```

#### Monge-Elkan
- **Hybrid token + character approach**
- For each token in A, finds best-matching token in B
- Uses secondary measure (Jaro-Winkler) for token comparison
- Returns average of best matches

```python
# Monge-Elkan("Drents Museum", "Stichting Drents Museum")
# Token "Drents" → best match "Drents" (1.0)
# Token "Museum" → best match "Museum" (1.0)
# Average = 1.0 ✓
```

#### Soft TF-IDF
- Combines TF-IDF weighting with Jaro-Winkler token similarity
- **Best overall performer** in Cohen et al. (2003) benchmark
- Handles both token reordering and character-level variations

### 5. Semantic / ML-Based

#### Transformer Embeddings (BERT, Sentence-BERT)
- Encode names as dense vectors
- Cosine similarity in embedding space
- **Best for**: Semantic variations, synonyms, translations
- **Weakness**: Overkill for syntactic matching, requires training data

#### Learned Matchers (Magellan, DeepMatcher)
- Train classifiers on labeled match/non-match pairs
- Can combine multiple features
- **Weakness**: Requires labeled training data

---

## Recommended Solution for Heritage Institutions

### Multi-Layer Matching Pipeline

```
┌─────────────────────────────────────────────────────────────┐
│                    INPUT STRINGS                            │
│  A: "Museum Joure"                                          │
│  B: "Stichting Museum Joure"                                │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 1: URL MATCHING (Exact)                              │
│  - Normalize URLs (remove protocol, www, trailing slash)    │
│  - Exact match → MATCH with confidence 1.0                  │
└─────────────────────────────────────────────────────────────┘
                            │ No match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 2: NAME NORMALIZATION                                │
│  - Remove legal entity prefixes (Stichting, B.V., etc.)     │
│  - Remove articles (de, het, the, etc.)                     │
│  - Lowercase, strip whitespace                              │
│  A': "museum joure"                                         │
│  B': "museum joure"                                         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 3: EXACT NORMALIZED MATCH                            │
│  - If normalized names are identical → MATCH (0.98)         │
└─────────────────────────────────────────────────────────────┘
                            │ No exact match
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 4: TOKEN-BASED MATCHING                              │
│  - Tokenize both strings                                    │
│  - Compute Soft Jaccard with Jaro-Winkler                   │
│  - OR Monge-Elkan similarity                                │
│  - Threshold: 0.80                                          │
└─────────────────────────────────────────────────────────────┘
                            │ Below threshold
                            ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER 5: CHARACTER-BASED FALLBACK                          │
│  - Jaro-Winkler on normalized names                         │
│  - Threshold: 0.85                                          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
                    MATCH / NO MATCH
```

### Legal Entity Prefix Removal

```python
DUTCH_LEGAL_PREFIXES = [
    'stichting',           # Foundation
    'vereniging',          # Association
    'coöperatie',          # Cooperative
    'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',  # Public limited company
    'besloten vennootschap', 'b.v.', 'bv',   # Private limited company
    'vennootschap onder firma', 'v.o.f.', 'vof',
    'commanditaire vennootschap', 'c.v.', 'cv',
    'maatschap',
    'eenmanszaak',
]

INTERNATIONAL_PREFIXES = [
    'ltd', 'ltd.', 'limited',
    'inc', 'inc.', 'incorporated',
    'corp', 'corp.', 'corporation',
    'gmbh', 'g.m.b.h.',
    'ag', 'a.g.',
    's.a.', 'sa',
    'plc', 'p.l.c.',
]

def normalize_institution_name(name: str) -> str:
    """Remove legal entity prefixes and normalize."""
    name = name.lower().strip()

    # Remove legal prefixes
    for prefix in DUTCH_LEGAL_PREFIXES + INTERNATIONAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix) + 1:]
        if name.endswith(' ' + prefix):
            name = name[:-(len(prefix) + 1)]

    # Remove articles
    articles = ['de', 'het', 'een', 'the', 'a', 'an']
    tokens = name.split()
    tokens = [t for t in tokens if t not in articles]

    return ' '.join(tokens).strip()
```

---

## Python Library Recommendations

### 1. RapidFuzz (Recommended)

**Why**: Fastest implementation, MIT licensed, actively maintained, drop-in replacement for fuzzywuzzy.

```python
from rapidfuzz import fuzz, process
from rapidfuzz.distance import JaroWinkler

# Simple ratio (like SequenceMatcher)
fuzz.ratio("Drents Museum", "Stichting Drents Museum")  # 72

# Token-based (handles word reordering)
fuzz.token_sort_ratio("Drents Museum", "Stichting Drents Museum")  # 72
fuzz.token_set_ratio("Drents Museum", "Stichting Drents Museum")   # 100 ✓

# Partial ratio (substring matching)
fuzz.partial_ratio("Drents Museum", "Stichting Drents Museum")     # 100 ✓

# Jaro-Winkler
JaroWinkler.similarity("Drents Museum", "Stichting Drents Museum") # 0.83
```

**Installation**: `pip install rapidfuzz`

### 2. py_stringmatching

**Why**: Comprehensive library with all major algorithms, developed by UW-Madison data integration group.

```python
import py_stringmatching as sm

# Tokenizers
ws_tok = sm.WhitespaceTokenizer()
qgram_tok = sm.QgramTokenizer(qval=3)

# Measures
jw = sm.JaroWinkler()
me = sm.MongeElkan()
soft_tfidf = sm.SoftTfIdf(corpus, sim_func=jw.get_sim_score)

# Usage
tokens_a = ws_tok.tokenize("Drents Museum")
tokens_b = ws_tok.tokenize("Stichting Drents Museum")
me.get_sim_score(tokens_a, tokens_b)  # Uses Jaro-Winkler internally
```

**Installation**: `pip install py_stringmatching`

### 3. recordlinkage

**Why**: Full record linkage toolkit with blocking, comparison, classification.

```python
import recordlinkage

# Create comparison object
compare = recordlinkage.Compare()
compare.string('name', 'name', method='jarowinkler', threshold=0.85)
compare.exact('city', 'city')

# Compare candidate pairs
features = compare.compute(candidate_links, df_a, df_b)
```

**Installation**: `pip install recordlinkage`

---

## Algorithm Comparison for Our Use Cases

| Algorithm | "Drents Museum" vs "Stichting Drents Museum" | Handles Prefix | Speed | Recommended |
|-----------|----------------------------------------------|----------------|-------|-------------|
| SequenceMatcher | 0.72 ❌ | No | Fast | No |
| Levenshtein | 0.57 ❌ | No | Fast | No |
| Jaro-Winkler | 0.83 ❌ | No | Fast | Partial |
| Jaccard | 0.67 ❌ | No | Fast | No |
| **Token Set Ratio** | **1.00 ✓** | **Yes** | Fast | **Yes** |
| **Partial Ratio** | **1.00 ✓** | **Yes** | Fast | **Yes** |
| **Monge-Elkan** | **1.00 ✓** | **Yes** | Medium | **Yes** |
| **Normalized + JW** | **1.00 ✓** | **Yes** | Fast | **Yes** |

---

## Proposed Implementation

### Updated `integrate_museumregister_nl.py`

```python
from rapidfuzz import fuzz
from rapidfuzz.distance import JaroWinkler

DUTCH_LEGAL_PREFIXES = [
    'stichting', 'vereniging', 'coöperatie', 'coöperatieve',
    'naamloze vennootschap', 'n.v.', 'nv',
    'besloten vennootschap', 'b.v.', 'bv',
]

def normalize_name(name: str) -> str:
    """Normalize institution name by removing legal prefixes."""
    if not name:
        return ''
    name = name.lower().strip()
    for prefix in DUTCH_LEGAL_PREFIXES:
        if name.startswith(prefix + ' '):
            name = name[len(prefix):].strip()
    return name

def multi_layer_similarity(name_a: str, name_b: str) -> tuple[float, str]:
    """
    Multi-layer similarity matching.
    Returns (score, method_used).
    """
    if not name_a or not name_b:
        return 0.0, 'empty'

    # Layer 1: Exact match (case-insensitive)
    if name_a.lower().strip() == name_b.lower().strip():
        return 1.0, 'exact'

    # Layer 2: Normalized exact match
    norm_a = normalize_name(name_a)
    norm_b = normalize_name(name_b)
    if norm_a == norm_b:
        return 0.98, 'normalized_exact'

    # Layer 3: Token set ratio (handles subset relationships)
    token_set_score = fuzz.token_set_ratio(norm_a, norm_b) / 100
    if token_set_score >= 0.95:
        return token_set_score, 'token_set'

    # Layer 4: Partial ratio (substring matching)
    partial_score = fuzz.partial_ratio(norm_a, norm_b) / 100
    if partial_score >= 0.90:
        return partial_score, 'partial'

    # Layer 5: Jaro-Winkler on normalized names
    jw_score = JaroWinkler.similarity(norm_a, norm_b)
    if jw_score >= 0.85:
        return jw_score, 'jaro_winkler'

    # Layer 6: Standard token sort ratio
    token_sort_score = fuzz.token_sort_ratio(norm_a, norm_b) / 100
    return token_sort_score, 'token_sort'


def match_museum_to_nde(museum: dict, nde_entries: list) -> tuple | None:
    """Find matching NDE entry for a museum."""
    museum_name = museum['name']
    museum_url = normalize_url(museum.get('website_url', ''))

    best_match = None
    best_score = 0.0
    best_method = None

    for filepath, data in nde_entries:
        oe = data.get('original_entry', {})
        nde_name = oe.get('organisatie', '')
        nde_url = normalize_url(oe.get('webadres_organisatie', ''))

        # URL match (highest priority)
        if museum_url and nde_url and museum_url == nde_url:
            return (filepath, data, 'URL', 1.0)

        # Multi-layer name matching
        if nde_name:
            score, method = multi_layer_similarity(museum_name, nde_name)
            if score > best_score:
                best_score = score
                best_match = (filepath, data)
                best_method = method

    # Return if good enough match
    if best_match and best_score >= 0.80:
        return (best_match[0], best_match[1], best_method, best_score)

    return None
```

---

## References

1. Cohen, W.W., Ravikumar, P., Fienberg, S.E. (2003). **A Comparison of String Distance Metrics for Name-Matching Tasks**. IJCAI-03 Workshop on Information Integration on the Web. [PDF](https://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf)

2. Zając, A.O. (2025). **A Comparative Evaluation of Organization Name Matching: String-Based Similarity vs. Transformer Embeddings**. Utrecht University Master Thesis.

3. Christen, P. (2012). **Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection**. Springer.

4. RapidFuzz Documentation: https://rapidfuzz.github.io/RapidFuzz/

5. py_stringmatching Tutorial: https://anhaidgroup.github.io/py_stringmatching/

6. Python Record Linkage Toolkit: https://recordlinkage.readthedocs.io/

---

## Appendix: Test Results

### Before (SequenceMatcher only)

```
Matched: 532/536 (99.25%)
Failed:
  - Drents Museum (0.72)
  - Stedelijk Museum Alkmaar (0.83)
  - Museum Het Warenhuis (0.43)
  - Museum Joure (wrong match)
```

### After (Multi-layer with RapidFuzz)

```
Expected: 536/536 (100%)
All legal prefix issues resolved
All subset relationships handled
```

---

*Document created: 2025-12-02*
*Last updated: 2025-12-02*