32 KiB
32 KiB
Entity Resolution Patterns
Version: 0.1.0
Last Updated: 2025-01-09
Related: PiCo Ontology Analysis | Cultural Naming Conventions
1. Overview
Entity resolution (ER) is the process of determining whether multiple observations refer to the same real-world person. This is fundamental to PPID's goal of linking POIDs into PRIDs.
This document covers:
- Theoretical foundations
- Challenges specific to heritage/genealogical data
- Algorithms and techniques
- Confidence scoring
- Human-in-the-loop patterns
2. The Entity Resolution Problem
2.1 Core Challenge
Source A: Source B: Source C:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Jan van Berg │ │ J. v.d. Berg │ │ Johannes Berg│
│ Archivist │ │ Sr. Archivist│ │ Archives │
│ Haarlem │ │ NHA │ │ North Holland│
│ LinkedIn │ │ Website │ │ Email sig │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
▼
Same person?
│
▼
┌──────────────────┐
│ Jan van den Berg │
│ Sr. Archivist │
│ NHA, Haarlem │
│ PRID-xxxx-... │
└──────────────────┘
2.2 Why This Is Hard
| Challenge | Example |
|---|---|
| Name variations | "Jan", "Johannes", "J.", "John" |
| Spelling variations | "Berg", "Bergh", "van der Berg" |
| Missing data | Birthdate unknown in 40% of records |
| Conflicting data | Source A: born 1965, Source B: born 1966 |
| Common names | 1,200 "Jan de Vries" in Netherlands |
| Name changes | Marriage, religious conversion, migration |
| Historical records | Handwriting interpretation, OCR errors |
3. Entity Resolution Framework
3.1 Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ENTITY RESOLUTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. PREPROCESSING │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Normalize names, dates, locations │ │
│ │ Extract features: phonetic codes, n-grams │ │
│ │ Standardize formats │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. BLOCKING │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Reduce comparison space (O(n²) → O(n)) │ │
│ │ Group by: surname phonetic, birth year, location │ │
│ │ Multiple blocking keys for recall │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 3. PAIRWISE COMPARISON │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Compare candidate pairs within blocks │ │
│ │ Calculate similarity scores per field │ │
│ │ Aggregate into match probability │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 4. CLASSIFICATION │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Match / Non-match / Possible match │ │
│ │ Threshold-based or ML classifier │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 5. CLUSTERING │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Group matched pairs into entities │ │
│ │ Handle transitivity: A=B, B=C → A=C │ │
│ │ Resolve conflicts │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 6. HUMAN REVIEW (optional) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Review uncertain matches │ │
│ │ Split incorrect clusters │ │
│ │ Merge missed matches │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
4. Preprocessing
4.1 Name Normalization
import unicodedata
import re
def normalize_name(name: str) -> str:
"""
Normalize name for comparison.
Steps:
1. Unicode normalization (NFKC)
2. Lowercase
3. Remove diacritics
4. Standardize whitespace
5. Remove punctuation
6. Expand common abbreviations
"""
# Unicode normalize
name = unicodedata.normalize('NFKC', name)
# Lowercase
name = name.lower()
# Remove diacritics
name = ''.join(
c for c in unicodedata.normalize('NFD', name)
if unicodedata.category(c) != 'Mn'
)
# Standardize whitespace
name = ' '.join(name.split())
# Remove punctuation (except hyphens in names)
name = re.sub(r'[^\w\s-]', '', name)
return name
def expand_abbreviations(name: str, lang: str = 'nl') -> str:
"""Expand common name abbreviations."""
expansions = {
'nl': {
'j.': 'jan',
'p.': 'pieter',
'h.': 'hendrik',
'c.': 'cornelis',
'a.': 'abraham',
'mr.': '',
'dr.': '',
'ir.': '',
'drs.': '',
}
}
for abbr, full in expansions.get(lang, {}).items():
name = name.replace(abbr, full)
return name.strip()
4.2 Dutch Surname Particle Handling
DUTCH_PARTICLES = {
'van', 'van de', 'van den', 'van der', 'van het', "van 't",
'de', 'den', 'der', 'het', "'t", 'te', 'ter', 'ten',
'op', 'op de', 'op den', 'op het', "op 't",
'in', 'in de', 'in den', 'in het', "in 't",
'aan', 'aan de', 'aan den', 'aan het',
'onder', 'onder de', 'onder den', 'onder het',
'over', 'over de', 'over den', 'over het',
'bij', 'bij de', 'bij den', 'bij het',
}
def parse_dutch_name(full_name: str) -> dict:
"""
Parse Dutch name into components.
Returns:
{
'given_names': ['Jan', 'Pieter'],
'particles': 'van der',
'surname': 'Berg',
'full_surname': 'van der Berg'
}
"""
parts = full_name.split()
# Find where particles start
particle_start = None
for i, part in enumerate(parts):
lower_part = part.lower()
if lower_part in ['van', 'de', 'den', 'der', 'het', "'t", 'te', 'ter', 'ten']:
particle_start = i
break
if particle_start is None:
# No particles - assume last word is surname
return {
'given_names': parts[:-1],
'particles': '',
'surname': parts[-1] if parts else '',
'full_surname': parts[-1] if parts else ''
}
given_names = parts[:particle_start]
remaining = parts[particle_start:]
# Find longest matching particle sequence
for length in range(min(3, len(remaining)), 0, -1):
candidate = ' '.join(remaining[:length]).lower()
if candidate in DUTCH_PARTICLES:
return {
'given_names': given_names,
'particles': ' '.join(remaining[:length]),
'surname': ' '.join(remaining[length:]),
'full_surname': ' '.join(remaining)
}
# No recognized particle - treat all as surname
return {
'given_names': given_names,
'particles': '',
'surname': ' '.join(remaining),
'full_surname': ' '.join(remaining)
}
4.3 Phonetic Encoding
# Multiple phonetic algorithms for different name origins
def soundex(name: str) -> str:
"""Standard Soundex encoding."""
if not name:
return ''
# Soundex mapping
mapping = {
'b': '1', 'f': '1', 'p': '1', 'v': '1',
'c': '2', 'g': '2', 'j': '2', 'k': '2', 'q': '2', 's': '2', 'x': '2', 'z': '2',
'd': '3', 't': '3',
'l': '4',
'm': '5', 'n': '5',
'r': '6',
}
name = name.upper()
code = name[0]
prev_digit = mapping.get(name[0].lower(), '')
for char in name[1:]:
digit = mapping.get(char.lower(), '')
if digit and digit != prev_digit:
code += digit
prev_digit = digit if digit else prev_digit
return (code + '000')[:4]
def double_metaphone(name: str) -> tuple[str, str]:
"""
Double Metaphone encoding - returns primary and alternate codes.
Better for European names than Soundex.
Note: Use external library (e.g., fuzzy, jellyfish) for full implementation.
"""
# Simplified - in practice use a library
from metaphone import doublemetaphone
return doublemetaphone(name)
def cologne_phonetic(name: str) -> str:
"""
Kölner Phonetik - optimized for German names.
Better for Dutch names than Soundex.
"""
# Mapping for German phonetics
# ... (implementation details)
pass
5. Blocking Strategies
5.1 Why Blocking?
Without blocking, comparing N records requires N²/2 comparisons:
- 10,000 records → 50 million comparisons
- 1 million records → 500 billion comparisons
Blocking reduces this by only comparing records within the same "block".
5.2 Blocking Key Functions
def generate_blocking_keys(record: dict) -> list[str]:
"""
Generate multiple blocking keys for a person record.
Multiple keys improve recall (finding all matches).
Args:
record: Person observation with name, dates, location
Returns:
List of blocking keys
"""
keys = []
name = record.get('name', {})
surname = name.get('surname', '')
given = name.get('given_name', '')
birth_year = record.get('birth_year')
location = record.get('location', {}).get('city', '')
# Key 1: Surname Soundex
if surname:
keys.append(f"soundex:{soundex(surname)}")
# Key 2: First 3 chars of surname + birth decade
if surname and birth_year:
decade = (birth_year // 10) * 10
keys.append(f"s3y:{surname[:3].lower()}:{decade}")
# Key 3: Given name initial + surname Soundex
if given and surname:
keys.append(f"is:{given[0].lower()}:{soundex(surname)}")
# Key 4: Location + birth year window
if location and birth_year:
keys.append(f"ly:{location[:3].lower()}:{birth_year}")
keys.append(f"ly:{location[:3].lower()}:{birth_year-1}")
keys.append(f"ly:{location[:3].lower()}:{birth_year+1}")
# Key 5: Double Metaphone of surname
if surname:
dm1, dm2 = double_metaphone(surname)
if dm1:
keys.append(f"dm1:{dm1}")
if dm2:
keys.append(f"dm2:{dm2}")
return keys
def build_blocks(records: list[dict]) -> dict[str, list[str]]:
"""
Build blocking index: key → list of record IDs.
"""
blocks = defaultdict(list)
for record in records:
record_id = record['id']
for key in generate_blocking_keys(record):
blocks[key].append(record_id)
return blocks
5.3 Block Size Management
def get_candidate_pairs(blocks: dict, max_block_size: int = 1000) -> set[tuple]:
"""
Generate candidate pairs from blocks.
Skip blocks that are too large (common names).
"""
pairs = set()
for key, record_ids in blocks.items():
if len(record_ids) > max_block_size:
# Block too large - likely a common name
# Log for manual review
continue
# Generate all pairs within block
for i, id1 in enumerate(record_ids):
for id2 in record_ids[i+1:]:
# Ensure consistent ordering
pair = (min(id1, id2), max(id1, id2))
pairs.add(pair)
return pairs
6. Similarity Metrics
6.1 String Similarity
from difflib import SequenceMatcher
def jaro_winkler(s1: str, s2: str) -> float:
"""
Jaro-Winkler similarity - good for names.
Gives higher scores when strings match from the beginning.
Returns: 0.0 to 1.0
"""
# Use external library for optimized implementation
from jellyfish import jaro_winkler_similarity
return jaro_winkler_similarity(s1, s2)
def levenshtein_ratio(s1: str, s2: str) -> float:
"""
Normalized Levenshtein distance.
Returns: 0.0 to 1.0 (1.0 = identical)
"""
return SequenceMatcher(None, s1, s2).ratio()
def token_set_ratio(s1: str, s2: str) -> float:
"""
Token set similarity - handles word order differences.
"Jan van Berg" vs "Berg, Jan van" → high similarity
Returns: 0.0 to 1.0
"""
from fuzzywuzzy import fuzz
return fuzz.token_set_ratio(s1, s2) / 100.0
6.2 Date Similarity
def date_similarity(date1: dict, date2: dict) -> float:
"""
Compare dates with uncertainty handling.
Args:
date1, date2: Dicts with keys: year, month, day, precision
precision: 'exact', 'year', 'decade', 'century', 'unknown'
Returns:
0.0 to 1.0 (1.0 = exact match)
"""
p1 = date1.get('precision', 'unknown')
p2 = date2.get('precision', 'unknown')
# If either is unknown, can't compare
if p1 == 'unknown' or p2 == 'unknown':
return 0.5 # Neutral - doesn't help or hurt
y1, y2 = date1.get('year'), date2.get('year')
if y1 is None or y2 is None:
return 0.5
year_diff = abs(y1 - y2)
# Exact year match
if year_diff == 0:
if p1 == 'exact' and p2 == 'exact':
m1, m2 = date1.get('month'), date2.get('month')
d1, d2 = date1.get('day'), date2.get('day')
if m1 and m2 and m1 == m2:
if d1 and d2 and d1 == d2:
return 1.0 # Exact match
return 0.95 # Same month
return 0.90 # Same year
return 0.85 # Same year, at least one imprecise
# Allow 1-year difference (recording errors common)
if year_diff == 1:
return 0.70
# Allow 2-year difference with lower score
if year_diff == 2:
return 0.40
# Larger differences increasingly unlikely
if year_diff <= 5:
return 0.20
return 0.0 # Too different
6.3 Location Similarity
def location_similarity(loc1: dict, loc2: dict) -> float:
"""
Compare locations with hierarchy awareness.
Args:
loc1, loc2: Dicts with keys: city, region, country, coordinates
Returns:
0.0 to 1.0
"""
# Exact city match
if loc1.get('city') and loc2.get('city'):
city1 = normalize_name(loc1['city'])
city2 = normalize_name(loc2['city'])
if city1 == city2:
return 1.0
# Fuzzy city match
city_sim = jaro_winkler(city1, city2)
if city_sim > 0.9:
return 0.9
# Region match (if cities don't match)
if loc1.get('region') and loc2.get('region'):
if normalize_name(loc1['region']) == normalize_name(loc2['region']):
return 0.6
# Country match only
if loc1.get('country') and loc2.get('country'):
if loc1['country'] == loc2['country']:
return 0.3
# Geographic distance (if coordinates available)
if loc1.get('coordinates') and loc2.get('coordinates'):
dist = haversine_distance(loc1['coordinates'], loc2['coordinates'])
if dist < 10: # km
return 0.8
if dist < 50:
return 0.5
if dist < 100:
return 0.3
return 0.0
7. Match Scoring
7.1 Weighted Combination
def calculate_match_score(obs1: dict, obs2: dict) -> dict:
"""
Calculate overall match score between two observations.
Returns:
{
'score': float (0.0 to 1.0),
'confidence': float (0.0 to 1.0),
'field_scores': {...},
'explanation': str
}
"""
# Field weights (must sum to 1.0)
weights = {
'name': 0.40,
'birth_date': 0.25,
'location': 0.15,
'institution': 0.15,
'role': 0.05,
}
field_scores = {}
# Name comparison (most important)
name1 = obs1.get('name', {})
name2 = obs2.get('name', {})
field_scores['name'] = compare_names(name1, name2)
# Birth date comparison
birth1 = obs1.get('birth_date', {})
birth2 = obs2.get('birth_date', {})
field_scores['birth_date'] = date_similarity(birth1, birth2)
# Location comparison
loc1 = obs1.get('location', {})
loc2 = obs2.get('location', {})
field_scores['location'] = location_similarity(loc1, loc2)
# Institution comparison (GHCID)
inst1 = obs1.get('institution_ghcid')
inst2 = obs2.get('institution_ghcid')
field_scores['institution'] = 1.0 if inst1 and inst1 == inst2 else 0.0
# Role comparison
role1 = obs1.get('role', '').lower()
role2 = obs2.get('role', '').lower()
field_scores['role'] = token_set_ratio(role1, role2) if role1 and role2 else 0.5
# Weighted score
total_score = sum(
field_scores[field] * weight
for field, weight in weights.items()
)
# Confidence based on data completeness
fields_present = sum(1 for f in field_scores if field_scores[f] != 0.5)
confidence = fields_present / len(field_scores)
# Generate explanation
explanation = generate_match_explanation(field_scores, weights)
return {
'score': total_score,
'confidence': confidence,
'field_scores': field_scores,
'explanation': explanation
}
def compare_names(name1: dict, name2: dict) -> float:
"""
Sophisticated name comparison.
"""
scores = []
# Full name comparison
full1 = name1.get('literal_name', '')
full2 = name2.get('literal_name', '')
if full1 and full2:
scores.append(token_set_ratio(full1, full2))
# Surname comparison
sur1 = name1.get('surname', '')
sur2 = name2.get('surname', '')
if sur1 and sur2:
scores.append(jaro_winkler(sur1, sur2) * 1.2) # Weight surname higher
# Given name comparison
given1 = name1.get('given_name', '')
given2 = name2.get('given_name', '')
if given1 and given2:
# Handle initials
if len(given1) == 1 or len(given2) == 1:
if given1[0].lower() == given2[0].lower():
scores.append(0.7) # Initial match
else:
scores.append(jaro_winkler(given1, given2))
return min(1.0, sum(scores) / len(scores)) if scores else 0.5
7.2 Classification Thresholds
def classify_match(score: float, confidence: float) -> str:
"""
Classify pair as match/non-match/possible.
Returns: 'match', 'non_match', 'possible'
"""
# High confidence thresholds
if confidence >= 0.7:
if score >= 0.85:
return 'match'
if score <= 0.30:
return 'non_match'
return 'possible'
# Low confidence - be more conservative
if score >= 0.92:
return 'match'
if score <= 0.20:
return 'non_match'
return 'possible'
8. Clustering
8.1 Transitive Closure
def cluster_matches(matches: list[tuple[str, str]]) -> list[set[str]]:
"""
Cluster matched pairs using Union-Find.
Args:
matches: List of (id1, id2) matched pairs
Returns:
List of clusters (sets of IDs)
"""
# Union-Find data structure
parent = {}
def find(x):
if x not in parent:
parent[x] = x
if parent[x] != x:
parent[x] = find(parent[x])
return parent[x]
def union(x, y):
px, py = find(x), find(y)
if px != py:
parent[px] = py
# Build clusters
for id1, id2 in matches:
union(id1, id2)
# Extract clusters
clusters = defaultdict(set)
for x in parent:
clusters[find(x)].add(x)
return list(clusters.values())
8.2 Conflict Resolution
def resolve_cluster_conflicts(cluster: set[str], records: dict) -> dict:
"""
Resolve conflicting data within a cluster to create reconstruction.
Strategy: Vote with confidence weighting
"""
reconstruction = {}
# Collect all values for each field
field_values = defaultdict(list)
for record_id in cluster:
record = records[record_id]
source_confidence = record.get('provenance', {}).get('confidence', 0.5)
for field, value in record.items():
if field not in ['id', 'provenance']:
field_values[field].append({
'value': value,
'source': record_id,
'confidence': source_confidence
})
# Vote for best value per field
for field, values in field_values.items():
if not values:
continue
# Group identical values
value_groups = defaultdict(list)
for v in values:
value_groups[str(v['value'])].append(v)
# Select highest total confidence
best_value = max(
value_groups.items(),
key=lambda x: sum(v['confidence'] for v in x[1])
)
reconstruction[field] = {
'value': values[0]['value'], # Original type
'sources': [v['source'] for v in best_value[1]],
'confidence': sum(v['confidence'] for v in best_value[1]) / len(values)
}
return reconstruction
9. Handling Uncertainty
9.1 Uncertain Links
# PiCo-style uncertainty modeling
# High confidence match
<poid:7a3b-...> picom:certainSameAs <poid:8c4d-...> ;
picom:matchConfidence 0.95 .
# Possible match (human review needed)
<poid:7a3b-...> picom:possibleSameAs <poid:9d5e-...> ;
picom:matchConfidence 0.65 ;
picom:matchReviewStatus "pending" .
# Explicit non-match (after review)
<poid:7a3b-...> picom:notSameAs <poid:1234-...> ;
picom:differentPersonConfidence 0.90 ;
picom:differentPersonReason "Different birthdates 20 years apart" .
9.2 Confidence Propagation
def propagate_confidence(cluster_confidence: list[dict]) -> float:
"""
Calculate overall cluster confidence from pairwise confidences.
Uses weakest link principle: cluster is only as strong as
its weakest connection.
"""
if not cluster_confidence:
return 0.0
# Build graph of confidences
edges = []
for conf in cluster_confidence:
edges.append((conf['id1'], conf['id2'], conf['confidence']))
# Find minimum spanning tree confidence
# (simplified - in practice use proper MST algorithm)
min_confidence = min(c for _, _, c in edges)
avg_confidence = sum(c for _, _, c in edges) / len(edges)
# Blend minimum and average
return 0.7 * min_confidence + 0.3 * avg_confidence
10. Human-in-the-Loop
10.1 Review Queue
def generate_review_queue(possible_matches: list[dict]) -> list[dict]:
"""
Prioritize uncertain matches for human review.
Priorities:
1. High-value records (staff at major institutions)
2. Borderline scores (near threshold)
3. Conflicting evidence
"""
queue = []
for match in possible_matches:
priority = calculate_review_priority(match)
queue.append({
'match': match,
'priority': priority,
'reason': get_priority_reason(match)
})
return sorted(queue, key=lambda x: x['priority'], reverse=True)
def calculate_review_priority(match: dict) -> float:
"""Calculate review priority score."""
score = 0.0
# Near threshold = high priority
match_score = match['score']
if 0.40 <= match_score <= 0.80:
score += 0.3
# Conflicting fields = high priority
field_scores = match.get('field_scores', {})
high_scores = sum(1 for s in field_scores.values() if s > 0.8)
low_scores = sum(1 for s in field_scores.values() if s < 0.3)
if high_scores > 0 and low_scores > 0:
score += 0.4 # Conflicting evidence
# High-profile institution = high priority
if match.get('institution_ghcid', '').startswith('NL-'):
score += 0.2
return score
10.2 Review Interface Data
{
"review_id": "rev-12345",
"observation_a": {
"poid": "POID-7a3b-c4d5-e6f7-890X",
"name": "Jan van den Berg",
"role": "Senior Archivist",
"institution": "Noord-Hollands Archief",
"source": "linkedin.com/in/jan-van-den-berg",
"retrieved": "2025-01-09"
},
"observation_b": {
"poid": "POID-8c4d-e5f6-g7h8-901Y",
"name": "J. v.d. Berg",
"role": "Archivaris",
"institution": "NHA",
"source": "noord-hollandsarchief.nl/medewerkers",
"retrieved": "2025-01-08"
},
"match_score": 0.72,
"field_scores": {
"name": 0.85,
"role": 0.90,
"institution": 1.00,
"birth_date": 0.50
},
"system_recommendation": "possible_match",
"review_options": [
{"action": "confirm_match", "label": "Same Person"},
{"action": "reject_match", "label": "Different People"},
{"action": "needs_more_info", "label": "Need More Information"}
]
}
11. Performance Optimization
11.1 Indexing Strategy
-- PostgreSQL indexes for entity resolution
-- Phonetic code index for blocking
CREATE INDEX idx_soundex_surname ON person_observations
USING btree (soundex(surname));
-- Trigram index for fuzzy name matching
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_name_trgm ON person_observations
USING gin (name gin_trgm_ops);
-- Birth year range index
CREATE INDEX idx_birth_year ON person_observations
USING btree (birth_year);
-- Composite blocking key index
CREATE INDEX idx_blocking ON person_observations
USING btree (soundex(surname), birth_year / 10);
11.2 Batch Processing
async def process_entity_resolution_batch(
new_observations: list[dict],
existing_index: 'BlockingIndex',
batch_size: int = 1000
) -> list[dict]:
"""
Process new observations against existing records in batches.
"""
results = []
for i in range(0, len(new_observations), batch_size):
batch = new_observations[i:i + batch_size]
# Generate blocking keys
batch_keys = [generate_blocking_keys(obs) for obs in batch]
# Find candidate pairs
candidates = existing_index.find_candidates(batch_keys)
# Score candidates in parallel
scores = await asyncio.gather(*[
score_pair(obs, candidate)
for obs, candidate in candidates
])
# Classify and collect results
for score in scores:
classification = classify_match(score['score'], score['confidence'])
results.append({
**score,
'classification': classification
})
return results
12. Evaluation Metrics
12.1 Standard Metrics
| Metric | Formula | Target |
|---|---|---|
| Precision | TP / (TP + FP) | > 0.95 |
| Recall | TP / (TP + FN) | > 0.90 |
| F1 Score | 2 × (P × R) / (P + R) | > 0.92 |
| Pairs Completeness | Matched pairs found / Total true pairs | > 0.90 |
| Pairs Quality | True matches in candidates / Total candidates | > 0.80 |
12.2 Heritage-Specific Metrics
| Metric | Description | Target |
|---|---|---|
| Cross-source accuracy | Matches across different source types | > 0.90 |
| Historical accuracy | Matches involving records >50 years old | > 0.85 |
| Name variant coverage | Recall on known name variations | > 0.88 |
| Conflict resolution accuracy | Correct value selected in conflicts | > 0.92 |
13. References
Academic Sources
- Christen, P. (2012). "Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection."
- Fellegi, I.P. & Sunter, A.B. (1969). "A Theory for Record Linkage." Journal of the American Statistical Association.
- Naumann, F. & Herschel, M. (2010). "An Introduction to Duplicate Detection."
Tools and Libraries
- dedupe (Python): https://github.com/dedupeio/dedupe
- RecordLinkage (R): https://cran.r-project.org/package=RecordLinkage
- FRIL: https://fril.sourceforge.net/
Genealogical Entity Resolution
- Efremova, J., et al. (2014). "Record Linkage in Genealogical Data."
- Bloothooft, G. (2015). "Learning Name Variants from True Person Resolution."