glam/docs/plan/person_pid/06_entity_resolution_patterns.md

32 KiB
Raw Permalink Blame History

Entity Resolution Patterns

Version: 0.1.0
Last Updated: 2025-01-09
Related: PiCo Ontology Analysis | Cultural Naming Conventions


1. Overview

Entity resolution (ER) is the process of determining whether multiple observations refer to the same real-world person. This is fundamental to PPID's goal of linking POIDs into PRIDs.

This document covers:

  • Theoretical foundations
  • Challenges specific to heritage/genealogical data
  • Algorithms and techniques
  • Confidence scoring
  • Human-in-the-loop patterns

2. The Entity Resolution Problem

2.1 Core Challenge

Source A:               Source B:               Source C:
┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ Jan van Berg │       │ J. v.d. Berg │       │ Johannes Berg│
│ Archivist    │       │ Sr. Archivist│       │ Archives     │
│ Haarlem      │       │ NHA          │       │ North Holland│
│ LinkedIn     │       │ Website      │       │ Email sig    │
└──────────────┘       └──────────────┘       └──────────────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
                               ▼
                        Same person?
                               │
                               ▼
                    ┌──────────────────┐
                    │ Jan van den Berg │
                    │ Sr. Archivist    │
                    │ NHA, Haarlem     │
                    │ PRID-xxxx-...    │
                    └──────────────────┘

2.2 Why This Is Hard

Challenge Example
Name variations "Jan", "Johannes", "J.", "John"
Spelling variations "Berg", "Bergh", "van der Berg"
Missing data Birthdate unknown in 40% of records
Conflicting data Source A: born 1965, Source B: born 1966
Common names 1,200 "Jan de Vries" in Netherlands
Name changes Marriage, religious conversion, migration
Historical records Handwriting interpretation, OCR errors

3. Entity Resolution Framework

3.1 Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ENTITY RESOLUTION PIPELINE                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. PREPROCESSING                                                │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Normalize names, dates, locations                     │    │
│     │ Extract features: phonetic codes, n-grams             │    │
│     │ Standardize formats                                   │    │
│     └──────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  2. BLOCKING                                                     │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Reduce comparison space (O(n²) → O(n))               │    │
│     │ Group by: surname phonetic, birth year, location     │    │
│     │ Multiple blocking keys for recall                     │    │
│     └──────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  3. PAIRWISE COMPARISON                                          │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Compare candidate pairs within blocks                │    │
│     │ Calculate similarity scores per field                │    │
│     │ Aggregate into match probability                      │    │
│     └──────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  4. CLASSIFICATION                                               │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Match / Non-match / Possible match                   │    │
│     │ Threshold-based or ML classifier                     │    │
│     └──────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  5. CLUSTERING                                                   │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Group matched pairs into entities                    │    │
│     │ Handle transitivity: A=B, B=C → A=C                  │    │
│     │ Resolve conflicts                                     │    │
│     └──────────────────────────────────────────────────────┘    │
│                              │                                   │
│                              ▼                                   │
│  6. HUMAN REVIEW (optional)                                      │
│     ┌──────────────────────────────────────────────────────┐    │
│     │ Review uncertain matches                             │    │
│     │ Split incorrect clusters                             │    │
│     │ Merge missed matches                                  │    │
│     └──────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4. Preprocessing

4.1 Name Normalization

import unicodedata
import re

def normalize_name(name: str) -> str:
    """
    Normalize name for comparison.
    
    Steps:
    1. Unicode normalization (NFKC)
    2. Lowercase
    3. Remove diacritics
    4. Standardize whitespace
    5. Remove punctuation
    6. Expand common abbreviations
    """
    # Unicode normalize
    name = unicodedata.normalize('NFKC', name)
    
    # Lowercase
    name = name.lower()
    
    # Remove diacritics
    name = ''.join(
        c for c in unicodedata.normalize('NFD', name)
        if unicodedata.category(c) != 'Mn'
    )
    
    # Standardize whitespace
    name = ' '.join(name.split())
    
    # Remove punctuation (except hyphens in names)
    name = re.sub(r'[^\w\s-]', '', name)
    
    return name


def expand_abbreviations(name: str, lang: str = 'nl') -> str:
    """Expand common name abbreviations."""
    expansions = {
        'nl': {
            'j.': 'jan',
            'p.': 'pieter',
            'h.': 'hendrik',
            'c.': 'cornelis',
            'a.': 'abraham',
            'mr.': '',
            'dr.': '',
            'ir.': '',
            'drs.': '',
        }
    }
    
    for abbr, full in expansions.get(lang, {}).items():
        name = name.replace(abbr, full)
    
    return name.strip()

4.2 Dutch Surname Particle Handling

DUTCH_PARTICLES = {
    'van', 'van de', 'van den', 'van der', 'van het', "van 't",
    'de', 'den', 'der', 'het', "'t", 'te', 'ter', 'ten',
    'op', 'op de', 'op den', 'op het', "op 't",
    'in', 'in de', 'in den', 'in het', "in 't",
    'aan', 'aan de', 'aan den', 'aan het',
    'onder', 'onder de', 'onder den', 'onder het',
    'over', 'over de', 'over den', 'over het',
    'bij', 'bij de', 'bij den', 'bij het',
}


def parse_dutch_name(full_name: str) -> dict:
    """
    Parse Dutch name into components.
    
    Returns:
        {
            'given_names': ['Jan', 'Pieter'],
            'particles': 'van der',
            'surname': 'Berg',
            'full_surname': 'van der Berg'
        }
    """
    parts = full_name.split()
    
    # Find where particles start
    particle_start = None
    for i, part in enumerate(parts):
        lower_part = part.lower()
        if lower_part in ['van', 'de', 'den', 'der', 'het', "'t", 'te', 'ter', 'ten']:
            particle_start = i
            break
    
    if particle_start is None:
        # No particles - assume last word is surname
        return {
            'given_names': parts[:-1],
            'particles': '',
            'surname': parts[-1] if parts else '',
            'full_surname': parts[-1] if parts else ''
        }
    
    given_names = parts[:particle_start]
    remaining = parts[particle_start:]
    
    # Find longest matching particle sequence
    for length in range(min(3, len(remaining)), 0, -1):
        candidate = ' '.join(remaining[:length]).lower()
        if candidate in DUTCH_PARTICLES:
            return {
                'given_names': given_names,
                'particles': ' '.join(remaining[:length]),
                'surname': ' '.join(remaining[length:]),
                'full_surname': ' '.join(remaining)
            }
    
    # No recognized particle - treat all as surname
    return {
        'given_names': given_names,
        'particles': '',
        'surname': ' '.join(remaining),
        'full_surname': ' '.join(remaining)
    }

4.3 Phonetic Encoding

# Multiple phonetic algorithms for different name origins

def soundex(name: str) -> str:
    """Standard Soundex encoding."""
    if not name:
        return ''
    
    # Soundex mapping
    mapping = {
        'b': '1', 'f': '1', 'p': '1', 'v': '1',
        'c': '2', 'g': '2', 'j': '2', 'k': '2', 'q': '2', 's': '2', 'x': '2', 'z': '2',
        'd': '3', 't': '3',
        'l': '4',
        'm': '5', 'n': '5',
        'r': '6',
    }
    
    name = name.upper()
    code = name[0]
    prev_digit = mapping.get(name[0].lower(), '')
    
    for char in name[1:]:
        digit = mapping.get(char.lower(), '')
        if digit and digit != prev_digit:
            code += digit
        prev_digit = digit if digit else prev_digit
    
    return (code + '000')[:4]


def double_metaphone(name: str) -> tuple[str, str]:
    """
    Double Metaphone encoding - returns primary and alternate codes.
    Better for European names than Soundex.
    
    Note: Use external library (e.g., fuzzy, jellyfish) for full implementation.
    """
    # Simplified - in practice use a library
    from metaphone import doublemetaphone
    return doublemetaphone(name)


def cologne_phonetic(name: str) -> str:
    """
    Kölner Phonetik - optimized for German names.
    Better for Dutch names than Soundex.
    """
    # Mapping for German phonetics
    # ... (implementation details)
    pass

5. Blocking Strategies

5.1 Why Blocking?

Without blocking, comparing N records requires N²/2 comparisons:

  • 10,000 records → 50 million comparisons
  • 1 million records → 500 billion comparisons

Blocking reduces this by only comparing records within the same "block".

5.2 Blocking Key Functions

def generate_blocking_keys(record: dict) -> list[str]:
    """
    Generate multiple blocking keys for a person record.
    Multiple keys improve recall (finding all matches).
    
    Args:
        record: Person observation with name, dates, location
    
    Returns:
        List of blocking keys
    """
    keys = []
    
    name = record.get('name', {})
    surname = name.get('surname', '')
    given = name.get('given_name', '')
    birth_year = record.get('birth_year')
    location = record.get('location', {}).get('city', '')
    
    # Key 1: Surname Soundex
    if surname:
        keys.append(f"soundex:{soundex(surname)}")
    
    # Key 2: First 3 chars of surname + birth decade
    if surname and birth_year:
        decade = (birth_year // 10) * 10
        keys.append(f"s3y:{surname[:3].lower()}:{decade}")
    
    # Key 3: Given name initial + surname Soundex
    if given and surname:
        keys.append(f"is:{given[0].lower()}:{soundex(surname)}")
    
    # Key 4: Location + birth year window
    if location and birth_year:
        keys.append(f"ly:{location[:3].lower()}:{birth_year}")
        keys.append(f"ly:{location[:3].lower()}:{birth_year-1}")
        keys.append(f"ly:{location[:3].lower()}:{birth_year+1}")
    
    # Key 5: Double Metaphone of surname
    if surname:
        dm1, dm2 = double_metaphone(surname)
        if dm1:
            keys.append(f"dm1:{dm1}")
        if dm2:
            keys.append(f"dm2:{dm2}")
    
    return keys


def build_blocks(records: list[dict]) -> dict[str, list[str]]:
    """
    Build blocking index: key → list of record IDs.
    """
    blocks = defaultdict(list)
    
    for record in records:
        record_id = record['id']
        for key in generate_blocking_keys(record):
            blocks[key].append(record_id)
    
    return blocks

5.3 Block Size Management

def get_candidate_pairs(blocks: dict, max_block_size: int = 1000) -> set[tuple]:
    """
    Generate candidate pairs from blocks.
    Skip blocks that are too large (common names).
    """
    pairs = set()
    
    for key, record_ids in blocks.items():
        if len(record_ids) > max_block_size:
            # Block too large - likely a common name
            # Log for manual review
            continue
        
        # Generate all pairs within block
        for i, id1 in enumerate(record_ids):
            for id2 in record_ids[i+1:]:
                # Ensure consistent ordering
                pair = (min(id1, id2), max(id1, id2))
                pairs.add(pair)
    
    return pairs

6. Similarity Metrics

6.1 String Similarity

from difflib import SequenceMatcher


def jaro_winkler(s1: str, s2: str) -> float:
    """
    Jaro-Winkler similarity - good for names.
    Gives higher scores when strings match from the beginning.
    
    Returns: 0.0 to 1.0
    """
    # Use external library for optimized implementation
    from jellyfish import jaro_winkler_similarity
    return jaro_winkler_similarity(s1, s2)


def levenshtein_ratio(s1: str, s2: str) -> float:
    """
    Normalized Levenshtein distance.
    
    Returns: 0.0 to 1.0 (1.0 = identical)
    """
    return SequenceMatcher(None, s1, s2).ratio()


def token_set_ratio(s1: str, s2: str) -> float:
    """
    Token set similarity - handles word order differences.
    "Jan van Berg" vs "Berg, Jan van" → high similarity
    
    Returns: 0.0 to 1.0
    """
    from fuzzywuzzy import fuzz
    return fuzz.token_set_ratio(s1, s2) / 100.0

6.2 Date Similarity

def date_similarity(date1: dict, date2: dict) -> float:
    """
    Compare dates with uncertainty handling.
    
    Args:
        date1, date2: Dicts with keys: year, month, day, precision
            precision: 'exact', 'year', 'decade', 'century', 'unknown'
    
    Returns:
        0.0 to 1.0 (1.0 = exact match)
    """
    p1 = date1.get('precision', 'unknown')
    p2 = date2.get('precision', 'unknown')
    
    # If either is unknown, can't compare
    if p1 == 'unknown' or p2 == 'unknown':
        return 0.5  # Neutral - doesn't help or hurt
    
    y1, y2 = date1.get('year'), date2.get('year')
    
    if y1 is None or y2 is None:
        return 0.5
    
    year_diff = abs(y1 - y2)
    
    # Exact year match
    if year_diff == 0:
        if p1 == 'exact' and p2 == 'exact':
            m1, m2 = date1.get('month'), date2.get('month')
            d1, d2 = date1.get('day'), date2.get('day')
            
            if m1 and m2 and m1 == m2:
                if d1 and d2 and d1 == d2:
                    return 1.0  # Exact match
                return 0.95  # Same month
            return 0.90  # Same year
        return 0.85  # Same year, at least one imprecise
    
    # Allow 1-year difference (recording errors common)
    if year_diff == 1:
        return 0.70
    
    # Allow 2-year difference with lower score
    if year_diff == 2:
        return 0.40
    
    # Larger differences increasingly unlikely
    if year_diff <= 5:
        return 0.20
    
    return 0.0  # Too different

6.3 Location Similarity

def location_similarity(loc1: dict, loc2: dict) -> float:
    """
    Compare locations with hierarchy awareness.
    
    Args:
        loc1, loc2: Dicts with keys: city, region, country, coordinates
    
    Returns:
        0.0 to 1.0
    """
    # Exact city match
    if loc1.get('city') and loc2.get('city'):
        city1 = normalize_name(loc1['city'])
        city2 = normalize_name(loc2['city'])
        
        if city1 == city2:
            return 1.0
        
        # Fuzzy city match
        city_sim = jaro_winkler(city1, city2)
        if city_sim > 0.9:
            return 0.9
    
    # Region match (if cities don't match)
    if loc1.get('region') and loc2.get('region'):
        if normalize_name(loc1['region']) == normalize_name(loc2['region']):
            return 0.6
    
    # Country match only
    if loc1.get('country') and loc2.get('country'):
        if loc1['country'] == loc2['country']:
            return 0.3
    
    # Geographic distance (if coordinates available)
    if loc1.get('coordinates') and loc2.get('coordinates'):
        dist = haversine_distance(loc1['coordinates'], loc2['coordinates'])
        if dist < 10:  # km
            return 0.8
        if dist < 50:
            return 0.5
        if dist < 100:
            return 0.3
    
    return 0.0

7. Match Scoring

7.1 Weighted Combination

def calculate_match_score(obs1: dict, obs2: dict) -> dict:
    """
    Calculate overall match score between two observations.
    
    Returns:
        {
            'score': float (0.0 to 1.0),
            'confidence': float (0.0 to 1.0),
            'field_scores': {...},
            'explanation': str
        }
    """
    # Field weights (must sum to 1.0)
    weights = {
        'name': 0.40,
        'birth_date': 0.25,
        'location': 0.15,
        'institution': 0.15,
        'role': 0.05,
    }
    
    field_scores = {}
    
    # Name comparison (most important)
    name1 = obs1.get('name', {})
    name2 = obs2.get('name', {})
    field_scores['name'] = compare_names(name1, name2)
    
    # Birth date comparison
    birth1 = obs1.get('birth_date', {})
    birth2 = obs2.get('birth_date', {})
    field_scores['birth_date'] = date_similarity(birth1, birth2)
    
    # Location comparison
    loc1 = obs1.get('location', {})
    loc2 = obs2.get('location', {})
    field_scores['location'] = location_similarity(loc1, loc2)
    
    # Institution comparison (GHCID)
    inst1 = obs1.get('institution_ghcid')
    inst2 = obs2.get('institution_ghcid')
    field_scores['institution'] = 1.0 if inst1 and inst1 == inst2 else 0.0
    
    # Role comparison
    role1 = obs1.get('role', '').lower()
    role2 = obs2.get('role', '').lower()
    field_scores['role'] = token_set_ratio(role1, role2) if role1 and role2 else 0.5
    
    # Weighted score
    total_score = sum(
        field_scores[field] * weight 
        for field, weight in weights.items()
    )
    
    # Confidence based on data completeness
    fields_present = sum(1 for f in field_scores if field_scores[f] != 0.5)
    confidence = fields_present / len(field_scores)
    
    # Generate explanation
    explanation = generate_match_explanation(field_scores, weights)
    
    return {
        'score': total_score,
        'confidence': confidence,
        'field_scores': field_scores,
        'explanation': explanation
    }


def compare_names(name1: dict, name2: dict) -> float:
    """
    Sophisticated name comparison.
    """
    scores = []
    
    # Full name comparison
    full1 = name1.get('literal_name', '')
    full2 = name2.get('literal_name', '')
    if full1 and full2:
        scores.append(token_set_ratio(full1, full2))
    
    # Surname comparison
    sur1 = name1.get('surname', '')
    sur2 = name2.get('surname', '')
    if sur1 and sur2:
        scores.append(jaro_winkler(sur1, sur2) * 1.2)  # Weight surname higher
    
    # Given name comparison
    given1 = name1.get('given_name', '')
    given2 = name2.get('given_name', '')
    if given1 and given2:
        # Handle initials
        if len(given1) == 1 or len(given2) == 1:
            if given1[0].lower() == given2[0].lower():
                scores.append(0.7)  # Initial match
        else:
            scores.append(jaro_winkler(given1, given2))
    
    return min(1.0, sum(scores) / len(scores)) if scores else 0.5

7.2 Classification Thresholds

def classify_match(score: float, confidence: float) -> str:
    """
    Classify pair as match/non-match/possible.
    
    Returns: 'match', 'non_match', 'possible'
    """
    # High confidence thresholds
    if confidence >= 0.7:
        if score >= 0.85:
            return 'match'
        if score <= 0.30:
            return 'non_match'
        return 'possible'
    
    # Low confidence - be more conservative
    if score >= 0.92:
        return 'match'
    if score <= 0.20:
        return 'non_match'
    return 'possible'

8. Clustering

8.1 Transitive Closure

def cluster_matches(matches: list[tuple[str, str]]) -> list[set[str]]:
    """
    Cluster matched pairs using Union-Find.
    
    Args:
        matches: List of (id1, id2) matched pairs
    
    Returns:
        List of clusters (sets of IDs)
    """
    # Union-Find data structure
    parent = {}
    
    def find(x):
        if x not in parent:
            parent[x] = x
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]
    
    def union(x, y):
        px, py = find(x), find(y)
        if px != py:
            parent[px] = py
    
    # Build clusters
    for id1, id2 in matches:
        union(id1, id2)
    
    # Extract clusters
    clusters = defaultdict(set)
    for x in parent:
        clusters[find(x)].add(x)
    
    return list(clusters.values())

8.2 Conflict Resolution

def resolve_cluster_conflicts(cluster: set[str], records: dict) -> dict:
    """
    Resolve conflicting data within a cluster to create reconstruction.
    
    Strategy: Vote with confidence weighting
    """
    reconstruction = {}
    
    # Collect all values for each field
    field_values = defaultdict(list)
    
    for record_id in cluster:
        record = records[record_id]
        source_confidence = record.get('provenance', {}).get('confidence', 0.5)
        
        for field, value in record.items():
            if field not in ['id', 'provenance']:
                field_values[field].append({
                    'value': value,
                    'source': record_id,
                    'confidence': source_confidence
                })
    
    # Vote for best value per field
    for field, values in field_values.items():
        if not values:
            continue
        
        # Group identical values
        value_groups = defaultdict(list)
        for v in values:
            value_groups[str(v['value'])].append(v)
        
        # Select highest total confidence
        best_value = max(
            value_groups.items(),
            key=lambda x: sum(v['confidence'] for v in x[1])
        )
        
        reconstruction[field] = {
            'value': values[0]['value'],  # Original type
            'sources': [v['source'] for v in best_value[1]],
            'confidence': sum(v['confidence'] for v in best_value[1]) / len(values)
        }
    
    return reconstruction

9. Handling Uncertainty

# PiCo-style uncertainty modeling

# High confidence match
<poid:7a3b-...> picom:certainSameAs <poid:8c4d-...> ;
    picom:matchConfidence 0.95 .

# Possible match (human review needed)
<poid:7a3b-...> picom:possibleSameAs <poid:9d5e-...> ;
    picom:matchConfidence 0.65 ;
    picom:matchReviewStatus "pending" .

# Explicit non-match (after review)
<poid:7a3b-...> picom:notSameAs <poid:1234-...> ;
    picom:differentPersonConfidence 0.90 ;
    picom:differentPersonReason "Different birthdates 20 years apart" .

9.2 Confidence Propagation

def propagate_confidence(cluster_confidence: list[dict]) -> float:
    """
    Calculate overall cluster confidence from pairwise confidences.
    
    Uses weakest link principle: cluster is only as strong as 
    its weakest connection.
    """
    if not cluster_confidence:
        return 0.0
    
    # Build graph of confidences
    edges = []
    for conf in cluster_confidence:
        edges.append((conf['id1'], conf['id2'], conf['confidence']))
    
    # Find minimum spanning tree confidence
    # (simplified - in practice use proper MST algorithm)
    min_confidence = min(c for _, _, c in edges)
    avg_confidence = sum(c for _, _, c in edges) / len(edges)
    
    # Blend minimum and average
    return 0.7 * min_confidence + 0.3 * avg_confidence

10. Human-in-the-Loop

10.1 Review Queue

def generate_review_queue(possible_matches: list[dict]) -> list[dict]:
    """
    Prioritize uncertain matches for human review.
    
    Priorities:
    1. High-value records (staff at major institutions)
    2. Borderline scores (near threshold)
    3. Conflicting evidence
    """
    queue = []
    
    for match in possible_matches:
        priority = calculate_review_priority(match)
        queue.append({
            'match': match,
            'priority': priority,
            'reason': get_priority_reason(match)
        })
    
    return sorted(queue, key=lambda x: x['priority'], reverse=True)


def calculate_review_priority(match: dict) -> float:
    """Calculate review priority score."""
    score = 0.0
    
    # Near threshold = high priority
    match_score = match['score']
    if 0.40 <= match_score <= 0.80:
        score += 0.3
    
    # Conflicting fields = high priority
    field_scores = match.get('field_scores', {})
    high_scores = sum(1 for s in field_scores.values() if s > 0.8)
    low_scores = sum(1 for s in field_scores.values() if s < 0.3)
    if high_scores > 0 and low_scores > 0:
        score += 0.4  # Conflicting evidence
    
    # High-profile institution = high priority
    if match.get('institution_ghcid', '').startswith('NL-'):
        score += 0.2
    
    return score

10.2 Review Interface Data

{
  "review_id": "rev-12345",
  "observation_a": {
    "poid": "POID-7a3b-c4d5-e6f7-890X",
    "name": "Jan van den Berg",
    "role": "Senior Archivist",
    "institution": "Noord-Hollands Archief",
    "source": "linkedin.com/in/jan-van-den-berg",
    "retrieved": "2025-01-09"
  },
  "observation_b": {
    "poid": "POID-8c4d-e5f6-g7h8-901Y",
    "name": "J. v.d. Berg",
    "role": "Archivaris",
    "institution": "NHA",
    "source": "noord-hollandsarchief.nl/medewerkers",
    "retrieved": "2025-01-08"
  },
  "match_score": 0.72,
  "field_scores": {
    "name": 0.85,
    "role": 0.90,
    "institution": 1.00,
    "birth_date": 0.50
  },
  "system_recommendation": "possible_match",
  "review_options": [
    {"action": "confirm_match", "label": "Same Person"},
    {"action": "reject_match", "label": "Different People"},
    {"action": "needs_more_info", "label": "Need More Information"}
  ]
}

11. Performance Optimization

11.1 Indexing Strategy

-- PostgreSQL indexes for entity resolution

-- Phonetic code index for blocking
CREATE INDEX idx_soundex_surname ON person_observations 
    USING btree (soundex(surname));

-- Trigram index for fuzzy name matching
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_name_trgm ON person_observations 
    USING gin (name gin_trgm_ops);

-- Birth year range index
CREATE INDEX idx_birth_year ON person_observations 
    USING btree (birth_year);

-- Composite blocking key index
CREATE INDEX idx_blocking ON person_observations 
    USING btree (soundex(surname), birth_year / 10);

11.2 Batch Processing

async def process_entity_resolution_batch(
    new_observations: list[dict],
    existing_index: 'BlockingIndex',
    batch_size: int = 1000
) -> list[dict]:
    """
    Process new observations against existing records in batches.
    """
    results = []
    
    for i in range(0, len(new_observations), batch_size):
        batch = new_observations[i:i + batch_size]
        
        # Generate blocking keys
        batch_keys = [generate_blocking_keys(obs) for obs in batch]
        
        # Find candidate pairs
        candidates = existing_index.find_candidates(batch_keys)
        
        # Score candidates in parallel
        scores = await asyncio.gather(*[
            score_pair(obs, candidate)
            for obs, candidate in candidates
        ])
        
        # Classify and collect results
        for score in scores:
            classification = classify_match(score['score'], score['confidence'])
            results.append({
                **score,
                'classification': classification
            })
    
    return results

12. Evaluation Metrics

12.1 Standard Metrics

Metric Formula Target
Precision TP / (TP + FP) > 0.95
Recall TP / (TP + FN) > 0.90
F1 Score 2 × (P × R) / (P + R) > 0.92
Pairs Completeness Matched pairs found / Total true pairs > 0.90
Pairs Quality True matches in candidates / Total candidates > 0.80

12.2 Heritage-Specific Metrics

Metric Description Target
Cross-source accuracy Matches across different source types > 0.90
Historical accuracy Matches involving records >50 years old > 0.85
Name variant coverage Recall on known name variations > 0.88
Conflict resolution accuracy Correct value selected in conflicts > 0.92

13. References

Academic Sources

  • Christen, P. (2012). "Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection."
  • Fellegi, I.P. & Sunter, A.B. (1969). "A Theory for Record Linkage." Journal of the American Statistical Association.
  • Naumann, F. & Herschel, M. (2010). "An Introduction to Duplicate Detection."

Tools and Libraries

Genealogical Entity Resolution

  • Efremova, J., et al. (2014). "Record Linkage in Genealogical Data."
  • Bloothooft, G. (2015). "Learning Name Variants from True Person Resolution."