# Rule 46: Entity Resolution in Person Enrichment - No Heuristics ## Status: CRITICAL ## The Core Principle 🚨 **SIMILAR OR IDENTICAL NAMES ARE NEVER SUFFICIENT FOR ENTITY RESOLUTION.** A web search result mentioning "Carmen Juliá born 1952" is **NOT** evidence that the Carmen Juliá in our person profile was born in 1952. Names are not unique identifiers - there are thousands of people with the same name worldwide. **Entity resolution requires verification of MULTIPLE independent identity attributes:** | Attribute | Purpose | Example | |-----------|---------|---------| | **Age/Birth Year** | Temporal consistency | Both sources describe someone in their 40s | | **Career Path** | Professional identity | Both are art curators, not one curator and one actress | | **Location** | Geographic consistency | Both are based in UK, not one UK and one Venezuela | | **Employer** | Institutional affiliation | Both work at New Contemporaries | | **Education** | Academic background | Same university or field | **Minimum Requirement**: At least **3 of 5** attributes must match before attributing ANY claim from a web source. Name match alone = **AUTOMATIC REJECTION**. ## Problem Statement When enriching person profiles via web search (Linkup, Exa, etc.), search results often return data about **different people with similar or identical names**. Without proper entity resolution, the enrichment process can attribute false claims to the wrong person. **Example Failure** (Carmen Juliá - UK Art Curator): - Source profile: Carmen Juliá, Curator at New Contemporaries (UK) - Birth year extracted: 1952 from Carmen Julia **Álvarez** (Venezuelan actress) - Spouse extracted: "actors Eduardo Serrano" from the Venezuelan actress - ResearchGate: Carmen Julia **Navarro** (Mexican hydrogeologist) - Academia.edu: Carmen Julia **Gutiérrez** (Spanish medieval studies) All data is from **different people** - none is the actual Carmen Juliá who is a UK-based art curator. **Why This Happened**: The enrichment script used regex pattern matching to extract "born 1952" without verifying that the Wikipedia article described the SAME person. ## The Rule ### DO NOT use name matching as the basis for entity resolution. EVER. For person enrichment via web search: **FORBIDDEN** (Name-based extraction): - ❌ Extracting birth years from any search result mentioning "Carmen Julia born..." - ❌ Attributing social media profiles just because the name appears - ❌ Claiming relationships (spouse, parent, child) from web text pattern matching - ❌ Assigning academic profiles (ResearchGate, Academia.edu, Google Scholar) based on name matching alone - ❌ Using Wikipedia articles without verifying ALL identity attributes - ❌ Trusting genealogy sites (Geni, Ancestry, MyHeritage) which describe historical namesakes - ❌ Using IMDB for birth years (actors with same names) **REQUIRED** (Multi-Attribute Entity Resolution): 1. **Verify identity via MULTIPLE attributes** - name alone is INSUFFICIENT 2. **Cross-reference with known facts** (employer, location, job title from LinkedIn) 3. **Detect conflicting signals** - actress vs curator, Venezuela vs UK, 1950s birth vs active 2020s career 4. **Reject ambiguous matches** - if source doesn't clearly identify the same person, reject the claim 5. **Document rejection rationale** - log why claim was rejected for audit trail ## Entity Resolution Verification Checklist Before attributing a web claim to a person profile, verify MULTIPLE identity attributes: | # | Attribute | What to Check | Example Match | Example Conflict | |---|-----------|---------------|---------------|------------------| | 1 | **Career/Profession** | Same field/industry | Both are curators | Source says "actress", profile is curator | | 2 | **Employer** | Same institution | Both at Rijksmuseum | Source says "film studio", profile is museum | | 3 | **Location** | Same city/country | Both UK-based | Source says Venezuela, profile is UK | | 4 | **Age Range** | Plausible for career | Birth 1980s, active 2020s | Birth 1952, still active in 2025 as junior | | 5 | **Education** | Same university/field | Both art history | Source says "medical school" | **Minimum requirement**: At least **3 of 5** attributes must match. Name match alone = **AUTOMATIC REJECTION**. **Any conflicting signal = AUTOMATIC REJECTION** (e.g., source says "actress" when profile is "curator"). ## Sources with High Entity Resolution Risk | Source Type | Risk Level | Why | Action | |-------------|------------|-----|--------| | Wikipedia | CRITICAL | Many people with same name have pages | Reject unless 4/5 attributes match | | IMDB | CRITICAL | Actors with common names | Reject all - never use for birth years | | Genealogy sites | CRITICAL | Historical persons with same name | **ALWAYS REJECT** - these are ancestors/namesakes | | Academic profiles | HIGH | Multiple researchers with same name | Verify institution and research field match | | Social media | HIGH | Many accounts with similar handles | Verify employer/location in bio | | News articles | MEDIUM | May mention multiple people | Read full context, verify identity | | Institutional websites | LOW | Usually about their own staff | Good source if person works there | ## Automatic Rejection Triggers The following MUST trigger **automatic claim rejection**: ### Profession Conflicts If source profession differs from profile profession, REJECT: ``` Source: "actress", "actor", "singer", "footballer", "politician" Profile: "curator", "archivist", "librarian", "conservator", "registrar" → REJECT (these are different people) ``` ### Location Conflicts If source location conflicts with profile location, REJECT: ``` Source: "Venezuela", "Mexico", "Brazil" Profile: "UK", "Netherlands", "France" → REJECT (these are different people) ``` ### Age Conflicts If source age is implausible for profile career stage, REJECT: ``` Source: Born 1922, 1915, 1939 Profile: Currently active professional in 2025 → REJECT (person would be 86-103 years old) Source: Born 2007, 2004 Profile: Senior curator → REJECT (person would be 18-21, too young) ``` ### Genealogy Source If source is from genealogy/ancestry site, ALWAYS REJECT: ``` Domains: geni.com, ancestry.*, familysearch.org, findagrave.com, myheritage.* → ALWAYS REJECT (these describe historical namesakes, not the living person) ``` ## Implementation in Enrichment Scripts ```python def validate_entity_match(profile: dict, search_result: dict) -> tuple[bool, str]: """ Validate that a search result refers to the same person as the profile. REQUIRES: At least 3 of 5 identity attributes must match. Name match alone is INSUFFICIENT and automatically rejected. Returns (is_valid, reason) """ profile_employer = profile.get('affiliations', [{}])[0].get('custodian_name', '').lower() profile_location = profile.get('profile_data', {}).get('location', '').lower() profile_role = profile.get('profile_data', {}).get('headline', '').lower() source_text = search_result.get('answer', '').lower() source_url = search_result.get('source_url', '').lower() # AUTOMATIC REJECTION: Genealogy sources genealogy_domains = ['geni.com', 'ancestry.', 'familysearch.', 'findagrave.', 'myheritage.'] if any(domain in source_url for domain in genealogy_domains): return False, "genealogy_source_rejected" # AUTOMATIC REJECTION: Profession conflicts heritage_roles = ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection', 'heritage'] entertainment_roles = ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'] profile_is_heritage = any(role in profile_role for role in heritage_roles) source_is_entertainment = any(role in source_text for role in entertainment_roles) if profile_is_heritage and source_is_entertainment: return False, "conflicting_profession" # AUTOMATIC REJECTION: Location conflicts if profile_location: location_conflicts = [ ('venezuela', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'), ('caracas', 'london'), ('mexico city', 'amsterdam') ] for source_loc, profile_loc in location_conflicts: if source_loc in source_text and profile_loc in profile_location: return False, "conflicting_location" # Count positive identity attribute matches (need 3 of 5) matches = 0 match_details = [] # 1. Employer match if profile_employer and profile_employer in source_text: matches += 1 match_details.append(f"employer:{profile_employer}") # 2. Location match if profile_location and profile_location in source_text: matches += 1 match_details.append(f"location:{profile_location}") # 3. Role/profession match if profile_role: role_words = [w for w in profile_role.split() if len(w) > 4] if any(word in source_text for word in role_words): matches += 1 match_details.append(f"role_match") # 4. Education/institution match (if available) profile_education = profile.get('profile_data', {}).get('education', []) if profile_education: edu_names = [e.get('school', '').lower() for e in profile_education if e.get('school')] if any(edu in source_text for edu in edu_names): matches += 1 match_details.append(f"education_match") # 5. Time period match (career dates) # (implementation depends on available data) # REQUIRE 3 OF 5 MATCHES if matches < 3: return False, f"insufficient_identity_verification (only {matches}/5 attributes matched)" return True, f"verified ({matches}/5 matches: {', '.join(match_details)})" ``` ## Claim Rejection Patterns The following patterns should trigger automatic claim rejection: ```python # Genealogy sources - ALWAYS REJECT GENEALOGY_DOMAINS = [ 'geni.com', 'ancestry.com', 'ancestry.co.uk', 'familysearch.org', 'findagrave.com', 'myheritage.com', 'wikitree.com', 'geneanet.org' ] # Profession conflicts - if profile has one and source has another, REJECT PROFESSION_CONFLICTS = { 'heritage': ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection manager'], 'entertainment': ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'], 'medical': ['doctor', 'nurse', 'surgeon', 'physician'], 'tech': ['software engineer', 'developer', 'programmer'], } # Location conflicts - if source describes person in location X and profile is location Y, REJECT LOCATION_PAIRS = [ ('venezuela', 'uk'), ('venezuela', 'netherlands'), ('venezuela', 'germany'), ('mexico', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'), ('caracas', 'london'), ('caracas', 'amsterdam'), ] # Age impossibility - if birth year makes current career implausible, REJECT MIN_PLAUSIBLE_BIRTH_YEAR = 1945 # Would be 80 in 2025 - still plausible but verify MAX_PLAUSIBLE_BIRTH_YEAR = 2002 # Would be 23 in 2025 - plausible for junior roles ``` ## Handling Rejected Claims When a claim fails entity resolution: ```json { "claim_type": "birth_year", "claim_value": 1952, "entity_resolution": { "status": "REJECTED", "reason": "conflicting_profession", "details": "Source describes Venezuelan actress, profile is UK curator", "source_identity": "Carmen Julia Álvarez (Venezuelan actress)", "profile_identity": "Carmen Juliá (UK art curator)", "rejected_at": "2026-01-11T15:00:00Z", "rejected_by": "entity_resolution_validator_v1" } } ``` ## Special Cases ### Common Names For very common names (e.g., "John Smith", "Maria García", "Jan de Vries"), require **4 of 5** verification checks instead of 3. The more common the name, the higher the threshold. | Name Commonality | Required Matches | |------------------|------------------| | Unique name (e.g., "Xander Vermeulen-Oosterhuis") | 2 of 5 | | Moderately common (e.g., "Carmen Juliá") | 3 of 5 | | Very common (e.g., "Jan de Vries") | 4 of 5 | | Extremely common (e.g., "John Smith") | 5 of 5 or reject | ### Abbreviated Names For profiles with abbreviated names (e.g., "J. Smith"), entity resolution is inherently uncertain: - Set `entity_resolution_confidence: "very_low"` - Require **human review** for all claims - Do NOT attribute web claims automatically ### Historical Persons When sources describe historical/deceased persons: - Check if death date conflicts with profile activity (living person active in 2025) - **ALWAYS REJECT** genealogy site data - Reject any source describing events before 1950 unless profile is known to be historical ### Wikipedia Articles Wikipedia is particularly dangerous because: - Many people with the same name have articles - Search engines return Wikipedia first - The Wikipedia Carmen Julia Álvarez article describes a Venezuelan actress born 1952 - This is a DIFFERENT PERSON from Carmen Juliá the UK curator **For Wikipedia sources**: 1. Read the FULL article, not just snippets 2. Verify the Wikipedia subject's profession matches the profile 3. Verify the Wikipedia subject's location matches the profile 4. If ANY conflict detected → REJECT ## Audit Trail All entity resolution decisions must be logged: ```json { "enrichment_history": [ { "enrichment_timestamp": "2026-01-11T15:00:00Z", "enrichment_agent": "enrich_person_comprehensive.py v1.4.0", "entity_resolution_decisions": [ { "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez", "decision": "REJECTED", "reason": "Different person - Venezuelan actress, not UK curator" } ], "claims_rejected_count": 5, "claims_accepted_count": 1 } ] } ``` ## See Also - Rule 21: Data Fabrication is Strictly Prohibited - Rule 26: Person Data Provenance - Web Claims for Staff Information - Rule 45: Inferred Data Must Be Explicit with Provenance