glam/.opencode/rules/entity-resolution-no-heuristics.md
kempersc 556cc6c294 Add workspace configuration for Git and Gitea integration
- Set up GitHub integration to be disabled.
- Configure Git settings including path and autofetch options.
- Add Gitea instance URL and repository details.
- Enable YAML support for LinkML schemas with validation.
- Define file associations for YAML files.
- Recommend essential extensions for development and exclude unwanted ones.
2026-01-11 02:50:39 +01:00

14 KiB

Rule 46: Entity Resolution in Person Enrichment - No Heuristics

Status: CRITICAL

The Core Principle

🚨 SIMILAR OR IDENTICAL NAMES ARE NEVER SUFFICIENT FOR ENTITY RESOLUTION.

A web search result mentioning "Carmen Juliá born 1952" is NOT evidence that the Carmen Juliá in our person profile was born in 1952. Names are not unique identifiers - there are thousands of people with the same name worldwide.

Entity resolution requires verification of MULTIPLE independent identity attributes:

Attribute Purpose Example
Age/Birth Year Temporal consistency Both sources describe someone in their 40s
Career Path Professional identity Both are art curators, not one curator and one actress
Location Geographic consistency Both are based in UK, not one UK and one Venezuela
Employer Institutional affiliation Both work at New Contemporaries
Education Academic background Same university or field

Minimum Requirement: At least 3 of 5 attributes must match before attributing ANY claim from a web source. Name match alone = AUTOMATIC REJECTION.

Problem Statement

When enriching person profiles via web search (Linkup, Exa, etc.), search results often return data about different people with similar or identical names. Without proper entity resolution, the enrichment process can attribute false claims to the wrong person.

Example Failure (Carmen Juliá - UK Art Curator):

  • Source profile: Carmen Juliá, Curator at New Contemporaries (UK)
  • Birth year extracted: 1952 from Carmen Julia Álvarez (Venezuelan actress)
  • Spouse extracted: "actors Eduardo Serrano" from the Venezuelan actress
  • ResearchGate: Carmen Julia Navarro (Mexican hydrogeologist)
  • Academia.edu: Carmen Julia Gutiérrez (Spanish medieval studies)

All data is from different people - none is the actual Carmen Juliá who is a UK-based art curator.

Why This Happened: The enrichment script used regex pattern matching to extract "born 1952" without verifying that the Wikipedia article described the SAME person.

The Rule

DO NOT use name matching as the basis for entity resolution. EVER.

For person enrichment via web search:

FORBIDDEN (Name-based extraction):

  • Extracting birth years from any search result mentioning "Carmen Julia born..."
  • Attributing social media profiles just because the name appears
  • Claiming relationships (spouse, parent, child) from web text pattern matching
  • Assigning academic profiles (ResearchGate, Academia.edu, Google Scholar) based on name matching alone
  • Using Wikipedia articles without verifying ALL identity attributes
  • Trusting genealogy sites (Geni, Ancestry, MyHeritage) which describe historical namesakes
  • Using IMDB for birth years (actors with same names)

REQUIRED (Multi-Attribute Entity Resolution):

  1. Verify identity via MULTIPLE attributes - name alone is INSUFFICIENT
  2. Cross-reference with known facts (employer, location, job title from LinkedIn)
  3. Detect conflicting signals - actress vs curator, Venezuela vs UK, 1950s birth vs active 2020s career
  4. Reject ambiguous matches - if source doesn't clearly identify the same person, reject the claim
  5. Document rejection rationale - log why claim was rejected for audit trail

Entity Resolution Verification Checklist

Before attributing a web claim to a person profile, verify MULTIPLE identity attributes:

# Attribute What to Check Example Match Example Conflict
1 Career/Profession Same field/industry Both are curators Source says "actress", profile is curator
2 Employer Same institution Both at Rijksmuseum Source says "film studio", profile is museum
3 Location Same city/country Both UK-based Source says Venezuela, profile is UK
4 Age Range Plausible for career Birth 1980s, active 2020s Birth 1952, still active in 2025 as junior
5 Education Same university/field Both art history Source says "medical school"

Minimum requirement: At least 3 of 5 attributes must match. Name match alone = AUTOMATIC REJECTION.

Any conflicting signal = AUTOMATIC REJECTION (e.g., source says "actress" when profile is "curator").

Sources with High Entity Resolution Risk

Source Type Risk Level Why Action
Wikipedia CRITICAL Many people with same name have pages Reject unless 4/5 attributes match
IMDB CRITICAL Actors with common names Reject all - never use for birth years
Genealogy sites CRITICAL Historical persons with same name ALWAYS REJECT - these are ancestors/namesakes
Academic profiles HIGH Multiple researchers with same name Verify institution and research field match
Social media HIGH Many accounts with similar handles Verify employer/location in bio
News articles MEDIUM May mention multiple people Read full context, verify identity
Institutional websites LOW Usually about their own staff Good source if person works there

Automatic Rejection Triggers

The following MUST trigger automatic claim rejection:

Profession Conflicts

If source profession differs from profile profession, REJECT:

Source: "actress", "actor", "singer", "footballer", "politician"
Profile: "curator", "archivist", "librarian", "conservator", "registrar"
→ REJECT (these are different people)

Location Conflicts

If source location conflicts with profile location, REJECT:

Source: "Venezuela", "Mexico", "Brazil"
Profile: "UK", "Netherlands", "France"
→ REJECT (these are different people)

Age Conflicts

If source age is implausible for profile career stage, REJECT:

Source: Born 1922, 1915, 1939
Profile: Currently active professional in 2025
→ REJECT (person would be 86-103 years old)

Source: Born 2007, 2004
Profile: Senior curator
→ REJECT (person would be 18-21, too young)

Genealogy Source

If source is from genealogy/ancestry site, ALWAYS REJECT:

Domains: geni.com, ancestry.*, familysearch.org, findagrave.com, myheritage.*
→ ALWAYS REJECT (these describe historical namesakes, not the living person)

Implementation in Enrichment Scripts

def validate_entity_match(profile: dict, search_result: dict) -> tuple[bool, str]:
    """
    Validate that a search result refers to the same person as the profile.
    
    REQUIRES: At least 3 of 5 identity attributes must match.
    Name match alone is INSUFFICIENT and automatically rejected.
    
    Returns (is_valid, reason)
    """
    profile_employer = profile.get('affiliations', [{}])[0].get('custodian_name', '').lower()
    profile_location = profile.get('profile_data', {}).get('location', '').lower()
    profile_role = profile.get('profile_data', {}).get('headline', '').lower()
    
    source_text = search_result.get('answer', '').lower()
    source_url = search_result.get('source_url', '').lower()
    
    # AUTOMATIC REJECTION: Genealogy sources
    genealogy_domains = ['geni.com', 'ancestry.', 'familysearch.', 'findagrave.', 'myheritage.']
    if any(domain in source_url for domain in genealogy_domains):
        return False, "genealogy_source_rejected"
    
    # AUTOMATIC REJECTION: Profession conflicts
    heritage_roles = ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection', 'heritage']
    entertainment_roles = ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete']
    
    profile_is_heritage = any(role in profile_role for role in heritage_roles)
    source_is_entertainment = any(role in source_text for role in entertainment_roles)
    
    if profile_is_heritage and source_is_entertainment:
        return False, "conflicting_profession"
    
    # AUTOMATIC REJECTION: Location conflicts
    if profile_location:
        location_conflicts = [
            ('venezuela', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
            ('caracas', 'london'), ('mexico city', 'amsterdam')
        ]
        for source_loc, profile_loc in location_conflicts:
            if source_loc in source_text and profile_loc in profile_location:
                return False, "conflicting_location"
    
    # Count positive identity attribute matches (need 3 of 5)
    matches = 0
    match_details = []
    
    # 1. Employer match
    if profile_employer and profile_employer in source_text:
        matches += 1
        match_details.append(f"employer:{profile_employer}")
    
    # 2. Location match
    if profile_location and profile_location in source_text:
        matches += 1
        match_details.append(f"location:{profile_location}")
    
    # 3. Role/profession match
    if profile_role:
        role_words = [w for w in profile_role.split() if len(w) > 4]
        if any(word in source_text for word in role_words):
            matches += 1
            match_details.append(f"role_match")
    
    # 4. Education/institution match (if available)
    profile_education = profile.get('profile_data', {}).get('education', [])
    if profile_education:
        edu_names = [e.get('school', '').lower() for e in profile_education if e.get('school')]
        if any(edu in source_text for edu in edu_names):
            matches += 1
            match_details.append(f"education_match")
    
    # 5. Time period match (career dates)
    # (implementation depends on available data)
    
    # REQUIRE 3 OF 5 MATCHES
    if matches < 3:
        return False, f"insufficient_identity_verification (only {matches}/5 attributes matched)"
    
    return True, f"verified ({matches}/5 matches: {', '.join(match_details)})"

Claim Rejection Patterns

The following patterns should trigger automatic claim rejection:

# Genealogy sources - ALWAYS REJECT
GENEALOGY_DOMAINS = [
    'geni.com', 'ancestry.com', 'ancestry.co.uk', 'familysearch.org',
    'findagrave.com', 'myheritage.com', 'wikitree.com', 'geneanet.org'
]

# Profession conflicts - if profile has one and source has another, REJECT
PROFESSION_CONFLICTS = {
    'heritage': ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection manager'],
    'entertainment': ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'],
    'medical': ['doctor', 'nurse', 'surgeon', 'physician'],
    'tech': ['software engineer', 'developer', 'programmer'],
}

# Location conflicts - if source describes person in location X and profile is location Y, REJECT
LOCATION_PAIRS = [
    ('venezuela', 'uk'), ('venezuela', 'netherlands'), ('venezuela', 'germany'),
    ('mexico', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
    ('caracas', 'london'), ('caracas', 'amsterdam'),
]

# Age impossibility - if birth year makes current career implausible, REJECT
MIN_PLAUSIBLE_BIRTH_YEAR = 1945  # Would be 80 in 2025 - still plausible but verify
MAX_PLAUSIBLE_BIRTH_YEAR = 2002  # Would be 23 in 2025 - plausible for junior roles

Handling Rejected Claims

When a claim fails entity resolution:

{
  "claim_type": "birth_year",
  "claim_value": 1952,
  "entity_resolution": {
    "status": "REJECTED",
    "reason": "conflicting_profession",
    "details": "Source describes Venezuelan actress, profile is UK curator",
    "source_identity": "Carmen Julia Álvarez (Venezuelan actress)",
    "profile_identity": "Carmen Juliá (UK art curator)",
    "rejected_at": "2026-01-11T15:00:00Z",
    "rejected_by": "entity_resolution_validator_v1"
  }
}

Special Cases

Common Names

For very common names (e.g., "John Smith", "Maria García", "Jan de Vries"), require 4 of 5 verification checks instead of 3. The more common the name, the higher the threshold.

Name Commonality Required Matches
Unique name (e.g., "Xander Vermeulen-Oosterhuis") 2 of 5
Moderately common (e.g., "Carmen Juliá") 3 of 5
Very common (e.g., "Jan de Vries") 4 of 5
Extremely common (e.g., "John Smith") 5 of 5 or reject

Abbreviated Names

For profiles with abbreviated names (e.g., "J. Smith"), entity resolution is inherently uncertain:

  • Set entity_resolution_confidence: "very_low"
  • Require human review for all claims
  • Do NOT attribute web claims automatically

Historical Persons

When sources describe historical/deceased persons:

  • Check if death date conflicts with profile activity (living person active in 2025)
  • ALWAYS REJECT genealogy site data
  • Reject any source describing events before 1950 unless profile is known to be historical

Wikipedia Articles

Wikipedia is particularly dangerous because:

  • Many people with the same name have articles
  • Search engines return Wikipedia first
  • The Wikipedia Carmen Julia Álvarez article describes a Venezuelan actress born 1952
  • This is a DIFFERENT PERSON from Carmen Juliá the UK curator

For Wikipedia sources:

  1. Read the FULL article, not just snippets
  2. Verify the Wikipedia subject's profession matches the profile
  3. Verify the Wikipedia subject's location matches the profile
  4. If ANY conflict detected → REJECT

Audit Trail

All entity resolution decisions must be logged:

{
  "enrichment_history": [
    {
      "enrichment_timestamp": "2026-01-11T15:00:00Z",
      "enrichment_agent": "enrich_person_comprehensive.py v1.4.0",
      "entity_resolution_decisions": [
        {
          "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
          "decision": "REJECTED",
          "reason": "Different person - Venezuelan actress, not UK curator"
        }
      ],
      "claims_rejected_count": 5,
      "claims_accepted_count": 1
    }
  ]
}

See Also

  • Rule 21: Data Fabrication is Strictly Prohibited
  • Rule 26: Person Data Provenance - Web Claims for Staff Information
  • Rule 45: Inferred Data Must Be Explicit with Provenance