390 lines
17 KiB
Markdown
390 lines
17 KiB
Markdown
# Rule 46: Entity Resolution - Names Are NEVER Sufficient
|
|
|
|
## Status: CRITICAL
|
|
|
|
## 🚨 DATA QUALITY IS OF UTMOST IMPORTANCE 🚨
|
|
|
|
**Wrong data is worse than no data.** Attributing a birth year, spouse, or social media profile to the wrong person is a **critical data quality failure** that undermines the entire dataset's trustworthiness.
|
|
|
|
**ALL enrichments MUST be done MANUALLY and double-checked.** Automated web search enrichment has been DISABLED due to catastrophic entity resolution failures (540+ false claims removed in Jan 2026).
|
|
|
|
**The cost of false data**:
|
|
- Corrupts downstream analysis and reporting
|
|
- Creates legal/privacy risks (attributing data to wrong person)
|
|
- Destroys user trust in the dataset
|
|
- Requires expensive manual cleanup
|
|
|
|
---
|
|
|
|
## 🚫 AUTOMATED ENRICHMENT IS PROHIBITED 🚫
|
|
|
|
**DO NOT USE** automated scripts to enrich person profiles with web search data. The `enrich_person_comprehensive.py` script has been deprecated.
|
|
|
|
**Why automated enrichment failed**:
|
|
- Web searches return data about DIFFERENT people with similar names
|
|
- Regex pattern matching cannot distinguish between namesakes
|
|
- Wikipedia, IMDB, ResearchGate, Instagram all returned data from wrong people
|
|
- Example: "Carmen Juliá" search returned Venezuelan actress, Mexican hydrogeologist, Spanish medievalist - NONE were the UK art curator
|
|
|
|
**ONLY ALLOWED enrichment methods**:
|
|
1. **Manual research** - Human curator verifies source refers to the correct person
|
|
2. **Institutional sources** - Data from the person's employer website (verified)
|
|
3. **LinkedIn profile data** - Already verified via direct profile access
|
|
4. **ORCID/Wikidata** - If the person has a verified identifier
|
|
|
|
---
|
|
|
|
## The Core Principle
|
|
|
|
🚨 **SIMILAR OR IDENTICAL NAMES ARE NEVER SUFFICIENT FOR ENTITY RESOLUTION.**
|
|
|
|
A web search result mentioning "Carmen Juliá born 1952" is **NOT** evidence that the Carmen Juliá in our person profile was born in 1952. Names are not unique identifiers - there are thousands of people with the same name worldwide.
|
|
|
|
**Entity resolution requires verification of MULTIPLE independent identity attributes:**
|
|
|
|
| Attribute | Purpose | Example |
|
|
|-----------|---------|---------|
|
|
| **Age/Birth Year** | Temporal consistency | Both sources describe someone in their 40s |
|
|
| **Career Path** | Professional identity | Both are art curators, not one curator and one actress |
|
|
| **Location** | Geographic consistency | Both are based in UK, not one UK and one Venezuela |
|
|
| **Employer** | Institutional affiliation | Both work at New Contemporaries |
|
|
| **Education** | Academic background | Same university or field |
|
|
|
|
**Minimum Requirement**: At least **3 of 5** attributes must match before attributing ANY claim from a web source. Name match alone = **AUTOMATIC REJECTION**.
|
|
|
|
## Problem Statement
|
|
|
|
When enriching person profiles via web search (Linkup, Exa, etc.), search results often return data about **different people with similar or identical names**. Without proper entity resolution, the enrichment process can attribute false claims to the wrong person.
|
|
|
|
**Example Failure** (Carmen Juliá - UK Art Curator):
|
|
- Source profile: Carmen Juliá, Curator at New Contemporaries (UK)
|
|
- Birth year extracted: 1952 from Carmen Julia **Álvarez** (Venezuelan actress)
|
|
- Spouse extracted: "actors Eduardo Serrano" from the Venezuelan actress
|
|
- ResearchGate: Carmen Julia **Navarro** (Mexican hydrogeologist)
|
|
- Academia.edu: Carmen Julia **Gutiérrez** (Spanish medieval studies)
|
|
|
|
All data is from **different people** - none is the actual Carmen Juliá who is a UK-based art curator.
|
|
|
|
**Why This Happened**: The enrichment script used regex pattern matching to extract "born 1952" without verifying that the Wikipedia article described the SAME person.
|
|
|
|
## The Rule
|
|
|
|
### DO NOT use name matching as the basis for entity resolution. EVER.
|
|
|
|
For person enrichment via web search:
|
|
|
|
**FORBIDDEN** (Name-based extraction):
|
|
- ❌ Extracting birth years from any search result mentioning "Carmen Julia born..."
|
|
- ❌ Attributing social media profiles just because the name appears
|
|
- ❌ Claiming relationships (spouse, parent, child) from web text pattern matching
|
|
- ❌ Assigning academic profiles (ResearchGate, Academia.edu, Google Scholar) based on name matching alone
|
|
- ❌ Using Wikipedia articles without verifying ALL identity attributes
|
|
- ❌ Trusting genealogy sites (Geni, Ancestry, MyHeritage) which describe historical namesakes
|
|
- ❌ Using IMDB for birth years (actors with same names)
|
|
|
|
**REQUIRED** (Multi-Attribute Entity Resolution):
|
|
1. **Verify identity via MULTIPLE attributes** - name alone is INSUFFICIENT
|
|
2. **Cross-reference with known facts** (employer, location, job title from LinkedIn)
|
|
3. **Detect conflicting signals** - actress vs curator, Venezuela vs UK, 1950s birth vs active 2020s career
|
|
4. **Reject ambiguous matches** - if source doesn't clearly identify the same person, reject the claim
|
|
5. **Document rejection rationale** - log why claim was rejected for audit trail
|
|
|
|
## Entity Resolution Verification Checklist
|
|
|
|
Before attributing a web claim to a person profile, verify MULTIPLE identity attributes:
|
|
|
|
| # | Attribute | What to Check | Example Match | Example Conflict |
|
|
|---|-----------|---------------|---------------|------------------|
|
|
| 1 | **Career/Profession** | Same field/industry | Both are curators | Source says "actress", profile is curator |
|
|
| 2 | **Employer** | Same institution | Both at Rijksmuseum | Source says "film studio", profile is museum |
|
|
| 3 | **Location** | Same city/country | Both UK-based | Source says Venezuela, profile is UK |
|
|
| 4 | **Age Range** | Plausible for career | Birth 1980s, active 2020s | Birth 1952, still active in 2025 as junior |
|
|
| 5 | **Education** | Same university/field | Both art history | Source says "medical school" |
|
|
|
|
**Minimum requirement**: At least **3 of 5** attributes must match. Name match alone = **AUTOMATIC REJECTION**.
|
|
|
|
**Any conflicting signal = AUTOMATIC REJECTION** (e.g., source says "actress" when profile is "curator").
|
|
|
|
## Sources with High Entity Resolution Risk
|
|
|
|
These sources are NOT forbidden, but require **stricter verification thresholds** due to high false-positive rates:
|
|
|
|
| Source Type | Risk Level | Why | Required Matches |
|
|
|-------------|------------|-----|------------------|
|
|
| Genealogy sites | CRITICAL | Historical persons with same name | 5/5 attributes (or explicit link to living person) |
|
|
| IMDB | CRITICAL | Actors with common names | 5/5 attributes (unless person works in film/TV) |
|
|
| Wikipedia | HIGH | Many people with same name have pages | 4/5 attributes match |
|
|
| Academic profiles | HIGH | Multiple researchers with same name | 4/5 attributes + institution match |
|
|
| Social media | HIGH | Many accounts with similar handles | 4/5 attributes + verify employer/location in bio |
|
|
| News articles | MEDIUM | May mention multiple people | 3/5 attributes + read full context |
|
|
| Institutional websites | LOW | Usually about their own staff | 2/5 attributes (good source if person works there) |
|
|
|
|
**Key point**: High-risk sources CAN be used if you verify enough identity attributes. The risk level determines the verification threshold, not whether the source is allowed.
|
|
|
|
## Red Flags Requiring Investigation
|
|
|
|
The following are **red flags** that require careful investigation - NOT automatic rejection. People change careers and relocate.
|
|
|
|
### Profession Differences
|
|
If source profession differs from profile profession, **investigate**:
|
|
```
|
|
Source: "actress", "actor", "singer"
|
|
Profile: "curator", "archivist", "librarian"
|
|
|
|
ASK: Did this person change careers?
|
|
- Check timeline: Did acting career END before heritage career BEGAN?
|
|
- Check for transition evidence: "former actress turned curator"
|
|
- If careers overlap in time → likely different people → REJECT
|
|
- If sequential careers with clear transition → may be same person → ACCEPT with documentation
|
|
```
|
|
|
|
### Location Differences
|
|
If source location differs from profile location, **investigate**:
|
|
```
|
|
Source: "Venezuela", "Mexico", "Brazil"
|
|
Profile: "UK", "Netherlands", "France"
|
|
|
|
ASK: Did this person relocate?
|
|
- Check timeline: When were they in each location?
|
|
- Check for migration evidence: education abroad, international career moves
|
|
- If locations overlap in time → likely different people → REJECT
|
|
- If sequential locations with clear move → may be same person → ACCEPT with documentation
|
|
```
|
|
|
|
### When to Actually REJECT
|
|
|
|
Reject when investigation shows **no plausible connection**:
|
|
```
|
|
Example: Carmen Julia Álvarez (Venezuelan actress, active 1970s-2000s)
|
|
vs Carmen Juliá (UK curator, active 2015-present)
|
|
|
|
- Overlapping active periods in DIFFERENT professions on DIFFERENT continents
|
|
- No evidence of career change or relocation
|
|
- Birth year 1952 makes current junior curator role implausible
|
|
→ REJECT: These are clearly different people
|
|
```
|
|
|
|
### Age Conflicts (Still Automatic Rejection)
|
|
If source age is **physically implausible** for profile career stage, REJECT:
|
|
```
|
|
Source: Born 1922, 1915, 1939
|
|
Profile: Currently active professional in 2025
|
|
→ REJECT (person would be 86-103 years old)
|
|
|
|
Source: Born 2007, 2004
|
|
Profile: Senior curator
|
|
→ REJECT (person would be 18-21, too young)
|
|
```
|
|
|
|
### Genealogy Source
|
|
Genealogy sources require **5 of 5 attribute matches** due to high false-positive rates:
|
|
```
|
|
Domains: geni.com, ancestry.*, familysearch.org, findagrave.com, myheritage.*
|
|
→ REQUIRE 5/5 attribute matches (these often describe historical namesakes)
|
|
→ Exception: If source explicitly links to living person with verifiable connection
|
|
```
|
|
|
|
## Implementation in Enrichment Scripts
|
|
|
|
```python
|
|
def validate_entity_match(profile: dict, search_result: dict) -> tuple[bool, str]:
|
|
"""
|
|
Validate that a search result refers to the same person as the profile.
|
|
|
|
REQUIRES: At least 3 of 5 identity attributes must match.
|
|
Name match alone is INSUFFICIENT and automatically rejected.
|
|
|
|
Returns (is_valid, reason)
|
|
"""
|
|
profile_employer = profile.get('affiliations', [{}])[0].get('custodian_name', '').lower()
|
|
profile_location = profile.get('profile_data', {}).get('location', '').lower()
|
|
profile_role = profile.get('profile_data', {}).get('headline', '').lower()
|
|
|
|
source_text = search_result.get('answer', '').lower()
|
|
source_url = search_result.get('source_url', '').lower()
|
|
|
|
# AUTOMATIC REJECTION: Genealogy sources
|
|
genealogy_domains = ['geni.com', 'ancestry.', 'familysearch.', 'findagrave.', 'myheritage.']
|
|
if any(domain in source_url for domain in genealogy_domains):
|
|
return False, "genealogy_source_rejected"
|
|
|
|
# AUTOMATIC REJECTION: Profession conflicts
|
|
heritage_roles = ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection', 'heritage']
|
|
entertainment_roles = ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete']
|
|
|
|
profile_is_heritage = any(role in profile_role for role in heritage_roles)
|
|
source_is_entertainment = any(role in source_text for role in entertainment_roles)
|
|
|
|
if profile_is_heritage and source_is_entertainment:
|
|
return False, "conflicting_profession"
|
|
|
|
# AUTOMATIC REJECTION: Location conflicts
|
|
if profile_location:
|
|
location_conflicts = [
|
|
('venezuela', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
|
|
('caracas', 'london'), ('mexico city', 'amsterdam')
|
|
]
|
|
for source_loc, profile_loc in location_conflicts:
|
|
if source_loc in source_text and profile_loc in profile_location:
|
|
return False, "conflicting_location"
|
|
|
|
# Count positive identity attribute matches (need 3 of 5)
|
|
matches = 0
|
|
match_details = []
|
|
|
|
# 1. Employer match
|
|
if profile_employer and profile_employer in source_text:
|
|
matches += 1
|
|
match_details.append(f"employer:{profile_employer}")
|
|
|
|
# 2. Location match
|
|
if profile_location and profile_location in source_text:
|
|
matches += 1
|
|
match_details.append(f"location:{profile_location}")
|
|
|
|
# 3. Role/profession match
|
|
if profile_role:
|
|
role_words = [w for w in profile_role.split() if len(w) > 4]
|
|
if any(word in source_text for word in role_words):
|
|
matches += 1
|
|
match_details.append(f"role_match")
|
|
|
|
# 4. Education/institution match (if available)
|
|
profile_education = profile.get('profile_data', {}).get('education', [])
|
|
if profile_education:
|
|
edu_names = [e.get('school', '').lower() for e in profile_education if e.get('school')]
|
|
if any(edu in source_text for edu in edu_names):
|
|
matches += 1
|
|
match_details.append(f"education_match")
|
|
|
|
# 5. Time period match (career dates)
|
|
# (implementation depends on available data)
|
|
|
|
# REQUIRE 3 OF 5 MATCHES
|
|
if matches < 3:
|
|
return False, f"insufficient_identity_verification (only {matches}/5 attributes matched)"
|
|
|
|
return True, f"verified ({matches}/5 matches: {', '.join(match_details)})"
|
|
```
|
|
|
|
## Claim Rejection Patterns
|
|
|
|
The following patterns should trigger automatic claim rejection:
|
|
|
|
```python
|
|
# Genealogy sources - ALWAYS REJECT
|
|
GENEALOGY_DOMAINS = [
|
|
'geni.com', 'ancestry.com', 'ancestry.co.uk', 'familysearch.org',
|
|
'findagrave.com', 'myheritage.com', 'wikitree.com', 'geneanet.org'
|
|
]
|
|
|
|
# Profession conflicts - if profile has one and source has another, REJECT
|
|
PROFESSION_CONFLICTS = {
|
|
'heritage': ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection manager'],
|
|
'entertainment': ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'],
|
|
'medical': ['doctor', 'nurse', 'surgeon', 'physician'],
|
|
'tech': ['software engineer', 'developer', 'programmer'],
|
|
}
|
|
|
|
# Location conflicts - if source describes person in location X and profile is location Y, REJECT
|
|
LOCATION_PAIRS = [
|
|
('venezuela', 'uk'), ('venezuela', 'netherlands'), ('venezuela', 'germany'),
|
|
('mexico', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
|
|
('caracas', 'london'), ('caracas', 'amsterdam'),
|
|
]
|
|
|
|
# Age impossibility - if birth year makes current career implausible, REJECT
|
|
MIN_PLAUSIBLE_BIRTH_YEAR = 1945 # Would be 80 in 2025 - still plausible but verify
|
|
MAX_PLAUSIBLE_BIRTH_YEAR = 2002 # Would be 23 in 2025 - plausible for junior roles
|
|
```
|
|
|
|
## Handling Rejected Claims
|
|
|
|
When a claim fails entity resolution:
|
|
|
|
```json
|
|
{
|
|
"claim_type": "birth_year",
|
|
"claim_value": 1952,
|
|
"entity_resolution": {
|
|
"status": "REJECTED",
|
|
"reason": "conflicting_profession",
|
|
"details": "Source describes Venezuelan actress, profile is UK curator",
|
|
"source_identity": "Carmen Julia Álvarez (Venezuelan actress)",
|
|
"profile_identity": "Carmen Juliá (UK art curator)",
|
|
"rejected_at": "2026-01-11T15:00:00Z",
|
|
"rejected_by": "entity_resolution_validator_v1"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Special Cases
|
|
|
|
### Common Names
|
|
|
|
For very common names (e.g., "John Smith", "Maria García", "Jan de Vries"), require **4 of 5** verification checks instead of 3. The more common the name, the higher the threshold.
|
|
|
|
| Name Commonality | Required Matches |
|
|
|------------------|------------------|
|
|
| Unique name (e.g., "Xander Vermeulen-Oosterhuis") | 2 of 5 |
|
|
| Moderately common (e.g., "Carmen Juliá") | 3 of 5 |
|
|
| Very common (e.g., "Jan de Vries") | 4 of 5 |
|
|
| Extremely common (e.g., "John Smith") | 5 of 5 or reject |
|
|
|
|
### Abbreviated Names
|
|
|
|
For profiles with abbreviated names (e.g., "J. Smith"), entity resolution is inherently uncertain:
|
|
- Set `entity_resolution_confidence: "very_low"`
|
|
- Require **human review** for all claims
|
|
- Do NOT attribute web claims automatically
|
|
|
|
### Historical Persons
|
|
|
|
When sources describe historical/deceased persons:
|
|
- Check if death date conflicts with profile activity (living person active in 2025)
|
|
- **ALWAYS REJECT** genealogy site data
|
|
- Reject any source describing events before 1950 unless profile is known to be historical
|
|
|
|
### Wikipedia Articles
|
|
|
|
Wikipedia is particularly dangerous because:
|
|
- Many people with the same name have articles
|
|
- Search engines return Wikipedia first
|
|
- The Wikipedia Carmen Julia Álvarez article describes a Venezuelan actress born 1952
|
|
- This is a DIFFERENT PERSON from Carmen Juliá the UK curator
|
|
|
|
**For Wikipedia sources**:
|
|
1. Read the FULL article, not just snippets
|
|
2. Verify the Wikipedia subject's profession matches the profile
|
|
3. Verify the Wikipedia subject's location matches the profile
|
|
4. If ANY conflict detected → REJECT
|
|
|
|
## Audit Trail
|
|
|
|
All entity resolution decisions must be logged:
|
|
|
|
```json
|
|
{
|
|
"enrichment_history": [
|
|
{
|
|
"enrichment_timestamp": "2026-01-11T15:00:00Z",
|
|
"enrichment_agent": "enrich_person_comprehensive.py v1.4.0",
|
|
"entity_resolution_decisions": [
|
|
{
|
|
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
|
|
"decision": "REJECTED",
|
|
"reason": "Different person - Venezuelan actress, not UK curator"
|
|
}
|
|
],
|
|
"claims_rejected_count": 5,
|
|
"claims_accepted_count": 1
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## See Also
|
|
|
|
- Rule 21: Data Fabrication is Strictly Prohibited
|
|
- Rule 26: Person Data Provenance - Web Claims for Staff Information
|
|
- Rule 45: Inferred Data Must Be Explicit with Provenance
|