- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions. - Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications. - Implemented Rule: No Version Indicators in Names to maintain stable semantic naming. - Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions. - Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices. - Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files. - Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates. - Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml. - Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
307 lines
14 KiB
Markdown
307 lines
14 KiB
Markdown
# Rule 46: Entity Resolution - Names Are NEVER Sufficient
|
|
|
|
## Status: CRITICAL
|
|
|
|
## 🚨 DATA QUALITY IS OF UTMOST IMPORTANCE 🚨
|
|
|
|
**Wrong data is worse than no data.** Attributing a birth year, spouse, or social media profile to the wrong person is a **critical data quality failure** that undermines the entire dataset's trustworthiness.
|
|
|
|
**ALL enrichments MUST be done MANUALLY and double-checked.** Automated web search enrichment has been DISABLED due to catastrophic entity resolution failures (540+ false claims removed in Jan 2026).
|
|
|
|
**The cost of false data**:
|
|
- Corrupts downstream analysis and reporting
|
|
- Creates legal/privacy risks (attributing data to wrong person)
|
|
- Destroys user trust in the dataset
|
|
- Requires expensive manual cleanup
|
|
|
|
---
|
|
|
|
## 🚫 AUTOMATED ENRICHMENT IS PROHIBITED 🚫
|
|
|
|
**DO NOT USE** automated scripts to enrich person profiles with web search data.
|
|
|
|
**Why automated enrichment failed**:
|
|
- Web searches return data about DIFFERENT people with similar names
|
|
- Regex pattern matching cannot distinguish between namesakes
|
|
- Wikipedia, IMDB, ResearchGate, Instagram all returned data from wrong people
|
|
- Example: "Carmen Juliá" search returned Venezuelan actress, Mexican hydrogeologist, Spanish medievalist - NONE were the UK art curator
|
|
|
|
**ONLY ALLOWED enrichment methods**:
|
|
1. **Manual research** - Human curator verifies source refers to the correct person
|
|
2. **Institutional sources** - Data from the person's employer website (verified)
|
|
3. **LinkedIn profile data** - Already verified via direct profile access
|
|
4. **ORCID/Wikidata** - If the person has a verified identifier
|
|
|
|
---
|
|
|
|
## The Core Principle
|
|
|
|
🚨 **SIMILAR OR IDENTICAL NAMES ARE NEVER SUFFICIENT FOR ENTITY RESOLUTION.**
|
|
|
|
A web search result mentioning "Carmen Juliá born 1952" is **NOT** evidence that the Carmen Juliá in our person profile was born in 1952. Names are not unique identifiers - there are thousands of people with the same name worldwide.
|
|
|
|
**Entity resolution requires verification of MULTIPLE independent identity attributes:**
|
|
|
|
| Attribute | Purpose | Example |
|
|
|-----------|---------|---------|
|
|
| **Age/Birth Year** | Temporal consistency | Both sources describe someone in their 40s |
|
|
| **Career Path** | Professional identity | Both are art curators, not one curator and one actress |
|
|
| **Location** | Geographic consistency | Both are based in UK, not one UK and one Venezuela |
|
|
| **Employer** | Institutional affiliation | Both work at New Contemporaries |
|
|
| **Education** | Academic background | Same university or field |
|
|
|
|
**Minimum Requirement**: At least **3 of 5** attributes must match before attributing ANY claim from a web source. Name match alone = **AUTOMATIC REJECTION**.
|
|
|
|
## Problem Statement
|
|
|
|
When enriching person profiles via web search (Linkup, Exa, etc.), search results often return data about **different people with similar or identical names**. Without proper entity resolution, the enrichment process can attribute false claims to the wrong person.
|
|
|
|
**Example Failure** (Carmen Juliá - UK Art Curator):
|
|
- Source profile: Carmen Juliá, Curator at New Contemporaries (UK)
|
|
- Birth year extracted: 1952 from Carmen Julia **Álvarez** (Venezuelan actress)
|
|
- Spouse extracted: "actors Eduardo Serrano" from the Venezuelan actress
|
|
- ResearchGate: Carmen Julia **Navarro** (Mexican hydrogeologist)
|
|
- Academia.edu: Carmen Julia **Gutiérrez** (Spanish medieval studies)
|
|
|
|
All data is from **different people** - none is the actual Carmen Juliá who is a UK-based art curator.
|
|
|
|
**Why This Happened**: The enrichment script used regex pattern matching to extract "born 1952" without verifying that the Wikipedia article described the SAME person.
|
|
|
|
## The Rule
|
|
|
|
### DO NOT use name matching as the basis for entity resolution. EVER.
|
|
|
|
For person enrichment via web search:
|
|
|
|
**FORBIDDEN** (Name-based extraction):
|
|
- ❌ Extracting birth years from any search result mentioning "Carmen Julia born..."
|
|
- ❌ Attributing social media profiles just because the name appears
|
|
- ❌ Claiming relationships (spouse, parent, child) from web text pattern matching
|
|
- ❌ Assigning academic profiles (ResearchGate, Academia.edu, Google Scholar) based on name matching alone
|
|
- ❌ Using Wikipedia articles without verifying ALL identity attributes
|
|
- ❌ Trusting genealogy sites (Geni, Ancestry, MyHeritage) which describe historical namesakes
|
|
- ❌ Using IMDB for birth years (actors with same names)
|
|
|
|
**REQUIRED** (Multi-Attribute Entity Resolution):
|
|
1. **Verify identity via MULTIPLE attributes** - name alone is INSUFFICIENT
|
|
2. **Cross-reference with known facts** (employer, location, job title from LinkedIn)
|
|
3. **Detect conflicting signals** - actress vs curator, Venezuela vs UK, 1950s birth vs active 2020s career
|
|
4. **Reject ambiguous matches** - if source doesn't clearly identify the same person, reject the claim
|
|
5. **Document rejection rationale** - log why claim was rejected for audit trail
|
|
|
|
## Entity Resolution Verification Checklist
|
|
|
|
Before attributing a web claim to a person profile, verify MULTIPLE identity attributes:
|
|
|
|
| # | Attribute | What to Check | Example Match | Example Conflict |
|
|
|---|-----------|---------------|---------------|------------------|
|
|
| 1 | **Career/Profession** | Same field/industry | Both are curators | Source says "actress", profile is curator |
|
|
| 2 | **Employer** | Same institution | Both at Rijksmuseum | Source says "film studio", profile is museum |
|
|
| 3 | **Location** | Same city/country | Both UK-based | Source says Venezuela, profile is UK |
|
|
| 4 | **Age Range** | Plausible for career | Birth 1980s, active 2020s | Birth 1952, still active in 2025 as junior |
|
|
| 5 | **Education** | Same university/field | Both art history | Source says "medical school" |
|
|
|
|
**Minimum requirement**: At least **3 of 5** attributes must match. Name match alone = **AUTOMATIC REJECTION**.
|
|
|
|
**Any conflicting signal = AUTOMATIC REJECTION** (e.g., source says "actress" when profile is "curator").
|
|
|
|
## Sources with High Entity Resolution Risk
|
|
|
|
These sources are NOT forbidden, but require **stricter verification thresholds** due to high false-positive rates:
|
|
|
|
| Source Type | Risk Level | Why | Required Matches |
|
|
|-------------|------------|-----|------------------|
|
|
| Genealogy sites | CRITICAL | Historical persons with same name | 5/5 attributes (or explicit link to living person) |
|
|
| IMDB | CRITICAL | Actors with common names | 5/5 attributes (unless person works in film/TV) |
|
|
| Wikipedia | HIGH | Many people with same name have pages | 4/5 attributes match |
|
|
| Academic profiles | HIGH | Multiple researchers with same name | 4/5 attributes + institution match |
|
|
| Social media | HIGH | Many accounts with similar handles | 4/5 attributes + verify employer/location in bio |
|
|
| News articles | MEDIUM | May mention multiple people | 3/5 attributes + read full context |
|
|
| Institutional websites | LOW | Usually about their own staff | 2/5 attributes (good source if person works there) |
|
|
|
|
**Key point**: High-risk sources CAN be used if you verify enough identity attributes. The risk level determines the verification threshold, not whether the source is allowed.
|
|
|
|
## Red Flags Requiring Investigation
|
|
|
|
The following are **red flags** that require careful investigation - NOT automatic rejection. People change careers and relocate.
|
|
|
|
### Profession Differences
|
|
If source profession differs from profile profession, **investigate**:
|
|
```
|
|
Source: "actress", "actor", "singer"
|
|
Profile: "curator", "archivist", "librarian"
|
|
|
|
ASK: Did this person change careers?
|
|
- Check timeline: Did acting career END before heritage career BEGAN?
|
|
- Check for transition evidence: "former actress turned curator"
|
|
- If careers overlap in time → likely different people → REJECT
|
|
- If sequential careers with clear transition → may be same person → ACCEPT with documentation
|
|
```
|
|
|
|
### Location Differences
|
|
If source location differs from profile location, **investigate**:
|
|
```
|
|
Source: "Venezuela", "Mexico", "Brazil"
|
|
Profile: "UK", "Netherlands", "France"
|
|
|
|
ASK: Did this person relocate?
|
|
- Check timeline: When were they in each location?
|
|
- Check for migration evidence: education abroad, international career moves
|
|
- If locations overlap in time → likely different people → REJECT
|
|
- If sequential locations with clear move → may be same person → ACCEPT with documentation
|
|
```
|
|
|
|
### When to Actually REJECT
|
|
|
|
Reject when investigation shows **no plausible connection**:
|
|
```
|
|
Example: Carmen Julia Álvarez (Venezuelan actress, active 1970s-2000s)
|
|
vs Carmen Juliá (UK curator, active 2015-present)
|
|
|
|
- Overlapping active periods in DIFFERENT professions on DIFFERENT continents
|
|
- No evidence of career change or relocation
|
|
- Birth year 1952 makes current junior curator role implausible
|
|
→ REJECT: These are clearly different people
|
|
```
|
|
|
|
### Age Conflicts (Still Automatic Rejection)
|
|
If source age is **physically implausible** for profile career stage, REJECT:
|
|
```
|
|
Source: Born 1922, 1915, 1939
|
|
Profile: Currently active professional in 2025
|
|
→ REJECT (person would be 86-103 years old)
|
|
|
|
Source: Born 2007, 2004
|
|
Profile: Senior curator
|
|
→ REJECT (person would be 18-21, too young)
|
|
```
|
|
|
|
### Genealogy Source
|
|
Genealogy sources require **5 of 5 attribute matches** due to high false-positive rates:
|
|
```
|
|
Domains: geni.com, ancestry.*, familysearch.org, findagrave.com, myheritage.*
|
|
→ REQUIRE 5/5 attribute matches (these often describe historical namesakes)
|
|
→ Exception: If source explicitly links to living person with verifiable connection
|
|
```
|
|
|
|
## Claim Rejection Patterns
|
|
|
|
The following inconsisten patterns should trigger automatic claim rejection:
|
|
|
|
```python
|
|
# Genealogy sources conflict - ALWAYS REJECT
|
|
GENEALOGY_DOMAINS = [
|
|
'geni.com', 'ancestry.com', 'ancestry.co.uk', 'familysearch.org',
|
|
'findagrave.com', 'myheritage.com', 'wikitree.com', 'geneanet.org'
|
|
]
|
|
|
|
# Profession conflicts - if profile has one and source has another, REJECT
|
|
PROFESSION_CONFLICTS = {
|
|
'heritage': ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection manager'],
|
|
'entertainment': ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'],
|
|
'medical': ['doctor', 'nurse', 'surgeon', 'physician'],
|
|
'tech': ['software engineer', 'developer', 'programmer'],
|
|
}
|
|
|
|
# Location conflicts - if source describes person in location X and profile is location Y, REJECT
|
|
LOCATION_PAIRS = [
|
|
('venezuela', 'uk'), ('venezuela', 'netherlands'), ('venezuela', 'germany'),
|
|
('mexico', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
|
|
('caracas', 'london'), ('caracas', 'amsterdam'),
|
|
]
|
|
|
|
# Age impossibility - if birth year makes current career implausible, REJECT. For instance, for a Junior role:
|
|
MIN_PLAUSIBLE_BIRTH_YEAR = 1945 # Would be 80 in 2025 - still plausible but verify
|
|
MAX_PLAUSIBLE_BIRTH_YEAR = 2002 # Would be 23 in 2025 - plausible for junior roles
|
|
```
|
|
|
|
## Handling Rejected Claims
|
|
|
|
When a claim fails entity resolution:
|
|
|
|
```json
|
|
{
|
|
"claim_type": "birth_year",
|
|
"claim_value": 1952,
|
|
"entity_resolution": {
|
|
"status": "REJECTED",
|
|
"reason": "conflicting_profession",
|
|
"details": "Source describes Venezuelan actress, profile is UK curator",
|
|
"source_identity": "Carmen Julia Álvarez (Venezuelan actress)",
|
|
"profile_identity": "Carmen Juliá (UK art curator)",
|
|
"rejected_at": "2026-01-11T15:00:00Z",
|
|
"rejected_by": "entity_resolution_validator_v1"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Special Cases
|
|
|
|
### Common Names
|
|
|
|
For very common names (e.g., "John Smith", "Maria García", "Jan de Vries"), require **4 of 5** verification checks instead of 3. The more common the name, the higher the threshold.
|
|
|
|
| Name Commonality | Required Matches |
|
|
|------------------|------------------|
|
|
| Unique name (e.g., "Xander Vermeulen-Oosterhuis") | 2 of 5 |
|
|
| Moderately common (e.g., "Carmen Juliá") | 3 of 5 |
|
|
| Very common (e.g., "Jan de Vries") | 4 of 5 |
|
|
| Extremely common (e.g., "John Smith") | 5 of 5 or reject |
|
|
|
|
### Abbreviated Names
|
|
|
|
For profiles with abbreviated names (e.g., "J. Smith"), entity resolution is inherently uncertain:
|
|
- Set `entity_resolution_confidence: "very_low"`
|
|
- Require **human review** for all claims
|
|
- Do NOT attribute web claims automatically
|
|
|
|
### Historical Persons
|
|
|
|
When sources describe historical/deceased persons:
|
|
- Check if death date conflicts with profile activity (living person active in 2025)
|
|
- **ALWAYS REJECT** genealogy site data
|
|
- Reject any source describing events before 1950 unless profile is known to be historical
|
|
|
|
### Wikipedia Articles
|
|
|
|
Wikipedia is particularly dangerous because:
|
|
- Many people with the same name have articles
|
|
- Search engines return Wikipedia first
|
|
- The Wikipedia Carmen Julia Álvarez article describes a Venezuelan actress born 1952
|
|
- This is a DIFFERENT PERSON from Carmen Juliá the UK curator
|
|
|
|
**For Wikipedia sources**:
|
|
1. Read the FULL article, not just snippets
|
|
2. Verify the Wikipedia subject's profession matches the profile
|
|
3. Verify the Wikipedia subject's location matches the profile
|
|
4. If ANY conflict detected → REJECT
|
|
|
|
## Audit Trail
|
|
|
|
All entity resolution decisions must be logged:
|
|
|
|
```json
|
|
{
|
|
"enrichment_history": [
|
|
{
|
|
"enrichment_timestamp": "2026-01-11T15:00:00Z",
|
|
"enrichment_agent": "enrich_person_comprehensive.py v1.4.0",
|
|
"entity_resolution_decisions": [
|
|
{
|
|
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
|
|
"decision": "REJECTED",
|
|
"reason": "Different person - Venezuelan actress, not UK curator"
|
|
}
|
|
],
|
|
"claims_rejected_count": 5,
|
|
"claims_accepted_count": 1
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## See Also
|
|
|
|
- Rule 21: Data Fabrication is Strictly Prohibited
|
|
- Rule 26: Person Data Provenance - Web Claims for Staff Information
|
|
- Rule 45: Inferred Data Must Be Explicit with Provenance
|