- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions. - Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications. - Implemented Rule: No Version Indicators in Names to maintain stable semantic naming. - Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions. - Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices. - Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files. - Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates. - Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml. - Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
14 KiB
Rule 46: Entity Resolution - Names Are NEVER Sufficient
Status: CRITICAL
🚨 DATA QUALITY IS OF UTMOST IMPORTANCE 🚨
Wrong data is worse than no data. Attributing a birth year, spouse, or social media profile to the wrong person is a critical data quality failure that undermines the entire dataset's trustworthiness.
ALL enrichments MUST be done MANUALLY and double-checked. Automated web search enrichment has been DISABLED due to catastrophic entity resolution failures (540+ false claims removed in Jan 2026).
The cost of false data:
- Corrupts downstream analysis and reporting
- Creates legal/privacy risks (attributing data to wrong person)
- Destroys user trust in the dataset
- Requires expensive manual cleanup
🚫 AUTOMATED ENRICHMENT IS PROHIBITED 🚫
DO NOT USE automated scripts to enrich person profiles with web search data.
Why automated enrichment failed:
- Web searches return data about DIFFERENT people with similar names
- Regex pattern matching cannot distinguish between namesakes
- Wikipedia, IMDB, ResearchGate, Instagram all returned data from wrong people
- Example: "Carmen Juliá" search returned Venezuelan actress, Mexican hydrogeologist, Spanish medievalist - NONE were the UK art curator
ONLY ALLOWED enrichment methods:
- Manual research - Human curator verifies source refers to the correct person
- Institutional sources - Data from the person's employer website (verified)
- LinkedIn profile data - Already verified via direct profile access
- ORCID/Wikidata - If the person has a verified identifier
The Core Principle
🚨 SIMILAR OR IDENTICAL NAMES ARE NEVER SUFFICIENT FOR ENTITY RESOLUTION.
A web search result mentioning "Carmen Juliá born 1952" is NOT evidence that the Carmen Juliá in our person profile was born in 1952. Names are not unique identifiers - there are thousands of people with the same name worldwide.
Entity resolution requires verification of MULTIPLE independent identity attributes:
| Attribute | Purpose | Example |
|---|---|---|
| Age/Birth Year | Temporal consistency | Both sources describe someone in their 40s |
| Career Path | Professional identity | Both are art curators, not one curator and one actress |
| Location | Geographic consistency | Both are based in UK, not one UK and one Venezuela |
| Employer | Institutional affiliation | Both work at New Contemporaries |
| Education | Academic background | Same university or field |
Minimum Requirement: At least 3 of 5 attributes must match before attributing ANY claim from a web source. Name match alone = AUTOMATIC REJECTION.
Problem Statement
When enriching person profiles via web search (Linkup, Exa, etc.), search results often return data about different people with similar or identical names. Without proper entity resolution, the enrichment process can attribute false claims to the wrong person.
Example Failure (Carmen Juliá - UK Art Curator):
- Source profile: Carmen Juliá, Curator at New Contemporaries (UK)
- Birth year extracted: 1952 from Carmen Julia Álvarez (Venezuelan actress)
- Spouse extracted: "actors Eduardo Serrano" from the Venezuelan actress
- ResearchGate: Carmen Julia Navarro (Mexican hydrogeologist)
- Academia.edu: Carmen Julia Gutiérrez (Spanish medieval studies)
All data is from different people - none is the actual Carmen Juliá who is a UK-based art curator.
Why This Happened: The enrichment script used regex pattern matching to extract "born 1952" without verifying that the Wikipedia article described the SAME person.
The Rule
DO NOT use name matching as the basis for entity resolution. EVER.
For person enrichment via web search:
FORBIDDEN (Name-based extraction):
- ❌ Extracting birth years from any search result mentioning "Carmen Julia born..."
- ❌ Attributing social media profiles just because the name appears
- ❌ Claiming relationships (spouse, parent, child) from web text pattern matching
- ❌ Assigning academic profiles (ResearchGate, Academia.edu, Google Scholar) based on name matching alone
- ❌ Using Wikipedia articles without verifying ALL identity attributes
- ❌ Trusting genealogy sites (Geni, Ancestry, MyHeritage) which describe historical namesakes
- ❌ Using IMDB for birth years (actors with same names)
REQUIRED (Multi-Attribute Entity Resolution):
- Verify identity via MULTIPLE attributes - name alone is INSUFFICIENT
- Cross-reference with known facts (employer, location, job title from LinkedIn)
- Detect conflicting signals - actress vs curator, Venezuela vs UK, 1950s birth vs active 2020s career
- Reject ambiguous matches - if source doesn't clearly identify the same person, reject the claim
- Document rejection rationale - log why claim was rejected for audit trail
Entity Resolution Verification Checklist
Before attributing a web claim to a person profile, verify MULTIPLE identity attributes:
| # | Attribute | What to Check | Example Match | Example Conflict |
|---|---|---|---|---|
| 1 | Career/Profession | Same field/industry | Both are curators | Source says "actress", profile is curator |
| 2 | Employer | Same institution | Both at Rijksmuseum | Source says "film studio", profile is museum |
| 3 | Location | Same city/country | Both UK-based | Source says Venezuela, profile is UK |
| 4 | Age Range | Plausible for career | Birth 1980s, active 2020s | Birth 1952, still active in 2025 as junior |
| 5 | Education | Same university/field | Both art history | Source says "medical school" |
Minimum requirement: At least 3 of 5 attributes must match. Name match alone = AUTOMATIC REJECTION.
Any conflicting signal = AUTOMATIC REJECTION (e.g., source says "actress" when profile is "curator").
Sources with High Entity Resolution Risk
These sources are NOT forbidden, but require stricter verification thresholds due to high false-positive rates:
| Source Type | Risk Level | Why | Required Matches |
|---|---|---|---|
| Genealogy sites | CRITICAL | Historical persons with same name | 5/5 attributes (or explicit link to living person) |
| IMDB | CRITICAL | Actors with common names | 5/5 attributes (unless person works in film/TV) |
| Wikipedia | HIGH | Many people with same name have pages | 4/5 attributes match |
| Academic profiles | HIGH | Multiple researchers with same name | 4/5 attributes + institution match |
| Social media | HIGH | Many accounts with similar handles | 4/5 attributes + verify employer/location in bio |
| News articles | MEDIUM | May mention multiple people | 3/5 attributes + read full context |
| Institutional websites | LOW | Usually about their own staff | 2/5 attributes (good source if person works there) |
Key point: High-risk sources CAN be used if you verify enough identity attributes. The risk level determines the verification threshold, not whether the source is allowed.
Red Flags Requiring Investigation
The following are red flags that require careful investigation - NOT automatic rejection. People change careers and relocate.
Profession Differences
If source profession differs from profile profession, investigate:
Source: "actress", "actor", "singer"
Profile: "curator", "archivist", "librarian"
ASK: Did this person change careers?
- Check timeline: Did acting career END before heritage career BEGAN?
- Check for transition evidence: "former actress turned curator"
- If careers overlap in time → likely different people → REJECT
- If sequential careers with clear transition → may be same person → ACCEPT with documentation
Location Differences
If source location differs from profile location, investigate:
Source: "Venezuela", "Mexico", "Brazil"
Profile: "UK", "Netherlands", "France"
ASK: Did this person relocate?
- Check timeline: When were they in each location?
- Check for migration evidence: education abroad, international career moves
- If locations overlap in time → likely different people → REJECT
- If sequential locations with clear move → may be same person → ACCEPT with documentation
When to Actually REJECT
Reject when investigation shows no plausible connection:
Example: Carmen Julia Álvarez (Venezuelan actress, active 1970s-2000s)
vs Carmen Juliá (UK curator, active 2015-present)
- Overlapping active periods in DIFFERENT professions on DIFFERENT continents
- No evidence of career change or relocation
- Birth year 1952 makes current junior curator role implausible
→ REJECT: These are clearly different people
Age Conflicts (Still Automatic Rejection)
If source age is physically implausible for profile career stage, REJECT:
Source: Born 1922, 1915, 1939
Profile: Currently active professional in 2025
→ REJECT (person would be 86-103 years old)
Source: Born 2007, 2004
Profile: Senior curator
→ REJECT (person would be 18-21, too young)
Genealogy Source
Genealogy sources require 5 of 5 attribute matches due to high false-positive rates:
Domains: geni.com, ancestry.*, familysearch.org, findagrave.com, myheritage.*
→ REQUIRE 5/5 attribute matches (these often describe historical namesakes)
→ Exception: If source explicitly links to living person with verifiable connection
Claim Rejection Patterns
The following inconsisten patterns should trigger automatic claim rejection:
# Genealogy sources conflict - ALWAYS REJECT
GENEALOGY_DOMAINS = [
'geni.com', 'ancestry.com', 'ancestry.co.uk', 'familysearch.org',
'findagrave.com', 'myheritage.com', 'wikitree.com', 'geneanet.org'
]
# Profession conflicts - if profile has one and source has another, REJECT
PROFESSION_CONFLICTS = {
'heritage': ['curator', 'archivist', 'librarian', 'conservator', 'registrar', 'collection manager'],
'entertainment': ['actress', 'actor', 'singer', 'footballer', 'politician', 'model', 'athlete'],
'medical': ['doctor', 'nurse', 'surgeon', 'physician'],
'tech': ['software engineer', 'developer', 'programmer'],
}
# Location conflicts - if source describes person in location X and profile is location Y, REJECT
LOCATION_PAIRS = [
('venezuela', 'uk'), ('venezuela', 'netherlands'), ('venezuela', 'germany'),
('mexico', 'uk'), ('mexico', 'netherlands'), ('brazil', 'france'),
('caracas', 'london'), ('caracas', 'amsterdam'),
]
# Age impossibility - if birth year makes current career implausible, REJECT. For instance, for a Junior role:
MIN_PLAUSIBLE_BIRTH_YEAR = 1945 # Would be 80 in 2025 - still plausible but verify
MAX_PLAUSIBLE_BIRTH_YEAR = 2002 # Would be 23 in 2025 - plausible for junior roles
Handling Rejected Claims
When a claim fails entity resolution:
{
"claim_type": "birth_year",
"claim_value": 1952,
"entity_resolution": {
"status": "REJECTED",
"reason": "conflicting_profession",
"details": "Source describes Venezuelan actress, profile is UK curator",
"source_identity": "Carmen Julia Álvarez (Venezuelan actress)",
"profile_identity": "Carmen Juliá (UK art curator)",
"rejected_at": "2026-01-11T15:00:00Z",
"rejected_by": "entity_resolution_validator_v1"
}
}
Special Cases
Common Names
For very common names (e.g., "John Smith", "Maria García", "Jan de Vries"), require 4 of 5 verification checks instead of 3. The more common the name, the higher the threshold.
| Name Commonality | Required Matches |
|---|---|
| Unique name (e.g., "Xander Vermeulen-Oosterhuis") | 2 of 5 |
| Moderately common (e.g., "Carmen Juliá") | 3 of 5 |
| Very common (e.g., "Jan de Vries") | 4 of 5 |
| Extremely common (e.g., "John Smith") | 5 of 5 or reject |
Abbreviated Names
For profiles with abbreviated names (e.g., "J. Smith"), entity resolution is inherently uncertain:
- Set
entity_resolution_confidence: "very_low" - Require human review for all claims
- Do NOT attribute web claims automatically
Historical Persons
When sources describe historical/deceased persons:
- Check if death date conflicts with profile activity (living person active in 2025)
- ALWAYS REJECT genealogy site data
- Reject any source describing events before 1950 unless profile is known to be historical
Wikipedia Articles
Wikipedia is particularly dangerous because:
- Many people with the same name have articles
- Search engines return Wikipedia first
- The Wikipedia Carmen Julia Álvarez article describes a Venezuelan actress born 1952
- This is a DIFFERENT PERSON from Carmen Juliá the UK curator
For Wikipedia sources:
- Read the FULL article, not just snippets
- Verify the Wikipedia subject's profession matches the profile
- Verify the Wikipedia subject's location matches the profile
- If ANY conflict detected → REJECT
Audit Trail
All entity resolution decisions must be logged:
{
"enrichment_history": [
{
"enrichment_timestamp": "2026-01-11T15:00:00Z",
"enrichment_agent": "enrich_person_comprehensive.py v1.4.0",
"entity_resolution_decisions": [
{
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
"decision": "REJECTED",
"reason": "Different person - Venezuelan actress, not UK curator"
}
],
"claims_rejected_count": 5,
"claims_accepted_count": 1
}
]
}
See Also
- Rule 21: Data Fabrication is Strictly Prohibited
- Rule 26: Person Data Provenance - Web Claims for Staff Information
- Rule 45: Inferred Data Must Be Explicit with Provenance