- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions. - Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications. - Implemented Rule: No Version Indicators in Names to maintain stable semantic naming. - Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions. - Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices. - Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files. - Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates. - Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml. - Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
248 lines
7.8 KiB
Markdown
248 lines
7.8 KiB
Markdown
# Rule 47: Disambiguation Entity Profiles - Prevent Repeated Entity Resolution Errors
|
|
|
|
## Status: CRITICAL
|
|
|
|
## Summary
|
|
|
|
When entity resolution determines that a web source describes a **different person** with a similar name, **create a PPID profile for that person** in `data/person/`. The PPID system is universal - ANY person who ever lived can have a profile, regardless of heritage relevance.
|
|
|
|
---
|
|
|
|
## The Universal PPID Principle
|
|
|
|
**In principle, all persons on Earth should be assigned PPIDs** - whether or not they are active in the heritage field. This includes:
|
|
|
|
- Heritage workers (curators, archivists, librarians, etc.)
|
|
- Non-heritage professionals (actors, doctors, athletes, etc.)
|
|
- Historical persons (deceased individuals from any era)
|
|
- Public figures and private individuals
|
|
|
|
The `heritage_relevance` field indicates whether someone works in the heritage sector, but does NOT determine whether they can have a profile. **Anyone can have a PPID.**
|
|
|
|
---
|
|
|
|
## The Problem
|
|
|
|
During entity resolution, we often discover that web search results describe a **different person** with a similar name:
|
|
|
|
| Heritage Profile | Namesake Discovered | Why Different |
|
|
|------------------|---------------------|---------------|
|
|
| Carmen Juliá (UK curator) | Carmen Julia Álvarez (Venezuelan actress) | Different profession, location, timeline |
|
|
| Jan de Vries (Rijksmuseum curator) | Jan de Vries (footballer) | Different profession |
|
|
| Robert Ritter (heritage worker) | Robert Ritter (Nazi doctor, 1901-1951) | Different era, profession |
|
|
|
|
Without creating a profile for the namesake, future enrichment attempts may:
|
|
1. Re-discover the same namesake
|
|
2. Waste time re-investigating
|
|
3. Risk attributing false claims again
|
|
|
|
---
|
|
|
|
## The Solution: Create PPID Profiles for Namesakes
|
|
|
|
When entity resolution proves two entities are different, **create a regular PPID profile for the namesake**:
|
|
|
|
1. Use standard PPID naming convention (no special prefix)
|
|
2. Set `heritage_relevance.is_heritage_relevant: false`
|
|
3. Document the disambiguation in BOTH profiles
|
|
|
|
---
|
|
|
|
## Example: Venezuelan Actress Profile
|
|
|
|
```json
|
|
{
|
|
"ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
|
|
"profile_data": {
|
|
"full_name": "Carmen Julia Álvarez",
|
|
"profession": "actress",
|
|
"nationality": "Venezuelan",
|
|
"birth_year": 1952,
|
|
"birth_location": "Caracas, Venezuela",
|
|
"active_period": "1970s-2000s"
|
|
},
|
|
"heritage_relevance": {
|
|
"is_heritage_relevant": false,
|
|
"relevance_score": 0.0,
|
|
"reason": "Entertainment industry professional - actress in film and television"
|
|
},
|
|
"disambiguation_notes": {
|
|
"commonly_confused_with": [
|
|
{
|
|
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
|
|
"name": "Carmen Juliá",
|
|
"profession": "curator",
|
|
"employer": "New Contemporaries",
|
|
"location": "UK",
|
|
"why_different": "Different profession (actress vs curator), different location (Venezuela vs UK), overlapping active periods in incompatible roles"
|
|
}
|
|
],
|
|
"disambiguation_note": "This is the Venezuelan actress, NOT the UK-based art curator."
|
|
},
|
|
"web_claims": [
|
|
{
|
|
"claim_type": "birth_year",
|
|
"claim_value": 1952,
|
|
"provenance": {
|
|
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
|
|
"retrieved_on": "2026-01-11T14:30:00Z",
|
|
"retrieval_agent": "manual-human-curator"
|
|
}
|
|
},
|
|
{
|
|
"claim_type": "profession",
|
|
"claim_value": "actress",
|
|
"provenance": {
|
|
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
|
|
"retrieved_on": "2026-01-11T14:30:00Z",
|
|
"retrieval_agent": "manual-human-curator"
|
|
}
|
|
}
|
|
],
|
|
"extraction_metadata": {
|
|
"created_at": "2026-01-11T15:00:00Z",
|
|
"created_by": "manual-human-curator",
|
|
"creation_reason": "Created during entity resolution to distinguish from heritage worker Carmen Juliá"
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Update the Heritage Profile Too
|
|
|
|
The heritage profile should also reference the disambiguation:
|
|
|
|
```json
|
|
{
|
|
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
|
|
"profile_data": {
|
|
"full_name": "Carmen Juliá",
|
|
"headline": "Curator at New Contemporaries"
|
|
},
|
|
"heritage_relevance": {
|
|
"is_heritage_relevant": true,
|
|
"relevance_score": 0.85
|
|
},
|
|
"disambiguation_notes": {
|
|
"known_namesakes": [
|
|
{
|
|
"ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
|
|
"name": "Carmen Julia Álvarez",
|
|
"profession": "actress",
|
|
"location": "Venezuela",
|
|
"why_not_same_person": "Different profession, location, timeline"
|
|
}
|
|
],
|
|
"disambiguation_warning": "Web searches for 'Carmen Julia' return data about Venezuelan actress Carmen Julia Álvarez (born 1952). This is a DIFFERENT person."
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## When to Create Namesake Profiles
|
|
|
|
Create a PPID profile for a namesake when:
|
|
|
|
1. **Entity resolution proves they are a different person**
|
|
2. **They are notable enough** to appear in search results repeatedly (Wikipedia, IMDB, news)
|
|
3. **The confusion risk is high** (similar name, some overlapping attributes)
|
|
|
|
**Do NOT create profiles for**:
|
|
- Random social media accounts with no notable presence
|
|
- Obvious mismatches unlikely to recur in searches
|
|
|
|
---
|
|
|
|
## Benefits
|
|
|
|
1. **Universal person database**: Any person can have a PPID
|
|
2. **Prevents repeated mistakes**: Future enrichment can check for known namesakes
|
|
3. **Bidirectional linking**: Both profiles reference each other
|
|
4. **Consistent data model**: No special file naming or profile types needed
|
|
5. **Audit trail**: Documents why profiles were created
|
|
|
|
---
|
|
|
|
## Workflow
|
|
|
|
### Step 1: During Entity Resolution
|
|
|
|
When you reject a claim due to identity mismatch with a notable namesake:
|
|
|
|
```
|
|
1. Document WHY the source describes a different person
|
|
2. Check if the namesake is notable (Wikipedia, IMDB, frequent search results)
|
|
3. If notable → Create PPID profile for the namesake
|
|
4. Link both profiles via disambiguation_notes
|
|
```
|
|
|
|
### Step 2: Create Namesake Profile
|
|
|
|
Use standard PPID naming:
|
|
```
|
|
ID_{birth-location}_{birth-decade}_{current-location}_{death-decade}_{NAME}.json
|
|
```
|
|
|
|
Example: `ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ.json`
|
|
|
|
### Step 3: Update Both Profiles
|
|
|
|
- Namesake profile: Add `commonly_confused_with` pointing to heritage profile
|
|
- Heritage profile: Add `known_namesakes` pointing to namesake profile
|
|
|
|
---
|
|
|
|
## Historical Persons
|
|
|
|
Historical persons (deceased) can also have PPID profiles:
|
|
|
|
```json
|
|
{
|
|
"ppid": "ID_DE-XX-XXX_1901_DE-XX-XXX_1951_ROBERT-RITTER",
|
|
"profile_data": {
|
|
"full_name": "Robert Ritter",
|
|
"profession": "physician",
|
|
"birth_year": 1901,
|
|
"death_year": 1951,
|
|
"nationality": "German",
|
|
"historical_note": "Nazi-era physician involved in racial hygiene programs"
|
|
},
|
|
"heritage_relevance": {
|
|
"is_heritage_relevant": false,
|
|
"relevance_score": 0.0
|
|
},
|
|
"disambiguation_notes": {
|
|
"commonly_confused_with": [
|
|
{
|
|
"ppid": "ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ROBERT-RITTER",
|
|
"name": "Robert Ritter",
|
|
"profession": "heritage worker",
|
|
"why_different": "Different era - historical figure (1901-1951) vs living heritage professional"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 46**: Entity Resolution - Names Are NEVER Sufficient
|
|
- **Rule 21**: Data Fabrication is Strictly Prohibited
|
|
- **Rule 26**: Person Data Provenance - Web Claims for Staff Information
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**The PPID system is universal.** When you discover during entity resolution that a web source describes a different person:
|
|
|
|
1. **Create a regular PPID profile** for the namesake (actress, historical figure, etc.)
|
|
2. **Set `heritage_relevance.is_heritage_relevant: false`** (unless they happen to also work in heritage)
|
|
3. **Link both profiles** via `disambiguation_notes`
|
|
4. **Use standard PPID naming** - no special prefixes needed
|
|
|
|
This builds a comprehensive person database while preventing entity resolution errors.
|