glam/.opencode/rules/entity_resolution/disambiguation-entity-profiles.md
kempersc 554fe520ea Add comprehensive rules for LinkML schema management and ontology mapping
- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions.
- Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications.
- Implemented Rule: No Version Indicators in Names to maintain stable semantic naming.
- Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions.
- Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices.
- Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files.
- Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates.
- Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml.
- Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
2026-02-15 19:20:09 +01:00

248 lines
7.8 KiB
Markdown

# Rule 47: Disambiguation Entity Profiles - Prevent Repeated Entity Resolution Errors
## Status: CRITICAL
## Summary
When entity resolution determines that a web source describes a **different person** with a similar name, **create a PPID profile for that person** in `data/person/`. The PPID system is universal - ANY person who ever lived can have a profile, regardless of heritage relevance.
---
## The Universal PPID Principle
**In principle, all persons on Earth should be assigned PPIDs** - whether or not they are active in the heritage field. This includes:
- Heritage workers (curators, archivists, librarians, etc.)
- Non-heritage professionals (actors, doctors, athletes, etc.)
- Historical persons (deceased individuals from any era)
- Public figures and private individuals
The `heritage_relevance` field indicates whether someone works in the heritage sector, but does NOT determine whether they can have a profile. **Anyone can have a PPID.**
---
## The Problem
During entity resolution, we often discover that web search results describe a **different person** with a similar name:
| Heritage Profile | Namesake Discovered | Why Different |
|------------------|---------------------|---------------|
| Carmen Juliá (UK curator) | Carmen Julia Álvarez (Venezuelan actress) | Different profession, location, timeline |
| Jan de Vries (Rijksmuseum curator) | Jan de Vries (footballer) | Different profession |
| Robert Ritter (heritage worker) | Robert Ritter (Nazi doctor, 1901-1951) | Different era, profession |
Without creating a profile for the namesake, future enrichment attempts may:
1. Re-discover the same namesake
2. Waste time re-investigating
3. Risk attributing false claims again
---
## The Solution: Create PPID Profiles for Namesakes
When entity resolution proves two entities are different, **create a regular PPID profile for the namesake**:
1. Use standard PPID naming convention (no special prefix)
2. Set `heritage_relevance.is_heritage_relevant: false`
3. Document the disambiguation in BOTH profiles
---
## Example: Venezuelan Actress Profile
```json
{
"ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
"profile_data": {
"full_name": "Carmen Julia Álvarez",
"profession": "actress",
"nationality": "Venezuelan",
"birth_year": 1952,
"birth_location": "Caracas, Venezuela",
"active_period": "1970s-2000s"
},
"heritage_relevance": {
"is_heritage_relevant": false,
"relevance_score": 0.0,
"reason": "Entertainment industry professional - actress in film and television"
},
"disambiguation_notes": {
"commonly_confused_with": [
{
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
"name": "Carmen Juliá",
"profession": "curator",
"employer": "New Contemporaries",
"location": "UK",
"why_different": "Different profession (actress vs curator), different location (Venezuela vs UK), overlapping active periods in incompatible roles"
}
],
"disambiguation_note": "This is the Venezuelan actress, NOT the UK-based art curator."
},
"web_claims": [
{
"claim_type": "birth_year",
"claim_value": 1952,
"provenance": {
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
"retrieved_on": "2026-01-11T14:30:00Z",
"retrieval_agent": "manual-human-curator"
}
},
{
"claim_type": "profession",
"claim_value": "actress",
"provenance": {
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
"retrieved_on": "2026-01-11T14:30:00Z",
"retrieval_agent": "manual-human-curator"
}
}
],
"extraction_metadata": {
"created_at": "2026-01-11T15:00:00Z",
"created_by": "manual-human-curator",
"creation_reason": "Created during entity resolution to distinguish from heritage worker Carmen Juliá"
}
}
```
---
## Update the Heritage Profile Too
The heritage profile should also reference the disambiguation:
```json
{
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
"profile_data": {
"full_name": "Carmen Juliá",
"headline": "Curator at New Contemporaries"
},
"heritage_relevance": {
"is_heritage_relevant": true,
"relevance_score": 0.85
},
"disambiguation_notes": {
"known_namesakes": [
{
"ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
"name": "Carmen Julia Álvarez",
"profession": "actress",
"location": "Venezuela",
"why_not_same_person": "Different profession, location, timeline"
}
],
"disambiguation_warning": "Web searches for 'Carmen Julia' return data about Venezuelan actress Carmen Julia Álvarez (born 1952). This is a DIFFERENT person."
}
}
```
---
## When to Create Namesake Profiles
Create a PPID profile for a namesake when:
1. **Entity resolution proves they are a different person**
2. **They are notable enough** to appear in search results repeatedly (Wikipedia, IMDB, news)
3. **The confusion risk is high** (similar name, some overlapping attributes)
**Do NOT create profiles for**:
- Random social media accounts with no notable presence
- Obvious mismatches unlikely to recur in searches
---
## Benefits
1. **Universal person database**: Any person can have a PPID
2. **Prevents repeated mistakes**: Future enrichment can check for known namesakes
3. **Bidirectional linking**: Both profiles reference each other
4. **Consistent data model**: No special file naming or profile types needed
5. **Audit trail**: Documents why profiles were created
---
## Workflow
### Step 1: During Entity Resolution
When you reject a claim due to identity mismatch with a notable namesake:
```
1. Document WHY the source describes a different person
2. Check if the namesake is notable (Wikipedia, IMDB, frequent search results)
3. If notable → Create PPID profile for the namesake
4. Link both profiles via disambiguation_notes
```
### Step 2: Create Namesake Profile
Use standard PPID naming:
```
ID_{birth-location}_{birth-decade}_{current-location}_{death-decade}_{NAME}.json
```
Example: `ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ.json`
### Step 3: Update Both Profiles
- Namesake profile: Add `commonly_confused_with` pointing to heritage profile
- Heritage profile: Add `known_namesakes` pointing to namesake profile
---
## Historical Persons
Historical persons (deceased) can also have PPID profiles:
```json
{
"ppid": "ID_DE-XX-XXX_1901_DE-XX-XXX_1951_ROBERT-RITTER",
"profile_data": {
"full_name": "Robert Ritter",
"profession": "physician",
"birth_year": 1901,
"death_year": 1951,
"nationality": "German",
"historical_note": "Nazi-era physician involved in racial hygiene programs"
},
"heritage_relevance": {
"is_heritage_relevant": false,
"relevance_score": 0.0
},
"disambiguation_notes": {
"commonly_confused_with": [
{
"ppid": "ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ROBERT-RITTER",
"name": "Robert Ritter",
"profession": "heritage worker",
"why_different": "Different era - historical figure (1901-1951) vs living heritage professional"
}
]
}
}
```
---
## Related Rules
- **Rule 46**: Entity Resolution - Names Are NEVER Sufficient
- **Rule 21**: Data Fabrication is Strictly Prohibited
- **Rule 26**: Person Data Provenance - Web Claims for Staff Information
---
## Summary
**The PPID system is universal.** When you discover during entity resolution that a web source describes a different person:
1. **Create a regular PPID profile** for the namesake (actress, historical figure, etc.)
2. **Set `heritage_relevance.is_heritage_relevant: false`** (unless they happen to also work in heritage)
3. **Link both profiles** via `disambiguation_notes`
4. **Use standard PPID naming** - no special prefixes needed
This builds a comprehensive person database while preventing entity resolution errors.