glam/.opencode/rules/entity_resolution/disambiguation-entity-profiles.md
kempersc 554fe520ea Add comprehensive rules for LinkML schema management and ontology mapping
- Introduced Rule 42: No Ontology Prefixes in Slot Names to enforce clean naming conventions.
- Established Rule: No Rough Edits in Schema Files to ensure structural integrity during modifications.
- Implemented Rule: No Version Indicators in Names to maintain stable semantic naming.
- Created Rule: Ontology Detection vs Heuristics to emphasize the importance of verifying ontology definitions.
- Defined Rule 50: Ontology-to-LinkML Mapping Convention to standardize mapping practices.
- Added Rule: Polished Slot Storage Location to specify directory structure for polished slot files.
- Enforced Rule: Preserve Bespoke Slots Until Refactoring to prevent unintended migrations during slot updates.
- Instituted Rule 56: Semantic Consistency Over Simplicity to mandate execution of revisions in slot_fixes.yaml.
- Added new Genealogy Archives Registry Enrichment class with multilingual support and structured aliases.
2026-02-15 19:20:09 +01:00

7.8 KiB

Rule 47: Disambiguation Entity Profiles - Prevent Repeated Entity Resolution Errors

Status: CRITICAL

Summary

When entity resolution determines that a web source describes a different person with a similar name, create a PPID profile for that person in data/person/. The PPID system is universal - ANY person who ever lived can have a profile, regardless of heritage relevance.


The Universal PPID Principle

In principle, all persons on Earth should be assigned PPIDs - whether or not they are active in the heritage field. This includes:

  • Heritage workers (curators, archivists, librarians, etc.)
  • Non-heritage professionals (actors, doctors, athletes, etc.)
  • Historical persons (deceased individuals from any era)
  • Public figures and private individuals

The heritage_relevance field indicates whether someone works in the heritage sector, but does NOT determine whether they can have a profile. Anyone can have a PPID.


The Problem

During entity resolution, we often discover that web search results describe a different person with a similar name:

Heritage Profile Namesake Discovered Why Different
Carmen Juliá (UK curator) Carmen Julia Álvarez (Venezuelan actress) Different profession, location, timeline
Jan de Vries (Rijksmuseum curator) Jan de Vries (footballer) Different profession
Robert Ritter (heritage worker) Robert Ritter (Nazi doctor, 1901-1951) Different era, profession

Without creating a profile for the namesake, future enrichment attempts may:

  1. Re-discover the same namesake
  2. Waste time re-investigating
  3. Risk attributing false claims again

The Solution: Create PPID Profiles for Namesakes

When entity resolution proves two entities are different, create a regular PPID profile for the namesake:

  1. Use standard PPID naming convention (no special prefix)
  2. Set heritage_relevance.is_heritage_relevant: false
  3. Document the disambiguation in BOTH profiles

Example: Venezuelan Actress Profile

{
  "ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
  "profile_data": {
    "full_name": "Carmen Julia Álvarez",
    "profession": "actress",
    "nationality": "Venezuelan",
    "birth_year": 1952,
    "birth_location": "Caracas, Venezuela",
    "active_period": "1970s-2000s"
  },
  "heritage_relevance": {
    "is_heritage_relevant": false,
    "relevance_score": 0.0,
    "reason": "Entertainment industry professional - actress in film and television"
  },
  "disambiguation_notes": {
    "commonly_confused_with": [
      {
        "ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
        "name": "Carmen Juliá",
        "profession": "curator",
        "employer": "New Contemporaries",
        "location": "UK",
        "why_different": "Different profession (actress vs curator), different location (Venezuela vs UK), overlapping active periods in incompatible roles"
      }
    ],
    "disambiguation_note": "This is the Venezuelan actress, NOT the UK-based art curator."
  },
  "web_claims": [
    {
      "claim_type": "birth_year",
      "claim_value": 1952,
      "provenance": {
        "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
        "retrieved_on": "2026-01-11T14:30:00Z",
        "retrieval_agent": "manual-human-curator"
      }
    },
    {
      "claim_type": "profession",
      "claim_value": "actress",
      "provenance": {
        "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez",
        "retrieved_on": "2026-01-11T14:30:00Z",
        "retrieval_agent": "manual-human-curator"
      }
    }
  ],
  "extraction_metadata": {
    "created_at": "2026-01-11T15:00:00Z",
    "created_by": "manual-human-curator",
    "creation_reason": "Created during entity resolution to distinguish from heritage worker Carmen Juliá"
  }
}

Update the Heritage Profile Too

The heritage profile should also reference the disambiguation:

{
  "ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
  "profile_data": {
    "full_name": "Carmen Juliá",
    "headline": "Curator at New Contemporaries"
  },
  "heritage_relevance": {
    "is_heritage_relevant": true,
    "relevance_score": 0.85
  },
  "disambiguation_notes": {
    "known_namesakes": [
      {
        "ppid": "ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ",
        "name": "Carmen Julia Álvarez",
        "profession": "actress",
        "location": "Venezuela",
        "why_not_same_person": "Different profession, location, timeline"
      }
    ],
    "disambiguation_warning": "Web searches for 'Carmen Julia' return data about Venezuelan actress Carmen Julia Álvarez (born 1952). This is a DIFFERENT person."
  }
}

When to Create Namesake Profiles

Create a PPID profile for a namesake when:

  1. Entity resolution proves they are a different person
  2. They are notable enough to appear in search results repeatedly (Wikipedia, IMDB, news)
  3. The confusion risk is high (similar name, some overlapping attributes)

Do NOT create profiles for:

  • Random social media accounts with no notable presence
  • Obvious mismatches unlikely to recur in searches

Benefits

  1. Universal person database: Any person can have a PPID
  2. Prevents repeated mistakes: Future enrichment can check for known namesakes
  3. Bidirectional linking: Both profiles reference each other
  4. Consistent data model: No special file naming or profile types needed
  5. Audit trail: Documents why profiles were created

Workflow

Step 1: During Entity Resolution

When you reject a claim due to identity mismatch with a notable namesake:

1. Document WHY the source describes a different person
2. Check if the namesake is notable (Wikipedia, IMDB, frequent search results)
3. If notable → Create PPID profile for the namesake
4. Link both profiles via disambiguation_notes

Step 2: Create Namesake Profile

Use standard PPID naming:

ID_{birth-location}_{birth-decade}_{current-location}_{death-decade}_{NAME}.json

Example: ID_VE-XX-CCS_1952_VE-XX-CCS_XXXX_CARMEN-JULIA-ALVAREZ.json

Step 3: Update Both Profiles

  • Namesake profile: Add commonly_confused_with pointing to heritage profile
  • Heritage profile: Add known_namesakes pointing to namesake profile

Historical Persons

Historical persons (deceased) can also have PPID profiles:

{
  "ppid": "ID_DE-XX-XXX_1901_DE-XX-XXX_1951_ROBERT-RITTER",
  "profile_data": {
    "full_name": "Robert Ritter",
    "profession": "physician",
    "birth_year": 1901,
    "death_year": 1951,
    "nationality": "German",
    "historical_note": "Nazi-era physician involved in racial hygiene programs"
  },
  "heritage_relevance": {
    "is_heritage_relevant": false,
    "relevance_score": 0.0
  },
  "disambiguation_notes": {
    "commonly_confused_with": [
      {
        "ppid": "ID_XX-XX-XXX_XXXX_XX-XX-XXX_XXXX_ROBERT-RITTER",
        "name": "Robert Ritter",
        "profession": "heritage worker",
        "why_different": "Different era - historical figure (1901-1951) vs living heritage professional"
      }
    ]
  }
}

  • Rule 46: Entity Resolution - Names Are NEVER Sufficient
  • Rule 21: Data Fabrication is Strictly Prohibited
  • Rule 26: Person Data Provenance - Web Claims for Staff Information

Summary

The PPID system is universal. When you discover during entity resolution that a web source describes a different person:

  1. Create a regular PPID profile for the namesake (actress, historical figure, etc.)
  2. Set heritage_relevance.is_heritage_relevant: false (unless they happen to also work in heritage)
  3. Link both profiles via disambiguation_notes
  4. Use standard PPID naming - no special prefixes needed

This builds a comprehensive person database while preventing entity resolution errors.