glam/.opencode/PERSON_ENTITY_DEDUPLICATION_RULE.md
2025-12-14 17:09:55 +01:00

6.9 KiB

Person Entity Deduplication Rule

Version: 1.0.0
Created: 2025-12-14
Applies To: Person entity profiles in data/custodian/person/entity/


Problem Statement

Duplicate person entity files can occur when:

  1. The same person is extracted at different times (different timestamps)
  2. LinkedIn URL slugs vary (e.g., with/without numeric suffix)
  3. Manual extraction overlaps with automated extraction
  4. Name variations lead to separate file creation

File Naming Convention

Person entity files follow this pattern:

{linkedin-slug}_{ISO-timestamp}.json

Examples:

  • frank-kanhai-a4119683_20251210T230007Z.json
  • frank-kanhai-a4119683_20251213T160000Z.json (same person, different time)
  • tom-de-smet_20251214T000000Z.json
  • tom-de-smet-5695436_20251211T073000Z.json (same person, different slug format)

Duplicate Detection

Indicators of Duplicates

  1. Same LinkedIn slug (with different timestamps)
  2. Same person name with different slug formats
  3. Same LinkedIn URL in extraction_metadata.linkedin_url
  4. Matching unique identifiers (ORCID, ISNI, email)

Detection Commands

# Find potential duplicates by name prefix
ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d

# Find files for a specific person
ls data/custodian/person/entity/ | grep "frank-kanhai"

Merge Strategy

Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE

When merging duplicates:

  1. Keep the NEWER file as the base (more recent extraction)
  2. Preserve ALL data from both files (additive only - per Rule 5 in AGENTS.md)
  3. Use newer values for conflicting scalar fields
  4. Merge arrays (deduplicate where appropriate)
  5. Document merge in provenance

Merge Priority (Newer Wins for Conflicts)

Field Type Merge Strategy
extraction_metadata Keep newer, note older in previous_extractions
profile_data scalars Newer value wins
profile_data.experience[] Merge arrays, dedupe by company+title+dates
profile_data.education[] Merge arrays, dedupe by institution+degree
profile_data.skills[] Union of all skills
contact_data Keep more complete version
heritage_sector_relevance Keep more detailed assessment
heritage_relevant_experience Preserve if newer file lacks equivalent

Merge Procedure

Step 1: Identify Duplicates

# List both files
ls -la data/custodian/person/entity/frank-kanhai*

Step 2: Read Both Files

Compare content to understand differences:

  • Which has more complete profile_data?
  • Which has contact_data or heritage_sector_relevance?
  • Which has more recent extraction_date?

Step 3: Create Merged File

Use the newer timestamp for the final filename:

frank-kanhai-a4119683_20251213T160000Z.json  (keep this name)

Step 4: Merge Content

{
  "extraction_metadata": {
    // From newer file
    "extraction_date": "2025-12-13T16:00:00Z",
    "extraction_method": "exa_contents",
    // Add reference to older extraction
    "previous_extractions": [
      {
        "source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
        "extraction_date": "2025-12-10T23:00:07Z",
        "merged_on": "2025-12-14T00:00:00Z"
      }
    ]
  },
  "profile_data": {
    // Merged content from both files
  },
  "contact_data": {
    // From whichever file has it
  },
  "heritage_sector_relevance": {
    // From whichever file has it, or merge assessments
  }
}

Step 5: Delete Older File

After successful merge and validation:

rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json

Step 6: Validate Merged File

python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON"

Example Merge

Before (Two Files)

File 1 (older): frank-kanhai-a4119683_20251210T230007Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-10T23:00:07Z"
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor"
  }
}

File 2 (newer): frank-kanhai-a4119683_20251213T160000Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-13T16:00:00Z"
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor at Nationaal Archief",
    "experience": [...]
  },
  "contact_data": {...}
}

After (Merged File)

Merged: frank-kanhai-a4119683_20251213T160000Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-13T16:00:00Z",
    "previous_extractions": [
      {
        "source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
        "extraction_date": "2025-12-10T23:00:07Z",
        "merged_on": "2025-12-14T00:00:00Z"
      }
    ]
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor at Nationaal Archief",
    "experience": [...]
  },
  "contact_data": {...}
}

Handling Slug Variations

When the same person has files with different slug formats:

Variation Example Resolution
With/without numeric suffix tom-de-smet vs tom-de-smet-5695436 Keep the one matching actual LinkedIn URL
Typos jon-smith vs john-smith Keep the correct spelling
Unicode normalization muller vs müller Keep ASCII-normalized version

Determining Correct Slug

  1. Check extraction_metadata.linkedin_url in both files
  2. Use the slug that matches the actual LinkedIn profile URL
  3. If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique)

Non-Person Entity Files

If a file in data/custodian/person/entity/ is NOT a person (e.g., an organization):

Detection

# Check if file contains organization indicators
grep -l '"type": "Organization"' data/custodian/person/entity/*.json
grep -l '"company"' data/custodian/person/entity/*.json | head -5

Resolution

  1. Do NOT delete - preserves data provenance
  2. Move to archive with documentation:
mkdir -p data/custodian/person/entity/archive/non_person
mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \
   data/custodian/person/entity/archive/non_person/
  1. Create README in archive folder explaining why files were moved

Prevention

To prevent future duplicates:

  1. Check before creating: Search for existing files by LinkedIn slug
  2. Use consistent slug format: Prefer {name}-{numeric} format when available
  3. Update existing files: Instead of creating new file, update existing one with new timestamp
  4. Document extraction source: Clear provenance prevents confusion

References

  • AGENTS.md: Rule 5 (Never Delete Enriched Data)
  • AGENTS.md: Rule 20 (Person Entity Profiles)
  • PERSON_ENTITY_PROFILE_FORMAT_RULE.md: Entity file structure
  • HERITAGE_SECTOR_RELEVANCE_SCORING.md: Scoring guidelines