# Person Entity Deduplication Rule **Version**: 1.0.0 **Created**: 2025-12-14 **Applies To**: Person entity profiles in `data/custodian/person/entity/` --- ## Problem Statement Duplicate person entity files can occur when: 1. The same person is extracted at different times (different timestamps) 2. LinkedIn URL slugs vary (e.g., with/without numeric suffix) 3. Manual extraction overlaps with automated extraction 4. Name variations lead to separate file creation --- ## File Naming Convention Person entity files follow this pattern: ``` {linkedin-slug}_{ISO-timestamp}.json ``` Examples: - `frank-kanhai-a4119683_20251210T230007Z.json` - `frank-kanhai-a4119683_20251213T160000Z.json` (same person, different time) - `tom-de-smet_20251214T000000Z.json` - `tom-de-smet-5695436_20251211T073000Z.json` (same person, different slug format) --- ## Duplicate Detection ### Indicators of Duplicates 1. **Same LinkedIn slug** (with different timestamps) 2. **Same person name** with different slug formats 3. **Same LinkedIn URL** in `extraction_metadata.linkedin_url` 4. **Matching unique identifiers** (ORCID, ISNI, email) ### Detection Commands ```bash # Find potential duplicates by name prefix ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d # Find files for a specific person ls data/custodian/person/entity/ | grep "frank-kanhai" ``` --- ## Merge Strategy ### Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE When merging duplicates: 1. **Keep the NEWER file** as the base (more recent extraction) 2. **Preserve ALL data** from both files (additive only - per Rule 5 in AGENTS.md) 3. **Use newer values** for conflicting scalar fields 4. **Merge arrays** (deduplicate where appropriate) 5. **Document merge** in provenance ### Merge Priority (Newer Wins for Conflicts) | Field Type | Merge Strategy | |------------|----------------| | `extraction_metadata` | Keep newer, note older in `previous_extractions` | | `profile_data` scalars | Newer value wins | | `profile_data.experience[]` | Merge arrays, dedupe by company+title+dates | | `profile_data.education[]` | Merge arrays, dedupe by institution+degree | | `profile_data.skills[]` | Union of all skills | | `contact_data` | Keep more complete version | | `heritage_sector_relevance` | Keep more detailed assessment | | `heritage_relevant_experience` | Preserve if newer file lacks equivalent | --- ## Merge Procedure ### Step 1: Identify Duplicates ```bash # List both files ls -la data/custodian/person/entity/frank-kanhai* ``` ### Step 2: Read Both Files Compare content to understand differences: - Which has more complete `profile_data`? - Which has `contact_data` or `heritage_sector_relevance`? - Which has more recent `extraction_date`? ### Step 3: Create Merged File Use the **newer timestamp** for the final filename: ``` frank-kanhai-a4119683_20251213T160000Z.json (keep this name) ``` ### Step 4: Merge Content ```json { "extraction_metadata": { // From newer file "extraction_date": "2025-12-13T16:00:00Z", "extraction_method": "exa_contents", // Add reference to older extraction "previous_extractions": [ { "source_file": "frank-kanhai-a4119683_20251210T230007Z.json", "extraction_date": "2025-12-10T23:00:07Z", "merged_on": "2025-12-14T00:00:00Z" } ] }, "profile_data": { // Merged content from both files }, "contact_data": { // From whichever file has it }, "heritage_sector_relevance": { // From whichever file has it, or merge assessments } } ``` ### Step 5: Delete Older File After successful merge and validation: ```bash rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json ``` ### Step 6: Validate Merged File ```bash python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON" ``` --- ## Example Merge ### Before (Two Files) **File 1** (older): `frank-kanhai-a4119683_20251210T230007Z.json` ```json { "extraction_metadata": { "extraction_date": "2025-12-10T23:00:07Z" }, "profile_data": { "name": "Frank Kanhai", "headline": "Senior Advisor" } } ``` **File 2** (newer): `frank-kanhai-a4119683_20251213T160000Z.json` ```json { "extraction_metadata": { "extraction_date": "2025-12-13T16:00:00Z" }, "profile_data": { "name": "Frank Kanhai", "headline": "Senior Advisor at Nationaal Archief", "experience": [...] }, "contact_data": {...} } ``` ### After (Merged File) **Merged**: `frank-kanhai-a4119683_20251213T160000Z.json` ```json { "extraction_metadata": { "extraction_date": "2025-12-13T16:00:00Z", "previous_extractions": [ { "source_file": "frank-kanhai-a4119683_20251210T230007Z.json", "extraction_date": "2025-12-10T23:00:07Z", "merged_on": "2025-12-14T00:00:00Z" } ] }, "profile_data": { "name": "Frank Kanhai", "headline": "Senior Advisor at Nationaal Archief", "experience": [...] }, "contact_data": {...} } ``` --- ## Handling Slug Variations When the same person has files with different slug formats: | Variation | Example | Resolution | |-----------|---------|------------| | With/without numeric suffix | `tom-de-smet` vs `tom-de-smet-5695436` | Keep the one matching actual LinkedIn URL | | Typos | `jon-smith` vs `john-smith` | Keep the correct spelling | | Unicode normalization | `muller` vs `müller` | Keep ASCII-normalized version | ### Determining Correct Slug 1. Check `extraction_metadata.linkedin_url` in both files 2. Use the slug that matches the actual LinkedIn profile URL 3. If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique) --- ## Non-Person Entity Files If a file in `data/custodian/person/entity/` is NOT a person (e.g., an organization): ### Detection ```bash # Check if file contains organization indicators grep -l '"type": "Organization"' data/custodian/person/entity/*.json grep -l '"company"' data/custodian/person/entity/*.json | head -5 ``` ### Resolution 1. **Do NOT delete** - preserves data provenance 2. **Move to archive** with documentation: ```bash mkdir -p data/custodian/person/entity/archive/non_person mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \ data/custodian/person/entity/archive/non_person/ ``` 3. **Create README** in archive folder explaining why files were moved --- ## Prevention To prevent future duplicates: 1. **Check before creating**: Search for existing files by LinkedIn slug 2. **Use consistent slug format**: Prefer `{name}-{numeric}` format when available 3. **Update existing files**: Instead of creating new file, update existing one with new timestamp 4. **Document extraction source**: Clear provenance prevents confusion --- ## References - **AGENTS.md**: Rule 5 (Never Delete Enriched Data) - **AGENTS.md**: Rule 20 (Person Entity Profiles) - **PERSON_ENTITY_PROFILE_FORMAT_RULE.md**: Entity file structure - **HERITAGE_SECTOR_RELEVANCE_SCORING.md**: Scoring guidelines