6.9 KiB
6.9 KiB
Person Entity Deduplication Rule
Version: 1.0.0
Created: 2025-12-14
Applies To: Person entity profiles in data/custodian/person/entity/
Problem Statement
Duplicate person entity files can occur when:
- The same person is extracted at different times (different timestamps)
- LinkedIn URL slugs vary (e.g., with/without numeric suffix)
- Manual extraction overlaps with automated extraction
- Name variations lead to separate file creation
File Naming Convention
Person entity files follow this pattern:
{linkedin-slug}_{ISO-timestamp}.json
Examples:
frank-kanhai-a4119683_20251210T230007Z.jsonfrank-kanhai-a4119683_20251213T160000Z.json(same person, different time)tom-de-smet_20251214T000000Z.jsontom-de-smet-5695436_20251211T073000Z.json(same person, different slug format)
Duplicate Detection
Indicators of Duplicates
- Same LinkedIn slug (with different timestamps)
- Same person name with different slug formats
- Same LinkedIn URL in
extraction_metadata.linkedin_url - Matching unique identifiers (ORCID, ISNI, email)
Detection Commands
# Find potential duplicates by name prefix
ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d
# Find files for a specific person
ls data/custodian/person/entity/ | grep "frank-kanhai"
Merge Strategy
Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE
When merging duplicates:
- Keep the NEWER file as the base (more recent extraction)
- Preserve ALL data from both files (additive only - per Rule 5 in AGENTS.md)
- Use newer values for conflicting scalar fields
- Merge arrays (deduplicate where appropriate)
- Document merge in provenance
Merge Priority (Newer Wins for Conflicts)
| Field Type | Merge Strategy |
|---|---|
extraction_metadata |
Keep newer, note older in previous_extractions |
profile_data scalars |
Newer value wins |
profile_data.experience[] |
Merge arrays, dedupe by company+title+dates |
profile_data.education[] |
Merge arrays, dedupe by institution+degree |
profile_data.skills[] |
Union of all skills |
contact_data |
Keep more complete version |
heritage_sector_relevance |
Keep more detailed assessment |
heritage_relevant_experience |
Preserve if newer file lacks equivalent |
Merge Procedure
Step 1: Identify Duplicates
# List both files
ls -la data/custodian/person/entity/frank-kanhai*
Step 2: Read Both Files
Compare content to understand differences:
- Which has more complete
profile_data? - Which has
contact_dataorheritage_sector_relevance? - Which has more recent
extraction_date?
Step 3: Create Merged File
Use the newer timestamp for the final filename:
frank-kanhai-a4119683_20251213T160000Z.json (keep this name)
Step 4: Merge Content
{
"extraction_metadata": {
// From newer file
"extraction_date": "2025-12-13T16:00:00Z",
"extraction_method": "exa_contents",
// Add reference to older extraction
"previous_extractions": [
{
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
"extraction_date": "2025-12-10T23:00:07Z",
"merged_on": "2025-12-14T00:00:00Z"
}
]
},
"profile_data": {
// Merged content from both files
},
"contact_data": {
// From whichever file has it
},
"heritage_sector_relevance": {
// From whichever file has it, or merge assessments
}
}
Step 5: Delete Older File
After successful merge and validation:
rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json
Step 6: Validate Merged File
python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON"
Example Merge
Before (Two Files)
File 1 (older): frank-kanhai-a4119683_20251210T230007Z.json
{
"extraction_metadata": {
"extraction_date": "2025-12-10T23:00:07Z"
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor"
}
}
File 2 (newer): frank-kanhai-a4119683_20251213T160000Z.json
{
"extraction_metadata": {
"extraction_date": "2025-12-13T16:00:00Z"
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor at Nationaal Archief",
"experience": [...]
},
"contact_data": {...}
}
After (Merged File)
Merged: frank-kanhai-a4119683_20251213T160000Z.json
{
"extraction_metadata": {
"extraction_date": "2025-12-13T16:00:00Z",
"previous_extractions": [
{
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
"extraction_date": "2025-12-10T23:00:07Z",
"merged_on": "2025-12-14T00:00:00Z"
}
]
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor at Nationaal Archief",
"experience": [...]
},
"contact_data": {...}
}
Handling Slug Variations
When the same person has files with different slug formats:
| Variation | Example | Resolution |
|---|---|---|
| With/without numeric suffix | tom-de-smet vs tom-de-smet-5695436 |
Keep the one matching actual LinkedIn URL |
| Typos | jon-smith vs john-smith |
Keep the correct spelling |
| Unicode normalization | muller vs müller |
Keep ASCII-normalized version |
Determining Correct Slug
- Check
extraction_metadata.linkedin_urlin both files - Use the slug that matches the actual LinkedIn profile URL
- If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique)
Non-Person Entity Files
If a file in data/custodian/person/entity/ is NOT a person (e.g., an organization):
Detection
# Check if file contains organization indicators
grep -l '"type": "Organization"' data/custodian/person/entity/*.json
grep -l '"company"' data/custodian/person/entity/*.json | head -5
Resolution
- Do NOT delete - preserves data provenance
- Move to archive with documentation:
mkdir -p data/custodian/person/entity/archive/non_person
mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \
data/custodian/person/entity/archive/non_person/
- Create README in archive folder explaining why files were moved
Prevention
To prevent future duplicates:
- Check before creating: Search for existing files by LinkedIn slug
- Use consistent slug format: Prefer
{name}-{numeric}format when available - Update existing files: Instead of creating new file, update existing one with new timestamp
- Document extraction source: Clear provenance prevents confusion
References
- AGENTS.md: Rule 5 (Never Delete Enriched Data)
- AGENTS.md: Rule 20 (Person Entity Profiles)
- PERSON_ENTITY_PROFILE_FORMAT_RULE.md: Entity file structure
- HERITAGE_SECTOR_RELEVANCE_SCORING.md: Scoring guidelines