270 lines
6.9 KiB
Markdown
270 lines
6.9 KiB
Markdown
# Person Entity Deduplication Rule
|
|
|
|
**Version**: 1.0.0
|
|
**Created**: 2025-12-14
|
|
**Applies To**: Person entity profiles in `data/custodian/person/entity/`
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
Duplicate person entity files can occur when:
|
|
1. The same person is extracted at different times (different timestamps)
|
|
2. LinkedIn URL slugs vary (e.g., with/without numeric suffix)
|
|
3. Manual extraction overlaps with automated extraction
|
|
4. Name variations lead to separate file creation
|
|
|
|
---
|
|
|
|
## File Naming Convention
|
|
|
|
Person entity files follow this pattern:
|
|
```
|
|
{linkedin-slug}_{ISO-timestamp}.json
|
|
```
|
|
|
|
Examples:
|
|
- `frank-kanhai-a4119683_20251210T230007Z.json`
|
|
- `frank-kanhai-a4119683_20251213T160000Z.json` (same person, different time)
|
|
- `tom-de-smet_20251214T000000Z.json`
|
|
- `tom-de-smet-5695436_20251211T073000Z.json` (same person, different slug format)
|
|
|
|
---
|
|
|
|
## Duplicate Detection
|
|
|
|
### Indicators of Duplicates
|
|
|
|
1. **Same LinkedIn slug** (with different timestamps)
|
|
2. **Same person name** with different slug formats
|
|
3. **Same LinkedIn URL** in `extraction_metadata.linkedin_url`
|
|
4. **Matching unique identifiers** (ORCID, ISNI, email)
|
|
|
|
### Detection Commands
|
|
|
|
```bash
|
|
# Find potential duplicates by name prefix
|
|
ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d
|
|
|
|
# Find files for a specific person
|
|
ls data/custodian/person/entity/ | grep "frank-kanhai"
|
|
```
|
|
|
|
---
|
|
|
|
## Merge Strategy
|
|
|
|
### Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE
|
|
|
|
When merging duplicates:
|
|
|
|
1. **Keep the NEWER file** as the base (more recent extraction)
|
|
2. **Preserve ALL data** from both files (additive only - per Rule 5 in AGENTS.md)
|
|
3. **Use newer values** for conflicting scalar fields
|
|
4. **Merge arrays** (deduplicate where appropriate)
|
|
5. **Document merge** in provenance
|
|
|
|
### Merge Priority (Newer Wins for Conflicts)
|
|
|
|
| Field Type | Merge Strategy |
|
|
|------------|----------------|
|
|
| `extraction_metadata` | Keep newer, note older in `previous_extractions` |
|
|
| `profile_data` scalars | Newer value wins |
|
|
| `profile_data.experience[]` | Merge arrays, dedupe by company+title+dates |
|
|
| `profile_data.education[]` | Merge arrays, dedupe by institution+degree |
|
|
| `profile_data.skills[]` | Union of all skills |
|
|
| `contact_data` | Keep more complete version |
|
|
| `heritage_sector_relevance` | Keep more detailed assessment |
|
|
| `heritage_relevant_experience` | Preserve if newer file lacks equivalent |
|
|
|
|
---
|
|
|
|
## Merge Procedure
|
|
|
|
### Step 1: Identify Duplicates
|
|
|
|
```bash
|
|
# List both files
|
|
ls -la data/custodian/person/entity/frank-kanhai*
|
|
```
|
|
|
|
### Step 2: Read Both Files
|
|
|
|
Compare content to understand differences:
|
|
- Which has more complete `profile_data`?
|
|
- Which has `contact_data` or `heritage_sector_relevance`?
|
|
- Which has more recent `extraction_date`?
|
|
|
|
### Step 3: Create Merged File
|
|
|
|
Use the **newer timestamp** for the final filename:
|
|
|
|
```
|
|
frank-kanhai-a4119683_20251213T160000Z.json (keep this name)
|
|
```
|
|
|
|
### Step 4: Merge Content
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
// From newer file
|
|
"extraction_date": "2025-12-13T16:00:00Z",
|
|
"extraction_method": "exa_contents",
|
|
// Add reference to older extraction
|
|
"previous_extractions": [
|
|
{
|
|
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
|
|
"extraction_date": "2025-12-10T23:00:07Z",
|
|
"merged_on": "2025-12-14T00:00:00Z"
|
|
}
|
|
]
|
|
},
|
|
"profile_data": {
|
|
// Merged content from both files
|
|
},
|
|
"contact_data": {
|
|
// From whichever file has it
|
|
},
|
|
"heritage_sector_relevance": {
|
|
// From whichever file has it, or merge assessments
|
|
}
|
|
}
|
|
```
|
|
|
|
### Step 5: Delete Older File
|
|
|
|
After successful merge and validation:
|
|
|
|
```bash
|
|
rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json
|
|
```
|
|
|
|
### Step 6: Validate Merged File
|
|
|
|
```bash
|
|
python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON"
|
|
```
|
|
|
|
---
|
|
|
|
## Example Merge
|
|
|
|
### Before (Two Files)
|
|
|
|
**File 1** (older): `frank-kanhai-a4119683_20251210T230007Z.json`
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"extraction_date": "2025-12-10T23:00:07Z"
|
|
},
|
|
"profile_data": {
|
|
"name": "Frank Kanhai",
|
|
"headline": "Senior Advisor"
|
|
}
|
|
}
|
|
```
|
|
|
|
**File 2** (newer): `frank-kanhai-a4119683_20251213T160000Z.json`
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"extraction_date": "2025-12-13T16:00:00Z"
|
|
},
|
|
"profile_data": {
|
|
"name": "Frank Kanhai",
|
|
"headline": "Senior Advisor at Nationaal Archief",
|
|
"experience": [...]
|
|
},
|
|
"contact_data": {...}
|
|
}
|
|
```
|
|
|
|
### After (Merged File)
|
|
|
|
**Merged**: `frank-kanhai-a4119683_20251213T160000Z.json`
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"extraction_date": "2025-12-13T16:00:00Z",
|
|
"previous_extractions": [
|
|
{
|
|
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
|
|
"extraction_date": "2025-12-10T23:00:07Z",
|
|
"merged_on": "2025-12-14T00:00:00Z"
|
|
}
|
|
]
|
|
},
|
|
"profile_data": {
|
|
"name": "Frank Kanhai",
|
|
"headline": "Senior Advisor at Nationaal Archief",
|
|
"experience": [...]
|
|
},
|
|
"contact_data": {...}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Handling Slug Variations
|
|
|
|
When the same person has files with different slug formats:
|
|
|
|
| Variation | Example | Resolution |
|
|
|-----------|---------|------------|
|
|
| With/without numeric suffix | `tom-de-smet` vs `tom-de-smet-5695436` | Keep the one matching actual LinkedIn URL |
|
|
| Typos | `jon-smith` vs `john-smith` | Keep the correct spelling |
|
|
| Unicode normalization | `muller` vs `müller` | Keep ASCII-normalized version |
|
|
|
|
### Determining Correct Slug
|
|
|
|
1. Check `extraction_metadata.linkedin_url` in both files
|
|
2. Use the slug that matches the actual LinkedIn profile URL
|
|
3. If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique)
|
|
|
|
---
|
|
|
|
## Non-Person Entity Files
|
|
|
|
If a file in `data/custodian/person/entity/` is NOT a person (e.g., an organization):
|
|
|
|
### Detection
|
|
|
|
```bash
|
|
# Check if file contains organization indicators
|
|
grep -l '"type": "Organization"' data/custodian/person/entity/*.json
|
|
grep -l '"company"' data/custodian/person/entity/*.json | head -5
|
|
```
|
|
|
|
### Resolution
|
|
|
|
1. **Do NOT delete** - preserves data provenance
|
|
2. **Move to archive** with documentation:
|
|
|
|
```bash
|
|
mkdir -p data/custodian/person/entity/archive/non_person
|
|
mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \
|
|
data/custodian/person/entity/archive/non_person/
|
|
```
|
|
|
|
3. **Create README** in archive folder explaining why files were moved
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
To prevent future duplicates:
|
|
|
|
1. **Check before creating**: Search for existing files by LinkedIn slug
|
|
2. **Use consistent slug format**: Prefer `{name}-{numeric}` format when available
|
|
3. **Update existing files**: Instead of creating new file, update existing one with new timestamp
|
|
4. **Document extraction source**: Clear provenance prevents confusion
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **AGENTS.md**: Rule 5 (Never Delete Enriched Data)
|
|
- **AGENTS.md**: Rule 20 (Person Entity Profiles)
|
|
- **PERSON_ENTITY_PROFILE_FORMAT_RULE.md**: Entity file structure
|
|
- **HERITAGE_SECTOR_RELEVANCE_SCORING.md**: Scoring guidelines
|