glam/.opencode/PERSON_ENTITY_DEDUPLICATION_RULE.md
2025-12-14 17:09:55 +01:00

270 lines
6.9 KiB
Markdown

# Person Entity Deduplication Rule
**Version**: 1.0.0
**Created**: 2025-12-14
**Applies To**: Person entity profiles in `data/custodian/person/entity/`
---
## Problem Statement
Duplicate person entity files can occur when:
1. The same person is extracted at different times (different timestamps)
2. LinkedIn URL slugs vary (e.g., with/without numeric suffix)
3. Manual extraction overlaps with automated extraction
4. Name variations lead to separate file creation
---
## File Naming Convention
Person entity files follow this pattern:
```
{linkedin-slug}_{ISO-timestamp}.json
```
Examples:
- `frank-kanhai-a4119683_20251210T230007Z.json`
- `frank-kanhai-a4119683_20251213T160000Z.json` (same person, different time)
- `tom-de-smet_20251214T000000Z.json`
- `tom-de-smet-5695436_20251211T073000Z.json` (same person, different slug format)
---
## Duplicate Detection
### Indicators of Duplicates
1. **Same LinkedIn slug** (with different timestamps)
2. **Same person name** with different slug formats
3. **Same LinkedIn URL** in `extraction_metadata.linkedin_url`
4. **Matching unique identifiers** (ORCID, ISNI, email)
### Detection Commands
```bash
# Find potential duplicates by name prefix
ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d
# Find files for a specific person
ls data/custodian/person/entity/ | grep "frank-kanhai"
```
---
## Merge Strategy
### Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE
When merging duplicates:
1. **Keep the NEWER file** as the base (more recent extraction)
2. **Preserve ALL data** from both files (additive only - per Rule 5 in AGENTS.md)
3. **Use newer values** for conflicting scalar fields
4. **Merge arrays** (deduplicate where appropriate)
5. **Document merge** in provenance
### Merge Priority (Newer Wins for Conflicts)
| Field Type | Merge Strategy |
|------------|----------------|
| `extraction_metadata` | Keep newer, note older in `previous_extractions` |
| `profile_data` scalars | Newer value wins |
| `profile_data.experience[]` | Merge arrays, dedupe by company+title+dates |
| `profile_data.education[]` | Merge arrays, dedupe by institution+degree |
| `profile_data.skills[]` | Union of all skills |
| `contact_data` | Keep more complete version |
| `heritage_sector_relevance` | Keep more detailed assessment |
| `heritage_relevant_experience` | Preserve if newer file lacks equivalent |
---
## Merge Procedure
### Step 1: Identify Duplicates
```bash
# List both files
ls -la data/custodian/person/entity/frank-kanhai*
```
### Step 2: Read Both Files
Compare content to understand differences:
- Which has more complete `profile_data`?
- Which has `contact_data` or `heritage_sector_relevance`?
- Which has more recent `extraction_date`?
### Step 3: Create Merged File
Use the **newer timestamp** for the final filename:
```
frank-kanhai-a4119683_20251213T160000Z.json (keep this name)
```
### Step 4: Merge Content
```json
{
"extraction_metadata": {
// From newer file
"extraction_date": "2025-12-13T16:00:00Z",
"extraction_method": "exa_contents",
// Add reference to older extraction
"previous_extractions": [
{
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
"extraction_date": "2025-12-10T23:00:07Z",
"merged_on": "2025-12-14T00:00:00Z"
}
]
},
"profile_data": {
// Merged content from both files
},
"contact_data": {
// From whichever file has it
},
"heritage_sector_relevance": {
// From whichever file has it, or merge assessments
}
}
```
### Step 5: Delete Older File
After successful merge and validation:
```bash
rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json
```
### Step 6: Validate Merged File
```bash
python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON"
```
---
## Example Merge
### Before (Two Files)
**File 1** (older): `frank-kanhai-a4119683_20251210T230007Z.json`
```json
{
"extraction_metadata": {
"extraction_date": "2025-12-10T23:00:07Z"
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor"
}
}
```
**File 2** (newer): `frank-kanhai-a4119683_20251213T160000Z.json`
```json
{
"extraction_metadata": {
"extraction_date": "2025-12-13T16:00:00Z"
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor at Nationaal Archief",
"experience": [...]
},
"contact_data": {...}
}
```
### After (Merged File)
**Merged**: `frank-kanhai-a4119683_20251213T160000Z.json`
```json
{
"extraction_metadata": {
"extraction_date": "2025-12-13T16:00:00Z",
"previous_extractions": [
{
"source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
"extraction_date": "2025-12-10T23:00:07Z",
"merged_on": "2025-12-14T00:00:00Z"
}
]
},
"profile_data": {
"name": "Frank Kanhai",
"headline": "Senior Advisor at Nationaal Archief",
"experience": [...]
},
"contact_data": {...}
}
```
---
## Handling Slug Variations
When the same person has files with different slug formats:
| Variation | Example | Resolution |
|-----------|---------|------------|
| With/without numeric suffix | `tom-de-smet` vs `tom-de-smet-5695436` | Keep the one matching actual LinkedIn URL |
| Typos | `jon-smith` vs `john-smith` | Keep the correct spelling |
| Unicode normalization | `muller` vs `müller` | Keep ASCII-normalized version |
### Determining Correct Slug
1. Check `extraction_metadata.linkedin_url` in both files
2. Use the slug that matches the actual LinkedIn profile URL
3. If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique)
---
## Non-Person Entity Files
If a file in `data/custodian/person/entity/` is NOT a person (e.g., an organization):
### Detection
```bash
# Check if file contains organization indicators
grep -l '"type": "Organization"' data/custodian/person/entity/*.json
grep -l '"company"' data/custodian/person/entity/*.json | head -5
```
### Resolution
1. **Do NOT delete** - preserves data provenance
2. **Move to archive** with documentation:
```bash
mkdir -p data/custodian/person/entity/archive/non_person
mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \
data/custodian/person/entity/archive/non_person/
```
3. **Create README** in archive folder explaining why files were moved
---
## Prevention
To prevent future duplicates:
1. **Check before creating**: Search for existing files by LinkedIn slug
2. **Use consistent slug format**: Prefer `{name}-{numeric}` format when available
3. **Update existing files**: Instead of creating new file, update existing one with new timestamp
4. **Document extraction source**: Clear provenance prevents confusion
---
## References
- **AGENTS.md**: Rule 5 (Never Delete Enriched Data)
- **AGENTS.md**: Rule 20 (Person Entity Profiles)
- **PERSON_ENTITY_PROFILE_FORMAT_RULE.md**: Entity file structure
- **HERITAGE_SECTOR_RELEVANCE_SCORING.md**: Scoring guidelines