kempersc/glam

Fork 0

kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

6.9 KiB

Raw Blame History

Person Entity Deduplication Rule

Version: 1.0.0
Created: 2025-12-14
Applies To: Person entity profiles in data/custodian/person/entity/

Problem Statement

Duplicate person entity files can occur when:

The same person is extracted at different times (different timestamps)
LinkedIn URL slugs vary (e.g., with/without numeric suffix)
Manual extraction overlaps with automated extraction
Name variations lead to separate file creation

File Naming Convention

Person entity files follow this pattern:

{linkedin-slug}_{ISO-timestamp}.json

Examples:

frank-kanhai-a4119683_20251210T230007Z.json
frank-kanhai-a4119683_20251213T160000Z.json (same person, different time)
tom-de-smet_20251214T000000Z.json
tom-de-smet-5695436_20251211T073000Z.json (same person, different slug format)

Duplicate Detection

Indicators of Duplicates

Same LinkedIn slug (with different timestamps)
Same person name with different slug formats
Same LinkedIn URL in extraction_metadata.linkedin_url
Matching unique identifiers (ORCID, ISNI, email)

Detection Commands

# Find potential duplicates by name prefix
ls data/custodian/person/entity/ | cut -d'_' -f1 | sort | uniq -d

# Find files for a specific person
ls data/custodian/person/entity/ | grep "frank-kanhai"

Merge Strategy

Principle: PRESERVE ALL DATA, KEEP NEWER STRUCTURE

When merging duplicates:

Keep the NEWER file as the base (more recent extraction)
Preserve ALL data from both files (additive only - per Rule 5 in AGENTS.md)
Use newer values for conflicting scalar fields
Merge arrays (deduplicate where appropriate)
Document merge in provenance

Merge Priority (Newer Wins for Conflicts)

Field Type	Merge Strategy
`extraction_metadata`	Keep newer, note older in `previous_extractions`
`profile_data` scalars	Newer value wins
`profile_data.experience[]`	Merge arrays, dedupe by company+title+dates
`profile_data.education[]`	Merge arrays, dedupe by institution+degree
`profile_data.skills[]`	Union of all skills
`contact_data`	Keep more complete version
`heritage_sector_relevance`	Keep more detailed assessment
`heritage_relevant_experience`	Preserve if newer file lacks equivalent

Merge Procedure

Step 1: Identify Duplicates

# List both files
ls -la data/custodian/person/entity/frank-kanhai*

Step 2: Read Both Files

Compare content to understand differences:

Which has more complete profile_data?
Which has contact_data or heritage_sector_relevance?
Which has more recent extraction_date?

Step 3: Create Merged File

Use the newer timestamp for the final filename:

frank-kanhai-a4119683_20251213T160000Z.json  (keep this name)

Step 4: Merge Content

{
  "extraction_metadata": {
    // From newer file
    "extraction_date": "2025-12-13T16:00:00Z",
    "extraction_method": "exa_contents",
    // Add reference to older extraction
    "previous_extractions": [
      {
        "source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
        "extraction_date": "2025-12-10T23:00:07Z",
        "merged_on": "2025-12-14T00:00:00Z"
      }
    ]
  },
  "profile_data": {
    // Merged content from both files
  },
  "contact_data": {
    // From whichever file has it
  },
  "heritage_sector_relevance": {
    // From whichever file has it, or merge assessments
  }
}

Step 5: Delete Older File

After successful merge and validation:

rm data/custodian/person/entity/frank-kanhai-a4119683_20251210T230007Z.json

Step 6: Validate Merged File

python3 -m json.tool data/custodian/person/entity/frank-kanhai-a4119683_20251213T160000Z.json > /dev/null && echo "Valid JSON"

Example Merge

Before (Two Files)

File 1 (older): frank-kanhai-a4119683_20251210T230007Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-10T23:00:07Z"
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor"
  }
}

File 2 (newer): frank-kanhai-a4119683_20251213T160000Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-13T16:00:00Z"
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor at Nationaal Archief",
    "experience": [...]
  },
  "contact_data": {...}
}

After (Merged File)

Merged: frank-kanhai-a4119683_20251213T160000Z.json

{
  "extraction_metadata": {
    "extraction_date": "2025-12-13T16:00:00Z",
    "previous_extractions": [
      {
        "source_file": "frank-kanhai-a4119683_20251210T230007Z.json",
        "extraction_date": "2025-12-10T23:00:07Z",
        "merged_on": "2025-12-14T00:00:00Z"
      }
    ]
  },
  "profile_data": {
    "name": "Frank Kanhai",
    "headline": "Senior Advisor at Nationaal Archief",
    "experience": [...]
  },
  "contact_data": {...}
}

Handling Slug Variations

When the same person has files with different slug formats:

Variation	Example	Resolution
With/without numeric suffix	`tom-de-smet` vs `tom-de-smet-5695436`	Keep the one matching actual LinkedIn URL
Typos	`jon-smith` vs `john-smith`	Keep the correct spelling
Unicode normalization	`muller` vs `müller`	Keep ASCII-normalized version

Determining Correct Slug

Check extraction_metadata.linkedin_url in both files
Use the slug that matches the actual LinkedIn profile URL
If both are valid (LinkedIn allows multiple URL formats), prefer the one with numeric suffix (more unique)

Non-Person Entity Files

If a file in data/custodian/person/entity/ is NOT a person (e.g., an organization):

Detection

# Check if file contains organization indicators
grep -l '"type": "Organization"' data/custodian/person/entity/*.json
grep -l '"company"' data/custodian/person/entity/*.json | head -5

Resolution

Do NOT delete - preserves data provenance
Move to archive with documentation:

mkdir -p data/custodian/person/entity/archive/non_person
mv data/custodian/person/entity/nationaal-archief_20251213T171606Z.json \
   data/custodian/person/entity/archive/non_person/

Create README in archive folder explaining why files were moved

Prevention

To prevent future duplicates:

Check before creating: Search for existing files by LinkedIn slug
Use consistent slug format: Prefer {name}-{numeric} format when available
Update existing files: Instead of creating new file, update existing one with new timestamp
Document extraction source: Clear provenance prevents confusion

References

AGENTS.md: Rule 5 (Never Delete Enriched Data)
AGENTS.md: Rule 20 (Person Entity Profiles)
PERSON_ENTITY_PROFILE_FORMAT_RULE.md: Entity file structure
HERITAGE_SECTOR_RELEVANCE_SCORING.md: Scoring guidelines

6.9 KiB Raw Blame History