glam/data/person/_entity_resolution_cleanup_report.md
2026-01-11 12:15:27 +01:00

2.2 KiB

Person Enrichment Cleanup Report

Date: 2026-01-11 02:53

Summary

Critical entity resolution failures were discovered in person profile enrichment. The enrichment script was attributing data from different people with similar names to our person profiles.

Issues Found

Issue Count Example
Birth years from wrong persons 122 Carmen Juliá born 1952 (was Venezuelan actress, not UK curator)
Wikipedia articles about different people 42 Robert Ritter = Nazi doctor, not heritage worker
Genealogy sources (historical namesakes) 8 Birth year 1922 from geni.com
ResearchGate/Academia from wrong researchers 80+ Carmen Julia Navarro = hydrogeologist, not curator
Social media from random accounts 150+ Instagram accounts of different people

Cleanup Actions

  1. Removed all enriched birth_year claims (234 claims, 124 files)
  2. Removed all social_connection claims (spouse, family from Wikipedia)
  3. Removed all social_media_content claims (Instagram follower counts)
  4. Removed claims from high-risk sources (Wikipedia, IMDB, ResearchGate, Academia.edu, Instagram, TikTok)

Total claims removed: 540+ claims

Rule 46 Added

Added new critical rule to prevent future entity resolution failures:

Rule 46: Entity Resolution - Names Are NEVER Sufficient

Key requirements:

  • Similar or identical names are NEVER sufficient for entity resolution
  • At least 3 of 5 identity attributes must match (career, employer, location, age, education)
  • Any conflicting signal (e.g., "actress" vs "curator") = automatic rejection
  • Genealogy sites = ALWAYS reject
  • Wikipedia = reject unless 4/5 attributes match

Files Modified

  • .opencode/rules/entity-resolution-no-heuristics.md - Enhanced with stricter requirements
  • AGENTS.md - Added Rule 46 summary
  • data/person/_birth_year_removal_log.json - Audit trail of removed claims
  • 207 person profile files cleaned

Remaining Work

The remaining 609 enriched claims (position, education, hobby, award) may still have entity resolution issues but are lower risk. Future enrichment MUST implement the entity resolution validation in Rule 46.