glam/docs/MANUAL_PERSON_ENRICHMENT_WORKFLOW.md
2026-01-11 12:15:27 +01:00

17 KiB

Manual Person Enrichment Workflow

Version: 1.0.0
Created: 2026-01-11
Status: MANDATORY (automated enrichment is PROHIBITED)


⚠️ CRITICAL: Automated Enrichment is Prohibited

Automated web search enrichment has been permanently disabled due to catastrophic entity resolution failures discovered in January 2026:

  • 540+ false claims were attributed to wrong people with similar names
  • Birth years from Venezuelan actresses attributed to UK curators
  • Death years attributed to living people
  • Social media from random namesakes attributed to heritage workers

ALL person enrichment must now be done MANUALLY with human verification that the source refers to the correct person.

See: .opencode/rules/entity-resolution-no-heuristics.md (Rule 46)


Why Manual Enrichment is Required

The Entity Resolution Problem

Web searches for "Carmen Juliá" return data about:

  • Carmen Julia Álvarez (Venezuelan actress, born 1952)
  • Carmen Julia Navarro (Mexican hydrogeologist)
  • Carmen Julia Gutiérrez (Spanish medievalist)

None of these is the actual Carmen Juliá who is a UK art curator at New Contemporaries.

Name matching alone CANNOT distinguish between namesakes. Only a human can:

  1. Read the full source context
  2. Cross-reference multiple identity attributes
  3. Detect conflicting signals (actress vs curator, Venezuela vs UK)
  4. Make an informed judgment about entity identity

The Cost of Wrong Data

Impact Description
Corrupts analysis Downstream reports use false birth years, wrong affiliations
Legal/privacy risk Attributing data to wrong person violates privacy
Destroys trust Users lose confidence in entire dataset
Expensive cleanup Manual removal of 540+ false claims took hours

Allowed Enrichment Methods

1. LinkedIn Profile Data (PREFERRED)

LinkedIn profiles are self-reported by the person and already verified through profile access.

What to extract:

  • Current and past positions
  • Education history
  • Skills and endorsements
  • Publications (if listed)
  • Certifications
  • Languages

How to extract:

  1. Navigate to person's LinkedIn profile
  2. Save page as HTML: File > Save Page As > "Webpage, Complete"
  3. Run: python scripts/parse_linkedin_html.py <saved_file.html>
  4. Review extracted data before committing

Provenance:

{
  "claim_type": "position",
  "claim_value": {"title": "Curator", "organization": "Rijksmuseum"},
  "provenance": {
    "source_url": "https://www.linkedin.com/in/person-slug/",
    "retrieval_agent": "manual-linkedin-extraction",
    "retrieved_on": "2026-01-11T12:00:00Z",
    "extraction_method": "parse_linkedin_html.py"
  }
}

2. Institutional Sources (VERIFIED)

Data from the person's employer website (museum, archive, library) about their own staff.

Allowed sources:

  • Staff directory pages
  • "About us" / "Team" pages
  • Press releases about staff appointments
  • Annual reports listing staff

Verification:

  • URL must be the institution's official domain
  • Content must explicitly identify the person

Example:

{
  "claim_type": "position",
  "claim_value": {"title": "Head of Collections", "organization": "Van Gogh Museum"},
  "provenance": {
    "source_url": "https://www.vangoghmuseum.nl/en/about/organisation/team",
    "retrieval_agent": "manual-institutional-extraction",
    "retrieved_on": "2026-01-11T12:00:00Z",
    "verification_notes": "Listed on official museum team page"
  }
}

3. Verified Identifier Lookup (ORCID, Wikidata)

If the person has a verified identifier (ORCID, Wikidata QID), data from that source is acceptable.

ORCID:

  • Must match by ORCID ID, not name search
  • Publications, affiliations, employment from ORCID record

Wikidata:

  • Must have confirmed Wikidata QID for this specific person
  • Not from a random Wikidata search by name

Example:

{
  "claim_type": "birth_year",
  "claim_value": 1975,
  "provenance": {
    "source_url": "https://www.wikidata.org/wiki/Q12345678",
    "wikidata_property": "P569",
    "retrieval_agent": "manual-wikidata-lookup",
    "retrieved_on": "2026-01-11T12:00:00Z",
    "verification_notes": "Wikidata QID confirmed via ISNI link"
  }
}

4. Manual Web Research (WITH VERIFICATION)

If you must use web search, follow these mandatory steps:

Step 1: Search and Gather Sources

Search for the person's name + employer + role:

"Carmen Juliá" "New Contemporaries" curator

Step 2: Entity Resolution Checklist

For EACH source, verify at least 3 of 5 identity attributes match:

# Attribute Profile Value Source Value Match?
1 Career/Profession Curator
2 Employer New Contemporaries
3 Location UK
4 Age/Time Period Active 2020s
5 Education [if known]

Minimum 3 of 5 must match. Name match alone = REJECT.

Step 3: Investigate Red Flags

Red flags requiring investigation (NOT automatic rejection - people change careers and relocate):

  • ⚠️ Source profession differs (actress vs curator) → Investigate: Did they change careers?
  • ⚠️ Source location differs (Venezuela vs UK) → Investigate: Did they relocate?
  • ⚠️ Time gap in career → Investigate: Career break or different person?

When to REJECT after investigation:

  • Overlapping timelines in different professions/locations (can't be actress in Venezuela AND curator in UK simultaneously)
  • No evidence of career change or relocation
  • Birth year makes current career stage implausible

Step 4: Document Verification

Record your verification in the claim provenance:

{
  "claim_type": "education",
  "claim_value": {"institution": "Courtauld Institute", "degree": "MA Art History"},
  "provenance": {
    "source_url": "https://example.org/interview-carmen-julia",
    "retrieval_agent": "manual-human-curator",
    "retrieved_on": "2026-01-11T12:00:00Z",
    "entity_resolution": {
      "verified_by": "kempersc",
      "verification_date": "2026-01-11T12:30:00Z",
      "attributes_matched": ["profession", "employer", "location"],
      "match_count": 3,
      "verification_notes": "Article explicitly mentions work at New Contemporaries in London"
    }
  }
}

High-Risk Sources (Extra Verification Required)

The following sources have high entity resolution risk and require extra careful verification. They are NOT forbidden, but you must apply stricter matching thresholds:

Source Risk Level Why Required Matches
Genealogy sites (geni.com, ancestry., familysearch.org, myheritage.) CRITICAL Often describe historical namesakes 5 of 5 attributes
IMDB CRITICAL Many actors share common names 5 of 5 attributes
Wikipedia (by name search) HIGH Many people with same name have articles 4 of 5 attributes
Instagram / TikTok / Social media HIGH Cannot easily verify account ownership 4 of 5 attributes
ResearchGate / Academia.edu HIGH Multiple researchers with same name 4 of 5 attributes
News articles MEDIUM May mention different person with same name 3 of 5 attributes

Using High-Risk Sources Correctly

These sources CAN be used if you verify enough identity attributes:

  1. Genealogy sites: May be valid if person is historical AND dates/locations/profession all match
  2. IMDB: May be valid if person actually works in film/TV AND other attributes match
  3. Wikipedia: Read the FULL article - if profession, employer, location, and time period all match, it's likely correct
  4. Social media: Check bio for employer/location mentions that match profile

Example - Using Wikipedia correctly:

Profile: Jan de Vries, Curator at Rijksmuseum, Amsterdam
Wikipedia: Jan de Vries (art historian)
  - Mentions "curator at Rijksmuseum" ✅
  - Mentions "Amsterdam" ✅  
  - Mentions "art history PhD from University of Amsterdam" ✅
  - Active dates 2010-present ✅
→ 4 of 4 attributes match → ACCEPT (with documentation)

Example - Correctly rejecting Wikipedia:

Profile: Carmen Juliá, Curator at New Contemporaries, UK
Wikipedia: Carmen Julia Álvarez
  - Profession: "actress" ❌ (conflict!)
  - Location: "Venezuela" ❌ (conflict!)
→ Profession AND location conflict → REJECT

Workflow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Person Enrichment Request                      │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│  Step 1: Check LinkedIn Profile (PREFERRED)                     │
│  - If accessible, extract and use LinkedIn data                  │
│  - Self-reported by person, already verified                     │
└─────────────────────────────────────────────────────────────────┘
                                │
                    LinkedIn not available?
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│  Step 2: Check Institutional Website                             │
│  - Find person on employer's official website                    │
│  - Staff directory, team page, press releases                    │
└─────────────────────────────────────────────────────────────────┘
                                │
                    Not on employer website?
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│  Step 3: Check Verified Identifiers                              │
│  - ORCID (by ID, not name)                                       │
│  - Wikidata (by confirmed QID)                                   │
└─────────────────────────────────────────────────────────────────┘
                                │
                    No verified identifiers?
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│  Step 4: Manual Web Research (WITH VERIFICATION)                 │
│  - Search with name + employer + role                            │
│  - Verify 3 of 5 identity attributes                             │
│  - Check for profession/location conflicts                       │
│  - Document verification in provenance                           │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│  Step 5: Record Claim with Full Provenance                       │
│  - source_url, retrieved_on, retrieval_agent                     │
│  - entity_resolution block with verification details             │
│  - verification_notes explaining match                           │
└─────────────────────────────────────────────────────────────────┘

Example: Correct Enrichment

Person Profile

{
  "ppid": "ID_NL-NH-AMS_198X_NL-NH-AMS_XXXX_JAN-DE-VRIES",
  "profile_data": {
    "full_name": "Jan de Vries",
    "headline": "Curator of Dutch Art at Rijksmuseum"
  }
}

Manual Research Process

  1. Search: "Jan de Vries" "Rijksmuseum" curator

  2. Found source: https://www.rijksmuseum.nl/en/about-us/team

  3. Entity resolution check:

    • Profession: "Curator" (matches)
    • Employer: "Rijksmuseum" (matches)
    • Location: "Amsterdam" (matches)
    • Age: Not stated
    • Education: Not stated
    • Result: 3 of 3 checked attributes match → ACCEPT
  4. Add claim:

{
  "claim_type": "position",
  "claim_value": {
    "title": "Curator of Dutch Art",
    "organization": "Rijksmuseum"
  },
  "provenance": {
    "source_url": "https://www.rijksmuseum.nl/en/about-us/team",
    "retrieval_agent": "manual-human-curator",
    "retrieved_on": "2026-01-11T14:00:00Z",
    "entity_resolution": {
      "verified_by": "kempersc",
      "verification_date": "2026-01-11T14:15:00Z",
      "attributes_matched": ["profession", "employer", "location"],
      "match_count": 3,
      "verification_notes": "Listed on official Rijksmuseum team page"
    }
  }
}

Example: Correct Rejection

Person Profile

{
  "ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
  "profile_data": {
    "full_name": "Carmen Juliá",
    "headline": "Curator at New Contemporaries"
  }
}

Manual Research Process

  1. Search: "Carmen Julia" born biography

  2. Found source: Wikipedia - Carmen Julia Álvarez

  3. Red flags detected (investigate, don't auto-reject):

    • ⚠️ Profession: "actress" vs "curator" → Did she change careers?
    • ⚠️ Location: "Venezuela" vs "UK" → Did she relocate?
    • ⚠️ Birth year: 1952 (would be 74 in 2026)
  4. Investigation:

    • Wikipedia shows Carmen Julia Álvarez was active as actress 1970s-2000s in Venezuela
    • Profile shows Carmen Juliá is active as curator 2015-present in UK
    • These careers overlap in time (2000s) on different continents
    • No evidence of career transition from acting to curating
    • Age 74 is possible but unusual for "Curator at New Contemporaries" (typically younger role)
  5. Conclusion: Overlapping timelines in incompatible roles → REJECT

  6. Log rejection:

{
  "rejected_claim": {
    "claim_type": "birth_year",
    "claim_value": 1952,
    "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez"
  },
  "rejection_reason": "overlapping_incompatible_careers",
  "rejection_details": "Wikipedia describes Venezuelan actress active 1970s-2000s; profile is UK curator active 2015-present. Overlapping timelines on different continents with no evidence of career transition.",
  "investigation_performed": true,
  "rejected_by": "kempersc",
  "rejected_at": "2026-01-11T14:30:00Z"
}

Scripts and Tools

Script Purpose Status
scripts/parse_linkedin_html.py Extract data from saved LinkedIn HTML Active
scripts/enrich_person_comprehensive.py Automated web enrichment 🚫 DEPRECATED
scripts/validate_person_claims.py Validate claim provenance Active

Checklist for Manual Enrichment

Before committing any enrichment:

  • High-risk sources verified with appropriate threshold (5/5 for genealogy/IMDB, 4/5 for Wikipedia/social)
  • At least 3 of 5 identity attributes verified
  • Red flags investigated (profession/location differences checked for career changes or relocations)
  • No overlapping incompatible timelines (can't be in two places/careers simultaneously)
  • Birth year is plausible for career stage
  • Full provenance recorded with entity_resolution block
  • Verification notes explain why this is the same person

  • .opencode/rules/entity-resolution-no-heuristics.md - Rule 46 (CRITICAL)
  • AGENTS.md - Rule 21 (Data Fabrication Prohibited), Rule 26 (Person Data Provenance)
  • data/person/_ENRICHMENT_CLEANUP_FINAL_REPORT.md - Cleanup report from Jan 2026

Remember: Wrong data is worse than no data. When in doubt, DO NOT add the claim.