# Manual Person Enrichment Workflow **Version**: 1.0.0 **Created**: 2026-01-11 **Status**: MANDATORY (automated enrichment is PROHIBITED) --- ## ⚠️ CRITICAL: Automated Enrichment is Prohibited Automated web search enrichment has been **permanently disabled** due to catastrophic entity resolution failures discovered in January 2026: - **540+ false claims** were attributed to wrong people with similar names - Birth years from Venezuelan actresses attributed to UK curators - Death years attributed to **living people** - Social media from random namesakes attributed to heritage workers **ALL person enrichment must now be done MANUALLY** with human verification that the source refers to the correct person. See: `.opencode/rules/entity-resolution-no-heuristics.md` (Rule 46) --- ## Why Manual Enrichment is Required ### The Entity Resolution Problem Web searches for "Carmen Juliá" return data about: - Carmen Julia **Álvarez** (Venezuelan actress, born 1952) - Carmen Julia **Navarro** (Mexican hydrogeologist) - Carmen Julia **Gutiérrez** (Spanish medievalist) **None of these** is the actual Carmen Juliá who is a UK art curator at New Contemporaries. Name matching alone **CANNOT** distinguish between namesakes. Only a human can: 1. Read the full source context 2. Cross-reference multiple identity attributes 3. Detect conflicting signals (actress vs curator, Venezuela vs UK) 4. Make an informed judgment about entity identity ### The Cost of Wrong Data | Impact | Description | |--------|-------------| | **Corrupts analysis** | Downstream reports use false birth years, wrong affiliations | | **Legal/privacy risk** | Attributing data to wrong person violates privacy | | **Destroys trust** | Users lose confidence in entire dataset | | **Expensive cleanup** | Manual removal of 540+ false claims took hours | --- ## Allowed Enrichment Methods ### 1. LinkedIn Profile Data (PREFERRED) LinkedIn profiles are **self-reported** by the person and already verified through profile access. **What to extract**: - Current and past positions - Education history - Skills and endorsements - Publications (if listed) - Certifications - Languages **How to extract**: 1. Navigate to person's LinkedIn profile 2. Save page as HTML: File > Save Page As > "Webpage, Complete" 3. Run: `python scripts/parse_linkedin_html.py ` 4. Review extracted data before committing **Provenance**: ```json { "claim_type": "position", "claim_value": {"title": "Curator", "organization": "Rijksmuseum"}, "provenance": { "source_url": "https://www.linkedin.com/in/person-slug/", "retrieval_agent": "manual-linkedin-extraction", "retrieved_on": "2026-01-11T12:00:00Z", "extraction_method": "parse_linkedin_html.py" } } ``` ### 2. Institutional Sources (VERIFIED) Data from the person's employer website (museum, archive, library) about their own staff. **Allowed sources**: - Staff directory pages - "About us" / "Team" pages - Press releases about staff appointments - Annual reports listing staff **Verification**: - URL must be the institution's official domain - Content must explicitly identify the person **Example**: ```json { "claim_type": "position", "claim_value": {"title": "Head of Collections", "organization": "Van Gogh Museum"}, "provenance": { "source_url": "https://www.vangoghmuseum.nl/en/about/organisation/team", "retrieval_agent": "manual-institutional-extraction", "retrieved_on": "2026-01-11T12:00:00Z", "verification_notes": "Listed on official museum team page" } } ``` ### 3. Verified Identifier Lookup (ORCID, Wikidata) If the person has a verified identifier (ORCID, Wikidata QID), data from that source is acceptable. **ORCID**: - Must match by ORCID ID, not name search - Publications, affiliations, employment from ORCID record **Wikidata**: - Must have confirmed Wikidata QID for this specific person - Not from a random Wikidata search by name **Example**: ```json { "claim_type": "birth_year", "claim_value": 1975, "provenance": { "source_url": "https://www.wikidata.org/wiki/Q12345678", "wikidata_property": "P569", "retrieval_agent": "manual-wikidata-lookup", "retrieved_on": "2026-01-11T12:00:00Z", "verification_notes": "Wikidata QID confirmed via ISNI link" } } ``` ### 4. Manual Web Research (WITH VERIFICATION) If you must use web search, follow these **mandatory** steps: #### Step 1: Search and Gather Sources Search for the person's name + employer + role: ``` "Carmen Juliá" "New Contemporaries" curator ``` #### Step 2: Entity Resolution Checklist For EACH source, verify **at least 3 of 5** identity attributes match: | # | Attribute | Profile Value | Source Value | Match? | |---|-----------|---------------|--------------|--------| | 1 | Career/Profession | Curator | | ☐ | | 2 | Employer | New Contemporaries | | ☐ | | 3 | Location | UK | | ☐ | | 4 | Age/Time Period | Active 2020s | | ☐ | | 5 | Education | [if known] | | ☐ | **Minimum 3 of 5 must match.** Name match alone = REJECT. #### Step 3: Investigate Red Flags **Red flags requiring investigation** (NOT automatic rejection - people change careers and relocate): - ⚠️ Source profession differs (actress vs curator) → **Investigate**: Did they change careers? - ⚠️ Source location differs (Venezuela vs UK) → **Investigate**: Did they relocate? - ⚠️ Time gap in career → **Investigate**: Career break or different person? **When to REJECT after investigation**: - ❌ Overlapping timelines in different professions/locations (can't be actress in Venezuela AND curator in UK simultaneously) - ❌ No evidence of career change or relocation - ❌ Birth year makes current career stage implausible #### Step 4: Document Verification Record your verification in the claim provenance: ```json { "claim_type": "education", "claim_value": {"institution": "Courtauld Institute", "degree": "MA Art History"}, "provenance": { "source_url": "https://example.org/interview-carmen-julia", "retrieval_agent": "manual-human-curator", "retrieved_on": "2026-01-11T12:00:00Z", "entity_resolution": { "verified_by": "kempersc", "verification_date": "2026-01-11T12:30:00Z", "attributes_matched": ["profession", "employer", "location"], "match_count": 3, "verification_notes": "Article explicitly mentions work at New Contemporaries in London" } } } ``` --- ## High-Risk Sources (Extra Verification Required) The following sources have **high entity resolution risk** and require extra careful verification. They are NOT forbidden, but you must apply stricter matching thresholds: | Source | Risk Level | Why | Required Matches | |--------|------------|-----|------------------| | **Genealogy sites** (geni.com, ancestry.*, familysearch.org, myheritage.*) | CRITICAL | Often describe historical namesakes | 5 of 5 attributes | | **IMDB** | CRITICAL | Many actors share common names | 5 of 5 attributes | | **Wikipedia (by name search)** | HIGH | Many people with same name have articles | 4 of 5 attributes | | **Instagram / TikTok / Social media** | HIGH | Cannot easily verify account ownership | 4 of 5 attributes | | **ResearchGate / Academia.edu** | HIGH | Multiple researchers with same name | 4 of 5 attributes | | **News articles** | MEDIUM | May mention different person with same name | 3 of 5 attributes | ### Using High-Risk Sources Correctly **These sources CAN be used** if you verify enough identity attributes: 1. **Genealogy sites**: May be valid if person is historical AND dates/locations/profession all match 2. **IMDB**: May be valid if person actually works in film/TV AND other attributes match 3. **Wikipedia**: Read the FULL article - if profession, employer, location, and time period all match, it's likely correct 4. **Social media**: Check bio for employer/location mentions that match profile **Example - Using Wikipedia correctly**: ``` Profile: Jan de Vries, Curator at Rijksmuseum, Amsterdam Wikipedia: Jan de Vries (art historian) - Mentions "curator at Rijksmuseum" ✅ - Mentions "Amsterdam" ✅ - Mentions "art history PhD from University of Amsterdam" ✅ - Active dates 2010-present ✅ → 4 of 4 attributes match → ACCEPT (with documentation) ``` **Example - Correctly rejecting Wikipedia**: ``` Profile: Carmen Juliá, Curator at New Contemporaries, UK Wikipedia: Carmen Julia Álvarez - Profession: "actress" ❌ (conflict!) - Location: "Venezuela" ❌ (conflict!) → Profession AND location conflict → REJECT ``` --- ## Workflow Diagram ``` ┌─────────────────────────────────────────────────────────────────┐ │ Person Enrichment Request │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Step 1: Check LinkedIn Profile (PREFERRED) │ │ - If accessible, extract and use LinkedIn data │ │ - Self-reported by person, already verified │ └─────────────────────────────────────────────────────────────────┘ │ LinkedIn not available? │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Step 2: Check Institutional Website │ │ - Find person on employer's official website │ │ - Staff directory, team page, press releases │ └─────────────────────────────────────────────────────────────────┘ │ Not on employer website? │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Step 3: Check Verified Identifiers │ │ - ORCID (by ID, not name) │ │ - Wikidata (by confirmed QID) │ └─────────────────────────────────────────────────────────────────┘ │ No verified identifiers? │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Step 4: Manual Web Research (WITH VERIFICATION) │ │ - Search with name + employer + role │ │ - Verify 3 of 5 identity attributes │ │ - Check for profession/location conflicts │ │ - Document verification in provenance │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Step 5: Record Claim with Full Provenance │ │ - source_url, retrieved_on, retrieval_agent │ │ - entity_resolution block with verification details │ │ - verification_notes explaining match │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Example: Correct Enrichment ### Person Profile ```json { "ppid": "ID_NL-NH-AMS_198X_NL-NH-AMS_XXXX_JAN-DE-VRIES", "profile_data": { "full_name": "Jan de Vries", "headline": "Curator of Dutch Art at Rijksmuseum" } } ``` ### Manual Research Process 1. **Search**: `"Jan de Vries" "Rijksmuseum" curator` 2. **Found source**: https://www.rijksmuseum.nl/en/about-us/team 3. **Entity resolution check**: - ✅ Profession: "Curator" (matches) - ✅ Employer: "Rijksmuseum" (matches) - ✅ Location: "Amsterdam" (matches) - ⚪ Age: Not stated - ⚪ Education: Not stated - **Result**: 3 of 3 checked attributes match → ACCEPT 4. **Add claim**: ```json { "claim_type": "position", "claim_value": { "title": "Curator of Dutch Art", "organization": "Rijksmuseum" }, "provenance": { "source_url": "https://www.rijksmuseum.nl/en/about-us/team", "retrieval_agent": "manual-human-curator", "retrieved_on": "2026-01-11T14:00:00Z", "entity_resolution": { "verified_by": "kempersc", "verification_date": "2026-01-11T14:15:00Z", "attributes_matched": ["profession", "employer", "location"], "match_count": 3, "verification_notes": "Listed on official Rijksmuseum team page" } } } ``` --- ## Example: Correct Rejection ### Person Profile ```json { "ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA", "profile_data": { "full_name": "Carmen Juliá", "headline": "Curator at New Contemporaries" } } ``` ### Manual Research Process 1. **Search**: `"Carmen Julia" born biography` 2. **Found source**: Wikipedia - Carmen Julia Álvarez 3. **Red flags detected** (investigate, don't auto-reject): - ⚠️ Profession: "actress" vs "curator" → Did she change careers? - ⚠️ Location: "Venezuela" vs "UK" → Did she relocate? - ⚠️ Birth year: 1952 (would be 74 in 2026) 4. **Investigation**: - Wikipedia shows Carmen Julia Álvarez was **active as actress 1970s-2000s** in Venezuela - Profile shows Carmen Juliá is **active as curator 2015-present** in UK - These careers **overlap in time** (2000s) on **different continents** - No evidence of career transition from acting to curating - Age 74 is possible but unusual for "Curator at New Contemporaries" (typically younger role) 5. **Conclusion**: Overlapping timelines in incompatible roles → **REJECT** 6. **Log rejection**: ```json { "rejected_claim": { "claim_type": "birth_year", "claim_value": 1952, "source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez" }, "rejection_reason": "overlapping_incompatible_careers", "rejection_details": "Wikipedia describes Venezuelan actress active 1970s-2000s; profile is UK curator active 2015-present. Overlapping timelines on different continents with no evidence of career transition.", "investigation_performed": true, "rejected_by": "kempersc", "rejected_at": "2026-01-11T14:30:00Z" } ``` --- ## Scripts and Tools | Script | Purpose | Status | |--------|---------|--------| | `scripts/parse_linkedin_html.py` | Extract data from saved LinkedIn HTML | ✅ Active | | `scripts/enrich_person_comprehensive.py` | Automated web enrichment | 🚫 **DEPRECATED** | | `scripts/validate_person_claims.py` | Validate claim provenance | ✅ Active | --- ## Checklist for Manual Enrichment Before committing any enrichment: - [ ] High-risk sources verified with appropriate threshold (5/5 for genealogy/IMDB, 4/5 for Wikipedia/social) - [ ] At least 3 of 5 identity attributes verified - [ ] Red flags investigated (profession/location differences checked for career changes or relocations) - [ ] No overlapping incompatible timelines (can't be in two places/careers simultaneously) - [ ] Birth year is plausible for career stage - [ ] Full provenance recorded with `entity_resolution` block - [ ] Verification notes explain why this is the same person --- ## Related Documentation - `.opencode/rules/entity-resolution-no-heuristics.md` - Rule 46 (CRITICAL) - `AGENTS.md` - Rule 21 (Data Fabrication Prohibited), Rule 26 (Person Data Provenance) - `data/person/_ENRICHMENT_CLEANUP_FINAL_REPORT.md` - Cleanup report from Jan 2026 --- **Remember: Wrong data is worse than no data. When in doubt, DO NOT add the claim.**