17 KiB
Manual Person Enrichment Workflow
Version: 1.0.0
Created: 2026-01-11
Status: MANDATORY (automated enrichment is PROHIBITED)
⚠️ CRITICAL: Automated Enrichment is Prohibited
Automated web search enrichment has been permanently disabled due to catastrophic entity resolution failures discovered in January 2026:
- 540+ false claims were attributed to wrong people with similar names
- Birth years from Venezuelan actresses attributed to UK curators
- Death years attributed to living people
- Social media from random namesakes attributed to heritage workers
ALL person enrichment must now be done MANUALLY with human verification that the source refers to the correct person.
See: .opencode/rules/entity-resolution-no-heuristics.md (Rule 46)
Why Manual Enrichment is Required
The Entity Resolution Problem
Web searches for "Carmen Juliá" return data about:
- Carmen Julia Álvarez (Venezuelan actress, born 1952)
- Carmen Julia Navarro (Mexican hydrogeologist)
- Carmen Julia Gutiérrez (Spanish medievalist)
None of these is the actual Carmen Juliá who is a UK art curator at New Contemporaries.
Name matching alone CANNOT distinguish between namesakes. Only a human can:
- Read the full source context
- Cross-reference multiple identity attributes
- Detect conflicting signals (actress vs curator, Venezuela vs UK)
- Make an informed judgment about entity identity
The Cost of Wrong Data
| Impact | Description |
|---|---|
| Corrupts analysis | Downstream reports use false birth years, wrong affiliations |
| Legal/privacy risk | Attributing data to wrong person violates privacy |
| Destroys trust | Users lose confidence in entire dataset |
| Expensive cleanup | Manual removal of 540+ false claims took hours |
Allowed Enrichment Methods
1. LinkedIn Profile Data (PREFERRED)
LinkedIn profiles are self-reported by the person and already verified through profile access.
What to extract:
- Current and past positions
- Education history
- Skills and endorsements
- Publications (if listed)
- Certifications
- Languages
How to extract:
- Navigate to person's LinkedIn profile
- Save page as HTML: File > Save Page As > "Webpage, Complete"
- Run:
python scripts/parse_linkedin_html.py <saved_file.html> - Review extracted data before committing
Provenance:
{
"claim_type": "position",
"claim_value": {"title": "Curator", "organization": "Rijksmuseum"},
"provenance": {
"source_url": "https://www.linkedin.com/in/person-slug/",
"retrieval_agent": "manual-linkedin-extraction",
"retrieved_on": "2026-01-11T12:00:00Z",
"extraction_method": "parse_linkedin_html.py"
}
}
2. Institutional Sources (VERIFIED)
Data from the person's employer website (museum, archive, library) about their own staff.
Allowed sources:
- Staff directory pages
- "About us" / "Team" pages
- Press releases about staff appointments
- Annual reports listing staff
Verification:
- URL must be the institution's official domain
- Content must explicitly identify the person
Example:
{
"claim_type": "position",
"claim_value": {"title": "Head of Collections", "organization": "Van Gogh Museum"},
"provenance": {
"source_url": "https://www.vangoghmuseum.nl/en/about/organisation/team",
"retrieval_agent": "manual-institutional-extraction",
"retrieved_on": "2026-01-11T12:00:00Z",
"verification_notes": "Listed on official museum team page"
}
}
3. Verified Identifier Lookup (ORCID, Wikidata)
If the person has a verified identifier (ORCID, Wikidata QID), data from that source is acceptable.
ORCID:
- Must match by ORCID ID, not name search
- Publications, affiliations, employment from ORCID record
Wikidata:
- Must have confirmed Wikidata QID for this specific person
- Not from a random Wikidata search by name
Example:
{
"claim_type": "birth_year",
"claim_value": 1975,
"provenance": {
"source_url": "https://www.wikidata.org/wiki/Q12345678",
"wikidata_property": "P569",
"retrieval_agent": "manual-wikidata-lookup",
"retrieved_on": "2026-01-11T12:00:00Z",
"verification_notes": "Wikidata QID confirmed via ISNI link"
}
}
4. Manual Web Research (WITH VERIFICATION)
If you must use web search, follow these mandatory steps:
Step 1: Search and Gather Sources
Search for the person's name + employer + role:
"Carmen Juliá" "New Contemporaries" curator
Step 2: Entity Resolution Checklist
For EACH source, verify at least 3 of 5 identity attributes match:
| # | Attribute | Profile Value | Source Value | Match? |
|---|---|---|---|---|
| 1 | Career/Profession | Curator | ☐ | |
| 2 | Employer | New Contemporaries | ☐ | |
| 3 | Location | UK | ☐ | |
| 4 | Age/Time Period | Active 2020s | ☐ | |
| 5 | Education | [if known] | ☐ |
Minimum 3 of 5 must match. Name match alone = REJECT.
Step 3: Investigate Red Flags
Red flags requiring investigation (NOT automatic rejection - people change careers and relocate):
- ⚠️ Source profession differs (actress vs curator) → Investigate: Did they change careers?
- ⚠️ Source location differs (Venezuela vs UK) → Investigate: Did they relocate?
- ⚠️ Time gap in career → Investigate: Career break or different person?
When to REJECT after investigation:
- ❌ Overlapping timelines in different professions/locations (can't be actress in Venezuela AND curator in UK simultaneously)
- ❌ No evidence of career change or relocation
- ❌ Birth year makes current career stage implausible
Step 4: Document Verification
Record your verification in the claim provenance:
{
"claim_type": "education",
"claim_value": {"institution": "Courtauld Institute", "degree": "MA Art History"},
"provenance": {
"source_url": "https://example.org/interview-carmen-julia",
"retrieval_agent": "manual-human-curator",
"retrieved_on": "2026-01-11T12:00:00Z",
"entity_resolution": {
"verified_by": "kempersc",
"verification_date": "2026-01-11T12:30:00Z",
"attributes_matched": ["profession", "employer", "location"],
"match_count": 3,
"verification_notes": "Article explicitly mentions work at New Contemporaries in London"
}
}
}
High-Risk Sources (Extra Verification Required)
The following sources have high entity resolution risk and require extra careful verification. They are NOT forbidden, but you must apply stricter matching thresholds:
| Source | Risk Level | Why | Required Matches |
|---|---|---|---|
| Genealogy sites (geni.com, ancestry., familysearch.org, myheritage.) | CRITICAL | Often describe historical namesakes | 5 of 5 attributes |
| IMDB | CRITICAL | Many actors share common names | 5 of 5 attributes |
| Wikipedia (by name search) | HIGH | Many people with same name have articles | 4 of 5 attributes |
| Instagram / TikTok / Social media | HIGH | Cannot easily verify account ownership | 4 of 5 attributes |
| ResearchGate / Academia.edu | HIGH | Multiple researchers with same name | 4 of 5 attributes |
| News articles | MEDIUM | May mention different person with same name | 3 of 5 attributes |
Using High-Risk Sources Correctly
These sources CAN be used if you verify enough identity attributes:
- Genealogy sites: May be valid if person is historical AND dates/locations/profession all match
- IMDB: May be valid if person actually works in film/TV AND other attributes match
- Wikipedia: Read the FULL article - if profession, employer, location, and time period all match, it's likely correct
- Social media: Check bio for employer/location mentions that match profile
Example - Using Wikipedia correctly:
Profile: Jan de Vries, Curator at Rijksmuseum, Amsterdam
Wikipedia: Jan de Vries (art historian)
- Mentions "curator at Rijksmuseum" ✅
- Mentions "Amsterdam" ✅
- Mentions "art history PhD from University of Amsterdam" ✅
- Active dates 2010-present ✅
→ 4 of 4 attributes match → ACCEPT (with documentation)
Example - Correctly rejecting Wikipedia:
Profile: Carmen Juliá, Curator at New Contemporaries, UK
Wikipedia: Carmen Julia Álvarez
- Profession: "actress" ❌ (conflict!)
- Location: "Venezuela" ❌ (conflict!)
→ Profession AND location conflict → REJECT
Workflow Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Person Enrichment Request │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Check LinkedIn Profile (PREFERRED) │
│ - If accessible, extract and use LinkedIn data │
│ - Self-reported by person, already verified │
└─────────────────────────────────────────────────────────────────┘
│
LinkedIn not available?
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Check Institutional Website │
│ - Find person on employer's official website │
│ - Staff directory, team page, press releases │
└─────────────────────────────────────────────────────────────────┘
│
Not on employer website?
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Check Verified Identifiers │
│ - ORCID (by ID, not name) │
│ - Wikidata (by confirmed QID) │
└─────────────────────────────────────────────────────────────────┘
│
No verified identifiers?
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: Manual Web Research (WITH VERIFICATION) │
│ - Search with name + employer + role │
│ - Verify 3 of 5 identity attributes │
│ - Check for profession/location conflicts │
│ - Document verification in provenance │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: Record Claim with Full Provenance │
│ - source_url, retrieved_on, retrieval_agent │
│ - entity_resolution block with verification details │
│ - verification_notes explaining match │
└─────────────────────────────────────────────────────────────────┘
Example: Correct Enrichment
Person Profile
{
"ppid": "ID_NL-NH-AMS_198X_NL-NH-AMS_XXXX_JAN-DE-VRIES",
"profile_data": {
"full_name": "Jan de Vries",
"headline": "Curator of Dutch Art at Rijksmuseum"
}
}
Manual Research Process
-
Search:
"Jan de Vries" "Rijksmuseum" curator -
Found source: https://www.rijksmuseum.nl/en/about-us/team
-
Entity resolution check:
- ✅ Profession: "Curator" (matches)
- ✅ Employer: "Rijksmuseum" (matches)
- ✅ Location: "Amsterdam" (matches)
- ⚪ Age: Not stated
- ⚪ Education: Not stated
- Result: 3 of 3 checked attributes match → ACCEPT
-
Add claim:
{
"claim_type": "position",
"claim_value": {
"title": "Curator of Dutch Art",
"organization": "Rijksmuseum"
},
"provenance": {
"source_url": "https://www.rijksmuseum.nl/en/about-us/team",
"retrieval_agent": "manual-human-curator",
"retrieved_on": "2026-01-11T14:00:00Z",
"entity_resolution": {
"verified_by": "kempersc",
"verification_date": "2026-01-11T14:15:00Z",
"attributes_matched": ["profession", "employer", "location"],
"match_count": 3,
"verification_notes": "Listed on official Rijksmuseum team page"
}
}
}
Example: Correct Rejection
Person Profile
{
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
"profile_data": {
"full_name": "Carmen Juliá",
"headline": "Curator at New Contemporaries"
}
}
Manual Research Process
-
Search:
"Carmen Julia" born biography -
Found source: Wikipedia - Carmen Julia Álvarez
-
Red flags detected (investigate, don't auto-reject):
- ⚠️ Profession: "actress" vs "curator" → Did she change careers?
- ⚠️ Location: "Venezuela" vs "UK" → Did she relocate?
- ⚠️ Birth year: 1952 (would be 74 in 2026)
-
Investigation:
- Wikipedia shows Carmen Julia Álvarez was active as actress 1970s-2000s in Venezuela
- Profile shows Carmen Juliá is active as curator 2015-present in UK
- These careers overlap in time (2000s) on different continents
- No evidence of career transition from acting to curating
- Age 74 is possible but unusual for "Curator at New Contemporaries" (typically younger role)
-
Conclusion: Overlapping timelines in incompatible roles → REJECT
-
Log rejection:
{
"rejected_claim": {
"claim_type": "birth_year",
"claim_value": 1952,
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez"
},
"rejection_reason": "overlapping_incompatible_careers",
"rejection_details": "Wikipedia describes Venezuelan actress active 1970s-2000s; profile is UK curator active 2015-present. Overlapping timelines on different continents with no evidence of career transition.",
"investigation_performed": true,
"rejected_by": "kempersc",
"rejected_at": "2026-01-11T14:30:00Z"
}
Scripts and Tools
| Script | Purpose | Status |
|---|---|---|
scripts/parse_linkedin_html.py |
Extract data from saved LinkedIn HTML | ✅ Active |
scripts/enrich_person_comprehensive.py |
Automated web enrichment | 🚫 DEPRECATED |
scripts/validate_person_claims.py |
Validate claim provenance | ✅ Active |
Checklist for Manual Enrichment
Before committing any enrichment:
- High-risk sources verified with appropriate threshold (5/5 for genealogy/IMDB, 4/5 for Wikipedia/social)
- At least 3 of 5 identity attributes verified
- Red flags investigated (profession/location differences checked for career changes or relocations)
- No overlapping incompatible timelines (can't be in two places/careers simultaneously)
- Birth year is plausible for career stage
- Full provenance recorded with
entity_resolutionblock - Verification notes explain why this is the same person
Related Documentation
.opencode/rules/entity-resolution-no-heuristics.md- Rule 46 (CRITICAL)AGENTS.md- Rule 21 (Data Fabrication Prohibited), Rule 26 (Person Data Provenance)data/person/_ENRICHMENT_CLEANUP_FINAL_REPORT.md- Cleanup report from Jan 2026
Remember: Wrong data is worse than no data. When in doubt, DO NOT add the claim.