glam/docs/MANUAL_PERSON_ENRICHMENT_WORKFLOW.md
2026-01-11 12:15:27 +01:00

436 lines
17 KiB
Markdown

# Manual Person Enrichment Workflow
**Version**: 1.0.0
**Created**: 2026-01-11
**Status**: MANDATORY (automated enrichment is PROHIBITED)
---
## ⚠️ CRITICAL: Automated Enrichment is Prohibited
Automated web search enrichment has been **permanently disabled** due to catastrophic entity resolution failures discovered in January 2026:
- **540+ false claims** were attributed to wrong people with similar names
- Birth years from Venezuelan actresses attributed to UK curators
- Death years attributed to **living people**
- Social media from random namesakes attributed to heritage workers
**ALL person enrichment must now be done MANUALLY** with human verification that the source refers to the correct person.
See: `.opencode/rules/entity-resolution-no-heuristics.md` (Rule 46)
---
## Why Manual Enrichment is Required
### The Entity Resolution Problem
Web searches for "Carmen Juliá" return data about:
- Carmen Julia **Álvarez** (Venezuelan actress, born 1952)
- Carmen Julia **Navarro** (Mexican hydrogeologist)
- Carmen Julia **Gutiérrez** (Spanish medievalist)
**None of these** is the actual Carmen Juliá who is a UK art curator at New Contemporaries.
Name matching alone **CANNOT** distinguish between namesakes. Only a human can:
1. Read the full source context
2. Cross-reference multiple identity attributes
3. Detect conflicting signals (actress vs curator, Venezuela vs UK)
4. Make an informed judgment about entity identity
### The Cost of Wrong Data
| Impact | Description |
|--------|-------------|
| **Corrupts analysis** | Downstream reports use false birth years, wrong affiliations |
| **Legal/privacy risk** | Attributing data to wrong person violates privacy |
| **Destroys trust** | Users lose confidence in entire dataset |
| **Expensive cleanup** | Manual removal of 540+ false claims took hours |
---
## Allowed Enrichment Methods
### 1. LinkedIn Profile Data (PREFERRED)
LinkedIn profiles are **self-reported** by the person and already verified through profile access.
**What to extract**:
- Current and past positions
- Education history
- Skills and endorsements
- Publications (if listed)
- Certifications
- Languages
**How to extract**:
1. Navigate to person's LinkedIn profile
2. Save page as HTML: File > Save Page As > "Webpage, Complete"
3. Run: `python scripts/parse_linkedin_html.py <saved_file.html>`
4. Review extracted data before committing
**Provenance**:
```json
{
"claim_type": "position",
"claim_value": {"title": "Curator", "organization": "Rijksmuseum"},
"provenance": {
"source_url": "https://www.linkedin.com/in/person-slug/",
"retrieval_agent": "manual-linkedin-extraction",
"retrieved_on": "2026-01-11T12:00:00Z",
"extraction_method": "parse_linkedin_html.py"
}
}
```
### 2. Institutional Sources (VERIFIED)
Data from the person's employer website (museum, archive, library) about their own staff.
**Allowed sources**:
- Staff directory pages
- "About us" / "Team" pages
- Press releases about staff appointments
- Annual reports listing staff
**Verification**:
- URL must be the institution's official domain
- Content must explicitly identify the person
**Example**:
```json
{
"claim_type": "position",
"claim_value": {"title": "Head of Collections", "organization": "Van Gogh Museum"},
"provenance": {
"source_url": "https://www.vangoghmuseum.nl/en/about/organisation/team",
"retrieval_agent": "manual-institutional-extraction",
"retrieved_on": "2026-01-11T12:00:00Z",
"verification_notes": "Listed on official museum team page"
}
}
```
### 3. Verified Identifier Lookup (ORCID, Wikidata)
If the person has a verified identifier (ORCID, Wikidata QID), data from that source is acceptable.
**ORCID**:
- Must match by ORCID ID, not name search
- Publications, affiliations, employment from ORCID record
**Wikidata**:
- Must have confirmed Wikidata QID for this specific person
- Not from a random Wikidata search by name
**Example**:
```json
{
"claim_type": "birth_year",
"claim_value": 1975,
"provenance": {
"source_url": "https://www.wikidata.org/wiki/Q12345678",
"wikidata_property": "P569",
"retrieval_agent": "manual-wikidata-lookup",
"retrieved_on": "2026-01-11T12:00:00Z",
"verification_notes": "Wikidata QID confirmed via ISNI link"
}
}
```
### 4. Manual Web Research (WITH VERIFICATION)
If you must use web search, follow these **mandatory** steps:
#### Step 1: Search and Gather Sources
Search for the person's name + employer + role:
```
"Carmen Juliá" "New Contemporaries" curator
```
#### Step 2: Entity Resolution Checklist
For EACH source, verify **at least 3 of 5** identity attributes match:
| # | Attribute | Profile Value | Source Value | Match? |
|---|-----------|---------------|--------------|--------|
| 1 | Career/Profession | Curator | | ☐ |
| 2 | Employer | New Contemporaries | | ☐ |
| 3 | Location | UK | | ☐ |
| 4 | Age/Time Period | Active 2020s | | ☐ |
| 5 | Education | [if known] | | ☐ |
**Minimum 3 of 5 must match.** Name match alone = REJECT.
#### Step 3: Investigate Red Flags
**Red flags requiring investigation** (NOT automatic rejection - people change careers and relocate):
- ⚠️ Source profession differs (actress vs curator) → **Investigate**: Did they change careers?
- ⚠️ Source location differs (Venezuela vs UK) → **Investigate**: Did they relocate?
- ⚠️ Time gap in career → **Investigate**: Career break or different person?
**When to REJECT after investigation**:
- ❌ Overlapping timelines in different professions/locations (can't be actress in Venezuela AND curator in UK simultaneously)
- ❌ No evidence of career change or relocation
- ❌ Birth year makes current career stage implausible
#### Step 4: Document Verification
Record your verification in the claim provenance:
```json
{
"claim_type": "education",
"claim_value": {"institution": "Courtauld Institute", "degree": "MA Art History"},
"provenance": {
"source_url": "https://example.org/interview-carmen-julia",
"retrieval_agent": "manual-human-curator",
"retrieved_on": "2026-01-11T12:00:00Z",
"entity_resolution": {
"verified_by": "kempersc",
"verification_date": "2026-01-11T12:30:00Z",
"attributes_matched": ["profession", "employer", "location"],
"match_count": 3,
"verification_notes": "Article explicitly mentions work at New Contemporaries in London"
}
}
}
```
---
## High-Risk Sources (Extra Verification Required)
The following sources have **high entity resolution risk** and require extra careful verification. They are NOT forbidden, but you must apply stricter matching thresholds:
| Source | Risk Level | Why | Required Matches |
|--------|------------|-----|------------------|
| **Genealogy sites** (geni.com, ancestry.*, familysearch.org, myheritage.*) | CRITICAL | Often describe historical namesakes | 5 of 5 attributes |
| **IMDB** | CRITICAL | Many actors share common names | 5 of 5 attributes |
| **Wikipedia (by name search)** | HIGH | Many people with same name have articles | 4 of 5 attributes |
| **Instagram / TikTok / Social media** | HIGH | Cannot easily verify account ownership | 4 of 5 attributes |
| **ResearchGate / Academia.edu** | HIGH | Multiple researchers with same name | 4 of 5 attributes |
| **News articles** | MEDIUM | May mention different person with same name | 3 of 5 attributes |
### Using High-Risk Sources Correctly
**These sources CAN be used** if you verify enough identity attributes:
1. **Genealogy sites**: May be valid if person is historical AND dates/locations/profession all match
2. **IMDB**: May be valid if person actually works in film/TV AND other attributes match
3. **Wikipedia**: Read the FULL article - if profession, employer, location, and time period all match, it's likely correct
4. **Social media**: Check bio for employer/location mentions that match profile
**Example - Using Wikipedia correctly**:
```
Profile: Jan de Vries, Curator at Rijksmuseum, Amsterdam
Wikipedia: Jan de Vries (art historian)
- Mentions "curator at Rijksmuseum" ✅
- Mentions "Amsterdam" ✅
- Mentions "art history PhD from University of Amsterdam" ✅
- Active dates 2010-present ✅
→ 4 of 4 attributes match → ACCEPT (with documentation)
```
**Example - Correctly rejecting Wikipedia**:
```
Profile: Carmen Juliá, Curator at New Contemporaries, UK
Wikipedia: Carmen Julia Álvarez
- Profession: "actress" ❌ (conflict!)
- Location: "Venezuela" ❌ (conflict!)
→ Profession AND location conflict → REJECT
```
---
## Workflow Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Person Enrichment Request │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Check LinkedIn Profile (PREFERRED) │
│ - If accessible, extract and use LinkedIn data │
│ - Self-reported by person, already verified │
└─────────────────────────────────────────────────────────────────┘
LinkedIn not available?
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Check Institutional Website │
│ - Find person on employer's official website │
│ - Staff directory, team page, press releases │
└─────────────────────────────────────────────────────────────────┘
Not on employer website?
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Check Verified Identifiers │
│ - ORCID (by ID, not name) │
│ - Wikidata (by confirmed QID) │
└─────────────────────────────────────────────────────────────────┘
No verified identifiers?
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: Manual Web Research (WITH VERIFICATION) │
│ - Search with name + employer + role │
│ - Verify 3 of 5 identity attributes │
│ - Check for profession/location conflicts │
│ - Document verification in provenance │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: Record Claim with Full Provenance │
│ - source_url, retrieved_on, retrieval_agent │
│ - entity_resolution block with verification details │
│ - verification_notes explaining match │
└─────────────────────────────────────────────────────────────────┘
```
---
## Example: Correct Enrichment
### Person Profile
```json
{
"ppid": "ID_NL-NH-AMS_198X_NL-NH-AMS_XXXX_JAN-DE-VRIES",
"profile_data": {
"full_name": "Jan de Vries",
"headline": "Curator of Dutch Art at Rijksmuseum"
}
}
```
### Manual Research Process
1. **Search**: `"Jan de Vries" "Rijksmuseum" curator`
2. **Found source**: https://www.rijksmuseum.nl/en/about-us/team
3. **Entity resolution check**:
- ✅ Profession: "Curator" (matches)
- ✅ Employer: "Rijksmuseum" (matches)
- ✅ Location: "Amsterdam" (matches)
- ⚪ Age: Not stated
- ⚪ Education: Not stated
- **Result**: 3 of 3 checked attributes match → ACCEPT
4. **Add claim**:
```json
{
"claim_type": "position",
"claim_value": {
"title": "Curator of Dutch Art",
"organization": "Rijksmuseum"
},
"provenance": {
"source_url": "https://www.rijksmuseum.nl/en/about-us/team",
"retrieval_agent": "manual-human-curator",
"retrieved_on": "2026-01-11T14:00:00Z",
"entity_resolution": {
"verified_by": "kempersc",
"verification_date": "2026-01-11T14:15:00Z",
"attributes_matched": ["profession", "employer", "location"],
"match_count": 3,
"verification_notes": "Listed on official Rijksmuseum team page"
}
}
}
```
---
## Example: Correct Rejection
### Person Profile
```json
{
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
"profile_data": {
"full_name": "Carmen Juliá",
"headline": "Curator at New Contemporaries"
}
}
```
### Manual Research Process
1. **Search**: `"Carmen Julia" born biography`
2. **Found source**: Wikipedia - Carmen Julia Álvarez
3. **Red flags detected** (investigate, don't auto-reject):
- ⚠️ Profession: "actress" vs "curator" → Did she change careers?
- ⚠️ Location: "Venezuela" vs "UK" → Did she relocate?
- ⚠️ Birth year: 1952 (would be 74 in 2026)
4. **Investigation**:
- Wikipedia shows Carmen Julia Álvarez was **active as actress 1970s-2000s** in Venezuela
- Profile shows Carmen Juliá is **active as curator 2015-present** in UK
- These careers **overlap in time** (2000s) on **different continents**
- No evidence of career transition from acting to curating
- Age 74 is possible but unusual for "Curator at New Contemporaries" (typically younger role)
5. **Conclusion**: Overlapping timelines in incompatible roles → **REJECT**
6. **Log rejection**:
```json
{
"rejected_claim": {
"claim_type": "birth_year",
"claim_value": 1952,
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez"
},
"rejection_reason": "overlapping_incompatible_careers",
"rejection_details": "Wikipedia describes Venezuelan actress active 1970s-2000s; profile is UK curator active 2015-present. Overlapping timelines on different continents with no evidence of career transition.",
"investigation_performed": true,
"rejected_by": "kempersc",
"rejected_at": "2026-01-11T14:30:00Z"
}
```
---
## Scripts and Tools
| Script | Purpose | Status |
|--------|---------|--------|
| `scripts/parse_linkedin_html.py` | Extract data from saved LinkedIn HTML | ✅ Active |
| `scripts/enrich_person_comprehensive.py` | Automated web enrichment | 🚫 **DEPRECATED** |
| `scripts/validate_person_claims.py` | Validate claim provenance | ✅ Active |
---
## Checklist for Manual Enrichment
Before committing any enrichment:
- [ ] High-risk sources verified with appropriate threshold (5/5 for genealogy/IMDB, 4/5 for Wikipedia/social)
- [ ] At least 3 of 5 identity attributes verified
- [ ] Red flags investigated (profession/location differences checked for career changes or relocations)
- [ ] No overlapping incompatible timelines (can't be in two places/careers simultaneously)
- [ ] Birth year is plausible for career stage
- [ ] Full provenance recorded with `entity_resolution` block
- [ ] Verification notes explain why this is the same person
---
## Related Documentation
- `.opencode/rules/entity-resolution-no-heuristics.md` - Rule 46 (CRITICAL)
- `AGENTS.md` - Rule 21 (Data Fabrication Prohibited), Rule 26 (Person Data Provenance)
- `data/person/_ENRICHMENT_CLEANUP_FINAL_REPORT.md` - Cleanup report from Jan 2026
---
**Remember: Wrong data is worse than no data. When in doubt, DO NOT add the claim.**