436 lines
17 KiB
Markdown
436 lines
17 KiB
Markdown
# Manual Person Enrichment Workflow
|
|
|
|
**Version**: 1.0.0
|
|
**Created**: 2026-01-11
|
|
**Status**: MANDATORY (automated enrichment is PROHIBITED)
|
|
|
|
---
|
|
|
|
## ⚠️ CRITICAL: Automated Enrichment is Prohibited
|
|
|
|
Automated web search enrichment has been **permanently disabled** due to catastrophic entity resolution failures discovered in January 2026:
|
|
|
|
- **540+ false claims** were attributed to wrong people with similar names
|
|
- Birth years from Venezuelan actresses attributed to UK curators
|
|
- Death years attributed to **living people**
|
|
- Social media from random namesakes attributed to heritage workers
|
|
|
|
**ALL person enrichment must now be done MANUALLY** with human verification that the source refers to the correct person.
|
|
|
|
See: `.opencode/rules/entity-resolution-no-heuristics.md` (Rule 46)
|
|
|
|
---
|
|
|
|
## Why Manual Enrichment is Required
|
|
|
|
### The Entity Resolution Problem
|
|
|
|
Web searches for "Carmen Juliá" return data about:
|
|
- Carmen Julia **Álvarez** (Venezuelan actress, born 1952)
|
|
- Carmen Julia **Navarro** (Mexican hydrogeologist)
|
|
- Carmen Julia **Gutiérrez** (Spanish medievalist)
|
|
|
|
**None of these** is the actual Carmen Juliá who is a UK art curator at New Contemporaries.
|
|
|
|
Name matching alone **CANNOT** distinguish between namesakes. Only a human can:
|
|
1. Read the full source context
|
|
2. Cross-reference multiple identity attributes
|
|
3. Detect conflicting signals (actress vs curator, Venezuela vs UK)
|
|
4. Make an informed judgment about entity identity
|
|
|
|
### The Cost of Wrong Data
|
|
|
|
| Impact | Description |
|
|
|--------|-------------|
|
|
| **Corrupts analysis** | Downstream reports use false birth years, wrong affiliations |
|
|
| **Legal/privacy risk** | Attributing data to wrong person violates privacy |
|
|
| **Destroys trust** | Users lose confidence in entire dataset |
|
|
| **Expensive cleanup** | Manual removal of 540+ false claims took hours |
|
|
|
|
---
|
|
|
|
## Allowed Enrichment Methods
|
|
|
|
### 1. LinkedIn Profile Data (PREFERRED)
|
|
|
|
LinkedIn profiles are **self-reported** by the person and already verified through profile access.
|
|
|
|
**What to extract**:
|
|
- Current and past positions
|
|
- Education history
|
|
- Skills and endorsements
|
|
- Publications (if listed)
|
|
- Certifications
|
|
- Languages
|
|
|
|
**How to extract**:
|
|
1. Navigate to person's LinkedIn profile
|
|
2. Save page as HTML: File > Save Page As > "Webpage, Complete"
|
|
3. Run: `python scripts/parse_linkedin_html.py <saved_file.html>`
|
|
4. Review extracted data before committing
|
|
|
|
**Provenance**:
|
|
```json
|
|
{
|
|
"claim_type": "position",
|
|
"claim_value": {"title": "Curator", "organization": "Rijksmuseum"},
|
|
"provenance": {
|
|
"source_url": "https://www.linkedin.com/in/person-slug/",
|
|
"retrieval_agent": "manual-linkedin-extraction",
|
|
"retrieved_on": "2026-01-11T12:00:00Z",
|
|
"extraction_method": "parse_linkedin_html.py"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2. Institutional Sources (VERIFIED)
|
|
|
|
Data from the person's employer website (museum, archive, library) about their own staff.
|
|
|
|
**Allowed sources**:
|
|
- Staff directory pages
|
|
- "About us" / "Team" pages
|
|
- Press releases about staff appointments
|
|
- Annual reports listing staff
|
|
|
|
**Verification**:
|
|
- URL must be the institution's official domain
|
|
- Content must explicitly identify the person
|
|
|
|
**Example**:
|
|
```json
|
|
{
|
|
"claim_type": "position",
|
|
"claim_value": {"title": "Head of Collections", "organization": "Van Gogh Museum"},
|
|
"provenance": {
|
|
"source_url": "https://www.vangoghmuseum.nl/en/about/organisation/team",
|
|
"retrieval_agent": "manual-institutional-extraction",
|
|
"retrieved_on": "2026-01-11T12:00:00Z",
|
|
"verification_notes": "Listed on official museum team page"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Verified Identifier Lookup (ORCID, Wikidata)
|
|
|
|
If the person has a verified identifier (ORCID, Wikidata QID), data from that source is acceptable.
|
|
|
|
**ORCID**:
|
|
- Must match by ORCID ID, not name search
|
|
- Publications, affiliations, employment from ORCID record
|
|
|
|
**Wikidata**:
|
|
- Must have confirmed Wikidata QID for this specific person
|
|
- Not from a random Wikidata search by name
|
|
|
|
**Example**:
|
|
```json
|
|
{
|
|
"claim_type": "birth_year",
|
|
"claim_value": 1975,
|
|
"provenance": {
|
|
"source_url": "https://www.wikidata.org/wiki/Q12345678",
|
|
"wikidata_property": "P569",
|
|
"retrieval_agent": "manual-wikidata-lookup",
|
|
"retrieved_on": "2026-01-11T12:00:00Z",
|
|
"verification_notes": "Wikidata QID confirmed via ISNI link"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. Manual Web Research (WITH VERIFICATION)
|
|
|
|
If you must use web search, follow these **mandatory** steps:
|
|
|
|
#### Step 1: Search and Gather Sources
|
|
|
|
Search for the person's name + employer + role:
|
|
```
|
|
"Carmen Juliá" "New Contemporaries" curator
|
|
```
|
|
|
|
#### Step 2: Entity Resolution Checklist
|
|
|
|
For EACH source, verify **at least 3 of 5** identity attributes match:
|
|
|
|
| # | Attribute | Profile Value | Source Value | Match? |
|
|
|---|-----------|---------------|--------------|--------|
|
|
| 1 | Career/Profession | Curator | | ☐ |
|
|
| 2 | Employer | New Contemporaries | | ☐ |
|
|
| 3 | Location | UK | | ☐ |
|
|
| 4 | Age/Time Period | Active 2020s | | ☐ |
|
|
| 5 | Education | [if known] | | ☐ |
|
|
|
|
**Minimum 3 of 5 must match.** Name match alone = REJECT.
|
|
|
|
#### Step 3: Investigate Red Flags
|
|
|
|
**Red flags requiring investigation** (NOT automatic rejection - people change careers and relocate):
|
|
- ⚠️ Source profession differs (actress vs curator) → **Investigate**: Did they change careers?
|
|
- ⚠️ Source location differs (Venezuela vs UK) → **Investigate**: Did they relocate?
|
|
- ⚠️ Time gap in career → **Investigate**: Career break or different person?
|
|
|
|
**When to REJECT after investigation**:
|
|
- ❌ Overlapping timelines in different professions/locations (can't be actress in Venezuela AND curator in UK simultaneously)
|
|
- ❌ No evidence of career change or relocation
|
|
- ❌ Birth year makes current career stage implausible
|
|
|
|
#### Step 4: Document Verification
|
|
|
|
Record your verification in the claim provenance:
|
|
|
|
```json
|
|
{
|
|
"claim_type": "education",
|
|
"claim_value": {"institution": "Courtauld Institute", "degree": "MA Art History"},
|
|
"provenance": {
|
|
"source_url": "https://example.org/interview-carmen-julia",
|
|
"retrieval_agent": "manual-human-curator",
|
|
"retrieved_on": "2026-01-11T12:00:00Z",
|
|
"entity_resolution": {
|
|
"verified_by": "kempersc",
|
|
"verification_date": "2026-01-11T12:30:00Z",
|
|
"attributes_matched": ["profession", "employer", "location"],
|
|
"match_count": 3,
|
|
"verification_notes": "Article explicitly mentions work at New Contemporaries in London"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## High-Risk Sources (Extra Verification Required)
|
|
|
|
The following sources have **high entity resolution risk** and require extra careful verification. They are NOT forbidden, but you must apply stricter matching thresholds:
|
|
|
|
| Source | Risk Level | Why | Required Matches |
|
|
|--------|------------|-----|------------------|
|
|
| **Genealogy sites** (geni.com, ancestry.*, familysearch.org, myheritage.*) | CRITICAL | Often describe historical namesakes | 5 of 5 attributes |
|
|
| **IMDB** | CRITICAL | Many actors share common names | 5 of 5 attributes |
|
|
| **Wikipedia (by name search)** | HIGH | Many people with same name have articles | 4 of 5 attributes |
|
|
| **Instagram / TikTok / Social media** | HIGH | Cannot easily verify account ownership | 4 of 5 attributes |
|
|
| **ResearchGate / Academia.edu** | HIGH | Multiple researchers with same name | 4 of 5 attributes |
|
|
| **News articles** | MEDIUM | May mention different person with same name | 3 of 5 attributes |
|
|
|
|
### Using High-Risk Sources Correctly
|
|
|
|
**These sources CAN be used** if you verify enough identity attributes:
|
|
|
|
1. **Genealogy sites**: May be valid if person is historical AND dates/locations/profession all match
|
|
2. **IMDB**: May be valid if person actually works in film/TV AND other attributes match
|
|
3. **Wikipedia**: Read the FULL article - if profession, employer, location, and time period all match, it's likely correct
|
|
4. **Social media**: Check bio for employer/location mentions that match profile
|
|
|
|
**Example - Using Wikipedia correctly**:
|
|
```
|
|
Profile: Jan de Vries, Curator at Rijksmuseum, Amsterdam
|
|
Wikipedia: Jan de Vries (art historian)
|
|
- Mentions "curator at Rijksmuseum" ✅
|
|
- Mentions "Amsterdam" ✅
|
|
- Mentions "art history PhD from University of Amsterdam" ✅
|
|
- Active dates 2010-present ✅
|
|
→ 4 of 4 attributes match → ACCEPT (with documentation)
|
|
```
|
|
|
|
**Example - Correctly rejecting Wikipedia**:
|
|
```
|
|
Profile: Carmen Juliá, Curator at New Contemporaries, UK
|
|
Wikipedia: Carmen Julia Álvarez
|
|
- Profession: "actress" ❌ (conflict!)
|
|
- Location: "Venezuela" ❌ (conflict!)
|
|
→ Profession AND location conflict → REJECT
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Person Enrichment Request │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Step 1: Check LinkedIn Profile (PREFERRED) │
|
|
│ - If accessible, extract and use LinkedIn data │
|
|
│ - Self-reported by person, already verified │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
LinkedIn not available?
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Step 2: Check Institutional Website │
|
|
│ - Find person on employer's official website │
|
|
│ - Staff directory, team page, press releases │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
Not on employer website?
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Step 3: Check Verified Identifiers │
|
|
│ - ORCID (by ID, not name) │
|
|
│ - Wikidata (by confirmed QID) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
No verified identifiers?
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Step 4: Manual Web Research (WITH VERIFICATION) │
|
|
│ - Search with name + employer + role │
|
|
│ - Verify 3 of 5 identity attributes │
|
|
│ - Check for profession/location conflicts │
|
|
│ - Document verification in provenance │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Step 5: Record Claim with Full Provenance │
|
|
│ - source_url, retrieved_on, retrieval_agent │
|
|
│ - entity_resolution block with verification details │
|
|
│ - verification_notes explaining match │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Example: Correct Enrichment
|
|
|
|
### Person Profile
|
|
```json
|
|
{
|
|
"ppid": "ID_NL-NH-AMS_198X_NL-NH-AMS_XXXX_JAN-DE-VRIES",
|
|
"profile_data": {
|
|
"full_name": "Jan de Vries",
|
|
"headline": "Curator of Dutch Art at Rijksmuseum"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Manual Research Process
|
|
|
|
1. **Search**: `"Jan de Vries" "Rijksmuseum" curator`
|
|
|
|
2. **Found source**: https://www.rijksmuseum.nl/en/about-us/team
|
|
|
|
3. **Entity resolution check**:
|
|
- ✅ Profession: "Curator" (matches)
|
|
- ✅ Employer: "Rijksmuseum" (matches)
|
|
- ✅ Location: "Amsterdam" (matches)
|
|
- ⚪ Age: Not stated
|
|
- ⚪ Education: Not stated
|
|
- **Result**: 3 of 3 checked attributes match → ACCEPT
|
|
|
|
4. **Add claim**:
|
|
```json
|
|
{
|
|
"claim_type": "position",
|
|
"claim_value": {
|
|
"title": "Curator of Dutch Art",
|
|
"organization": "Rijksmuseum"
|
|
},
|
|
"provenance": {
|
|
"source_url": "https://www.rijksmuseum.nl/en/about-us/team",
|
|
"retrieval_agent": "manual-human-curator",
|
|
"retrieved_on": "2026-01-11T14:00:00Z",
|
|
"entity_resolution": {
|
|
"verified_by": "kempersc",
|
|
"verification_date": "2026-01-11T14:15:00Z",
|
|
"attributes_matched": ["profession", "employer", "location"],
|
|
"match_count": 3,
|
|
"verification_notes": "Listed on official Rijksmuseum team page"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Example: Correct Rejection
|
|
|
|
### Person Profile
|
|
```json
|
|
{
|
|
"ppid": "ID_UK-XX-XXX_XXXX_UK-XX-XXX_XXXX_CARMEN-JULIA",
|
|
"profile_data": {
|
|
"full_name": "Carmen Juliá",
|
|
"headline": "Curator at New Contemporaries"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Manual Research Process
|
|
|
|
1. **Search**: `"Carmen Julia" born biography`
|
|
|
|
2. **Found source**: Wikipedia - Carmen Julia Álvarez
|
|
|
|
3. **Red flags detected** (investigate, don't auto-reject):
|
|
- ⚠️ Profession: "actress" vs "curator" → Did she change careers?
|
|
- ⚠️ Location: "Venezuela" vs "UK" → Did she relocate?
|
|
- ⚠️ Birth year: 1952 (would be 74 in 2026)
|
|
|
|
4. **Investigation**:
|
|
- Wikipedia shows Carmen Julia Álvarez was **active as actress 1970s-2000s** in Venezuela
|
|
- Profile shows Carmen Juliá is **active as curator 2015-present** in UK
|
|
- These careers **overlap in time** (2000s) on **different continents**
|
|
- No evidence of career transition from acting to curating
|
|
- Age 74 is possible but unusual for "Curator at New Contemporaries" (typically younger role)
|
|
|
|
5. **Conclusion**: Overlapping timelines in incompatible roles → **REJECT**
|
|
|
|
6. **Log rejection**:
|
|
```json
|
|
{
|
|
"rejected_claim": {
|
|
"claim_type": "birth_year",
|
|
"claim_value": 1952,
|
|
"source_url": "https://en.wikipedia.org/wiki/Carmen_Julia_Álvarez"
|
|
},
|
|
"rejection_reason": "overlapping_incompatible_careers",
|
|
"rejection_details": "Wikipedia describes Venezuelan actress active 1970s-2000s; profile is UK curator active 2015-present. Overlapping timelines on different continents with no evidence of career transition.",
|
|
"investigation_performed": true,
|
|
"rejected_by": "kempersc",
|
|
"rejected_at": "2026-01-11T14:30:00Z"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Scripts and Tools
|
|
|
|
| Script | Purpose | Status |
|
|
|--------|---------|--------|
|
|
| `scripts/parse_linkedin_html.py` | Extract data from saved LinkedIn HTML | ✅ Active |
|
|
| `scripts/enrich_person_comprehensive.py` | Automated web enrichment | 🚫 **DEPRECATED** |
|
|
| `scripts/validate_person_claims.py` | Validate claim provenance | ✅ Active |
|
|
|
|
---
|
|
|
|
## Checklist for Manual Enrichment
|
|
|
|
Before committing any enrichment:
|
|
|
|
- [ ] High-risk sources verified with appropriate threshold (5/5 for genealogy/IMDB, 4/5 for Wikipedia/social)
|
|
- [ ] At least 3 of 5 identity attributes verified
|
|
- [ ] Red flags investigated (profession/location differences checked for career changes or relocations)
|
|
- [ ] No overlapping incompatible timelines (can't be in two places/careers simultaneously)
|
|
- [ ] Birth year is plausible for career stage
|
|
- [ ] Full provenance recorded with `entity_resolution` block
|
|
- [ ] Verification notes explain why this is the same person
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- `.opencode/rules/entity-resolution-no-heuristics.md` - Rule 46 (CRITICAL)
|
|
- `AGENTS.md` - Rule 21 (Data Fabrication Prohibited), Rule 26 (Person Data Provenance)
|
|
- `data/person/_ENRICHMENT_CLEANUP_FINAL_REPORT.md` - Cleanup report from Jan 2026
|
|
|
|
---
|
|
|
|
**Remember: Wrong data is worse than no data. When in doubt, DO NOT add the claim.**
|