glam/.opencode/PERSON_DATA_PROVENANCE_RULE.md

# Person Data Provenance Rule

## Rule Summary

**Rule 26** in `AGENTS.md`: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.

## Purpose

Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:

1. **Data Verification** - Any claim about a person can be verified against source
2. **Legal Compliance** - GDPR and privacy regulations require data source transparency
3. **Update Management** - Knowing sources enables systematic refresh cycles
4. **Credibility** - Academic and institutional users need citation trails

## What Requires Web Claims

### Person Data Types Requiring Provenance

| Data Type | Source Types | Provenance Required |
|-----------|--------------|---------------------|
| **Full Name** | LinkedIn, institutional pages | YES - always |
| **Job Title/Role** | LinkedIn, about pages, staff directories | YES - always |
| **Department** | Institutional pages | YES - always |
| **Email** | Contact pages, staff directories | YES - always |
| **Phone** | Contact pages | YES - always |
| **Professional History** | LinkedIn profiles | YES - always |
| **Education** | LinkedIn, CVs, institutional bios | YES - always |
| **Specialization** | Institutional bios, publications | YES - always |
| **Start Date** | News articles, announcements | RECOMMENDED |
| **Photo** | LinkedIn, institutional pages | RECOMMENDED |

## Claim Types

### Standard Person Claim Types

| Claim Type | Description | Example Value |
|------------|-------------|---------------|
| `full_name` | Person's complete name | "Taco Dibbits" |
| `role_title` | Current job title | "General Director" |
| `department` | Department or division | "Curatorial Department" |
| `email` | Work email address | "t.dibbits@rijksmuseum.nl" |
| `phone` | Work phone number | "+31 20 674 7000" |
| `start_date` | When role began | "2020-01-15" |
| `end_date` | When role ended | "2024-12-31" |
| `education` | Degree and institution | "PhD Art History, University of Amsterdam" |
| `specialization` | Area of expertise | "17th Century Dutch Painting" |
| `previous_employer` | Prior organization | "Metropolitan Museum of Art" |
| `biography` | Brief bio text | "Dr. Dibbits has led..." |
| `photo_url` | Profile image URL | "https://media.licdn.com/..." |

## Web Claim Structure

### Basic Web Claim

```yaml
web_claims:
  - claim_type: full_name
    claim_value: Taco Dibbits
    source_url: https://www.rijksmuseum.nl/en/about-us/organisation
    xpath: /html/body/main/section[2]/div[1]/h2
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
    xpath_match_score: 1.0
```

### Required Fields

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `claim_type` | string | YES | Type from claim types table |
| `claim_value` | string | YES | The extracted value |
| `source_url` | string | YES | URL where data was found |
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
| `retrieval_agent` | enum | YES | Tool used for extraction |
| `xpath` | string | RECOMMENDED | XPath to element |
| `xpath_match_score` | float | RECOMMENDED | 1.0 exact, <1.0 fuzzy |
| `html_file` | string | RECOMMENDED | Path to archived HTML |

### Retrieval Agent Values

| Value | Description | Best For |
|-------|-------------|----------|
| `firecrawl` | FireCrawl MCP | Institutional pages |
| `playwright` | Playwright browser | JS-heavy sites |
| `exa_crawling_exa` | Exa crawl | LinkedIn profiles |
| `exa_linkedin_search_exa` | Exa LinkedIn search | Finding profiles |
| `manual` | Manual inspection | Last resort |

## Complete Staff Entry Example

```yaml
staff:
  - person_id: "rijksmuseum_staff_0001_taco_dibbits"
    name: Taco Dibbits
    role: General Director
    department: Executive Management
    current: true

    # Web claims with full provenance
    web_claims:
      - claim_type: full_name
        claim_value: Taco Dibbits
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/h2
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0

      - claim_type: role_title
        claim_value: General Director
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[1]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0

      - claim_type: biography
        claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[2]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0

    # LinkedIn profile (separate file reference per Rule 12)
    linkedin_claim:
      linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
      profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: exa_crawling_exa
```

## LinkedIn Integration

### When LinkedIn Data is Available

Per **Rule 12** (Person Data Reference Pattern), full LinkedIn profiles are stored separately:

```yaml
# In custodian YAML - reference only
staff:
  - name: Alexandr Belov
    role: Collection/Information Specialist
    linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
    person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json
```

### Person Profile File Structure

Per **Rule 20** (Person Entity Profiles), profile files in `data/custodian/person/entity/`:

```json
{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
    "staff_id": "rijksmuseum_staff_0042_alexandr_belov",
    "extraction_date": "2025-01-15T10:30:00Z",
    "extraction_method": "exa_crawling_exa",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
    "cost_usd": 0
  },
  "profile_data": {
    "name": "Alexandr Belov",
    "headline": "Collection Information Specialist at Rijksmuseum",
    "location": "Amsterdam, Netherlands",
    "about": "...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "profile_image_url": "https://media.licdn.com/..."
  }
}
```

## Staff Discovery Workflow

### Step 1: Scrape Institutional Staff Pages

```bash
# Find team/staff pages
firecrawl_firecrawl_map(
  url="https://www.institution.org",
  search="staff OR team OR about OR organization"
)

# Scrape identified pages
firecrawl_firecrawl_scrape(
  url="https://www.institution.org/about/team",
  formats=["markdown"]
)
```

### Step 2: Extract Names and Roles

For each person identified:
1. Extract full name
2. Extract job title
3. Note XPath for each element
4. Archive source HTML

### Step 3: Search LinkedIn for Profiles

```bash
# For each identified staff member
exa_linkedin_search_exa(
  query="Taco Dibbits Rijksmuseum",
  searchType="profiles",
  numResults=5
)
```

### Step 4: Extract LinkedIn Profiles

```bash
# When profile URL is known
exa_crawling_exa(
  url="https://www.linkedin.com/in/taco-dibbits-12345",
  maxCharacters=10000
)
```

### Step 5: Create Person Entity Files

Save extracted profile to `data/custodian/person/entity/{slug}_{timestamp}.json`

### Step 6: Update Custodian YAML

Add staff entries with:
- Basic info (name, role)
- Web claims with provenance
- LinkedIn profile reference (if available)

## Provenance Sources Priority

When multiple sources provide the same information:

| Priority | Source | Reliability |
|----------|--------|-------------|
| 1 | Official institutional website | Highest |
| 2 | LinkedIn profile | High |
| 3 | News articles/press releases | Medium-High |
| 4 | Conference programs | Medium |
| 5 | Academic publications | Medium |
| 6 | Third-party databases | Lower |

When sources conflict, document both with provenance and note the discrepancy.

## Validation Checklist

Before marking staff data complete, verify:

- [ ] Every staff member has `person_id`
- [ ] Full name has web claim with source_url
- [ ] Role/title has web claim with source_url
- [ ] All claims have `retrieved_on` timestamp
- [ ] All claims have `retrieval_agent` specified
- [ ] LinkedIn profiles stored in `person/entity/` (not inline)
- [ ] XPath included where HTML was scraped
- [ ] No fabricated data (per Rule 21)

## Related Rules

- **Rule 6**: WebObservation claims MUST have XPath provenance
- **Rule 12**: Person data reference pattern (file paths, not inline)
- **Rule 14**: Exa MCP LinkedIn profile extraction
- **Rule 16**: LinkedIn photo URLs (CDN, not overlay page)
- **Rule 17**: LinkedIn connection unique identifiers
- **Rule 19**: HTML-only LinkedIn extraction
- **Rule 20**: Person entity profiles stored individually
- **Rule 21**: Data fabrication strictly prohibited

## Tools Reference

### For Institutional Websites

| Tool | MCP Name | Use Case |
|------|----------|----------|
| FireCrawl Scrape | `firecrawl_firecrawl_scrape` | Staff pages |
| Playwright Snapshot | `playwright_browser_snapshot` | JS-heavy sites |

### For LinkedIn

| Tool | MCP Name | Use Case |
|------|----------|----------|
| Exa Crawl | `exa_crawling_exa` | Profile extraction (URL known) |
| Exa LinkedIn Search | `exa_linkedin_search_exa` | Find profiles |
| Exa Web Search | `exa_web_search_exa` | Fallback search |

## Error Handling

### Missing XPath

If XPath cannot be determined:
```yaml
web_claims:
  - claim_type: full_name
    claim_value: Example Person
    source_url: https://example.org/team
    xpath: null  # Could not determine - page uses dynamic rendering
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: playwright
    notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"
```

### Conflicting Sources

Document both claims:
```yaml
web_claims:
  - claim_type: role_title
    claim_value: Senior Curator
    source_url: https://institution.org/team
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl

  - claim_type: role_title
    claim_value: Chief Curator
    source_url: https://linkedin.com/in/example
    retrieved_on: "2025-01-15T11:00:00Z"
    retrieval_agent: exa_crawling_exa
    notes: "Title differs from institutional website - may be outdated"
```

## Version History

| Date | Change |
|------|--------|
| 2025-01-15 | Initial rule creation |

## See Also

- `AGENTS.md` Rule 26
- `schemas/20251121/linkml/modules/classes/PersonObservation.yaml`
- `schemas/20251121/linkml/modules/classes/StaffRole.yaml`
- `data/custodian/person/entity/` (profile storage)