glam/.opencode/PERSON_DATA_PROVENANCE_RULE.md
2025-12-14 17:09:55 +01:00

337 lines
11 KiB
Markdown

# Person Data Provenance Rule
## Rule Summary
**Rule 26** in `AGENTS.md`: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.
## Purpose
Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:
1. **Data Verification** - Any claim about a person can be verified against source
2. **Legal Compliance** - GDPR and privacy regulations require data source transparency
3. **Update Management** - Knowing sources enables systematic refresh cycles
4. **Credibility** - Academic and institutional users need citation trails
## What Requires Web Claims
### Person Data Types Requiring Provenance
| Data Type | Source Types | Provenance Required |
|-----------|--------------|---------------------|
| **Full Name** | LinkedIn, institutional pages | YES - always |
| **Job Title/Role** | LinkedIn, about pages, staff directories | YES - always |
| **Department** | Institutional pages | YES - always |
| **Email** | Contact pages, staff directories | YES - always |
| **Phone** | Contact pages | YES - always |
| **Professional History** | LinkedIn profiles | YES - always |
| **Education** | LinkedIn, CVs, institutional bios | YES - always |
| **Specialization** | Institutional bios, publications | YES - always |
| **Start Date** | News articles, announcements | RECOMMENDED |
| **Photo** | LinkedIn, institutional pages | RECOMMENDED |
## Claim Types
### Standard Person Claim Types
| Claim Type | Description | Example Value |
|------------|-------------|---------------|
| `full_name` | Person's complete name | "Taco Dibbits" |
| `role_title` | Current job title | "General Director" |
| `department` | Department or division | "Curatorial Department" |
| `email` | Work email address | "t.dibbits@rijksmuseum.nl" |
| `phone` | Work phone number | "+31 20 674 7000" |
| `start_date` | When role began | "2020-01-15" |
| `end_date` | When role ended | "2024-12-31" |
| `education` | Degree and institution | "PhD Art History, University of Amsterdam" |
| `specialization` | Area of expertise | "17th Century Dutch Painting" |
| `previous_employer` | Prior organization | "Metropolitan Museum of Art" |
| `biography` | Brief bio text | "Dr. Dibbits has led..." |
| `photo_url` | Profile image URL | "https://media.licdn.com/..." |
## Web Claim Structure
### Basic Web Claim
```yaml
web_claims:
- claim_type: full_name
claim_value: Taco Dibbits
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/h2
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
```
### Required Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `claim_type` | string | YES | Type from claim types table |
| `claim_value` | string | YES | The extracted value |
| `source_url` | string | YES | URL where data was found |
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
| `retrieval_agent` | enum | YES | Tool used for extraction |
| `xpath` | string | RECOMMENDED | XPath to element |
| `xpath_match_score` | float | RECOMMENDED | 1.0 exact, <1.0 fuzzy |
| `html_file` | string | RECOMMENDED | Path to archived HTML |
### Retrieval Agent Values
| Value | Description | Best For |
|-------|-------------|----------|
| `firecrawl` | FireCrawl MCP | Institutional pages |
| `playwright` | Playwright browser | JS-heavy sites |
| `exa_crawling_exa` | Exa crawl | LinkedIn profiles |
| `exa_linkedin_search_exa` | Exa LinkedIn search | Finding profiles |
| `manual` | Manual inspection | Last resort |
## Complete Staff Entry Example
```yaml
staff:
- person_id: "rijksmuseum_staff_0001_taco_dibbits"
name: Taco Dibbits
role: General Director
department: Executive Management
current: true
# Web claims with full provenance
web_claims:
- claim_type: full_name
claim_value: Taco Dibbits
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/h2
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
- claim_type: role_title
claim_value: General Director
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/p[1]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
- claim_type: biography
claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/p[2]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
# LinkedIn profile (separate file reference per Rule 12)
linkedin_claim:
linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: exa_crawling_exa
```
## LinkedIn Integration
### When LinkedIn Data is Available
Per **Rule 12** (Person Data Reference Pattern), full LinkedIn profiles are stored separately:
```yaml
# In custodian YAML - reference only
staff:
- name: Alexandr Belov
role: Collection/Information Specialist
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json
```
### Person Profile File Structure
Per **Rule 20** (Person Entity Profiles), profile files in `data/custodian/person/entity/`:
```json
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
"staff_id": "rijksmuseum_staff_0042_alexandr_belov",
"extraction_date": "2025-01-15T10:30:00Z",
"extraction_method": "exa_crawling_exa",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
"cost_usd": 0
},
"profile_data": {
"name": "Alexandr Belov",
"headline": "Collection Information Specialist at Rijksmuseum",
"location": "Amsterdam, Netherlands",
"about": "...",
"experience": [...],
"education": [...],
"skills": [...],
"profile_image_url": "https://media.licdn.com/..."
}
}
```
## Staff Discovery Workflow
### Step 1: Scrape Institutional Staff Pages
```bash
# Find team/staff pages
firecrawl_firecrawl_map(
url="https://www.institution.org",
search="staff OR team OR about OR organization"
)
# Scrape identified pages
firecrawl_firecrawl_scrape(
url="https://www.institution.org/about/team",
formats=["markdown"]
)
```
### Step 2: Extract Names and Roles
For each person identified:
1. Extract full name
2. Extract job title
3. Note XPath for each element
4. Archive source HTML
### Step 3: Search LinkedIn for Profiles
```bash
# For each identified staff member
exa_linkedin_search_exa(
query="Taco Dibbits Rijksmuseum",
searchType="profiles",
numResults=5
)
```
### Step 4: Extract LinkedIn Profiles
```bash
# When profile URL is known
exa_crawling_exa(
url="https://www.linkedin.com/in/taco-dibbits-12345",
maxCharacters=10000
)
```
### Step 5: Create Person Entity Files
Save extracted profile to `data/custodian/person/entity/{slug}_{timestamp}.json`
### Step 6: Update Custodian YAML
Add staff entries with:
- Basic info (name, role)
- Web claims with provenance
- LinkedIn profile reference (if available)
## Provenance Sources Priority
When multiple sources provide the same information:
| Priority | Source | Reliability |
|----------|--------|-------------|
| 1 | Official institutional website | Highest |
| 2 | LinkedIn profile | High |
| 3 | News articles/press releases | Medium-High |
| 4 | Conference programs | Medium |
| 5 | Academic publications | Medium |
| 6 | Third-party databases | Lower |
When sources conflict, document both with provenance and note the discrepancy.
## Validation Checklist
Before marking staff data complete, verify:
- [ ] Every staff member has `person_id`
- [ ] Full name has web claim with source_url
- [ ] Role/title has web claim with source_url
- [ ] All claims have `retrieved_on` timestamp
- [ ] All claims have `retrieval_agent` specified
- [ ] LinkedIn profiles stored in `person/entity/` (not inline)
- [ ] XPath included where HTML was scraped
- [ ] No fabricated data (per Rule 21)
## Related Rules
- **Rule 6**: WebObservation claims MUST have XPath provenance
- **Rule 12**: Person data reference pattern (file paths, not inline)
- **Rule 14**: Exa MCP LinkedIn profile extraction
- **Rule 16**: LinkedIn photo URLs (CDN, not overlay page)
- **Rule 17**: LinkedIn connection unique identifiers
- **Rule 19**: HTML-only LinkedIn extraction
- **Rule 20**: Person entity profiles stored individually
- **Rule 21**: Data fabrication strictly prohibited
## Tools Reference
### For Institutional Websites
| Tool | MCP Name | Use Case |
|------|----------|----------|
| FireCrawl Scrape | `firecrawl_firecrawl_scrape` | Staff pages |
| Playwright Snapshot | `playwright_browser_snapshot` | JS-heavy sites |
### For LinkedIn
| Tool | MCP Name | Use Case |
|------|----------|----------|
| Exa Crawl | `exa_crawling_exa` | Profile extraction (URL known) |
| Exa LinkedIn Search | `exa_linkedin_search_exa` | Find profiles |
| Exa Web Search | `exa_web_search_exa` | Fallback search |
## Error Handling
### Missing XPath
If XPath cannot be determined:
```yaml
web_claims:
- claim_type: full_name
claim_value: Example Person
source_url: https://example.org/team
xpath: null # Could not determine - page uses dynamic rendering
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: playwright
notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"
```
### Conflicting Sources
Document both claims:
```yaml
web_claims:
- claim_type: role_title
claim_value: Senior Curator
source_url: https://institution.org/team
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
- claim_type: role_title
claim_value: Chief Curator
source_url: https://linkedin.com/in/example
retrieved_on: "2025-01-15T11:00:00Z"
retrieval_agent: exa_crawling_exa
notes: "Title differs from institutional website - may be outdated"
```
## Version History
| Date | Change |
|------|--------|
| 2025-01-15 | Initial rule creation |
## See Also
- `AGENTS.md` Rule 26
- `schemas/20251121/linkml/modules/classes/PersonObservation.yaml`
- `schemas/20251121/linkml/modules/classes/StaffRole.yaml`
- `data/custodian/person/entity/` (profile storage)