337 lines
11 KiB
Markdown
337 lines
11 KiB
Markdown
# Person Data Provenance Rule
|
|
|
|
## Rule Summary
|
|
|
|
**Rule 26** in `AGENTS.md`: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.
|
|
|
|
## Purpose
|
|
|
|
Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:
|
|
|
|
1. **Data Verification** - Any claim about a person can be verified against source
|
|
2. **Legal Compliance** - GDPR and privacy regulations require data source transparency
|
|
3. **Update Management** - Knowing sources enables systematic refresh cycles
|
|
4. **Credibility** - Academic and institutional users need citation trails
|
|
|
|
## What Requires Web Claims
|
|
|
|
### Person Data Types Requiring Provenance
|
|
|
|
| Data Type | Source Types | Provenance Required |
|
|
|-----------|--------------|---------------------|
|
|
| **Full Name** | LinkedIn, institutional pages | YES - always |
|
|
| **Job Title/Role** | LinkedIn, about pages, staff directories | YES - always |
|
|
| **Department** | Institutional pages | YES - always |
|
|
| **Email** | Contact pages, staff directories | YES - always |
|
|
| **Phone** | Contact pages | YES - always |
|
|
| **Professional History** | LinkedIn profiles | YES - always |
|
|
| **Education** | LinkedIn, CVs, institutional bios | YES - always |
|
|
| **Specialization** | Institutional bios, publications | YES - always |
|
|
| **Start Date** | News articles, announcements | RECOMMENDED |
|
|
| **Photo** | LinkedIn, institutional pages | RECOMMENDED |
|
|
|
|
## Claim Types
|
|
|
|
### Standard Person Claim Types
|
|
|
|
| Claim Type | Description | Example Value |
|
|
|------------|-------------|---------------|
|
|
| `full_name` | Person's complete name | "Taco Dibbits" |
|
|
| `role_title` | Current job title | "General Director" |
|
|
| `department` | Department or division | "Curatorial Department" |
|
|
| `email` | Work email address | "t.dibbits@rijksmuseum.nl" |
|
|
| `phone` | Work phone number | "+31 20 674 7000" |
|
|
| `start_date` | When role began | "2020-01-15" |
|
|
| `end_date` | When role ended | "2024-12-31" |
|
|
| `education` | Degree and institution | "PhD Art History, University of Amsterdam" |
|
|
| `specialization` | Area of expertise | "17th Century Dutch Painting" |
|
|
| `previous_employer` | Prior organization | "Metropolitan Museum of Art" |
|
|
| `biography` | Brief bio text | "Dr. Dibbits has led..." |
|
|
| `photo_url` | Profile image URL | "https://media.licdn.com/..." |
|
|
|
|
## Web Claim Structure
|
|
|
|
### Basic Web Claim
|
|
|
|
```yaml
|
|
web_claims:
|
|
- claim_type: full_name
|
|
claim_value: Taco Dibbits
|
|
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
|
|
xpath: /html/body/main/section[2]/div[1]/h2
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
xpath_match_score: 1.0
|
|
```
|
|
|
|
### Required Fields
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `claim_type` | string | YES | Type from claim types table |
|
|
| `claim_value` | string | YES | The extracted value |
|
|
| `source_url` | string | YES | URL where data was found |
|
|
| `retrieved_on` | datetime | YES | ISO 8601 timestamp |
|
|
| `retrieval_agent` | enum | YES | Tool used for extraction |
|
|
| `xpath` | string | RECOMMENDED | XPath to element |
|
|
| `xpath_match_score` | float | RECOMMENDED | 1.0 exact, <1.0 fuzzy |
|
|
| `html_file` | string | RECOMMENDED | Path to archived HTML |
|
|
|
|
### Retrieval Agent Values
|
|
|
|
| Value | Description | Best For |
|
|
|-------|-------------|----------|
|
|
| `firecrawl` | FireCrawl MCP | Institutional pages |
|
|
| `playwright` | Playwright browser | JS-heavy sites |
|
|
| `exa_crawling_exa` | Exa crawl | LinkedIn profiles |
|
|
| `exa_linkedin_search_exa` | Exa LinkedIn search | Finding profiles |
|
|
| `manual` | Manual inspection | Last resort |
|
|
|
|
## Complete Staff Entry Example
|
|
|
|
```yaml
|
|
staff:
|
|
- person_id: "rijksmuseum_staff_0001_taco_dibbits"
|
|
name: Taco Dibbits
|
|
role: General Director
|
|
department: Executive Management
|
|
current: true
|
|
|
|
# Web claims with full provenance
|
|
web_claims:
|
|
- claim_type: full_name
|
|
claim_value: Taco Dibbits
|
|
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
|
|
xpath: /html/body/main/section[2]/div[1]/h2
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
xpath_match_score: 1.0
|
|
|
|
- claim_type: role_title
|
|
claim_value: General Director
|
|
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
|
|
xpath: /html/body/main/section[2]/div[1]/p[1]
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
xpath_match_score: 1.0
|
|
|
|
- claim_type: biography
|
|
claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
|
|
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
|
|
xpath: /html/body/main/section[2]/div[1]/p[2]
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
xpath_match_score: 1.0
|
|
|
|
# LinkedIn profile (separate file reference per Rule 12)
|
|
linkedin_claim:
|
|
linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
|
|
profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: exa_crawling_exa
|
|
```
|
|
|
|
## LinkedIn Integration
|
|
|
|
### When LinkedIn Data is Available
|
|
|
|
Per **Rule 12** (Person Data Reference Pattern), full LinkedIn profiles are stored separately:
|
|
|
|
```yaml
|
|
# In custodian YAML - reference only
|
|
staff:
|
|
- name: Alexandr Belov
|
|
role: Collection/Information Specialist
|
|
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
|
|
person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json
|
|
```
|
|
|
|
### Person Profile File Structure
|
|
|
|
Per **Rule 20** (Person Entity Profiles), profile files in `data/custodian/person/entity/`:
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
|
|
"staff_id": "rijksmuseum_staff_0042_alexandr_belov",
|
|
"extraction_date": "2025-01-15T10:30:00Z",
|
|
"extraction_method": "exa_crawling_exa",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
|
|
"cost_usd": 0
|
|
},
|
|
"profile_data": {
|
|
"name": "Alexandr Belov",
|
|
"headline": "Collection Information Specialist at Rijksmuseum",
|
|
"location": "Amsterdam, Netherlands",
|
|
"about": "...",
|
|
"experience": [...],
|
|
"education": [...],
|
|
"skills": [...],
|
|
"profile_image_url": "https://media.licdn.com/..."
|
|
}
|
|
}
|
|
```
|
|
|
|
## Staff Discovery Workflow
|
|
|
|
### Step 1: Scrape Institutional Staff Pages
|
|
|
|
```bash
|
|
# Find team/staff pages
|
|
firecrawl_firecrawl_map(
|
|
url="https://www.institution.org",
|
|
search="staff OR team OR about OR organization"
|
|
)
|
|
|
|
# Scrape identified pages
|
|
firecrawl_firecrawl_scrape(
|
|
url="https://www.institution.org/about/team",
|
|
formats=["markdown"]
|
|
)
|
|
```
|
|
|
|
### Step 2: Extract Names and Roles
|
|
|
|
For each person identified:
|
|
1. Extract full name
|
|
2. Extract job title
|
|
3. Note XPath for each element
|
|
4. Archive source HTML
|
|
|
|
### Step 3: Search LinkedIn for Profiles
|
|
|
|
```bash
|
|
# For each identified staff member
|
|
exa_linkedin_search_exa(
|
|
query="Taco Dibbits Rijksmuseum",
|
|
searchType="profiles",
|
|
numResults=5
|
|
)
|
|
```
|
|
|
|
### Step 4: Extract LinkedIn Profiles
|
|
|
|
```bash
|
|
# When profile URL is known
|
|
exa_crawling_exa(
|
|
url="https://www.linkedin.com/in/taco-dibbits-12345",
|
|
maxCharacters=10000
|
|
)
|
|
```
|
|
|
|
### Step 5: Create Person Entity Files
|
|
|
|
Save extracted profile to `data/custodian/person/entity/{slug}_{timestamp}.json`
|
|
|
|
### Step 6: Update Custodian YAML
|
|
|
|
Add staff entries with:
|
|
- Basic info (name, role)
|
|
- Web claims with provenance
|
|
- LinkedIn profile reference (if available)
|
|
|
|
## Provenance Sources Priority
|
|
|
|
When multiple sources provide the same information:
|
|
|
|
| Priority | Source | Reliability |
|
|
|----------|--------|-------------|
|
|
| 1 | Official institutional website | Highest |
|
|
| 2 | LinkedIn profile | High |
|
|
| 3 | News articles/press releases | Medium-High |
|
|
| 4 | Conference programs | Medium |
|
|
| 5 | Academic publications | Medium |
|
|
| 6 | Third-party databases | Lower |
|
|
|
|
When sources conflict, document both with provenance and note the discrepancy.
|
|
|
|
## Validation Checklist
|
|
|
|
Before marking staff data complete, verify:
|
|
|
|
- [ ] Every staff member has `person_id`
|
|
- [ ] Full name has web claim with source_url
|
|
- [ ] Role/title has web claim with source_url
|
|
- [ ] All claims have `retrieved_on` timestamp
|
|
- [ ] All claims have `retrieval_agent` specified
|
|
- [ ] LinkedIn profiles stored in `person/entity/` (not inline)
|
|
- [ ] XPath included where HTML was scraped
|
|
- [ ] No fabricated data (per Rule 21)
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 6**: WebObservation claims MUST have XPath provenance
|
|
- **Rule 12**: Person data reference pattern (file paths, not inline)
|
|
- **Rule 14**: Exa MCP LinkedIn profile extraction
|
|
- **Rule 16**: LinkedIn photo URLs (CDN, not overlay page)
|
|
- **Rule 17**: LinkedIn connection unique identifiers
|
|
- **Rule 19**: HTML-only LinkedIn extraction
|
|
- **Rule 20**: Person entity profiles stored individually
|
|
- **Rule 21**: Data fabrication strictly prohibited
|
|
|
|
## Tools Reference
|
|
|
|
### For Institutional Websites
|
|
|
|
| Tool | MCP Name | Use Case |
|
|
|------|----------|----------|
|
|
| FireCrawl Scrape | `firecrawl_firecrawl_scrape` | Staff pages |
|
|
| Playwright Snapshot | `playwright_browser_snapshot` | JS-heavy sites |
|
|
|
|
### For LinkedIn
|
|
|
|
| Tool | MCP Name | Use Case |
|
|
|------|----------|----------|
|
|
| Exa Crawl | `exa_crawling_exa` | Profile extraction (URL known) |
|
|
| Exa LinkedIn Search | `exa_linkedin_search_exa` | Find profiles |
|
|
| Exa Web Search | `exa_web_search_exa` | Fallback search |
|
|
|
|
## Error Handling
|
|
|
|
### Missing XPath
|
|
|
|
If XPath cannot be determined:
|
|
```yaml
|
|
web_claims:
|
|
- claim_type: full_name
|
|
claim_value: Example Person
|
|
source_url: https://example.org/team
|
|
xpath: null # Could not determine - page uses dynamic rendering
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: playwright
|
|
notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"
|
|
```
|
|
|
|
### Conflicting Sources
|
|
|
|
Document both claims:
|
|
```yaml
|
|
web_claims:
|
|
- claim_type: role_title
|
|
claim_value: Senior Curator
|
|
source_url: https://institution.org/team
|
|
retrieved_on: "2025-01-15T10:30:00Z"
|
|
retrieval_agent: firecrawl
|
|
|
|
- claim_type: role_title
|
|
claim_value: Chief Curator
|
|
source_url: https://linkedin.com/in/example
|
|
retrieved_on: "2025-01-15T11:00:00Z"
|
|
retrieval_agent: exa_crawling_exa
|
|
notes: "Title differs from institutional website - may be outdated"
|
|
```
|
|
|
|
## Version History
|
|
|
|
| Date | Change |
|
|
|------|--------|
|
|
| 2025-01-15 | Initial rule creation |
|
|
|
|
## See Also
|
|
|
|
- `AGENTS.md` Rule 26
|
|
- `schemas/20251121/linkml/modules/classes/PersonObservation.yaml`
|
|
- `schemas/20251121/linkml/modules/classes/StaffRole.yaml`
|
|
- `data/custodian/person/entity/` (profile storage)
|