11 KiB
Person Data Provenance Rule
Rule Summary
Rule 26 in AGENTS.md: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.
Purpose
Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:
- Data Verification - Any claim about a person can be verified against source
- Legal Compliance - GDPR and privacy regulations require data source transparency
- Update Management - Knowing sources enables systematic refresh cycles
- Credibility - Academic and institutional users need citation trails
What Requires Web Claims
Person Data Types Requiring Provenance
| Data Type | Source Types | Provenance Required |
|---|---|---|
| Full Name | LinkedIn, institutional pages | YES - always |
| Job Title/Role | LinkedIn, about pages, staff directories | YES - always |
| Department | Institutional pages | YES - always |
| Contact pages, staff directories | YES - always | |
| Phone | Contact pages | YES - always |
| Professional History | LinkedIn profiles | YES - always |
| Education | LinkedIn, CVs, institutional bios | YES - always |
| Specialization | Institutional bios, publications | YES - always |
| Start Date | News articles, announcements | RECOMMENDED |
| Photo | LinkedIn, institutional pages | RECOMMENDED |
Claim Types
Standard Person Claim Types
| Claim Type | Description | Example Value |
|---|---|---|
full_name |
Person's complete name | "Taco Dibbits" |
role_title |
Current job title | "General Director" |
department |
Department or division | "Curatorial Department" |
email |
Work email address | "t.dibbits@rijksmuseum.nl" |
phone |
Work phone number | "+31 20 674 7000" |
start_date |
When role began | "2020-01-15" |
end_date |
When role ended | "2024-12-31" |
education |
Degree and institution | "PhD Art History, University of Amsterdam" |
specialization |
Area of expertise | "17th Century Dutch Painting" |
previous_employer |
Prior organization | "Metropolitan Museum of Art" |
biography |
Brief bio text | "Dr. Dibbits has led..." |
photo_url |
Profile image URL | "https://media.licdn.com/..." |
Web Claim Structure
Basic Web Claim
web_claims:
- claim_type: full_name
claim_value: Taco Dibbits
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/h2
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
Required Fields
| Field | Type | Required | Description |
|---|---|---|---|
claim_type |
string | YES | Type from claim types table |
claim_value |
string | YES | The extracted value |
source_url |
string | YES | URL where data was found |
retrieved_on |
datetime | YES | ISO 8601 timestamp |
retrieval_agent |
enum | YES | Tool used for extraction |
xpath |
string | RECOMMENDED | XPath to element |
xpath_match_score |
float | RECOMMENDED | 1.0 exact, <1.0 fuzzy |
html_file |
string | RECOMMENDED | Path to archived HTML |
Retrieval Agent Values
| Value | Description | Best For |
|---|---|---|
firecrawl |
FireCrawl MCP | Institutional pages |
playwright |
Playwright browser | JS-heavy sites |
exa_crawling_exa |
Exa crawl | LinkedIn profiles |
exa_linkedin_search_exa |
Exa LinkedIn search | Finding profiles |
manual |
Manual inspection | Last resort |
Complete Staff Entry Example
staff:
- person_id: "rijksmuseum_staff_0001_taco_dibbits"
name: Taco Dibbits
role: General Director
department: Executive Management
current: true
# Web claims with full provenance
web_claims:
- claim_type: full_name
claim_value: Taco Dibbits
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/h2
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
- claim_type: role_title
claim_value: General Director
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/p[1]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
- claim_type: biography
claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
source_url: https://www.rijksmuseum.nl/en/about-us/organisation
xpath: /html/body/main/section[2]/div[1]/p[2]
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
xpath_match_score: 1.0
# LinkedIn profile (separate file reference per Rule 12)
linkedin_claim:
linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: exa_crawling_exa
LinkedIn Integration
When LinkedIn Data is Available
Per Rule 12 (Person Data Reference Pattern), full LinkedIn profiles are stored separately:
# In custodian YAML - reference only
staff:
- name: Alexandr Belov
role: Collection/Information Specialist
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json
Person Profile File Structure
Per Rule 20 (Person Entity Profiles), profile files in data/custodian/person/entity/:
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
"staff_id": "rijksmuseum_staff_0042_alexandr_belov",
"extraction_date": "2025-01-15T10:30:00Z",
"extraction_method": "exa_crawling_exa",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
"cost_usd": 0
},
"profile_data": {
"name": "Alexandr Belov",
"headline": "Collection Information Specialist at Rijksmuseum",
"location": "Amsterdam, Netherlands",
"about": "...",
"experience": [...],
"education": [...],
"skills": [...],
"profile_image_url": "https://media.licdn.com/..."
}
}
Staff Discovery Workflow
Step 1: Scrape Institutional Staff Pages
# Find team/staff pages
firecrawl_firecrawl_map(
url="https://www.institution.org",
search="staff OR team OR about OR organization"
)
# Scrape identified pages
firecrawl_firecrawl_scrape(
url="https://www.institution.org/about/team",
formats=["markdown"]
)
Step 2: Extract Names and Roles
For each person identified:
- Extract full name
- Extract job title
- Note XPath for each element
- Archive source HTML
Step 3: Search LinkedIn for Profiles
# For each identified staff member
exa_linkedin_search_exa(
query="Taco Dibbits Rijksmuseum",
searchType="profiles",
numResults=5
)
Step 4: Extract LinkedIn Profiles
# When profile URL is known
exa_crawling_exa(
url="https://www.linkedin.com/in/taco-dibbits-12345",
maxCharacters=10000
)
Step 5: Create Person Entity Files
Save extracted profile to data/custodian/person/entity/{slug}_{timestamp}.json
Step 6: Update Custodian YAML
Add staff entries with:
- Basic info (name, role)
- Web claims with provenance
- LinkedIn profile reference (if available)
Provenance Sources Priority
When multiple sources provide the same information:
| Priority | Source | Reliability |
|---|---|---|
| 1 | Official institutional website | Highest |
| 2 | LinkedIn profile | High |
| 3 | News articles/press releases | Medium-High |
| 4 | Conference programs | Medium |
| 5 | Academic publications | Medium |
| 6 | Third-party databases | Lower |
When sources conflict, document both with provenance and note the discrepancy.
Validation Checklist
Before marking staff data complete, verify:
- Every staff member has
person_id - Full name has web claim with source_url
- Role/title has web claim with source_url
- All claims have
retrieved_ontimestamp - All claims have
retrieval_agentspecified - LinkedIn profiles stored in
person/entity/(not inline) - XPath included where HTML was scraped
- No fabricated data (per Rule 21)
Related Rules
- Rule 6: WebObservation claims MUST have XPath provenance
- Rule 12: Person data reference pattern (file paths, not inline)
- Rule 14: Exa MCP LinkedIn profile extraction
- Rule 16: LinkedIn photo URLs (CDN, not overlay page)
- Rule 17: LinkedIn connection unique identifiers
- Rule 19: HTML-only LinkedIn extraction
- Rule 20: Person entity profiles stored individually
- Rule 21: Data fabrication strictly prohibited
Tools Reference
For Institutional Websites
| Tool | MCP Name | Use Case |
|---|---|---|
| FireCrawl Scrape | firecrawl_firecrawl_scrape |
Staff pages |
| Playwright Snapshot | playwright_browser_snapshot |
JS-heavy sites |
For LinkedIn
| Tool | MCP Name | Use Case |
|---|---|---|
| Exa Crawl | exa_crawling_exa |
Profile extraction (URL known) |
| Exa LinkedIn Search | exa_linkedin_search_exa |
Find profiles |
| Exa Web Search | exa_web_search_exa |
Fallback search |
Error Handling
Missing XPath
If XPath cannot be determined:
web_claims:
- claim_type: full_name
claim_value: Example Person
source_url: https://example.org/team
xpath: null # Could not determine - page uses dynamic rendering
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: playwright
notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"
Conflicting Sources
Document both claims:
web_claims:
- claim_type: role_title
claim_value: Senior Curator
source_url: https://institution.org/team
retrieved_on: "2025-01-15T10:30:00Z"
retrieval_agent: firecrawl
- claim_type: role_title
claim_value: Chief Curator
source_url: https://linkedin.com/in/example
retrieved_on: "2025-01-15T11:00:00Z"
retrieval_agent: exa_crawling_exa
notes: "Title differs from institutional website - may be outdated"
Version History
| Date | Change |
|---|---|
| 2025-01-15 | Initial rule creation |
See Also
AGENTS.mdRule 26schemas/20251121/linkml/modules/classes/PersonObservation.yamlschemas/20251121/linkml/modules/classes/StaffRole.yamldata/custodian/person/entity/(profile storage)