# Person Data Provenance Rule ## Rule Summary **Rule 26** in `AGENTS.md`: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable. ## Purpose Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for: 1. **Data Verification** - Any claim about a person can be verified against source 2. **Legal Compliance** - GDPR and privacy regulations require data source transparency 3. **Update Management** - Knowing sources enables systematic refresh cycles 4. **Credibility** - Academic and institutional users need citation trails ## What Requires Web Claims ### Person Data Types Requiring Provenance | Data Type | Source Types | Provenance Required | |-----------|--------------|---------------------| | **Full Name** | LinkedIn, institutional pages | YES - always | | **Job Title/Role** | LinkedIn, about pages, staff directories | YES - always | | **Department** | Institutional pages | YES - always | | **Email** | Contact pages, staff directories | YES - always | | **Phone** | Contact pages | YES - always | | **Professional History** | LinkedIn profiles | YES - always | | **Education** | LinkedIn, CVs, institutional bios | YES - always | | **Specialization** | Institutional bios, publications | YES - always | | **Start Date** | News articles, announcements | RECOMMENDED | | **Photo** | LinkedIn, institutional pages | RECOMMENDED | ## Claim Types ### Standard Person Claim Types | Claim Type | Description | Example Value | |------------|-------------|---------------| | `full_name` | Person's complete name | "Taco Dibbits" | | `role_title` | Current job title | "General Director" | | `department` | Department or division | "Curatorial Department" | | `email` | Work email address | "t.dibbits@rijksmuseum.nl" | | `phone` | Work phone number | "+31 20 674 7000" | | `start_date` | When role began | "2020-01-15" | | `end_date` | When role ended | "2024-12-31" | | `education` | Degree and institution | "PhD Art History, University of Amsterdam" | | `specialization` | Area of expertise | "17th Century Dutch Painting" | | `previous_employer` | Prior organization | "Metropolitan Museum of Art" | | `biography` | Brief bio text | "Dr. Dibbits has led..." | | `photo_url` | Profile image URL | "https://media.licdn.com/..." | ## Web Claim Structure ### Basic Web Claim ```yaml web_claims: - claim_type: full_name claim_value: Taco Dibbits source_url: https://www.rijksmuseum.nl/en/about-us/organisation xpath: /html/body/main/section[2]/div[1]/h2 retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl xpath_match_score: 1.0 ``` ### Required Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `claim_type` | string | YES | Type from claim types table | | `claim_value` | string | YES | The extracted value | | `source_url` | string | YES | URL where data was found | | `retrieved_on` | datetime | YES | ISO 8601 timestamp | | `retrieval_agent` | enum | YES | Tool used for extraction | | `xpath` | string | RECOMMENDED | XPath to element | | `xpath_match_score` | float | RECOMMENDED | 1.0 exact, <1.0 fuzzy | | `html_file` | string | RECOMMENDED | Path to archived HTML | ### Retrieval Agent Values | Value | Description | Best For | |-------|-------------|----------| | `firecrawl` | FireCrawl MCP | Institutional pages | | `playwright` | Playwright browser | JS-heavy sites | | `exa_crawling_exa` | Exa crawl | LinkedIn profiles | | `exa_linkedin_search_exa` | Exa LinkedIn search | Finding profiles | | `manual` | Manual inspection | Last resort | ## Complete Staff Entry Example ```yaml staff: - person_id: "rijksmuseum_staff_0001_taco_dibbits" name: Taco Dibbits role: General Director department: Executive Management current: true # Web claims with full provenance web_claims: - claim_type: full_name claim_value: Taco Dibbits source_url: https://www.rijksmuseum.nl/en/about-us/organisation xpath: /html/body/main/section[2]/div[1]/h2 retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl xpath_match_score: 1.0 - claim_type: role_title claim_value: General Director source_url: https://www.rijksmuseum.nl/en/about-us/organisation xpath: /html/body/main/section[2]/div[1]/p[1] retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl xpath_match_score: 1.0 - claim_type: biography claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..." source_url: https://www.rijksmuseum.nl/en/about-us/organisation xpath: /html/body/main/section[2]/div[1]/p[2] retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl xpath_match_score: 1.0 # LinkedIn profile (separate file reference per Rule 12) linkedin_claim: linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345 profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: exa_crawling_exa ``` ## LinkedIn Integration ### When LinkedIn Data is Available Per **Rule 12** (Person Data Reference Pattern), full LinkedIn profiles are stored separately: ```yaml # In custodian YAML - reference only staff: - name: Alexandr Belov role: Collection/Information Specialist linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46 person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json ``` ### Person Profile File Structure Per **Rule 20** (Person Entity Profiles), profile files in `data/custodian/person/entity/`: ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json", "staff_id": "rijksmuseum_staff_0042_alexandr_belov", "extraction_date": "2025-01-15T10:30:00Z", "extraction_method": "exa_crawling_exa", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46", "cost_usd": 0 }, "profile_data": { "name": "Alexandr Belov", "headline": "Collection Information Specialist at Rijksmuseum", "location": "Amsterdam, Netherlands", "about": "...", "experience": [...], "education": [...], "skills": [...], "profile_image_url": "https://media.licdn.com/..." } } ``` ## Staff Discovery Workflow ### Step 1: Scrape Institutional Staff Pages ```bash # Find team/staff pages firecrawl_firecrawl_map( url="https://www.institution.org", search="staff OR team OR about OR organization" ) # Scrape identified pages firecrawl_firecrawl_scrape( url="https://www.institution.org/about/team", formats=["markdown"] ) ``` ### Step 2: Extract Names and Roles For each person identified: 1. Extract full name 2. Extract job title 3. Note XPath for each element 4. Archive source HTML ### Step 3: Search LinkedIn for Profiles ```bash # For each identified staff member exa_linkedin_search_exa( query="Taco Dibbits Rijksmuseum", searchType="profiles", numResults=5 ) ``` ### Step 4: Extract LinkedIn Profiles ```bash # When profile URL is known exa_crawling_exa( url="https://www.linkedin.com/in/taco-dibbits-12345", maxCharacters=10000 ) ``` ### Step 5: Create Person Entity Files Save extracted profile to `data/custodian/person/entity/{slug}_{timestamp}.json` ### Step 6: Update Custodian YAML Add staff entries with: - Basic info (name, role) - Web claims with provenance - LinkedIn profile reference (if available) ## Provenance Sources Priority When multiple sources provide the same information: | Priority | Source | Reliability | |----------|--------|-------------| | 1 | Official institutional website | Highest | | 2 | LinkedIn profile | High | | 3 | News articles/press releases | Medium-High | | 4 | Conference programs | Medium | | 5 | Academic publications | Medium | | 6 | Third-party databases | Lower | When sources conflict, document both with provenance and note the discrepancy. ## Validation Checklist Before marking staff data complete, verify: - [ ] Every staff member has `person_id` - [ ] Full name has web claim with source_url - [ ] Role/title has web claim with source_url - [ ] All claims have `retrieved_on` timestamp - [ ] All claims have `retrieval_agent` specified - [ ] LinkedIn profiles stored in `person/entity/` (not inline) - [ ] XPath included where HTML was scraped - [ ] No fabricated data (per Rule 21) ## Related Rules - **Rule 6**: WebObservation claims MUST have XPath provenance - **Rule 12**: Person data reference pattern (file paths, not inline) - **Rule 14**: Exa MCP LinkedIn profile extraction - **Rule 16**: LinkedIn photo URLs (CDN, not overlay page) - **Rule 17**: LinkedIn connection unique identifiers - **Rule 19**: HTML-only LinkedIn extraction - **Rule 20**: Person entity profiles stored individually - **Rule 21**: Data fabrication strictly prohibited ## Tools Reference ### For Institutional Websites | Tool | MCP Name | Use Case | |------|----------|----------| | FireCrawl Scrape | `firecrawl_firecrawl_scrape` | Staff pages | | Playwright Snapshot | `playwright_browser_snapshot` | JS-heavy sites | ### For LinkedIn | Tool | MCP Name | Use Case | |------|----------|----------| | Exa Crawl | `exa_crawling_exa` | Profile extraction (URL known) | | Exa LinkedIn Search | `exa_linkedin_search_exa` | Find profiles | | Exa Web Search | `exa_web_search_exa` | Fallback search | ## Error Handling ### Missing XPath If XPath cannot be determined: ```yaml web_claims: - claim_type: full_name claim_value: Example Person source_url: https://example.org/team xpath: null # Could not determine - page uses dynamic rendering retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: playwright notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content" ``` ### Conflicting Sources Document both claims: ```yaml web_claims: - claim_type: role_title claim_value: Senior Curator source_url: https://institution.org/team retrieved_on: "2025-01-15T10:30:00Z" retrieval_agent: firecrawl - claim_type: role_title claim_value: Chief Curator source_url: https://linkedin.com/in/example retrieved_on: "2025-01-15T11:00:00Z" retrieval_agent: exa_crawling_exa notes: "Title differs from institutional website - may be outdated" ``` ## Version History | Date | Change | |------|--------| | 2025-01-15 | Initial rule creation | ## See Also - `AGENTS.md` Rule 26 - `schemas/20251121/linkml/modules/classes/PersonObservation.yaml` - `schemas/20251121/linkml/modules/classes/StaffRole.yaml` - `data/custodian/person/entity/` (profile storage)