glam/.opencode/PERSON_DATA_PROVENANCE_RULE.md
2025-12-14 17:09:55 +01:00

11 KiB

Person Data Provenance Rule

Rule Summary

Rule 26 in AGENTS.md: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.

Purpose

Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:

  1. Data Verification - Any claim about a person can be verified against source
  2. Legal Compliance - GDPR and privacy regulations require data source transparency
  3. Update Management - Knowing sources enables systematic refresh cycles
  4. Credibility - Academic and institutional users need citation trails

What Requires Web Claims

Person Data Types Requiring Provenance

Data Type Source Types Provenance Required
Full Name LinkedIn, institutional pages YES - always
Job Title/Role LinkedIn, about pages, staff directories YES - always
Department Institutional pages YES - always
Email Contact pages, staff directories YES - always
Phone Contact pages YES - always
Professional History LinkedIn profiles YES - always
Education LinkedIn, CVs, institutional bios YES - always
Specialization Institutional bios, publications YES - always
Start Date News articles, announcements RECOMMENDED
Photo LinkedIn, institutional pages RECOMMENDED

Claim Types

Standard Person Claim Types

Claim Type Description Example Value
full_name Person's complete name "Taco Dibbits"
role_title Current job title "General Director"
department Department or division "Curatorial Department"
email Work email address "t.dibbits@rijksmuseum.nl"
phone Work phone number "+31 20 674 7000"
start_date When role began "2020-01-15"
end_date When role ended "2024-12-31"
education Degree and institution "PhD Art History, University of Amsterdam"
specialization Area of expertise "17th Century Dutch Painting"
previous_employer Prior organization "Metropolitan Museum of Art"
biography Brief bio text "Dr. Dibbits has led..."
photo_url Profile image URL "https://media.licdn.com/..."

Web Claim Structure

Basic Web Claim

web_claims:
  - claim_type: full_name
    claim_value: Taco Dibbits
    source_url: https://www.rijksmuseum.nl/en/about-us/organisation
    xpath: /html/body/main/section[2]/div[1]/h2
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
    xpath_match_score: 1.0

Required Fields

Field Type Required Description
claim_type string YES Type from claim types table
claim_value string YES The extracted value
source_url string YES URL where data was found
retrieved_on datetime YES ISO 8601 timestamp
retrieval_agent enum YES Tool used for extraction
xpath string RECOMMENDED XPath to element
xpath_match_score float RECOMMENDED 1.0 exact, <1.0 fuzzy
html_file string RECOMMENDED Path to archived HTML

Retrieval Agent Values

Value Description Best For
firecrawl FireCrawl MCP Institutional pages
playwright Playwright browser JS-heavy sites
exa_crawling_exa Exa crawl LinkedIn profiles
exa_linkedin_search_exa Exa LinkedIn search Finding profiles
manual Manual inspection Last resort

Complete Staff Entry Example

staff:
  - person_id: "rijksmuseum_staff_0001_taco_dibbits"
    name: Taco Dibbits
    role: General Director
    department: Executive Management
    current: true
    
    # Web claims with full provenance
    web_claims:
      - claim_type: full_name
        claim_value: Taco Dibbits
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/h2
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
        
      - claim_type: role_title
        claim_value: General Director
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[1]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
        
      - claim_type: biography
        claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[2]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
    
    # LinkedIn profile (separate file reference per Rule 12)
    linkedin_claim:
      linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
      profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: exa_crawling_exa

LinkedIn Integration

When LinkedIn Data is Available

Per Rule 12 (Person Data Reference Pattern), full LinkedIn profiles are stored separately:

# In custodian YAML - reference only
staff:
  - name: Alexandr Belov
    role: Collection/Information Specialist
    linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
    person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json

Person Profile File Structure

Per Rule 20 (Person Entity Profiles), profile files in data/custodian/person/entity/:

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
    "staff_id": "rijksmuseum_staff_0042_alexandr_belov",
    "extraction_date": "2025-01-15T10:30:00Z",
    "extraction_method": "exa_crawling_exa",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
    "cost_usd": 0
  },
  "profile_data": {
    "name": "Alexandr Belov",
    "headline": "Collection Information Specialist at Rijksmuseum",
    "location": "Amsterdam, Netherlands",
    "about": "...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "profile_image_url": "https://media.licdn.com/..."
  }
}

Staff Discovery Workflow

Step 1: Scrape Institutional Staff Pages

# Find team/staff pages
firecrawl_firecrawl_map(
  url="https://www.institution.org",
  search="staff OR team OR about OR organization"
)

# Scrape identified pages
firecrawl_firecrawl_scrape(
  url="https://www.institution.org/about/team",
  formats=["markdown"]
)

Step 2: Extract Names and Roles

For each person identified:

  1. Extract full name
  2. Extract job title
  3. Note XPath for each element
  4. Archive source HTML

Step 3: Search LinkedIn for Profiles

# For each identified staff member
exa_linkedin_search_exa(
  query="Taco Dibbits Rijksmuseum",
  searchType="profiles",
  numResults=5
)

Step 4: Extract LinkedIn Profiles

# When profile URL is known
exa_crawling_exa(
  url="https://www.linkedin.com/in/taco-dibbits-12345",
  maxCharacters=10000
)

Step 5: Create Person Entity Files

Save extracted profile to data/custodian/person/entity/{slug}_{timestamp}.json

Step 6: Update Custodian YAML

Add staff entries with:

  • Basic info (name, role)
  • Web claims with provenance
  • LinkedIn profile reference (if available)

Provenance Sources Priority

When multiple sources provide the same information:

Priority Source Reliability
1 Official institutional website Highest
2 LinkedIn profile High
3 News articles/press releases Medium-High
4 Conference programs Medium
5 Academic publications Medium
6 Third-party databases Lower

When sources conflict, document both with provenance and note the discrepancy.

Validation Checklist

Before marking staff data complete, verify:

  • Every staff member has person_id
  • Full name has web claim with source_url
  • Role/title has web claim with source_url
  • All claims have retrieved_on timestamp
  • All claims have retrieval_agent specified
  • LinkedIn profiles stored in person/entity/ (not inline)
  • XPath included where HTML was scraped
  • No fabricated data (per Rule 21)
  • Rule 6: WebObservation claims MUST have XPath provenance
  • Rule 12: Person data reference pattern (file paths, not inline)
  • Rule 14: Exa MCP LinkedIn profile extraction
  • Rule 16: LinkedIn photo URLs (CDN, not overlay page)
  • Rule 17: LinkedIn connection unique identifiers
  • Rule 19: HTML-only LinkedIn extraction
  • Rule 20: Person entity profiles stored individually
  • Rule 21: Data fabrication strictly prohibited

Tools Reference

For Institutional Websites

Tool MCP Name Use Case
FireCrawl Scrape firecrawl_firecrawl_scrape Staff pages
Playwright Snapshot playwright_browser_snapshot JS-heavy sites

For LinkedIn

Tool MCP Name Use Case
Exa Crawl exa_crawling_exa Profile extraction (URL known)
Exa LinkedIn Search exa_linkedin_search_exa Find profiles
Exa Web Search exa_web_search_exa Fallback search

Error Handling

Missing XPath

If XPath cannot be determined:

web_claims:
  - claim_type: full_name
    claim_value: Example Person
    source_url: https://example.org/team
    xpath: null  # Could not determine - page uses dynamic rendering
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: playwright
    notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"

Conflicting Sources

Document both claims:

web_claims:
  - claim_type: role_title
    claim_value: Senior Curator
    source_url: https://institution.org/team
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
    
  - claim_type: role_title
    claim_value: Chief Curator
    source_url: https://linkedin.com/in/example
    retrieved_on: "2025-01-15T11:00:00Z"
    retrieval_agent: exa_crawling_exa
    notes: "Title differs from institutional website - may be outdated"

Version History

Date Change
2025-01-15 Initial rule creation

See Also

  • AGENTS.md Rule 26
  • schemas/20251121/linkml/modules/classes/PersonObservation.yaml
  • schemas/20251121/linkml/modules/classes/StaffRole.yaml
  • data/custodian/person/entity/ (profile storage)