kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

11 KiB

Raw Blame History

Person Data Provenance Rule

Rule Summary

Rule 26 in AGENTS.md: All person/staff data associated with heritage custodians MUST have web claim provenance. Staff information without verifiable sources is unacceptable.

Purpose

Person data provenance ensures that all information about individuals associated with heritage institutions can be traced back to authoritative sources. This is critical for:

Data Verification - Any claim about a person can be verified against source
Legal Compliance - GDPR and privacy regulations require data source transparency
Update Management - Knowing sources enables systematic refresh cycles
Credibility - Academic and institutional users need citation trails

What Requires Web Claims

Person Data Types Requiring Provenance

Data Type	Source Types	Provenance Required
Full Name	LinkedIn, institutional pages	YES - always
Job Title/Role	LinkedIn, about pages, staff directories	YES - always
Department	Institutional pages	YES - always
Email	Contact pages, staff directories	YES - always
Phone	Contact pages	YES - always
Professional History	LinkedIn profiles	YES - always
Education	LinkedIn, CVs, institutional bios	YES - always
Specialization	Institutional bios, publications	YES - always
Start Date	News articles, announcements	RECOMMENDED
Photo	LinkedIn, institutional pages	RECOMMENDED

Claim Types

Standard Person Claim Types

Claim Type	Description	Example Value
`full_name`	Person's complete name	"Taco Dibbits"
`role_title`	Current job title	"General Director"
`department`	Department or division	"Curatorial Department"
`email`	Work email address	"t.dibbits@rijksmuseum.nl"
`phone`	Work phone number	"+31 20 674 7000"
`start_date`	When role began	"2020-01-15"
`end_date`	When role ended	"2024-12-31"
`education`	Degree and institution	"PhD Art History, University of Amsterdam"
`specialization`	Area of expertise	"17th Century Dutch Painting"
`previous_employer`	Prior organization	"Metropolitan Museum of Art"
`biography`	Brief bio text	"Dr. Dibbits has led..."
`photo_url`	Profile image URL	"https://media.licdn.com/..."

Web Claim Structure

Basic Web Claim

web_claims:
  - claim_type: full_name
    claim_value: Taco Dibbits
    source_url: https://www.rijksmuseum.nl/en/about-us/organisation
    xpath: /html/body/main/section[2]/div[1]/h2
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
    xpath_match_score: 1.0

Required Fields

Field	Type	Required	Description
`claim_type`	string	YES	Type from claim types table
`claim_value`	string	YES	The extracted value
`source_url`	string	YES	URL where data was found
`retrieved_on`	datetime	YES	ISO 8601 timestamp
`retrieval_agent`	enum	YES	Tool used for extraction
`xpath`	string	RECOMMENDED	XPath to element
`xpath_match_score`	float	RECOMMENDED	1.0 exact, <1.0 fuzzy
`html_file`	string	RECOMMENDED	Path to archived HTML

Retrieval Agent Values

Value	Description	Best For
`firecrawl`	FireCrawl MCP	Institutional pages
`playwright`	Playwright browser	JS-heavy sites
`exa_crawling_exa`	Exa crawl	LinkedIn profiles
`exa_linkedin_search_exa`	Exa LinkedIn search	Finding profiles
`manual`	Manual inspection	Last resort

Complete Staff Entry Example

staff:
  - person_id: "rijksmuseum_staff_0001_taco_dibbits"
    name: Taco Dibbits
    role: General Director
    department: Executive Management
    current: true
    
    # Web claims with full provenance
    web_claims:
      - claim_type: full_name
        claim_value: Taco Dibbits
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/h2
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
        
      - claim_type: role_title
        claim_value: General Director
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[1]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
        
      - claim_type: biography
        claim_value: "Taco Dibbits has been General Director of the Rijksmuseum since 2016..."
        source_url: https://www.rijksmuseum.nl/en/about-us/organisation
        xpath: /html/body/main/section[2]/div[1]/p[2]
        retrieved_on: "2025-01-15T10:30:00Z"
        retrieval_agent: firecrawl
        xpath_match_score: 1.0
    
    # LinkedIn profile (separate file reference per Rule 12)
    linkedin_claim:
      linkedin_url: https://www.linkedin.com/in/taco-dibbits-12345
      profile_data_path: data/custodian/person/entity/taco-dibbits-12345_20250115T103000Z.json
      retrieved_on: "2025-01-15T10:30:00Z"
      retrieval_agent: exa_crawling_exa

LinkedIn Integration

When LinkedIn Data is Available

Per Rule 12 (Person Data Reference Pattern), full LinkedIn profiles are stored separately:

# In custodian YAML - reference only
staff:
  - name: Alexandr Belov
    role: Collection/Information Specialist
    linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
    person_profile_path: data/custodian/person/entity/alexandr-belov-bb547b46_20251210T120000Z.json

Person Profile File Structure

Per Rule 20 (Person Entity Profiles), profile files in data/custodian/person/entity/:

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/rijksmuseum_staff_20250115T103000Z.json",
    "staff_id": "rijksmuseum_staff_0042_alexandr_belov",
    "extraction_date": "2025-01-15T10:30:00Z",
    "extraction_method": "exa_crawling_exa",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/alexandr-belov-bb547b46",
    "cost_usd": 0
  },
  "profile_data": {
    "name": "Alexandr Belov",
    "headline": "Collection Information Specialist at Rijksmuseum",
    "location": "Amsterdam, Netherlands",
    "about": "...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "profile_image_url": "https://media.licdn.com/..."
  }
}

Staff Discovery Workflow

Step 1: Scrape Institutional Staff Pages

# Find team/staff pages
firecrawl_firecrawl_map(
  url="https://www.institution.org",
  search="staff OR team OR about OR organization"
)

# Scrape identified pages
firecrawl_firecrawl_scrape(
  url="https://www.institution.org/about/team",
  formats=["markdown"]
)

Step 2: Extract Names and Roles

For each person identified:

Extract full name
Extract job title
Note XPath for each element
Archive source HTML

Step 3: Search LinkedIn for Profiles

# For each identified staff member
exa_linkedin_search_exa(
  query="Taco Dibbits Rijksmuseum",
  searchType="profiles",
  numResults=5
)

Step 4: Extract LinkedIn Profiles

# When profile URL is known
exa_crawling_exa(
  url="https://www.linkedin.com/in/taco-dibbits-12345",
  maxCharacters=10000
)

Step 5: Create Person Entity Files

Save extracted profile to data/custodian/person/entity/{slug}_{timestamp}.json

Step 6: Update Custodian YAML

Add staff entries with:

Basic info (name, role)
Web claims with provenance
LinkedIn profile reference (if available)

Provenance Sources Priority

When multiple sources provide the same information:

Priority	Source	Reliability
1	Official institutional website	Highest
2	LinkedIn profile	High
3	News articles/press releases	Medium-High
4	Conference programs	Medium
5	Academic publications	Medium
6	Third-party databases	Lower

When sources conflict, document both with provenance and note the discrepancy.

Validation Checklist

Before marking staff data complete, verify:

Every staff member has person_id
Full name has web claim with source_url
Role/title has web claim with source_url
All claims have retrieved_on timestamp
All claims have retrieval_agent specified
LinkedIn profiles stored in person/entity/ (not inline)
XPath included where HTML was scraped
No fabricated data (per Rule 21)

Rule 6: WebObservation claims MUST have XPath provenance
Rule 12: Person data reference pattern (file paths, not inline)
Rule 14: Exa MCP LinkedIn profile extraction
Rule 16: LinkedIn photo URLs (CDN, not overlay page)
Rule 17: LinkedIn connection unique identifiers
Rule 19: HTML-only LinkedIn extraction
Rule 20: Person entity profiles stored individually
Rule 21: Data fabrication strictly prohibited

Tools Reference

For Institutional Websites

Tool	MCP Name	Use Case
FireCrawl Scrape	`firecrawl_firecrawl_scrape`	Staff pages
Playwright Snapshot	`playwright_browser_snapshot`	JS-heavy sites

For LinkedIn

Tool	MCP Name	Use Case
Exa Crawl	`exa_crawling_exa`	Profile extraction (URL known)
Exa LinkedIn Search	`exa_linkedin_search_exa`	Find profiles
Exa Web Search	`exa_web_search_exa`	Fallback search

Error Handling

Missing XPath

If XPath cannot be determined:

web_claims:
  - claim_type: full_name
    claim_value: Example Person
    source_url: https://example.org/team
    xpath: null  # Could not determine - page uses dynamic rendering
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: playwright
    notes: "Extracted from Playwright accessibility snapshot; XPath not available for dynamically rendered content"

Conflicting Sources

Document both claims:

web_claims:
  - claim_type: role_title
    claim_value: Senior Curator
    source_url: https://institution.org/team
    retrieved_on: "2025-01-15T10:30:00Z"
    retrieval_agent: firecrawl
    
  - claim_type: role_title
    claim_value: Chief Curator
    source_url: https://linkedin.com/in/example
    retrieved_on: "2025-01-15T11:00:00Z"
    retrieval_agent: exa_crawling_exa
    notes: "Title differs from institutional website - may be outdated"

Version History

Date	Change
2025-01-15	Initial rule creation

11 KiB Raw Blame History