kempersc 1b1cfbfca0 enrich custodians

2025-12-11 22:32:09 +01:00

7.8 KiB

Raw Permalink Blame History

Exa MCP LinkedIn Profile Extraction Rules

Overview

This document specifies the rules for extracting LinkedIn profile data using the Exa MCP server tools. LinkedIn profiles are a critical data source for enriching heritage custodian staff records.

Available Exa Tools for LinkedIn

Tool	Purpose	When to Use
`exa_linkedin_search_exa`	Search for LinkedIn profiles/companies	Finding profiles when URL is unknown
`exa_crawling_exa`	Extract content from specific URL	PREFERRED when profile URL is known
`exa_web_search_exa`	General web search	Fallback for finding LinkedIn URLs

Recommended Workflow

When Profile URL is Known (PREFERRED)

Use exa_crawling_exa directly with the LinkedIn profile URL:

Tool: exa_crawling_exa
Parameters:
  url: "https://www.linkedin.com/in/{profile-slug}"
  maxCharacters: 10000  # Recommended for comprehensive extraction

Advantages:

Returns structured markdown with full profile content
Includes complete career history with dates and durations
Captures education, skills, languages, and about section
Returns profile image URL
Includes company metadata (size, founding year, industry)
Low cost (~$0.001 per request)

When Profile URL is Unknown

First: Try exa_linkedin_search_exa:

Tool: exa_linkedin_search_exa
Parameters:
  query: "{person name} {organization name}"
  searchType: "profiles"
  numResults: 5

If search fails: Use exa_web_search_exa with site restriction:

Tool: exa_web_search_exa
Parameters:
  query: "site:linkedin.com/in/ {person name} {organization}"
  numResults: 5

Then: Use exa_crawling_exa with discovered URL for full extraction

Output Format

Person profile JSON files are stored in: data/custodian/person/entity/

File Naming Convention

{linkedin-slug}_{ISO-timestamp}.json

Example: alexandr-belov-bb547b46_20251210T120000Z.json

Required JSON Structure

🚨 CRITICAL: ALL profiles MUST use structured JSON format. Raw content dumps are NOT acceptable.

See .opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md for comprehensive format requirements.

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_{index}_{name_slug}",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "cost_usd": 0,
    "request_id": "{exa_request_id}"
  },
  "profile_data": {
    "name": "<full name>",
    "linkedin_url": "<full URL>",
    "headline": "<LinkedIn headline>",
    "location": "<city, region, country>",
    "connections": "<500 connections • 2,000 followers>",
    "about": "<full about section text>",
    "experience": [
      {
        "title": "<job title>",
        "company": "<company name>",
        "duration": "<Mon YYYY - Mon YYYY • X years Y months>",
        "location": "<city, country>",
        "company_details": "<Company: size • Founded year • Type • Industry>",
        "department": "<department • Level: level>",
        "description": "<role description if provided>"
      }
    ],
    "education": [
      {
        "degree": "<degree type and field>",
        "institution": "<school name>",
        "duration": "<start - end • X years>"
      }
    ],
    "skills": ["<skill1>", "<skill2>", ...],
    "languages": [
      {"language": "<name>", "proficiency": "<level>"}
    ],
    "profile_image_url": "<LinkedIn CDN image URL>"
  }
}

extraction_agent Requirement

🚨 CRITICAL: The extraction_agent field MUST always be set to "claude-opus-4.5".

This ensures consistent provenance tracking and audit trail.


## Heritage-Relevant Experience Tagging

When extracting profiles for heritage custodian staff, identify and tag positions at:

- **Museums**: Collections, curatorial, conservation roles
- **Libraries**: Cataloging, acquisition, research assistance
- **Archives**: Archivists, digitization, records management
- **Research institutions**: Academic libraries, scholarly heritage
- **National/government cultural bodies**: National libraries, archives, cultural ministries
- **Heritage NGOs**: Preservation societies, cultural foundations

## Data Quality Rules

1. **Preserve raw Exa response metadata** - Always include `exa_request_id`, `exa_cost_dollars`, and `exa_source`

2. **Use ISO country codes** - Convert location strings to ISO 3166-1 alpha-2 codes

3. **Normalize proficiency levels** - Use LinkedIn's standard terms:
   - Native or bilingual proficiency
   - Full professional proficiency
   - Professional working proficiency
   - Limited working proficiency
   - Elementary proficiency

4. **Calculate current status** - Parse dates to set `current: true` for ongoing positions

5. **Extract structured skills** - Create `skills_structured` with professional terminology even when LinkedIn `skills` list uses informal terms

## Custodian File Reference Pattern

When person profile is extracted, update custodian file to reference it:

```yaml
collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  current: true
  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json

See: .opencode/PERSON_DATA_REFERENCE_PATTERN.md for complete reference pattern rules.

Rate Limits and Costs

Exa crawling: ~$0.001 per URL
Exa LinkedIn search: ~$0.003 per search
Cached responses: Faster and same cost
Rate limits: Respect Exa's rate limits (typically generous for authenticated users)

Troubleshooting

LinkedIn Search Returns Wrong Profiles

The exa_linkedin_search_exa tool may return:

Posts by other people mentioning the target person
Profiles with similar names
Company page posts instead of personal profiles

Solution: Use exa_crawling_exa with direct profile URL when available.

Connection Lists Cannot Be Accessed

Exa CANNOT access LinkedIn connection lists. Connection data requires authenticated LinkedIn access.

Workaround: If user provides connection list data (via manual browse), create a separate connections file:

data/custodian/person/{linkedin-slug}_connections_{timestamp}.json

Structure the connections file with:

source_metadata - URL, timestamp, scrape method
connections[] - Array of connection objects with name, headline, organization
network_analysis - Aggregated statistics and heritage-relevant counts
heritage_network_insights - Cluster analysis of heritage connections

Reference from main profile:

"linkedin_connections_file": "data/custodian/person/{slug}_connections_{timestamp}.json"

Profile Content is Truncated

Increase maxCharacters parameter (default is often 3000):

exa_crawling_exa:
  url: "https://www.linkedin.com/in/..."
  maxCharacters: 10000  # or higher for very extensive profiles

Profile Returns 403/Blocked

LinkedIn may block some requests. Exa typically handles this through caching and rotation, but if persistent:

Try again after some time
Use alternative search to verify profile exists
Document in raw_exa_response.note that profile was inaccessible

Version History

Version	Date	Changes
1.1.0	2025-12-10	Added connection list handling (manual scrape required)
1.0.0	2025-12-10	Initial documentation based on Eye Filmmuseum staff extraction

.opencode/PERSON_DATA_REFERENCE_PATTERN.md - Person file reference pattern
.opencode/DATA_PRESERVATION_RULES.md - Data preservation requirements
AGENTS.md - Rule 12: Person Data Reference Pattern

7.8 KiB Raw Permalink Blame History