7.8 KiB
Exa MCP LinkedIn Profile Extraction Rules
Overview
This document specifies the rules for extracting LinkedIn profile data using the Exa MCP server tools. LinkedIn profiles are a critical data source for enriching heritage custodian staff records.
Available Exa Tools for LinkedIn
| Tool | Purpose | When to Use |
|---|---|---|
exa_linkedin_search_exa |
Search for LinkedIn profiles/companies | Finding profiles when URL is unknown |
exa_crawling_exa |
Extract content from specific URL | PREFERRED when profile URL is known |
exa_web_search_exa |
General web search | Fallback for finding LinkedIn URLs |
Recommended Workflow
When Profile URL is Known (PREFERRED)
Use exa_crawling_exa directly with the LinkedIn profile URL:
Tool: exa_crawling_exa
Parameters:
url: "https://www.linkedin.com/in/{profile-slug}"
maxCharacters: 10000 # Recommended for comprehensive extraction
Advantages:
- Returns structured markdown with full profile content
- Includes complete career history with dates and durations
- Captures education, skills, languages, and about section
- Returns profile image URL
- Includes company metadata (size, founding year, industry)
- Low cost (~$0.001 per request)
When Profile URL is Unknown
-
First: Try
exa_linkedin_search_exa:Tool: exa_linkedin_search_exa Parameters: query: "{person name} {organization name}" searchType: "profiles" numResults: 5 -
If search fails: Use
exa_web_search_exawith site restriction:Tool: exa_web_search_exa Parameters: query: "site:linkedin.com/in/ {person name} {organization}" numResults: 5 -
Then: Use
exa_crawling_exawith discovered URL for full extraction
Output Format
Person profile JSON files are stored in: data/custodian/person/entity/
File Naming Convention
{linkedin-slug}_{ISO-timestamp}.json
Example: alexandr-belov-bb547b46_20251210T120000Z.json
Required JSON Structure
🚨 CRITICAL: ALL profiles MUST use structured JSON format. Raw content dumps are NOT acceptable.
See .opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md for comprehensive format requirements.
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
"staff_id": "{custodian}_staff_{index}_{name_slug}",
"extraction_date": "2025-12-10T16:00:00Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/{slug}",
"cost_usd": 0,
"request_id": "{exa_request_id}"
},
"profile_data": {
"name": "<full name>",
"linkedin_url": "<full URL>",
"headline": "<LinkedIn headline>",
"location": "<city, region, country>",
"connections": "<500 connections • 2,000 followers>",
"about": "<full about section text>",
"experience": [
{
"title": "<job title>",
"company": "<company name>",
"duration": "<Mon YYYY - Mon YYYY • X years Y months>",
"location": "<city, country>",
"company_details": "<Company: size • Founded year • Type • Industry>",
"department": "<department • Level: level>",
"description": "<role description if provided>"
}
],
"education": [
{
"degree": "<degree type and field>",
"institution": "<school name>",
"duration": "<start - end • X years>"
}
],
"skills": ["<skill1>", "<skill2>", ...],
"languages": [
{"language": "<name>", "proficiency": "<level>"}
],
"profile_image_url": "<LinkedIn CDN image URL>"
}
}
extraction_agent Requirement
🚨 CRITICAL: The extraction_agent field MUST always be set to "claude-opus-4.5".
This ensures consistent provenance tracking and audit trail.
## Heritage-Relevant Experience Tagging
When extracting profiles for heritage custodian staff, identify and tag positions at:
- **Museums**: Collections, curatorial, conservation roles
- **Libraries**: Cataloging, acquisition, research assistance
- **Archives**: Archivists, digitization, records management
- **Research institutions**: Academic libraries, scholarly heritage
- **National/government cultural bodies**: National libraries, archives, cultural ministries
- **Heritage NGOs**: Preservation societies, cultural foundations
## Data Quality Rules
1. **Preserve raw Exa response metadata** - Always include `exa_request_id`, `exa_cost_dollars`, and `exa_source`
2. **Use ISO country codes** - Convert location strings to ISO 3166-1 alpha-2 codes
3. **Normalize proficiency levels** - Use LinkedIn's standard terms:
- Native or bilingual proficiency
- Full professional proficiency
- Professional working proficiency
- Limited working proficiency
- Elementary proficiency
4. **Calculate current status** - Parse dates to set `current: true` for ongoing positions
5. **Extract structured skills** - Create `skills_structured` with professional terminology even when LinkedIn `skills` list uses informal terms
## Custodian File Reference Pattern
When person profile is extracted, update custodian file to reference it:
```yaml
collection_management_specialist:
- name: Alexandr Belov
role: Collection/Information Specialist
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
current: true
person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json
See: .opencode/PERSON_DATA_REFERENCE_PATTERN.md for complete reference pattern rules.
Rate Limits and Costs
- Exa crawling: ~$0.001 per URL
- Exa LinkedIn search: ~$0.003 per search
- Cached responses: Faster and same cost
- Rate limits: Respect Exa's rate limits (typically generous for authenticated users)
Troubleshooting
LinkedIn Search Returns Wrong Profiles
The exa_linkedin_search_exa tool may return:
- Posts by other people mentioning the target person
- Profiles with similar names
- Company page posts instead of personal profiles
Solution: Use exa_crawling_exa with direct profile URL when available.
Connection Lists Cannot Be Accessed
Exa CANNOT access LinkedIn connection lists. Connection data requires authenticated LinkedIn access.
Workaround: If user provides connection list data (via manual browse), create a separate connections file:
data/custodian/person/{linkedin-slug}_connections_{timestamp}.json
Structure the connections file with:
source_metadata- URL, timestamp, scrape methodconnections[]- Array of connection objects with name, headline, organizationnetwork_analysis- Aggregated statistics and heritage-relevant countsheritage_network_insights- Cluster analysis of heritage connections
Reference from main profile:
"linkedin_connections_file": "data/custodian/person/{slug}_connections_{timestamp}.json"
Profile Content is Truncated
Increase maxCharacters parameter (default is often 3000):
exa_crawling_exa:
url: "https://www.linkedin.com/in/..."
maxCharacters: 10000 # or higher for very extensive profiles
Profile Returns 403/Blocked
LinkedIn may block some requests. Exa typically handles this through caching and rotation, but if persistent:
- Try again after some time
- Use alternative search to verify profile exists
- Document in
raw_exa_response.notethat profile was inaccessible
Version History
| Version | Date | Changes |
|---|---|---|
| 1.1.0 | 2025-12-10 | Added connection list handling (manual scrape required) |
| 1.0.0 | 2025-12-10 | Initial documentation based on Eye Filmmuseum staff extraction |
Related Rules
.opencode/PERSON_DATA_REFERENCE_PATTERN.md- Person file reference pattern.opencode/DATA_PRESERVATION_RULES.md- Data preservation requirementsAGENTS.md- Rule 12: Person Data Reference Pattern