glam/.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md
2025-12-11 22:32:09 +01:00

7.8 KiB

Exa MCP LinkedIn Profile Extraction Rules

Overview

This document specifies the rules for extracting LinkedIn profile data using the Exa MCP server tools. LinkedIn profiles are a critical data source for enriching heritage custodian staff records.

Available Exa Tools for LinkedIn

Tool Purpose When to Use
exa_linkedin_search_exa Search for LinkedIn profiles/companies Finding profiles when URL is unknown
exa_crawling_exa Extract content from specific URL PREFERRED when profile URL is known
exa_web_search_exa General web search Fallback for finding LinkedIn URLs

When Profile URL is Known (PREFERRED)

Use exa_crawling_exa directly with the LinkedIn profile URL:

Tool: exa_crawling_exa
Parameters:
  url: "https://www.linkedin.com/in/{profile-slug}"
  maxCharacters: 10000  # Recommended for comprehensive extraction

Advantages:

  • Returns structured markdown with full profile content
  • Includes complete career history with dates and durations
  • Captures education, skills, languages, and about section
  • Returns profile image URL
  • Includes company metadata (size, founding year, industry)
  • Low cost (~$0.001 per request)

When Profile URL is Unknown

  1. First: Try exa_linkedin_search_exa:

    Tool: exa_linkedin_search_exa
    Parameters:
      query: "{person name} {organization name}"
      searchType: "profiles"
      numResults: 5
    
  2. If search fails: Use exa_web_search_exa with site restriction:

    Tool: exa_web_search_exa
    Parameters:
      query: "site:linkedin.com/in/ {person name} {organization}"
      numResults: 5
    
  3. Then: Use exa_crawling_exa with discovered URL for full extraction

Output Format

Person profile JSON files are stored in: data/custodian/person/entity/

File Naming Convention

{linkedin-slug}_{ISO-timestamp}.json

Example: alexandr-belov-bb547b46_20251210T120000Z.json

Required JSON Structure

🚨 CRITICAL: ALL profiles MUST use structured JSON format. Raw content dumps are NOT acceptable.

See .opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md for comprehensive format requirements.

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_{index}_{name_slug}",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "cost_usd": 0,
    "request_id": "{exa_request_id}"
  },
  "profile_data": {
    "name": "<full name>",
    "linkedin_url": "<full URL>",
    "headline": "<LinkedIn headline>",
    "location": "<city, region, country>",
    "connections": "<500 connections • 2,000 followers>",
    "about": "<full about section text>",
    "experience": [
      {
        "title": "<job title>",
        "company": "<company name>",
        "duration": "<Mon YYYY - Mon YYYY • X years Y months>",
        "location": "<city, country>",
        "company_details": "<Company: size • Founded year • Type • Industry>",
        "department": "<department • Level: level>",
        "description": "<role description if provided>"
      }
    ],
    "education": [
      {
        "degree": "<degree type and field>",
        "institution": "<school name>",
        "duration": "<start - end • X years>"
      }
    ],
    "skills": ["<skill1>", "<skill2>", ...],
    "languages": [
      {"language": "<name>", "proficiency": "<level>"}
    ],
    "profile_image_url": "<LinkedIn CDN image URL>"
  }
}

extraction_agent Requirement

🚨 CRITICAL: The extraction_agent field MUST always be set to "claude-opus-4.5".

This ensures consistent provenance tracking and audit trail.


## Heritage-Relevant Experience Tagging

When extracting profiles for heritage custodian staff, identify and tag positions at:

- **Museums**: Collections, curatorial, conservation roles
- **Libraries**: Cataloging, acquisition, research assistance
- **Archives**: Archivists, digitization, records management
- **Research institutions**: Academic libraries, scholarly heritage
- **National/government cultural bodies**: National libraries, archives, cultural ministries
- **Heritage NGOs**: Preservation societies, cultural foundations

## Data Quality Rules

1. **Preserve raw Exa response metadata** - Always include `exa_request_id`, `exa_cost_dollars`, and `exa_source`

2. **Use ISO country codes** - Convert location strings to ISO 3166-1 alpha-2 codes

3. **Normalize proficiency levels** - Use LinkedIn's standard terms:
   - Native or bilingual proficiency
   - Full professional proficiency
   - Professional working proficiency
   - Limited working proficiency
   - Elementary proficiency

4. **Calculate current status** - Parse dates to set `current: true` for ongoing positions

5. **Extract structured skills** - Create `skills_structured` with professional terminology even when LinkedIn `skills` list uses informal terms

## Custodian File Reference Pattern

When person profile is extracted, update custodian file to reference it:

```yaml
collection_management_specialist:
- name: Alexandr Belov
  role: Collection/Information Specialist
  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
  current: true
  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json

See: .opencode/PERSON_DATA_REFERENCE_PATTERN.md for complete reference pattern rules.

Rate Limits and Costs

  • Exa crawling: ~$0.001 per URL
  • Exa LinkedIn search: ~$0.003 per search
  • Cached responses: Faster and same cost
  • Rate limits: Respect Exa's rate limits (typically generous for authenticated users)

Troubleshooting

LinkedIn Search Returns Wrong Profiles

The exa_linkedin_search_exa tool may return:

  • Posts by other people mentioning the target person
  • Profiles with similar names
  • Company page posts instead of personal profiles

Solution: Use exa_crawling_exa with direct profile URL when available.

Connection Lists Cannot Be Accessed

Exa CANNOT access LinkedIn connection lists. Connection data requires authenticated LinkedIn access.

Workaround: If user provides connection list data (via manual browse), create a separate connections file:

data/custodian/person/{linkedin-slug}_connections_{timestamp}.json

Structure the connections file with:

  • source_metadata - URL, timestamp, scrape method
  • connections[] - Array of connection objects with name, headline, organization
  • network_analysis - Aggregated statistics and heritage-relevant counts
  • heritage_network_insights - Cluster analysis of heritage connections

Reference from main profile:

"linkedin_connections_file": "data/custodian/person/{slug}_connections_{timestamp}.json"

Profile Content is Truncated

Increase maxCharacters parameter (default is often 3000):

exa_crawling_exa:
  url: "https://www.linkedin.com/in/..."
  maxCharacters: 10000  # or higher for very extensive profiles

Profile Returns 403/Blocked

LinkedIn may block some requests. Exa typically handles this through caching and rotation, but if persistent:

  1. Try again after some time
  2. Use alternative search to verify profile exists
  3. Document in raw_exa_response.note that profile was inaccessible

Version History

Version Date Changes
1.1.0 2025-12-10 Added connection list handling (manual scrape required)
1.0.0 2025-12-10 Initial documentation based on Eye Filmmuseum staff extraction
  • .opencode/PERSON_DATA_REFERENCE_PATTERN.md - Person file reference pattern
  • .opencode/DATA_PRESERVATION_RULES.md - Data preservation requirements
  • AGENTS.md - Rule 12: Person Data Reference Pattern