# Exa MCP LinkedIn Profile Extraction Rules ## Overview This document specifies the rules for extracting LinkedIn profile data using the Exa MCP server tools. LinkedIn profiles are a critical data source for enriching heritage custodian staff records. ## Available Exa Tools for LinkedIn | Tool | Purpose | When to Use | |------|---------|-------------| | `exa_linkedin_search_exa` | Search for LinkedIn profiles/companies | Finding profiles when URL is unknown | | `exa_crawling_exa` | Extract content from specific URL | **PREFERRED** when profile URL is known | | `exa_web_search_exa` | General web search | Fallback for finding LinkedIn URLs | ## Recommended Workflow ### When Profile URL is Known (PREFERRED) Use `exa_crawling_exa` directly with the LinkedIn profile URL: ``` Tool: exa_crawling_exa Parameters: url: "https://www.linkedin.com/in/{profile-slug}" maxCharacters: 10000 # Recommended for comprehensive extraction ``` **Advantages**: - Returns structured markdown with full profile content - Includes complete career history with dates and durations - Captures education, skills, languages, and about section - Returns profile image URL - Includes company metadata (size, founding year, industry) - Low cost (~$0.001 per request) ### When Profile URL is Unknown 1. **First**: Try `exa_linkedin_search_exa`: ``` Tool: exa_linkedin_search_exa Parameters: query: "{person name} {organization name}" searchType: "profiles" numResults: 5 ``` 2. **If search fails**: Use `exa_web_search_exa` with site restriction: ``` Tool: exa_web_search_exa Parameters: query: "site:linkedin.com/in/ {person name} {organization}" numResults: 5 ``` 3. **Then**: Use `exa_crawling_exa` with discovered URL for full extraction ## Output Format Person profile JSON files are stored in: `data/custodian/person/entity/` ### File Naming Convention ``` {linkedin-slug}_{ISO-timestamp}.json ``` Example: `alexandr-belov-bb547b46_20251210T120000Z.json` ### Required JSON Structure **🚨 CRITICAL: ALL profiles MUST use structured JSON format. Raw content dumps are NOT acceptable.** See `.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md` for comprehensive format requirements. ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json", "staff_id": "{custodian}_staff_{index}_{name_slug}", "extraction_date": "2025-12-10T16:00:00Z", "extraction_method": "exa_contents", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/{slug}", "cost_usd": 0, "request_id": "{exa_request_id}" }, "profile_data": { "name": "", "linkedin_url": "", "headline": "", "location": "", "connections": "<500 connections • 2,000 followers>", "about": "", "experience": [ { "title": "", "company": "", "duration": "", "location": "", "company_details": "", "department": "", "description": "" } ], "education": [ { "degree": "", "institution": "", "duration": "" } ], "skills": ["", "", ...], "languages": [ {"language": "", "proficiency": ""} ], "profile_image_url": "" } } ``` ### extraction_agent Requirement **🚨 CRITICAL: The `extraction_agent` field MUST always be set to `"claude-opus-4.5"`.** This ensures consistent provenance tracking and audit trail. ``` ## Heritage-Relevant Experience Tagging When extracting profiles for heritage custodian staff, identify and tag positions at: - **Museums**: Collections, curatorial, conservation roles - **Libraries**: Cataloging, acquisition, research assistance - **Archives**: Archivists, digitization, records management - **Research institutions**: Academic libraries, scholarly heritage - **National/government cultural bodies**: National libraries, archives, cultural ministries - **Heritage NGOs**: Preservation societies, cultural foundations ## Data Quality Rules 1. **Preserve raw Exa response metadata** - Always include `exa_request_id`, `exa_cost_dollars`, and `exa_source` 2. **Use ISO country codes** - Convert location strings to ISO 3166-1 alpha-2 codes 3. **Normalize proficiency levels** - Use LinkedIn's standard terms: - Native or bilingual proficiency - Full professional proficiency - Professional working proficiency - Limited working proficiency - Elementary proficiency 4. **Calculate current status** - Parse dates to set `current: true` for ongoing positions 5. **Extract structured skills** - Create `skills_structured` with professional terminology even when LinkedIn `skills` list uses informal terms ## Custodian File Reference Pattern When person profile is extracted, update custodian file to reference it: ```yaml collection_management_specialist: - name: Alexandr Belov role: Collection/Information Specialist linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46 current: true person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json ``` See: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` for complete reference pattern rules. ## Rate Limits and Costs - **Exa crawling**: ~$0.001 per URL - **Exa LinkedIn search**: ~$0.003 per search - **Cached responses**: Faster and same cost - **Rate limits**: Respect Exa's rate limits (typically generous for authenticated users) ## Troubleshooting ### LinkedIn Search Returns Wrong Profiles The `exa_linkedin_search_exa` tool may return: - Posts by other people mentioning the target person - Profiles with similar names - Company page posts instead of personal profiles **Solution**: Use `exa_crawling_exa` with direct profile URL when available. ### Connection Lists Cannot Be Accessed **Exa CANNOT access LinkedIn connection lists.** Connection data requires authenticated LinkedIn access. **Workaround**: If user provides connection list data (via manual browse), create a separate connections file: ``` data/custodian/person/{linkedin-slug}_connections_{timestamp}.json ``` Structure the connections file with: - `source_metadata` - URL, timestamp, scrape method - `connections[]` - Array of connection objects with name, headline, organization - `network_analysis` - Aggregated statistics and heritage-relevant counts - `heritage_network_insights` - Cluster analysis of heritage connections Reference from main profile: ```json "linkedin_connections_file": "data/custodian/person/{slug}_connections_{timestamp}.json" ``` ### Profile Content is Truncated Increase `maxCharacters` parameter (default is often 3000): ``` exa_crawling_exa: url: "https://www.linkedin.com/in/..." maxCharacters: 10000 # or higher for very extensive profiles ``` ### Profile Returns 403/Blocked LinkedIn may block some requests. Exa typically handles this through caching and rotation, but if persistent: 1. Try again after some time 2. Use alternative search to verify profile exists 3. Document in `raw_exa_response.note` that profile was inaccessible ## Version History | Version | Date | Changes | |---------|------|---------| | 1.1.0 | 2025-12-10 | Added connection list handling (manual scrape required) | | 1.0.0 | 2025-12-10 | Initial documentation based on Eye Filmmuseum staff extraction | ## Related Rules - `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person file reference pattern - `.opencode/DATA_PRESERVATION_RULES.md` - Data preservation requirements - `AGENTS.md` - Rule 12: Person Data Reference Pattern