# LinkedIn Photo URL Extraction Rule ## Problem LinkedIn profile URLs like `https://www.linkedin.com/in/giovannafossati/` have a trivially derivable photo overlay page: - Profile URL: `https://www.linkedin.com/in/giovannafossati/` - Overlay URL: `https://www.linkedin.com/in/giovannafossati/overlay/photo/` **The overlay URL is useless for data storage** because: 1. It requires JavaScript rendering to display the actual image 2. It cannot be directly embedded in applications 3. It provides no direct access to the image file ## Solution: Extract the Actual CDN Photo URL When visiting the LinkedIn photo overlay page, the **actual image URL** is hosted on LinkedIn's CDN at `media.licdn.com`. ### Example **Profile**: `https://www.linkedin.com/in/giovannafossati/` **WRONG** (overlay page - derivable, useless): ``` https://www.linkedin.com/in/giovannafossati/overlay/photo/ ``` **CORRECT** (actual CDN image - must be extracted and stored): ``` https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M ``` ## CDN URL Structure LinkedIn photo CDN URLs follow this pattern: ``` https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN} ``` ### Components: - **Host**: `media.licdn.com` - **Path**: `/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/...` - **Size Options**: `100_100`, `200_200`, `400_400`, `800_800` - **Expiry** (`e=`): Unix timestamp when URL expires - **Token** (`t=`): Authentication/integrity token ### Size Preference Always prefer the largest available size for archival purposes: 1. `800_800` (preferred) 2. `400_400` 3. `200_200` 4. `100_100` ## Extraction Workflow ### Method 1: Browser Inspection (Manual) 1. Navigate to profile: `https://www.linkedin.com/in/{slug}/` 2. Click on profile photo to open overlay 3. Right-click on the photo → "Copy Image Address" 4. The URL should start with `https://media.licdn.com/dms/image/` ### Method 2: Playwright Automation ```python from playwright.sync_api import sync_playwright def extract_linkedin_photo_url(profile_url: str) -> str: """Extract actual CDN photo URL from LinkedIn profile.""" with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() # Navigate to profile page.goto(profile_url) page.wait_for_load_state('networkidle') # Click on profile photo to open overlay page.click('button[aria-label*="photo"]') page.wait_for_selector('img[src*="media.licdn.com"]') # Extract the CDN URL img = page.query_selector('img[src*="media.licdn.com"]') photo_url = img.get_attribute('src') browser.close() return photo_url ``` ### Method 3: Exa MCP Tool When using `exa_crawling_exa` or `exa_linkedin_search_exa`, look for URLs matching: ```regex https://media\.licdn\.com/dms/image/v2/[^/]+/profile-displayphoto-shrink_\d+_\d+/[^?]+\?[^\s"']+ ``` ## JSON Storage Format In person profile JSON files (`data/custodian/person/*.json`): ```json { "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati", "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M", "profile_data": {...} } ``` ### When CDN URL Not Available If the actual CDN URL cannot be extracted, set `linkedin_photo_url` to `null` and use a `photo_urls` object with alternative sources: ```json { "linkedin_profile_url": "https://www.linkedin.com/in/anne-gant-59908a18", "linkedin_photo_url": null, "photo_urls": { "source_name": "https://example.com/photo.jpg", "primary": "https://example.com/photo.jpg", "photo_notes": "LinkedIn CDN URL not available. Using alternative source." } } ``` ## URL Expiration **IMPORTANT**: LinkedIn CDN URLs have an expiration timestamp (`e=` parameter). - URLs typically expire in 1-2 years - The token (`t=`) becomes invalid after expiry - For long-term archival, consider: 1. Downloading and storing the image locally 2. Recording the extraction timestamp 3. Planning for periodic re-extraction ## Anti-Patterns ### ❌ WRONG: Store overlay page URL ```json "linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/" ``` This is **derivable from the profile URL** and requires JavaScript to render. ### ❌ WRONG: Store profile URL in photo field ```json "linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/" ``` This is not a photo URL at all. ### ✅ CORRECT: Store actual CDN URL ```json "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/..." ``` ## Related Documentation - `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - LinkedIn profile extraction with Exa MCP - `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile file structure - `AGENTS.md` Rule 14 - Exa MCP LinkedIn Profile Extraction --- **Created**: 2025-12-09 **Version**: 1.0 **Status**: PRODUCTION