165 lines
5.3 KiB
Markdown
165 lines
5.3 KiB
Markdown
# LinkedIn Photo URL Extraction Rule
|
|
|
|
## Problem
|
|
|
|
LinkedIn profile URLs like `https://www.linkedin.com/in/giovannafossati/` have a trivially derivable photo overlay page:
|
|
- Profile URL: `https://www.linkedin.com/in/giovannafossati/`
|
|
- Overlay URL: `https://www.linkedin.com/in/giovannafossati/overlay/photo/`
|
|
|
|
**The overlay URL is useless for data storage** because:
|
|
1. It requires JavaScript rendering to display the actual image
|
|
2. It cannot be directly embedded in applications
|
|
3. It provides no direct access to the image file
|
|
|
|
## Solution: Extract the Actual CDN Photo URL
|
|
|
|
When visiting the LinkedIn photo overlay page, the **actual image URL** is hosted on LinkedIn's CDN at `media.licdn.com`.
|
|
|
|
### Example
|
|
|
|
**Profile**: `https://www.linkedin.com/in/giovannafossati/`
|
|
|
|
**WRONG** (overlay page - derivable, useless):
|
|
```
|
|
https://www.linkedin.com/in/giovannafossati/overlay/photo/
|
|
```
|
|
|
|
**CORRECT** (actual CDN image - must be extracted and stored):
|
|
```
|
|
https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M
|
|
```
|
|
|
|
## CDN URL Structure
|
|
|
|
LinkedIn photo CDN URLs follow this pattern:
|
|
```
|
|
https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN}
|
|
```
|
|
|
|
### Components:
|
|
- **Host**: `media.licdn.com`
|
|
- **Path**: `/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/...`
|
|
- **Size Options**: `100_100`, `200_200`, `400_400`, `800_800`
|
|
- **Expiry** (`e=`): Unix timestamp when URL expires
|
|
- **Token** (`t=`): Authentication/integrity token
|
|
|
|
### Size Preference
|
|
|
|
Always prefer the largest available size for archival purposes:
|
|
1. `800_800` (preferred)
|
|
2. `400_400`
|
|
3. `200_200`
|
|
4. `100_100`
|
|
|
|
## Extraction Workflow
|
|
|
|
### Method 1: Browser Inspection (Manual)
|
|
|
|
1. Navigate to profile: `https://www.linkedin.com/in/{slug}/`
|
|
2. Click on profile photo to open overlay
|
|
3. Right-click on the photo → "Copy Image Address"
|
|
4. The URL should start with `https://media.licdn.com/dms/image/`
|
|
|
|
### Method 2: Playwright Automation
|
|
|
|
```python
|
|
from playwright.sync_api import sync_playwright
|
|
|
|
def extract_linkedin_photo_url(profile_url: str) -> str:
|
|
"""Extract actual CDN photo URL from LinkedIn profile."""
|
|
with sync_playwright() as p:
|
|
browser = p.chromium.launch(headless=True)
|
|
page = browser.new_page()
|
|
|
|
# Navigate to profile
|
|
page.goto(profile_url)
|
|
page.wait_for_load_state('networkidle')
|
|
|
|
# Click on profile photo to open overlay
|
|
page.click('button[aria-label*="photo"]')
|
|
page.wait_for_selector('img[src*="media.licdn.com"]')
|
|
|
|
# Extract the CDN URL
|
|
img = page.query_selector('img[src*="media.licdn.com"]')
|
|
photo_url = img.get_attribute('src')
|
|
|
|
browser.close()
|
|
return photo_url
|
|
```
|
|
|
|
### Method 3: Exa MCP Tool
|
|
|
|
When using `exa_crawling_exa` or `exa_linkedin_search_exa`, look for URLs matching:
|
|
```regex
|
|
https://media\.licdn\.com/dms/image/v2/[^/]+/profile-displayphoto-shrink_\d+_\d+/[^?]+\?[^\s"']+
|
|
```
|
|
|
|
## JSON Storage Format
|
|
|
|
In person profile JSON files (`data/custodian/person/*.json`):
|
|
|
|
```json
|
|
{
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
|
|
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M",
|
|
"profile_data": {...}
|
|
}
|
|
```
|
|
|
|
### When CDN URL Not Available
|
|
|
|
If the actual CDN URL cannot be extracted, set `linkedin_photo_url` to `null` and use a `photo_urls` object with alternative sources:
|
|
|
|
```json
|
|
{
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/anne-gant-59908a18",
|
|
"linkedin_photo_url": null,
|
|
"photo_urls": {
|
|
"source_name": "https://example.com/photo.jpg",
|
|
"primary": "https://example.com/photo.jpg",
|
|
"photo_notes": "LinkedIn CDN URL not available. Using alternative source."
|
|
}
|
|
}
|
|
```
|
|
|
|
## URL Expiration
|
|
|
|
**IMPORTANT**: LinkedIn CDN URLs have an expiration timestamp (`e=` parameter).
|
|
|
|
- URLs typically expire in 1-2 years
|
|
- The token (`t=`) becomes invalid after expiry
|
|
- For long-term archival, consider:
|
|
1. Downloading and storing the image locally
|
|
2. Recording the extraction timestamp
|
|
3. Planning for periodic re-extraction
|
|
|
|
## Anti-Patterns
|
|
|
|
### ❌ WRONG: Store overlay page URL
|
|
```json
|
|
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/"
|
|
```
|
|
This is **derivable from the profile URL** and requires JavaScript to render.
|
|
|
|
### ❌ WRONG: Store profile URL in photo field
|
|
```json
|
|
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/"
|
|
```
|
|
This is not a photo URL at all.
|
|
|
|
### ✅ CORRECT: Store actual CDN URL
|
|
```json
|
|
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/..."
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - LinkedIn profile extraction with Exa MCP
|
|
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile file structure
|
|
- `AGENTS.md` Rule 14 - Exa MCP LinkedIn Profile Extraction
|
|
|
|
---
|
|
|
|
**Created**: 2025-12-09
|
|
**Version**: 1.0
|
|
**Status**: PRODUCTION
|