glam/docs/LINKEDIN_PHOTO_URL_EXTRACTION.md
2025-12-10 13:01:13 +01:00

165 lines
5.3 KiB
Markdown

# LinkedIn Photo URL Extraction Rule
## Problem
LinkedIn profile URLs like `https://www.linkedin.com/in/giovannafossati/` have a trivially derivable photo overlay page:
- Profile URL: `https://www.linkedin.com/in/giovannafossati/`
- Overlay URL: `https://www.linkedin.com/in/giovannafossati/overlay/photo/`
**The overlay URL is useless for data storage** because:
1. It requires JavaScript rendering to display the actual image
2. It cannot be directly embedded in applications
3. It provides no direct access to the image file
## Solution: Extract the Actual CDN Photo URL
When visiting the LinkedIn photo overlay page, the **actual image URL** is hosted on LinkedIn's CDN at `media.licdn.com`.
### Example
**Profile**: `https://www.linkedin.com/in/giovannafossati/`
**WRONG** (overlay page - derivable, useless):
```
https://www.linkedin.com/in/giovannafossati/overlay/photo/
```
**CORRECT** (actual CDN image - must be extracted and stored):
```
https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M
```
## CDN URL Structure
LinkedIn photo CDN URLs follow this pattern:
```
https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN}
```
### Components:
- **Host**: `media.licdn.com`
- **Path**: `/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/...`
- **Size Options**: `100_100`, `200_200`, `400_400`, `800_800`
- **Expiry** (`e=`): Unix timestamp when URL expires
- **Token** (`t=`): Authentication/integrity token
### Size Preference
Always prefer the largest available size for archival purposes:
1. `800_800` (preferred)
2. `400_400`
3. `200_200`
4. `100_100`
## Extraction Workflow
### Method 1: Browser Inspection (Manual)
1. Navigate to profile: `https://www.linkedin.com/in/{slug}/`
2. Click on profile photo to open overlay
3. Right-click on the photo → "Copy Image Address"
4. The URL should start with `https://media.licdn.com/dms/image/`
### Method 2: Playwright Automation
```python
from playwright.sync_api import sync_playwright
def extract_linkedin_photo_url(profile_url: str) -> str:
"""Extract actual CDN photo URL from LinkedIn profile."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate to profile
page.goto(profile_url)
page.wait_for_load_state('networkidle')
# Click on profile photo to open overlay
page.click('button[aria-label*="photo"]')
page.wait_for_selector('img[src*="media.licdn.com"]')
# Extract the CDN URL
img = page.query_selector('img[src*="media.licdn.com"]')
photo_url = img.get_attribute('src')
browser.close()
return photo_url
```
### Method 3: Exa MCP Tool
When using `exa_crawling_exa` or `exa_linkedin_search_exa`, look for URLs matching:
```regex
https://media\.licdn\.com/dms/image/v2/[^/]+/profile-displayphoto-shrink_\d+_\d+/[^?]+\?[^\s"']+
```
## JSON Storage Format
In person profile JSON files (`data/custodian/person/*.json`):
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M",
"profile_data": {...}
}
```
### When CDN URL Not Available
If the actual CDN URL cannot be extracted, set `linkedin_photo_url` to `null` and use a `photo_urls` object with alternative sources:
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/anne-gant-59908a18",
"linkedin_photo_url": null,
"photo_urls": {
"source_name": "https://example.com/photo.jpg",
"primary": "https://example.com/photo.jpg",
"photo_notes": "LinkedIn CDN URL not available. Using alternative source."
}
}
```
## URL Expiration
**IMPORTANT**: LinkedIn CDN URLs have an expiration timestamp (`e=` parameter).
- URLs typically expire in 1-2 years
- The token (`t=`) becomes invalid after expiry
- For long-term archival, consider:
1. Downloading and storing the image locally
2. Recording the extraction timestamp
3. Planning for periodic re-extraction
## Anti-Patterns
### ❌ WRONG: Store overlay page URL
```json
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/"
```
This is **derivable from the profile URL** and requires JavaScript to render.
### ❌ WRONG: Store profile URL in photo field
```json
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/"
```
This is not a photo URL at all.
### ✅ CORRECT: Store actual CDN URL
```json
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/..."
```
## Related Documentation
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - LinkedIn profile extraction with Exa MCP
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile file structure
- `AGENTS.md` Rule 14 - Exa MCP LinkedIn Profile Extraction
---
**Created**: 2025-12-09
**Version**: 1.0
**Status**: PRODUCTION