glam/docs/LINKEDIN_PHOTO_URL_EXTRACTION.md

# LinkedIn Photo URL Extraction Rule

## Problem

LinkedIn profile URLs like `https://www.linkedin.com/in/giovannafossati/` have a trivially derivable photo overlay page:
- Profile URL: `https://www.linkedin.com/in/giovannafossati/`
- Overlay URL: `https://www.linkedin.com/in/giovannafossati/overlay/photo/`

**The overlay URL is useless for data storage** because:
1. It requires JavaScript rendering to display the actual image
2. It cannot be directly embedded in applications
3. It provides no direct access to the image file

## Solution: Extract the Actual CDN Photo URL

When visiting the LinkedIn photo overlay page, the **actual image URL** is hosted on LinkedIn's CDN at `media.licdn.com`.

### Example

**Profile**: `https://www.linkedin.com/in/giovannafossati/`

**WRONG** (overlay page - derivable, useless):
```
https://www.linkedin.com/in/giovannafossati/overlay/photo/
```

**CORRECT** (actual CDN image - must be extracted and stored):
```
https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M
```

## CDN URL Structure

LinkedIn photo CDN URLs follow this pattern:
```
https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN}
```

### Components:
- **Host**: `media.licdn.com`
- **Path**: `/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/...`
- **Size Options**: `100_100`, `200_200`, `400_400`, `800_800`
- **Expiry** (`e=`): Unix timestamp when URL expires
- **Token** (`t=`): Authentication/integrity token

### Size Preference

Always prefer the largest available size for archival purposes:
1. `800_800` (preferred)
2. `400_400`
3. `200_200`
4. `100_100`

## Extraction Workflow

### Method 1: Browser Inspection (Manual)

1. Navigate to profile: `https://www.linkedin.com/in/{slug}/`
2. Click on profile photo to open overlay
3. Right-click on the photo → "Copy Image Address"
4. The URL should start with `https://media.licdn.com/dms/image/`

### Method 2: Playwright Automation

```python
from playwright.sync_api import sync_playwright

def extract_linkedin_photo_url(profile_url: str) -> str:
    """Extract actual CDN photo URL from LinkedIn profile."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate to profile
        page.goto(profile_url)
        page.wait_for_load_state('networkidle')

        # Click on profile photo to open overlay
        page.click('button[aria-label*="photo"]')
        page.wait_for_selector('img[src*="media.licdn.com"]')

        # Extract the CDN URL
        img = page.query_selector('img[src*="media.licdn.com"]')
        photo_url = img.get_attribute('src')

        browser.close()
        return photo_url
```

### Method 3: Exa MCP Tool

When using `exa_crawling_exa` or `exa_linkedin_search_exa`, look for URLs matching:
```regex
https://media\.licdn\.com/dms/image/v2/[^/]+/profile-displayphoto-shrink_\d+_\d+/[^?]+\?[^\s"']+
```

## JSON Storage Format

In person profile JSON files (`data/custodian/person/*.json`):

```json
{
  "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
  "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M",
  "profile_data": {...}
}
```

### When CDN URL Not Available

If the actual CDN URL cannot be extracted, set `linkedin_photo_url` to `null` and use a `photo_urls` object with alternative sources:

```json
{
  "linkedin_profile_url": "https://www.linkedin.com/in/anne-gant-59908a18",
  "linkedin_photo_url": null,
  "photo_urls": {
    "source_name": "https://example.com/photo.jpg",
    "primary": "https://example.com/photo.jpg",
    "photo_notes": "LinkedIn CDN URL not available. Using alternative source."
  }
}
```

## URL Expiration

**IMPORTANT**: LinkedIn CDN URLs have an expiration timestamp (`e=` parameter).

- URLs typically expire in 1-2 years
- The token (`t=`) becomes invalid after expiry
- For long-term archival, consider:
  1. Downloading and storing the image locally
  2. Recording the extraction timestamp
  3. Planning for periodic re-extraction

## Anti-Patterns

### ❌ WRONG: Store overlay page URL
```json
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/"
```
This is **derivable from the profile URL** and requires JavaScript to render.

### ❌ WRONG: Store profile URL in photo field
```json
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/"
```
This is not a photo URL at all.

### ✅ CORRECT: Store actual CDN URL
```json
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/..."
```

## Related Documentation

- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - LinkedIn profile extraction with Exa MCP
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile file structure
- `AGENTS.md` Rule 14 - Exa MCP LinkedIn Profile Extraction

---

**Created**: 2025-12-09
**Version**: 1.0
**Status**: PRODUCTION