kempersc 41959f0766 correct HCID!

2025-12-10 13:01:13 +01:00

5.3 KiB

Raw Blame History

LinkedIn Photo URL Extraction Rule

Problem

LinkedIn profile URLs like https://www.linkedin.com/in/giovannafossati/ have a trivially derivable photo overlay page:

Profile URL: https://www.linkedin.com/in/giovannafossati/
Overlay URL: https://www.linkedin.com/in/giovannafossati/overlay/photo/

The overlay URL is useless for data storage because:

It requires JavaScript rendering to display the actual image
It cannot be directly embedded in applications
It provides no direct access to the image file

Solution: Extract the Actual CDN Photo URL

When visiting the LinkedIn photo overlay page, the actual image URL is hosted on LinkedIn's CDN at media.licdn.com.

Example

Profile: https://www.linkedin.com/in/giovannafossati/

WRONG (overlay page - derivable, useless):

https://www.linkedin.com/in/giovannafossati/overlay/photo/

CORRECT (actual CDN image - must be extracted and stored):

https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M

CDN URL Structure

LinkedIn photo CDN URLs follow this pattern:

https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN}

Components:

Host: media.licdn.com
Path: /dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/...
Size Options: 100_100, 200_200, 400_400, 800_800
Expiry (e=): Unix timestamp when URL expires
Token (t=): Authentication/integrity token

Size Preference

Always prefer the largest available size for archival purposes:

800_800 (preferred)
400_400
200_200
100_100

Extraction Workflow

Method 1: Browser Inspection (Manual)

Navigate to profile: https://www.linkedin.com/in/{slug}/
Click on profile photo to open overlay
Right-click on the photo → "Copy Image Address"
The URL should start with https://media.licdn.com/dms/image/

Method 2: Playwright Automation

from playwright.sync_api import sync_playwright

def extract_linkedin_photo_url(profile_url: str) -> str:
    """Extract actual CDN photo URL from LinkedIn profile."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Navigate to profile
        page.goto(profile_url)
        page.wait_for_load_state('networkidle')
        
        # Click on profile photo to open overlay
        page.click('button[aria-label*="photo"]')
        page.wait_for_selector('img[src*="media.licdn.com"]')
        
        # Extract the CDN URL
        img = page.query_selector('img[src*="media.licdn.com"]')
        photo_url = img.get_attribute('src')
        
        browser.close()
        return photo_url

Method 3: Exa MCP Tool

When using exa_crawling_exa or exa_linkedin_search_exa, look for URLs matching:

https://media\.licdn\.com/dms/image/v2/[^/]+/profile-displayphoto-shrink_\d+_\d+/[^?]+\?[^\s"']+

JSON Storage Format

In person profile JSON files (data/custodian/person/*.json):

{
  "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
  "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M",
  "profile_data": {...}
}

When CDN URL Not Available

If the actual CDN URL cannot be extracted, set linkedin_photo_url to null and use a photo_urls object with alternative sources:

{
  "linkedin_profile_url": "https://www.linkedin.com/in/anne-gant-59908a18",
  "linkedin_photo_url": null,
  "photo_urls": {
    "source_name": "https://example.com/photo.jpg",
    "primary": "https://example.com/photo.jpg",
    "photo_notes": "LinkedIn CDN URL not available. Using alternative source."
  }
}

URL Expiration

IMPORTANT: LinkedIn CDN URLs have an expiration timestamp (e= parameter).

URLs typically expire in 1-2 years
The token (t=) becomes invalid after expiry
For long-term archival, consider:
1. Downloading and storing the image locally
2. Recording the extraction timestamp
3. Planning for periodic re-extraction

Anti-Patterns

❌ WRONG: Store overlay page URL

"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/"

This is derivable from the profile URL and requires JavaScript to render.

❌ WRONG: Store profile URL in photo field

"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/"

This is not a photo URL at all.

✅ CORRECT: Store actual CDN URL

"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/..."

.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - LinkedIn profile extraction with Exa MCP
.opencode/PERSON_DATA_REFERENCE_PATTERN.md - Person profile file structure
AGENTS.md Rule 14 - Exa MCP LinkedIn Profile Extraction

Created: 2025-12-09
Version: 1.0
Status: PRODUCTION

5.3 KiB Raw Blame History