# Rule 16: LinkedIn Photo URLs Must Be CDN URLs, Not Overlay Pages ## Core Rule **🚨 CRITICAL: When storing LinkedIn profile photos, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.** The LinkedIn photo overlay page (`/overlay/photo/`) is **trivially derivable** from any profile URL and provides no value. The actual image file URL from LinkedIn's CDN is what must be extracted and stored. ## URL Transformation | URL Type | Example | Store? | |----------|---------|--------| | Profile URL | `https://www.linkedin.com/in/giovannafossati/` | Store in `linkedin_profile_url` | | Overlay Page URL | `https://www.linkedin.com/in/giovannafossati/overlay/photo/` | ❌ NEVER STORE (derivable) | | CDN Image URL | `https://media.licdn.com/dms/image/v2/C4D03AQ.../profile-displayphoto-shrink_800_800/...` | ✅ Store in `linkedin_photo_url` | ## Why This Matters 1. **Overlay URLs are derivable**: `{profile_url}overlay/photo/` - no information value 2. **Overlay URLs require JavaScript**: Cannot be directly embedded or rendered 3. **CDN URLs are direct links**: Can be embedded, downloaded, verified 4. **CDN URLs prove extraction effort**: Demonstrate actual profile access ## Derivability Rule **If a URL can be trivially derived from another stored URL, DO NOT store it separately.** ``` linkedin_profile_url → overlay/photo/ ← DERIVABLE, don't store linkedin_profile_url → media.licdn.com CDN URL ← NOT DERIVABLE, must store ``` ## Implementation ### CORRECT Storage Pattern ```json { "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati", "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M" } ``` ### WRONG Storage Pattern (NEVER DO THIS) ```json { "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati", "linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/" } ``` ## How to Extract CDN URLs ### Method 1: Browser (Manual) 1. Go to profile → Click photo → Right-click → "Copy Image Address" 2. URL should start with `https://media.licdn.com/dms/image/` ### Method 2: Playwright Automation ```python # Navigate to overlay page, extract img[src*="media.licdn.com"] ``` ### Method 3: Exa MCP Tools Use `exa_crawling_exa` with the profile URL and look for CDN URLs in the response. ## Fallback: Alternative Photo Sources When LinkedIn CDN URL cannot be extracted, use `photo_urls` object: ```json { "linkedin_photo_url": null, "photo_urls": { "indiana_university_blog": "https://blogs.libraries.indiana.edu/.../headshot.jpeg", "screen_daily": "https://d1nslcd7m2225b.cloudfront.net/.../photo.jpeg", "primary": "https://blogs.libraries.indiana.edu/.../headshot.jpeg", "photo_credit": "Indiana University" } } ``` ## Validation When reviewing person profile JSON files: 1. ✅ `linkedin_photo_url` is `null` OR starts with `https://media.licdn.com/` 2. ❌ `linkedin_photo_url` contains `/overlay/photo/` - **FIX IMMEDIATELY** 3. ❌ `linkedin_photo_url` equals `linkedin_profile_url` - **FIX IMMEDIATELY** ## CDN URL Structure Reference ``` https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN} ``` Sizes: `100_100`, `200_200`, `400_400`, `800_800` (prefer `800_800`) ## See Also - `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` - Complete extraction documentation - `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa MCP extraction rules - `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile structure --- **Rule Number**: 16 **Created**: 2025-12-09 **Status**: PRODUCTION