glam/.opencode/LINKEDIN_PHOTO_CDN_RULE.md
2025-12-10 13:01:13 +01:00

109 lines
3.8 KiB
Markdown

# Rule 16: LinkedIn Photo URLs Must Be CDN URLs, Not Overlay Pages
## Core Rule
**🚨 CRITICAL: When storing LinkedIn profile photos, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.**
The LinkedIn photo overlay page (`/overlay/photo/`) is **trivially derivable** from any profile URL and provides no value. The actual image file URL from LinkedIn's CDN is what must be extracted and stored.
## URL Transformation
| URL Type | Example | Store? |
|----------|---------|--------|
| Profile URL | `https://www.linkedin.com/in/giovannafossati/` | Store in `linkedin_profile_url` |
| Overlay Page URL | `https://www.linkedin.com/in/giovannafossati/overlay/photo/` | ❌ NEVER STORE (derivable) |
| CDN Image URL | `https://media.licdn.com/dms/image/v2/C4D03AQ.../profile-displayphoto-shrink_800_800/...` | ✅ Store in `linkedin_photo_url` |
## Why This Matters
1. **Overlay URLs are derivable**: `{profile_url}overlay/photo/` - no information value
2. **Overlay URLs require JavaScript**: Cannot be directly embedded or rendered
3. **CDN URLs are direct links**: Can be embedded, downloaded, verified
4. **CDN URLs prove extraction effort**: Demonstrate actual profile access
## Derivability Rule
**If a URL can be trivially derived from another stored URL, DO NOT store it separately.**
```
linkedin_profile_url → overlay/photo/ ← DERIVABLE, don't store
linkedin_profile_url → media.licdn.com CDN URL ← NOT DERIVABLE, must store
```
## Implementation
### CORRECT Storage Pattern
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/profile-displayphoto-shrink_800_800/0/1517545267594?e=1766620800&v=beta&t=R1_3Tm1cgNanjfgJZkXHBUiQcQik7_QSdt94d87I52M"
}
```
### WRONG Storage Pattern (NEVER DO THIS)
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
"linkedin_photo_url": "https://www.linkedin.com/in/giovannafossati/overlay/photo/"
}
```
## How to Extract CDN URLs
### Method 1: Browser (Manual)
1. Go to profile → Click photo → Right-click → "Copy Image Address"
2. URL should start with `https://media.licdn.com/dms/image/`
### Method 2: Playwright Automation
```python
# Navigate to overlay page, extract img[src*="media.licdn.com"]
```
### Method 3: Exa MCP Tools
Use `exa_crawling_exa` with the profile URL and look for CDN URLs in the response.
## Fallback: Alternative Photo Sources
When LinkedIn CDN URL cannot be extracted, use `photo_urls` object:
```json
{
"linkedin_photo_url": null,
"photo_urls": {
"indiana_university_blog": "https://blogs.libraries.indiana.edu/.../headshot.jpeg",
"screen_daily": "https://d1nslcd7m2225b.cloudfront.net/.../photo.jpeg",
"primary": "https://blogs.libraries.indiana.edu/.../headshot.jpeg",
"photo_credit": "Indiana University"
}
}
```
## Validation
When reviewing person profile JSON files:
1.`linkedin_photo_url` is `null` OR starts with `https://media.licdn.com/`
2.`linkedin_photo_url` contains `/overlay/photo/` - **FIX IMMEDIATELY**
3.`linkedin_photo_url` equals `linkedin_profile_url` - **FIX IMMEDIATELY**
## CDN URL Structure Reference
```
https://media.licdn.com/dms/image/v2/{IMAGE_ID}/profile-displayphoto-shrink_{SIZE}_{SIZE}/profile-displayphoto-shrink_{SIZE}_{SIZE}/0/{TIMESTAMP}?e={EXPIRY}&v=beta&t={TOKEN}
```
Sizes: `100_100`, `200_200`, `400_400`, `800_800` (prefer `800_800`)
## See Also
- `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` - Complete extraction documentation
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa MCP extraction rules
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Person profile structure
---
**Rule Number**: 16
**Created**: 2025-12-09
**Status**: PRODUCTION