4.7 KiB
LinkedIn Profile Fetching System - Complete Implementation
Overview
A complete system to fetch LinkedIn profile data using the Exa API for people found in staff files, with duplicate prevention and threading for efficiency.
Files Created
Main Scripts
-
scripts/fetch_linkedin_profiles_complete.py- Main script (RECOMMENDED)- Processes ALL 24 staff directories automatically
- Interactive batch size selection
- Complete duplicate prevention
- Threading (3 workers)
- Rate limiting (1 second delay)
-
scripts/fetch_linkedin_profiles_exa_final.py- Core script- For processing single directories
- Same features as main script
-
scripts/run_linkedin_fetcher.sh- Quick start script- Loads .env file automatically
- Environment validation
- One-click execution
Utility Scripts
scripts/test_fetch_profiles.py- Test with 3 profilesscripts/test_linkedin_urls.py- Preview URLs to be fetched
Quick Start
cd /Users/kempersc/apps/glam
./scripts/run_linkedin_fetcher.sh
The system will:
- Load environment from
.envfile - Scan all 24 staff directories
- Check existing profiles (5,338 already exist)
- Show how many new profiles to fetch
- Ask how many to process (batch mode)
- Fetch using Exa API with GLM-4.6
- Save structured JSON to
/data/custodian/person/entity/
Data Flow
Staff Files (24 directories)
↓
Extract LinkedIn URLs (5,338 unique)
↓
Check existing profiles (already have 5,338)
↓
Fetch only new profiles
↓
Save as {slug}_{timestamp}.json
Output Format
Each profile is saved as:
{
"extraction_metadata": {
"source_file": "staff_parsing",
"staff_id": "{slug}_profile",
"extraction_date": "2025-12-11T...",
"extraction_method": "exa_crawling_glm46",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/...",
"cost_usd": 0,
"request_id": "md5_hash"
},
"profile_data": {
"name": "Full Name",
"linkedin_url": "...",
"headline": "Current Position",
"location": "City, Country",
"connections": "500+ connections",
"about": "Professional summary...",
"experience": [...],
"education": [...],
"skills": [...],
"languages": [...],
"profile_image_url": "https://..."
}
}
Features Implemented
✅ Duplicate Prevention: Checks existing profiles by LinkedIn slug ✅ Threading: 3 parallel workers for efficiency ✅ Rate Limiting: 1-second delay between API calls ✅ Progress Tracking: Real-time progress bar ✅ Error Handling: Graceful API error handling ✅ Logging: Detailed results log with timestamps ✅ Interactive: Choose batch size or process all ✅ Environment Loading: Automatic .env file loading ✅ Structured Output: Follows project schema exactly
Statistics
- Staff Directories: 24
- Total LinkedIn URLs: 5,338
- Existing Profiles: 5,338 (already fetched)
- New Profiles to Fetch: Varies (check when running)
Requirements
- Python 3.7+
- httpx
- tqdm
- ZAI_API_TOKEN in environment or .env
Installation
pip install httpx tqdm
Configuration
Set ZAI_API_TOKEN in your .env file:
ZAI_API_TOKEN=your_token_here
Usage Examples
Fetch all new profiles:
./scripts/run_linkedin_fetcher.sh
Process specific number:
When prompted, enter a number like 50 to process only first 50 profiles.
Process single directory:
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/dir
Test system:
python scripts/test_fetch_profiles.py
File Locations
- Staff Files:
/data/custodian/person/affiliated/parsed/ - Entity Profiles:
/data/custodian/person/entity/ - Logs:
fetch_log_YYYYMMDD_HHMMSS.txt - Scripts:
/scripts/
Notes
- The system is designed to prevent duplicates - existing profiles are automatically skipped
- Rate limiting prevents API quota issues
- All timestamps are in UTC (ISO 8601 format)
- Failed fetches are logged with error details
- The shell script automatically loads the
.envfile
Troubleshooting
-
"ZAI_API_TOKEN not set":
- Add token to
.envfile - Or export:
export ZAI_API_TOKEN=token
- Add token to
-
Rate limit errors:
- The script includes 1-second delays
- Reduce workers if needed (edit script)
-
Parsing failures:
- Some profiles may be private or restricted
- Check the log file for details
-
Network errors:
- Script will retry on next run
- Check internet connection
Success Indicators
✅ Script runs without errors
✅ Progress bar completes
✅ Log file shows successful fetches
✅ New JSON files appear in /data/custodian/person/entity/
✅ No duplicate profiles created