4.8 KiB
LinkedIn Profile Fetcher
This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in /data/custodian/person/entity/.
Scripts
1. fetch_linkedin_profiles_complete.py ⭐ RECOMMENDED
Complete script that processes ALL staff directories and fetches only new profiles.
Features:
- Processes all 24 staff directories automatically
- Extracts LinkedIn URLs from staff JSON files
- Checks existing profiles to prevent duplicates
- Interactive: asks how many profiles to fetch
- Uses Exa API with GLM-4.6 model for profile extraction
- Threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema
- Detailed logging with success/failure tracking
Usage:
python fetch_linkedin_profiles_complete.py
2. fetch_linkedin_profiles_exa_final.py
Core script that fetches LinkedIn profiles from a single staff directory.
Features:
- Extracts LinkedIn URLs from staff JSON files
- Uses Exa API with GLM-4.6 model for profile extraction
- Prevents duplicate entries by checking existing profiles
- Uses threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema
Usage:
python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>
3. test_fetch_profiles.py
Test script that fetches 3 sample profiles to verify the system works.
Usage:
python test_fetch_profiles.py
4. test_linkedin_urls.py
Utility script to show LinkedIn URLs that would be fetched from staff files.
Usage:
python test_linkedin_urls.py <path_to_staff_directory>
2. fetch_all_linkedin_profiles.py
Wrapper script that runs the fetcher on the main staff directory.
Usage:
python fetch_all_linkedin_profiles.py
3. test_fetch_profiles.py
Test script that fetches 3 sample profiles to verify the system works.
Usage:
python test_fetch_profiles.py
Data Structure
Input (Staff Files)
Staff files should have this structure:
{
"staff": [
{
"name": "Person Name",
"linkedin_url": "https://www.linkedin.com/in/slug",
"linkedin_profile_url": "https://www.linkedin.com/in/slug", // Alternative field name
...
}
]
}
Output (Entity Files)
Each profile is saved as {slug}_{timestamp}.json in /data/custodian/person/entity/:
{
"extraction_metadata": {
"source_file": "staff_parsing",
"staff_id": "{slug}_profile",
"extraction_date": "2025-12-11T...",
"extraction_method": "exa_crawling_glm46",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/slug",
"cost_usd": 0,
"request_id": "md5_hash"
},
"profile_data": {
"name": "Full Name",
"linkedin_url": "https://www.linkedin.com/in/slug",
"headline": "Current Position",
"location": "City, Country",
"connections": "500+ connections",
"about": "Professional summary...",
"experience": [...],
"education": [...],
"skills": [...],
"languages": [...],
"profile_image_url": "https://..."
}
}
Setup
-
Set ZAI_API_TOKEN environment variable:
export ZAI_API_TOKEN=your_token_here -
Install dependencies:
pip install httpx tqdm
Running
To fetch all profiles:
cd /Users/kempersc/apps/glam
python scripts/fetch_all_linkedin_profiles.py
To fetch from a specific directory:
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files
To test the system:
python scripts/test_fetch_profiles.py
Features
- Duplicate Prevention: Checks existing profiles by LinkedIn slug
- Threading: Processes up to 3 profiles in parallel
- Rate Limiting: 1-second delay between API calls
- Progress Tracking: Shows progress bar with success/failure counts
- Error Handling: Graceful handling of API errors and parsing failures
- Logging: Saves detailed results log with timestamps
File Locations
- Staff Files:
/data/custodian/person/affiliated/parsed/ - Entity Profiles:
/data/custodian/person/entity/ - Logs:
fetch_log_YYYYMMDD_HHMMSS.txt
Notes
- The script limits to first 50 new profiles per run for testing
- Existing profiles are skipped based on LinkedIn slug
- Failed fetches are logged with error messages
- All timestamps are in UTC (ISO 8601 format)
Troubleshooting
- "ZAI_API_TOKEN not set": Set the environment variable
- Rate limiting: The script includes 1-second delays between requests
- Parsing failures: Some LinkedIn profiles may be private or have unusual structures
- Network errors: The script will retry failed URLs if run again