179 lines
No EOL
4.8 KiB
Markdown
179 lines
No EOL
4.8 KiB
Markdown
# LinkedIn Profile Fetcher
|
|
|
|
This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in `/data/custodian/person/entity/`.
|
|
|
|
## Scripts
|
|
|
|
### 1. `fetch_linkedin_profiles_complete.py` ⭐ **RECOMMENDED**
|
|
Complete script that processes ALL staff directories and fetches only new profiles.
|
|
|
|
**Features:**
|
|
- Processes all 24 staff directories automatically
|
|
- Extracts LinkedIn URLs from staff JSON files
|
|
- Checks existing profiles to prevent duplicates
|
|
- Interactive: asks how many profiles to fetch
|
|
- Uses Exa API with GLM-4.6 model for profile extraction
|
|
- Threading for parallel processing (3 workers)
|
|
- Rate limiting (1 second delay between requests)
|
|
- Structured JSON output following the project schema
|
|
- Detailed logging with success/failure tracking
|
|
|
|
**Usage:**
|
|
```bash
|
|
python fetch_linkedin_profiles_complete.py
|
|
```
|
|
|
|
### 2. `fetch_linkedin_profiles_exa_final.py`
|
|
Core script that fetches LinkedIn profiles from a single staff directory.
|
|
|
|
**Features:**
|
|
- Extracts LinkedIn URLs from staff JSON files
|
|
- Uses Exa API with GLM-4.6 model for profile extraction
|
|
- Prevents duplicate entries by checking existing profiles
|
|
- Uses threading for parallel processing (3 workers)
|
|
- Rate limiting (1 second delay between requests)
|
|
- Structured JSON output following the project schema
|
|
|
|
**Usage:**
|
|
```bash
|
|
python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>
|
|
```
|
|
|
|
### 3. `test_fetch_profiles.py`
|
|
Test script that fetches 3 sample profiles to verify the system works.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python test_fetch_profiles.py
|
|
```
|
|
|
|
### 4. `test_linkedin_urls.py`
|
|
Utility script to show LinkedIn URLs that would be fetched from staff files.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python test_linkedin_urls.py <path_to_staff_directory>
|
|
```
|
|
|
|
### 2. `fetch_all_linkedin_profiles.py`
|
|
Wrapper script that runs the fetcher on the main staff directory.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python fetch_all_linkedin_profiles.py
|
|
```
|
|
|
|
### 3. `test_fetch_profiles.py`
|
|
Test script that fetches 3 sample profiles to verify the system works.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python test_fetch_profiles.py
|
|
```
|
|
|
|
## Data Structure
|
|
|
|
### Input (Staff Files)
|
|
Staff files should have this structure:
|
|
```json
|
|
{
|
|
"staff": [
|
|
{
|
|
"name": "Person Name",
|
|
"linkedin_url": "https://www.linkedin.com/in/slug",
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/slug", // Alternative field name
|
|
...
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Output (Entity Files)
|
|
Each profile is saved as `{slug}_{timestamp}.json` in `/data/custodian/person/entity/`:
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "staff_parsing",
|
|
"staff_id": "{slug}_profile",
|
|
"extraction_date": "2025-12-11T...",
|
|
"extraction_method": "exa_crawling_glm46",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/slug",
|
|
"cost_usd": 0,
|
|
"request_id": "md5_hash"
|
|
},
|
|
"profile_data": {
|
|
"name": "Full Name",
|
|
"linkedin_url": "https://www.linkedin.com/in/slug",
|
|
"headline": "Current Position",
|
|
"location": "City, Country",
|
|
"connections": "500+ connections",
|
|
"about": "Professional summary...",
|
|
"experience": [...],
|
|
"education": [...],
|
|
"skills": [...],
|
|
"languages": [...],
|
|
"profile_image_url": "https://..."
|
|
}
|
|
}
|
|
```
|
|
|
|
## Setup
|
|
|
|
1. **Set ZAI_API_TOKEN environment variable:**
|
|
```bash
|
|
export ZAI_API_TOKEN=your_token_here
|
|
```
|
|
|
|
2. **Install dependencies:**
|
|
```bash
|
|
pip install httpx tqdm
|
|
```
|
|
|
|
## Running
|
|
|
|
### To fetch all profiles:
|
|
```bash
|
|
cd /Users/kempersc/apps/glam
|
|
python scripts/fetch_all_linkedin_profiles.py
|
|
```
|
|
|
|
### To fetch from a specific directory:
|
|
```bash
|
|
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files
|
|
```
|
|
|
|
### To test the system:
|
|
```bash
|
|
python scripts/test_fetch_profiles.py
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Duplicate Prevention**: Checks existing profiles by LinkedIn slug
|
|
- **Threading**: Processes up to 3 profiles in parallel
|
|
- **Rate Limiting**: 1-second delay between API calls
|
|
- **Progress Tracking**: Shows progress bar with success/failure counts
|
|
- **Error Handling**: Graceful handling of API errors and parsing failures
|
|
- **Logging**: Saves detailed results log with timestamps
|
|
|
|
## File Locations
|
|
|
|
- **Staff Files**: `/data/custodian/person/affiliated/parsed/`
|
|
- **Entity Profiles**: `/data/custodian/person/entity/`
|
|
- **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt`
|
|
|
|
## Notes
|
|
|
|
- The script limits to first 50 new profiles per run for testing
|
|
- Existing profiles are skipped based on LinkedIn slug
|
|
- Failed fetches are logged with error messages
|
|
- All timestamps are in UTC (ISO 8601 format)
|
|
|
|
## Troubleshooting
|
|
|
|
1. **"ZAI_API_TOKEN not set"**: Set the environment variable
|
|
2. **Rate limiting**: The script includes 1-second delays between requests
|
|
3. **Parsing failures**: Some LinkedIn profiles may be private or have unusual structures
|
|
4. **Network errors**: The script will retry failed URLs if run again |