glam/scripts/README_linkedin_fetcher.md
2025-12-12 00:40:26 +01:00

179 lines
No EOL
4.8 KiB
Markdown

# LinkedIn Profile Fetcher
This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in `/data/custodian/person/entity/`.
## Scripts
### 1. `fetch_linkedin_profiles_complete.py` ⭐ **RECOMMENDED**
Complete script that processes ALL staff directories and fetches only new profiles.
**Features:**
- Processes all 24 staff directories automatically
- Extracts LinkedIn URLs from staff JSON files
- Checks existing profiles to prevent duplicates
- Interactive: asks how many profiles to fetch
- Uses Exa API with GLM-4.6 model for profile extraction
- Threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema
- Detailed logging with success/failure tracking
**Usage:**
```bash
python fetch_linkedin_profiles_complete.py
```
### 2. `fetch_linkedin_profiles_exa_final.py`
Core script that fetches LinkedIn profiles from a single staff directory.
**Features:**
- Extracts LinkedIn URLs from staff JSON files
- Uses Exa API with GLM-4.6 model for profile extraction
- Prevents duplicate entries by checking existing profiles
- Uses threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema
**Usage:**
```bash
python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>
```
### 3. `test_fetch_profiles.py`
Test script that fetches 3 sample profiles to verify the system works.
**Usage:**
```bash
python test_fetch_profiles.py
```
### 4. `test_linkedin_urls.py`
Utility script to show LinkedIn URLs that would be fetched from staff files.
**Usage:**
```bash
python test_linkedin_urls.py <path_to_staff_directory>
```
### 2. `fetch_all_linkedin_profiles.py`
Wrapper script that runs the fetcher on the main staff directory.
**Usage:**
```bash
python fetch_all_linkedin_profiles.py
```
### 3. `test_fetch_profiles.py`
Test script that fetches 3 sample profiles to verify the system works.
**Usage:**
```bash
python test_fetch_profiles.py
```
## Data Structure
### Input (Staff Files)
Staff files should have this structure:
```json
{
"staff": [
{
"name": "Person Name",
"linkedin_url": "https://www.linkedin.com/in/slug",
"linkedin_profile_url": "https://www.linkedin.com/in/slug", // Alternative field name
...
}
]
}
```
### Output (Entity Files)
Each profile is saved as `{slug}_{timestamp}.json` in `/data/custodian/person/entity/`:
```json
{
"extraction_metadata": {
"source_file": "staff_parsing",
"staff_id": "{slug}_profile",
"extraction_date": "2025-12-11T...",
"extraction_method": "exa_crawling_glm46",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/slug",
"cost_usd": 0,
"request_id": "md5_hash"
},
"profile_data": {
"name": "Full Name",
"linkedin_url": "https://www.linkedin.com/in/slug",
"headline": "Current Position",
"location": "City, Country",
"connections": "500+ connections",
"about": "Professional summary...",
"experience": [...],
"education": [...],
"skills": [...],
"languages": [...],
"profile_image_url": "https://..."
}
}
```
## Setup
1. **Set ZAI_API_TOKEN environment variable:**
```bash
export ZAI_API_TOKEN=your_token_here
```
2. **Install dependencies:**
```bash
pip install httpx tqdm
```
## Running
### To fetch all profiles:
```bash
cd /Users/kempersc/apps/glam
python scripts/fetch_all_linkedin_profiles.py
```
### To fetch from a specific directory:
```bash
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files
```
### To test the system:
```bash
python scripts/test_fetch_profiles.py
```
## Features
- **Duplicate Prevention**: Checks existing profiles by LinkedIn slug
- **Threading**: Processes up to 3 profiles in parallel
- **Rate Limiting**: 1-second delay between API calls
- **Progress Tracking**: Shows progress bar with success/failure counts
- **Error Handling**: Graceful handling of API errors and parsing failures
- **Logging**: Saves detailed results log with timestamps
## File Locations
- **Staff Files**: `/data/custodian/person/affiliated/parsed/`
- **Entity Profiles**: `/data/custodian/person/entity/`
- **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt`
## Notes
- The script limits to first 50 new profiles per run for testing
- Existing profiles are skipped based on LinkedIn slug
- Failed fetches are logged with error messages
- All timestamps are in UTC (ISO 8601 format)
## Troubleshooting
1. **"ZAI_API_TOKEN not set"**: Set the environment variable
2. **Rate limiting**: The script includes 1-second delays between requests
3. **Parsing failures**: Some LinkedIn profiles may be private or have unusual structures
4. **Network errors**: The script will retry failed URLs if run again