188 lines
No EOL
4.7 KiB
Markdown
188 lines
No EOL
4.7 KiB
Markdown
# LinkedIn Profile Fetching System - Complete Implementation
|
|
|
|
## Overview
|
|
A complete system to fetch LinkedIn profile data using the Exa API for people found in staff files, with duplicate prevention and threading for efficiency.
|
|
|
|
## Files Created
|
|
|
|
### Main Scripts
|
|
1. **`scripts/fetch_linkedin_profiles_complete.py`** - Main script (RECOMMENDED)
|
|
- Processes ALL 24 staff directories automatically
|
|
- Interactive batch size selection
|
|
- Complete duplicate prevention
|
|
- Threading (3 workers)
|
|
- Rate limiting (1 second delay)
|
|
|
|
2. **`scripts/fetch_linkedin_profiles_exa_final.py`** - Core script
|
|
- For processing single directories
|
|
- Same features as main script
|
|
|
|
3. **`scripts/run_linkedin_fetcher.sh`** - Quick start script
|
|
- Loads .env file automatically
|
|
- Environment validation
|
|
- One-click execution
|
|
|
|
### Utility Scripts
|
|
4. **`scripts/test_fetch_profiles.py`** - Test with 3 profiles
|
|
5. **`scripts/test_linkedin_urls.py`** - Preview URLs to be fetched
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cd /Users/kempersc/apps/glam
|
|
./scripts/run_linkedin_fetcher.sh
|
|
```
|
|
|
|
The system will:
|
|
1. Load environment from `.env` file
|
|
2. Scan all 24 staff directories
|
|
3. Check existing profiles (5,338 already exist)
|
|
4. Show how many new profiles to fetch
|
|
5. Ask how many to process (batch mode)
|
|
6. Fetch using Exa API with GLM-4.6
|
|
7. Save structured JSON to `/data/custodian/person/entity/`
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Staff Files (24 directories)
|
|
↓
|
|
Extract LinkedIn URLs (5,338 unique)
|
|
↓
|
|
Check existing profiles (already have 5,338)
|
|
↓
|
|
Fetch only new profiles
|
|
↓
|
|
Save as {slug}_{timestamp}.json
|
|
```
|
|
|
|
## Output Format
|
|
|
|
Each profile is saved as:
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "staff_parsing",
|
|
"staff_id": "{slug}_profile",
|
|
"extraction_date": "2025-12-11T...",
|
|
"extraction_method": "exa_crawling_glm46",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/...",
|
|
"cost_usd": 0,
|
|
"request_id": "md5_hash"
|
|
},
|
|
"profile_data": {
|
|
"name": "Full Name",
|
|
"linkedin_url": "...",
|
|
"headline": "Current Position",
|
|
"location": "City, Country",
|
|
"connections": "500+ connections",
|
|
"about": "Professional summary...",
|
|
"experience": [...],
|
|
"education": [...],
|
|
"skills": [...],
|
|
"languages": [...],
|
|
"profile_image_url": "https://..."
|
|
}
|
|
}
|
|
```
|
|
|
|
## Features Implemented
|
|
|
|
✅ **Duplicate Prevention**: Checks existing profiles by LinkedIn slug
|
|
✅ **Threading**: 3 parallel workers for efficiency
|
|
✅ **Rate Limiting**: 1-second delay between API calls
|
|
✅ **Progress Tracking**: Real-time progress bar
|
|
✅ **Error Handling**: Graceful API error handling
|
|
✅ **Logging**: Detailed results log with timestamps
|
|
✅ **Interactive**: Choose batch size or process all
|
|
✅ **Environment Loading**: Automatic .env file loading
|
|
✅ **Structured Output**: Follows project schema exactly
|
|
|
|
## Statistics
|
|
|
|
- **Staff Directories**: 24
|
|
- **Total LinkedIn URLs**: 5,338
|
|
- **Existing Profiles**: 5,338 (already fetched)
|
|
- **New Profiles to Fetch**: Varies (check when running)
|
|
|
|
## Requirements
|
|
|
|
- Python 3.7+
|
|
- httpx
|
|
- tqdm
|
|
- ZAI_API_TOKEN in environment or .env
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
pip install httpx tqdm
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Set `ZAI_API_TOKEN` in your `.env` file:
|
|
```
|
|
ZAI_API_TOKEN=your_token_here
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Fetch all new profiles:
|
|
```bash
|
|
./scripts/run_linkedin_fetcher.sh
|
|
```
|
|
|
|
### Process specific number:
|
|
When prompted, enter a number like `50` to process only first 50 profiles.
|
|
|
|
### Process single directory:
|
|
```bash
|
|
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/dir
|
|
```
|
|
|
|
### Test system:
|
|
```bash
|
|
python scripts/test_fetch_profiles.py
|
|
```
|
|
|
|
## File Locations
|
|
|
|
- **Staff Files**: `/data/custodian/person/affiliated/parsed/`
|
|
- **Entity Profiles**: `/data/custodian/person/entity/`
|
|
- **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt`
|
|
- **Scripts**: `/scripts/`
|
|
|
|
## Notes
|
|
|
|
1. The system is designed to prevent duplicates - existing profiles are automatically skipped
|
|
2. Rate limiting prevents API quota issues
|
|
3. All timestamps are in UTC (ISO 8601 format)
|
|
4. Failed fetches are logged with error details
|
|
5. The shell script automatically loads the `.env` file
|
|
|
|
## Troubleshooting
|
|
|
|
1. **"ZAI_API_TOKEN not set"**:
|
|
- Add token to `.env` file
|
|
- Or export: `export ZAI_API_TOKEN=token`
|
|
|
|
2. **Rate limit errors**:
|
|
- The script includes 1-second delays
|
|
- Reduce workers if needed (edit script)
|
|
|
|
3. **Parsing failures**:
|
|
- Some profiles may be private or restricted
|
|
- Check the log file for details
|
|
|
|
4. **Network errors**:
|
|
- Script will retry on next run
|
|
- Check internet connection
|
|
|
|
## Success Indicators
|
|
|
|
✅ Script runs without errors
|
|
✅ Progress bar completes
|
|
✅ Log file shows successful fetches
|
|
✅ New JSON files appear in `/data/custodian/person/entity/`
|
|
✅ No duplicate profiles created |