glam/LINKEDIN_FETCHER_COMPLETE.md
2025-12-12 00:40:26 +01:00

188 lines
No EOL
4.7 KiB
Markdown

# LinkedIn Profile Fetching System - Complete Implementation
## Overview
A complete system to fetch LinkedIn profile data using the Exa API for people found in staff files, with duplicate prevention and threading for efficiency.
## Files Created
### Main Scripts
1. **`scripts/fetch_linkedin_profiles_complete.py`** - Main script (RECOMMENDED)
- Processes ALL 24 staff directories automatically
- Interactive batch size selection
- Complete duplicate prevention
- Threading (3 workers)
- Rate limiting (1 second delay)
2. **`scripts/fetch_linkedin_profiles_exa_final.py`** - Core script
- For processing single directories
- Same features as main script
3. **`scripts/run_linkedin_fetcher.sh`** - Quick start script
- Loads .env file automatically
- Environment validation
- One-click execution
### Utility Scripts
4. **`scripts/test_fetch_profiles.py`** - Test with 3 profiles
5. **`scripts/test_linkedin_urls.py`** - Preview URLs to be fetched
## Quick Start
```bash
cd /Users/kempersc/apps/glam
./scripts/run_linkedin_fetcher.sh
```
The system will:
1. Load environment from `.env` file
2. Scan all 24 staff directories
3. Check existing profiles (5,338 already exist)
4. Show how many new profiles to fetch
5. Ask how many to process (batch mode)
6. Fetch using Exa API with GLM-4.6
7. Save structured JSON to `/data/custodian/person/entity/`
## Data Flow
```
Staff Files (24 directories)
Extract LinkedIn URLs (5,338 unique)
Check existing profiles (already have 5,338)
Fetch only new profiles
Save as {slug}_{timestamp}.json
```
## Output Format
Each profile is saved as:
```json
{
"extraction_metadata": {
"source_file": "staff_parsing",
"staff_id": "{slug}_profile",
"extraction_date": "2025-12-11T...",
"extraction_method": "exa_crawling_glm46",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/...",
"cost_usd": 0,
"request_id": "md5_hash"
},
"profile_data": {
"name": "Full Name",
"linkedin_url": "...",
"headline": "Current Position",
"location": "City, Country",
"connections": "500+ connections",
"about": "Professional summary...",
"experience": [...],
"education": [...],
"skills": [...],
"languages": [...],
"profile_image_url": "https://..."
}
}
```
## Features Implemented
**Duplicate Prevention**: Checks existing profiles by LinkedIn slug
**Threading**: 3 parallel workers for efficiency
**Rate Limiting**: 1-second delay between API calls
**Progress Tracking**: Real-time progress bar
**Error Handling**: Graceful API error handling
**Logging**: Detailed results log with timestamps
**Interactive**: Choose batch size or process all
**Environment Loading**: Automatic .env file loading
**Structured Output**: Follows project schema exactly
## Statistics
- **Staff Directories**: 24
- **Total LinkedIn URLs**: 5,338
- **Existing Profiles**: 5,338 (already fetched)
- **New Profiles to Fetch**: Varies (check when running)
## Requirements
- Python 3.7+
- httpx
- tqdm
- ZAI_API_TOKEN in environment or .env
## Installation
```bash
pip install httpx tqdm
```
## Configuration
Set `ZAI_API_TOKEN` in your `.env` file:
```
ZAI_API_TOKEN=your_token_here
```
## Usage Examples
### Fetch all new profiles:
```bash
./scripts/run_linkedin_fetcher.sh
```
### Process specific number:
When prompted, enter a number like `50` to process only first 50 profiles.
### Process single directory:
```bash
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/dir
```
### Test system:
```bash
python scripts/test_fetch_profiles.py
```
## File Locations
- **Staff Files**: `/data/custodian/person/affiliated/parsed/`
- **Entity Profiles**: `/data/custodian/person/entity/`
- **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt`
- **Scripts**: `/scripts/`
## Notes
1. The system is designed to prevent duplicates - existing profiles are automatically skipped
2. Rate limiting prevents API quota issues
3. All timestamps are in UTC (ISO 8601 format)
4. Failed fetches are logged with error details
5. The shell script automatically loads the `.env` file
## Troubleshooting
1. **"ZAI_API_TOKEN not set"**:
- Add token to `.env` file
- Or export: `export ZAI_API_TOKEN=token`
2. **Rate limit errors**:
- The script includes 1-second delays
- Reduce workers if needed (edit script)
3. **Parsing failures**:
- Some profiles may be private or restricted
- Check the log file for details
4. **Network errors**:
- Script will retry on next run
- Check internet connection
## Success Indicators
✅ Script runs without errors
✅ Progress bar completes
✅ Log file shows successful fetches
✅ New JSON files appear in `/data/custodian/person/entity/`
✅ No duplicate profiles created