kempersc 03263f67d6 moved web archives

2025-12-12 00:40:26 +01:00

4.7 KiB

Raw Blame History

LinkedIn Profile Fetching System - Complete Implementation

Overview

A complete system to fetch LinkedIn profile data using the Exa API for people found in staff files, with duplicate prevention and threading for efficiency.

Files Created

Main Scripts

scripts/fetch_linkedin_profiles_complete.py - Main script (RECOMMENDED)
- Processes ALL 24 staff directories automatically
- Interactive batch size selection
- Complete duplicate prevention
- Threading (3 workers)
- Rate limiting (1 second delay)
scripts/fetch_linkedin_profiles_exa_final.py - Core script
- For processing single directories
- Same features as main script
scripts/run_linkedin_fetcher.sh - Quick start script
- Loads .env file automatically
- Environment validation
- One-click execution

Utility Scripts

scripts/test_fetch_profiles.py - Test with 3 profiles
scripts/test_linkedin_urls.py - Preview URLs to be fetched

Quick Start

cd /Users/kempersc/apps/glam
./scripts/run_linkedin_fetcher.sh

The system will:

Load environment from .env file
Scan all 24 staff directories
Check existing profiles (5,338 already exist)
Show how many new profiles to fetch
Ask how many to process (batch mode)
Fetch using Exa API with GLM-4.6
Save structured JSON to /data/custodian/person/entity/

Data Flow

Staff Files (24 directories)
    ↓
Extract LinkedIn URLs (5,338 unique)
    ↓
Check existing profiles (already have 5,338)
    ↓
Fetch only new profiles
    ↓
Save as {slug}_{timestamp}.json

Output Format

Each profile is saved as:

{
  "extraction_metadata": {
    "source_file": "staff_parsing",
    "staff_id": "{slug}_profile",
    "extraction_date": "2025-12-11T...",
    "extraction_method": "exa_crawling_glm46",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/...",
    "cost_usd": 0,
    "request_id": "md5_hash"
  },
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "...",
    "headline": "Current Position",
    "location": "City, Country",
    "connections": "500+ connections",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "languages": [...],
    "profile_image_url": "https://..."
  }
}

Features Implemented

✅ Duplicate Prevention: Checks existing profiles by LinkedIn slug ✅ Threading: 3 parallel workers for efficiency ✅ Rate Limiting: 1-second delay between API calls ✅ Progress Tracking: Real-time progress bar ✅ Error Handling: Graceful API error handling ✅ Logging: Detailed results log with timestamps ✅ Interactive: Choose batch size or process all ✅ Environment Loading: Automatic .env file loading ✅ Structured Output: Follows project schema exactly

Statistics

Staff Directories: 24
Total LinkedIn URLs: 5,338
Existing Profiles: 5,338 (already fetched)
New Profiles to Fetch: Varies (check when running)

Requirements

Python 3.7+
httpx
tqdm
ZAI_API_TOKEN in environment or .env

Installation

pip install httpx tqdm

Configuration

Set ZAI_API_TOKEN in your .env file:

ZAI_API_TOKEN=your_token_here

Usage Examples

Fetch all new profiles:

./scripts/run_linkedin_fetcher.sh

Process specific number:

When prompted, enter a number like 50 to process only first 50 profiles.

Process single directory:

python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/dir

Test system:

python scripts/test_fetch_profiles.py

File Locations

Staff Files: /data/custodian/person/affiliated/parsed/
Entity Profiles: /data/custodian/person/entity/
Logs: fetch_log_YYYYMMDD_HHMMSS.txt
Scripts: /scripts/

Notes

The system is designed to prevent duplicates - existing profiles are automatically skipped
Rate limiting prevents API quota issues
All timestamps are in UTC (ISO 8601 format)
Failed fetches are logged with error details
The shell script automatically loads the .env file

Troubleshooting

"ZAI_API_TOKEN not set":
- Add token to .env file
- Or export: export ZAI_API_TOKEN=token
Rate limit errors:
- The script includes 1-second delays
- Reduce workers if needed (edit script)
Parsing failures:
- Some profiles may be private or restricted
- Check the log file for details
Network errors:
- Script will retry on next run
- Check internet connection

Success Indicators

✅ Script runs without errors ✅ Progress bar completes ✅ Log file shows successful fetches ✅ New JSON files appear in /data/custodian/person/entity/ ✅ No duplicate profiles created

4.7 KiB Raw Blame History