glam/scripts/README_linkedin_fetcher.md
2025-12-12 00:40:26 +01:00

4.8 KiB

LinkedIn Profile Fetcher

This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in /data/custodian/person/entity/.

Scripts

Complete script that processes ALL staff directories and fetches only new profiles.

Features:

  • Processes all 24 staff directories automatically
  • Extracts LinkedIn URLs from staff JSON files
  • Checks existing profiles to prevent duplicates
  • Interactive: asks how many profiles to fetch
  • Uses Exa API with GLM-4.6 model for profile extraction
  • Threading for parallel processing (3 workers)
  • Rate limiting (1 second delay between requests)
  • Structured JSON output following the project schema
  • Detailed logging with success/failure tracking

Usage:

python fetch_linkedin_profiles_complete.py

2. fetch_linkedin_profiles_exa_final.py

Core script that fetches LinkedIn profiles from a single staff directory.

Features:

  • Extracts LinkedIn URLs from staff JSON files
  • Uses Exa API with GLM-4.6 model for profile extraction
  • Prevents duplicate entries by checking existing profiles
  • Uses threading for parallel processing (3 workers)
  • Rate limiting (1 second delay between requests)
  • Structured JSON output following the project schema

Usage:

python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>

3. test_fetch_profiles.py

Test script that fetches 3 sample profiles to verify the system works.

Usage:

python test_fetch_profiles.py

4. test_linkedin_urls.py

Utility script to show LinkedIn URLs that would be fetched from staff files.

Usage:

python test_linkedin_urls.py <path_to_staff_directory>

2. fetch_all_linkedin_profiles.py

Wrapper script that runs the fetcher on the main staff directory.

Usage:

python fetch_all_linkedin_profiles.py

3. test_fetch_profiles.py

Test script that fetches 3 sample profiles to verify the system works.

Usage:

python test_fetch_profiles.py

Data Structure

Input (Staff Files)

Staff files should have this structure:

{
  "staff": [
    {
      "name": "Person Name",
      "linkedin_url": "https://www.linkedin.com/in/slug",
      "linkedin_profile_url": "https://www.linkedin.com/in/slug",  // Alternative field name
      ...
    }
  ]
}

Output (Entity Files)

Each profile is saved as {slug}_{timestamp}.json in /data/custodian/person/entity/:

{
  "extraction_metadata": {
    "source_file": "staff_parsing",
    "staff_id": "{slug}_profile",
    "extraction_date": "2025-12-11T...",
    "extraction_method": "exa_crawling_glm46",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "cost_usd": 0,
    "request_id": "md5_hash"
  },
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "headline": "Current Position",
    "location": "City, Country",
    "connections": "500+ connections",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "languages": [...],
    "profile_image_url": "https://..."
  }
}

Setup

  1. Set ZAI_API_TOKEN environment variable:

    export ZAI_API_TOKEN=your_token_here
    
  2. Install dependencies:

    pip install httpx tqdm
    

Running

To fetch all profiles:

cd /Users/kempersc/apps/glam
python scripts/fetch_all_linkedin_profiles.py

To fetch from a specific directory:

python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files

To test the system:

python scripts/test_fetch_profiles.py

Features

  • Duplicate Prevention: Checks existing profiles by LinkedIn slug
  • Threading: Processes up to 3 profiles in parallel
  • Rate Limiting: 1-second delay between API calls
  • Progress Tracking: Shows progress bar with success/failure counts
  • Error Handling: Graceful handling of API errors and parsing failures
  • Logging: Saves detailed results log with timestamps

File Locations

  • Staff Files: /data/custodian/person/affiliated/parsed/
  • Entity Profiles: /data/custodian/person/entity/
  • Logs: fetch_log_YYYYMMDD_HHMMSS.txt

Notes

  • The script limits to first 50 new profiles per run for testing
  • Existing profiles are skipped based on LinkedIn slug
  • Failed fetches are logged with error messages
  • All timestamps are in UTC (ISO 8601 format)

Troubleshooting

  1. "ZAI_API_TOKEN not set": Set the environment variable
  2. Rate limiting: The script includes 1-second delays between requests
  3. Parsing failures: Some LinkedIn profiles may be private or have unusual structures
  4. Network errors: The script will retry failed URLs if run again