kempersc 03263f67d6 moved web archives

2025-12-12 00:40:26 +01:00

4.8 KiB

Raw Blame History

LinkedIn Profile Fetcher

This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in /data/custodian/person/entity/.

Scripts

1. `fetch_linkedin_profiles_complete.py` ⭐ RECOMMENDED

Complete script that processes ALL staff directories and fetches only new profiles.

Features:

Processes all 24 staff directories automatically
Extracts LinkedIn URLs from staff JSON files
Checks existing profiles to prevent duplicates
Interactive: asks how many profiles to fetch
Uses Exa API with GLM-4.6 model for profile extraction
Threading for parallel processing (3 workers)
Rate limiting (1 second delay between requests)
Structured JSON output following the project schema
Detailed logging with success/failure tracking

Usage:

python fetch_linkedin_profiles_complete.py

2. `fetch_linkedin_profiles_exa_final.py`

Core script that fetches LinkedIn profiles from a single staff directory.

Features:

Extracts LinkedIn URLs from staff JSON files
Uses Exa API with GLM-4.6 model for profile extraction
Prevents duplicate entries by checking existing profiles
Uses threading for parallel processing (3 workers)
Rate limiting (1 second delay between requests)
Structured JSON output following the project schema

Usage:

python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>

3. `test_fetch_profiles.py`

Test script that fetches 3 sample profiles to verify the system works.

Usage:

python test_fetch_profiles.py

4. `test_linkedin_urls.py`

Utility script to show LinkedIn URLs that would be fetched from staff files.

Usage:

python test_linkedin_urls.py <path_to_staff_directory>

2. `fetch_all_linkedin_profiles.py`

Wrapper script that runs the fetcher on the main staff directory.

Usage:

python fetch_all_linkedin_profiles.py

3. `test_fetch_profiles.py`

Test script that fetches 3 sample profiles to verify the system works.

Usage:

python test_fetch_profiles.py

Data Structure

Input (Staff Files)

Staff files should have this structure:

{
  "staff": [
    {
      "name": "Person Name",
      "linkedin_url": "https://www.linkedin.com/in/slug",
      "linkedin_profile_url": "https://www.linkedin.com/in/slug",  // Alternative field name
      ...
    }
  ]
}

Output (Entity Files)

Each profile is saved as {slug}_{timestamp}.json in /data/custodian/person/entity/:

{
  "extraction_metadata": {
    "source_file": "staff_parsing",
    "staff_id": "{slug}_profile",
    "extraction_date": "2025-12-11T...",
    "extraction_method": "exa_crawling_glm46",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "cost_usd": 0,
    "request_id": "md5_hash"
  },
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "headline": "Current Position",
    "location": "City, Country",
    "connections": "500+ connections",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "languages": [...],
    "profile_image_url": "https://..."
  }
}

Setup

Set ZAI_API_TOKEN environment variable:
```
export ZAI_API_TOKEN=your_token_here
```
Install dependencies:
```
pip install httpx tqdm
```

Running

To fetch all profiles:

cd /Users/kempersc/apps/glam
python scripts/fetch_all_linkedin_profiles.py

To fetch from a specific directory:

python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files

To test the system:

python scripts/test_fetch_profiles.py

Features

Duplicate Prevention: Checks existing profiles by LinkedIn slug
Threading: Processes up to 3 profiles in parallel
Rate Limiting: 1-second delay between API calls
Progress Tracking: Shows progress bar with success/failure counts
Error Handling: Graceful handling of API errors and parsing failures
Logging: Saves detailed results log with timestamps

File Locations

Staff Files: /data/custodian/person/affiliated/parsed/
Entity Profiles: /data/custodian/person/entity/
Logs: fetch_log_YYYYMMDD_HHMMSS.txt

Notes

The script limits to first 50 new profiles per run for testing
Existing profiles are skipped based on LinkedIn slug
Failed fetches are logged with error messages
All timestamps are in UTC (ISO 8601 format)

Troubleshooting

"ZAI_API_TOKEN not set": Set the environment variable
Rate limiting: The script includes 1-second delays between requests
Parsing failures: Some LinkedIn profiles may be private or have unusual structures
Network errors: The script will retry failed URLs if run again

4.8 KiB Raw Blame History

LinkedIn Profile Fetcher

Scripts

1. fetch_linkedin_profiles_complete.py ⭐ RECOMMENDED

2. fetch_linkedin_profiles_exa_final.py

3. test_fetch_profiles.py

4. test_linkedin_urls.py

2. fetch_all_linkedin_profiles.py

3. test_fetch_profiles.py

Data Structure

Input (Staff Files)

Output (Entity Files)

Setup

Running

To fetch all profiles:

To fetch from a specific directory:

To test the system:

Features

File Locations

Notes

Troubleshooting

4.8 KiB

Raw Blame History

1. `fetch_linkedin_profiles_complete.py` ⭐ RECOMMENDED

2. `fetch_linkedin_profiles_exa_final.py`

3. `test_fetch_profiles.py`

4. `test_linkedin_urls.py`

2. `fetch_all_linkedin_profiles.py`

3. `test_fetch_profiles.py`