glam/scripts/README_linkedin_fetcher.md

# LinkedIn Profile Fetcher

This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in `/data/custodian/person/entity/`.

## Scripts

### 1. `fetch_linkedin_profiles_complete.py` ⭐ **RECOMMENDED**
Complete script that processes ALL staff directories and fetches only new profiles.

**Features:**
- Processes all 24 staff directories automatically
- Extracts LinkedIn URLs from staff JSON files
- Checks existing profiles to prevent duplicates
- Interactive: asks how many profiles to fetch
- Uses Exa API with GLM-4.6 model for profile extraction
- Threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema
- Detailed logging with success/failure tracking

**Usage:**
```bash
python fetch_linkedin_profiles_complete.py
```

### 2. `fetch_linkedin_profiles_exa_final.py`
Core script that fetches LinkedIn profiles from a single staff directory.

**Features:**
- Extracts LinkedIn URLs from staff JSON files
- Uses Exa API with GLM-4.6 model for profile extraction
- Prevents duplicate entries by checking existing profiles
- Uses threading for parallel processing (3 workers)
- Rate limiting (1 second delay between requests)
- Structured JSON output following the project schema

**Usage:**
```bash
python fetch_linkedin_profiles_exa_final.py <path_to_staff_directory>
```

### 3. `test_fetch_profiles.py`
Test script that fetches 3 sample profiles to verify the system works.

**Usage:**
```bash
python test_fetch_profiles.py
```

### 4. `test_linkedin_urls.py`
Utility script to show LinkedIn URLs that would be fetched from staff files.

**Usage:**
```bash
python test_linkedin_urls.py <path_to_staff_directory>
```

### 2. `fetch_all_linkedin_profiles.py`
Wrapper script that runs the fetcher on the main staff directory.

**Usage:**
```bash
python fetch_all_linkedin_profiles.py
```

### 3. `test_fetch_profiles.py`
Test script that fetches 3 sample profiles to verify the system works.

**Usage:**
```bash
python test_fetch_profiles.py
```

## Data Structure

### Input (Staff Files)
Staff files should have this structure:
```json
{
  "staff": [
    {
      "name": "Person Name",
      "linkedin_url": "https://www.linkedin.com/in/slug",
      "linkedin_profile_url": "https://www.linkedin.com/in/slug",  // Alternative field name
      ...
    }
  ]
}
```

### Output (Entity Files)
Each profile is saved as `{slug}_{timestamp}.json` in `/data/custodian/person/entity/`:

```json
{
  "extraction_metadata": {
    "source_file": "staff_parsing",
    "staff_id": "{slug}_profile",
    "extraction_date": "2025-12-11T...",
    "extraction_method": "exa_crawling_glm46",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "cost_usd": 0,
    "request_id": "md5_hash"
  },
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/slug",
    "headline": "Current Position",
    "location": "City, Country",
    "connections": "500+ connections",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...],
    "languages": [...],
    "profile_image_url": "https://..."
  }
}
```

## Setup

1. **Set ZAI_API_TOKEN environment variable:**
   ```bash
   export ZAI_API_TOKEN=your_token_here
   ```

2. **Install dependencies:**
   ```bash
   pip install httpx tqdm
   ```

## Running

### To fetch all profiles:
```bash
cd /Users/kempersc/apps/glam
python scripts/fetch_all_linkedin_profiles.py
```

### To fetch from a specific directory:
```bash
python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files
```

### To test the system:
```bash
python scripts/test_fetch_profiles.py
```

## Features

- **Duplicate Prevention**: Checks existing profiles by LinkedIn slug
- **Threading**: Processes up to 3 profiles in parallel
- **Rate Limiting**: 1-second delay between API calls
- **Progress Tracking**: Shows progress bar with success/failure counts
- **Error Handling**: Graceful handling of API errors and parsing failures
- **Logging**: Saves detailed results log with timestamps

## File Locations

- **Staff Files**: `/data/custodian/person/affiliated/parsed/`
- **Entity Profiles**: `/data/custodian/person/entity/`
- **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt`

## Notes

- The script limits to first 50 new profiles per run for testing
- Existing profiles are skipped based on LinkedIn slug
- Failed fetches are logged with error messages
- All timestamps are in UTC (ISO 8601 format)

## Troubleshooting

1. **"ZAI_API_TOKEN not set"**: Set the environment variable
2. **Rate limiting**: The script includes 1-second delays between requests
3. **Parsing failures**: Some LinkedIn profiles may be private or have unusual structures
4. **Network errors**: The script will retry failed URLs if run again