# LinkedIn Profile Fetcher This system fetches LinkedIn profile data using the Exa API for people found in staff files, storing structured data in `/data/custodian/person/entity/`. ## Scripts ### 1. `fetch_linkedin_profiles_complete.py` ⭐ **RECOMMENDED** Complete script that processes ALL staff directories and fetches only new profiles. **Features:** - Processes all 24 staff directories automatically - Extracts LinkedIn URLs from staff JSON files - Checks existing profiles to prevent duplicates - Interactive: asks how many profiles to fetch - Uses Exa API with GLM-4.6 model for profile extraction - Threading for parallel processing (3 workers) - Rate limiting (1 second delay between requests) - Structured JSON output following the project schema - Detailed logging with success/failure tracking **Usage:** ```bash python fetch_linkedin_profiles_complete.py ``` ### 2. `fetch_linkedin_profiles_exa_final.py` Core script that fetches LinkedIn profiles from a single staff directory. **Features:** - Extracts LinkedIn URLs from staff JSON files - Uses Exa API with GLM-4.6 model for profile extraction - Prevents duplicate entries by checking existing profiles - Uses threading for parallel processing (3 workers) - Rate limiting (1 second delay between requests) - Structured JSON output following the project schema **Usage:** ```bash python fetch_linkedin_profiles_exa_final.py ``` ### 3. `test_fetch_profiles.py` Test script that fetches 3 sample profiles to verify the system works. **Usage:** ```bash python test_fetch_profiles.py ``` ### 4. `test_linkedin_urls.py` Utility script to show LinkedIn URLs that would be fetched from staff files. **Usage:** ```bash python test_linkedin_urls.py ``` ### 2. `fetch_all_linkedin_profiles.py` Wrapper script that runs the fetcher on the main staff directory. **Usage:** ```bash python fetch_all_linkedin_profiles.py ``` ### 3. `test_fetch_profiles.py` Test script that fetches 3 sample profiles to verify the system works. **Usage:** ```bash python test_fetch_profiles.py ``` ## Data Structure ### Input (Staff Files) Staff files should have this structure: ```json { "staff": [ { "name": "Person Name", "linkedin_url": "https://www.linkedin.com/in/slug", "linkedin_profile_url": "https://www.linkedin.com/in/slug", // Alternative field name ... } ] } ``` ### Output (Entity Files) Each profile is saved as `{slug}_{timestamp}.json` in `/data/custodian/person/entity/`: ```json { "extraction_metadata": { "source_file": "staff_parsing", "staff_id": "{slug}_profile", "extraction_date": "2025-12-11T...", "extraction_method": "exa_crawling_glm46", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/slug", "cost_usd": 0, "request_id": "md5_hash" }, "profile_data": { "name": "Full Name", "linkedin_url": "https://www.linkedin.com/in/slug", "headline": "Current Position", "location": "City, Country", "connections": "500+ connections", "about": "Professional summary...", "experience": [...], "education": [...], "skills": [...], "languages": [...], "profile_image_url": "https://..." } } ``` ## Setup 1. **Set ZAI_API_TOKEN environment variable:** ```bash export ZAI_API_TOKEN=your_token_here ``` 2. **Install dependencies:** ```bash pip install httpx tqdm ``` ## Running ### To fetch all profiles: ```bash cd /Users/kempersc/apps/glam python scripts/fetch_all_linkedin_profiles.py ``` ### To fetch from a specific directory: ```bash python scripts/fetch_linkedin_profiles_exa_final.py /path/to/staff/files ``` ### To test the system: ```bash python scripts/test_fetch_profiles.py ``` ## Features - **Duplicate Prevention**: Checks existing profiles by LinkedIn slug - **Threading**: Processes up to 3 profiles in parallel - **Rate Limiting**: 1-second delay between API calls - **Progress Tracking**: Shows progress bar with success/failure counts - **Error Handling**: Graceful handling of API errors and parsing failures - **Logging**: Saves detailed results log with timestamps ## File Locations - **Staff Files**: `/data/custodian/person/affiliated/parsed/` - **Entity Profiles**: `/data/custodian/person/entity/` - **Logs**: `fetch_log_YYYYMMDD_HHMMSS.txt` ## Notes - The script limits to first 50 new profiles per run for testing - Existing profiles are skipped based on LinkedIn slug - Failed fetches are logged with error messages - All timestamps are in UTC (ISO 8601 format) ## Troubleshooting 1. **"ZAI_API_TOKEN not set"**: Set the environment variable 2. **Rate limiting**: The script includes 1-second delays between requests 3. **Parsing failures**: Some LinkedIn profiles may be private or have unusual structures 4. **Network errors**: The script will retry failed URLs if run again