# Person Entity Profile Format Rule ## Rule: ALL Person Entity Profiles MUST Use Structured JSON Format **🚨 CRITICAL: Person entity profiles stored in `data/custodian/person/entity/` MUST be properly structured JSON, NOT raw content dumps from Exa API responses.** This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base. --- ## The Problem: Raw Content Dumps Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates: - ❌ **Unparseable data** - No structured fields to query - ❌ **Inconsistent format** - Each profile has different structure - ❌ **Missing metadata** - No extraction provenance tracking - ❌ **Difficult processing** - Cannot programmatically extract career history - ❌ **Wasted effort** - Re-extraction required for structured data --- ## Required JSON Structure ### Top-Level Structure ```json { "extraction_metadata": { ... }, "profile_data": { ... } } ``` ### extraction_metadata (REQUIRED) ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json", "staff_id": "{custodian}_staff_{index}_{name_slug}", "extraction_date": "2025-12-10T16:00:00Z", "extraction_method": "exa_contents", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/{slug}", "cost_usd": 0, "request_id": "{exa_request_id}" } } ``` **Required Fields**: | Field | Type | Description | |-------|------|-------------| | `source_file` | string | Path to source staff list file | | `staff_id` | string | Unique staff identifier from source | | `extraction_date` | string | ISO 8601 timestamp | | `extraction_method` | string | `exa_contents`, `exa_crawling_exa`, or `manual` | | `extraction_agent` | string | **MUST be `claude-opus-4.5`** | | `linkedin_url` | string | Full LinkedIn profile URL | | `cost_usd` | number | Exa API cost (0 for contents endpoint) | | `request_id` | string | Exa request ID for tracing | ### profile_data (REQUIRED) ```json { "profile_data": { "name": "Full Name", "linkedin_url": "https://www.linkedin.com/in/{slug}", "headline": "Current professional headline", "location": "City, Region, Country", "connections": "500 connections • 2,000 followers", "about": "Full about section text or summary", "summary": "AI-generated summary if about is truncated", "experience": [ ... ], "education": [ ... ], "skills": [ ... ], "languages": [ ... ], "profile_image_url": "https://media.licdn.com/..." } } ``` **Required profile_data Fields**: | Field | Type | Required | Description | |-------|------|----------|-------------| | `name` | string | YES | Full name from profile | | `linkedin_url` | string | YES | Profile URL | | `headline` | string | YES | Professional headline | | `location` | string | YES | Geographic location | | `connections` | string | NO | Connection/follower count string | | `about` | string | NO | About section text | | `experience` | array | YES | Array of experience objects | | `education` | array | NO | Array of education objects | | `skills` | array | NO | Array of skill strings | | `languages` | array | NO | Array of language objects | | `profile_image_url` | string | NO | LinkedIn profile photo URL | ### experience[] Object Structure ```json { "title": "Job Title", "company": "Company Name", "duration": "Start Date - End Date • X years Y months", "location": "City, Country", "company_details": "Company: size • Founded year • Type • Industry", "department": "Department • Level: Level", "description": "Role description if provided" } ``` ### education[] Object Structure ```json { "degree": "Degree Title and Field", "institution": "Institution Name", "duration": "Start - End • X years" } ``` ### languages[] Object Structure ```json { "language": "Language Name", "proficiency": "Native or bilingual proficiency" } ``` --- ## ❌ WRONG: Raw Content Dump ```json { "content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..." } ``` This is **UNACCEPTABLE** because: - No structured fields - No extraction metadata - Cannot query programmatically - No provenance tracking --- ## ✅ CORRECT: Structured Format ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json", "staff_id": "nationaal-archief_staff_0042_john_smith", "extraction_date": "2025-12-11T10:00:00Z", "extraction_method": "exa_contents", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/john-smith-12345", "cost_usd": 0, "request_id": "req_abc123xyz" }, "profile_data": { "name": "John Smith", "linkedin_url": "https://www.linkedin.com/in/john-smith-12345", "headline": "Software Engineer at Company X", "location": "Amsterdam, Netherlands", "connections": "500 connections • 1,000 followers", "about": "Experienced software engineer with 10 years of experience in heritage digitization projects...", "experience": [ { "title": "Software Engineer", "company": "Company X", "duration": "2020 - Present • 5 years", "location": "Amsterdam, Netherlands", "company_details": "Company: 51-200 employees • Founded 2015 • Technology" } ], "education": [ { "degree": "MSc Computer Science", "institution": "University of Amsterdam", "duration": "2015 - 2017 • 2 years" } ], "skills": ["Python", "Data Engineering", "Heritage Digitization"], "languages": [ {"language": "Dutch", "proficiency": "Native"}, {"language": "English", "proficiency": "Full professional proficiency"} ], "profile_image_url": "https://media.licdn.com/dms/image/..." } } ``` --- ## Formatting Script Use the provided formatting script to convert raw Exa dumps to proper format: ```bash python scripts/format_linkedin_profile.py ``` This script: 1. Identifies raw content dump files in `data/custodian/person/entity/` 2. Parses the raw markdown content 3. Extracts structured fields (name, headline, experience, etc.) 4. Creates proper `extraction_metadata` block 5. Overwrites files with structured format --- ## Extraction Agent Requirement **The `extraction_agent` field MUST always be set to `"claude-opus-4.5"`.** This ensures: - Consistent provenance tracking - Clear audit trail of which model performed extraction - Reproducibility of extraction process --- ## Validation Before committing person entity files, verify: 1. **JSON is valid**: `python -c "import json; json.load(open('file.json'))"` 2. **Required fields present**: `extraction_metadata`, `profile_data`, `name`, `linkedin_url`, `experience` 3. **extraction_agent is correct**: Must be `"claude-opus-4.5"` 4. **No raw content dumps**: Check that `content` field does not exist at top level --- ## Migration: Converting Raw Dumps If you find raw content dump files: 1. Run formatting script: `python scripts/format_linkedin_profile.py` 2. Verify output is properly structured 3. Check for any remaining raw files 4. Report files that couldn't be automatically converted --- ## Version History | Version | Date | Changes | |---------|------|---------| | 1.0.0 | 2025-12-11 | Initial rule documentation based on Nationaal Archief extraction session | --- ## Related Rules - `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa tool usage guidelines - `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern for custodian files - `AGENTS.md` - Rule 20: Person Entity Profiles - Individual File Storage