glam/.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md

# Person Entity Profile Format Rule

## Rule: ALL Person Entity Profiles MUST Use Structured JSON Format

**🚨 CRITICAL: Person entity profiles stored in `data/custodian/person/entity/` MUST be properly structured JSON, NOT raw content dumps from Exa API responses.**

This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.

---

## The Problem: Raw Content Dumps

Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:

- ❌ **Unparseable data** - No structured fields to query
- ❌ **Inconsistent format** - Each profile has different structure
- ❌ **Missing metadata** - No extraction provenance tracking
- ❌ **Difficult processing** - Cannot programmatically extract career history
- ❌ **Wasted effort** - Re-extraction required for structured data

---

## Required JSON Structure

### Top-Level Structure

```json
{
  "extraction_metadata": { ... },
  "profile_data": { ... }
}
```

### extraction_metadata (REQUIRED)

```json
{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_{index}_{name_slug}",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "cost_usd": 0,
    "request_id": "{exa_request_id}"
  }
}
```

**Required Fields**:

| Field | Type | Description |
|-------|------|-------------|
| `source_file` | string | Path to source staff list file |
| `staff_id` | string | Unique staff identifier from source |
| `extraction_date` | string | ISO 8601 timestamp |
| `extraction_method` | string | `exa_contents`, `exa_crawling_exa`, or `manual` |
| `extraction_agent` | string | **MUST be `claude-opus-4.5`** |
| `linkedin_url` | string | Full LinkedIn profile URL |
| `cost_usd` | number | Exa API cost (0 for contents endpoint) |
| `request_id` | string | Exa request ID for tracing |

### profile_data (REQUIRED)

```json
{
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "headline": "Current professional headline",
    "location": "City, Region, Country",
    "connections": "500 connections • 2,000 followers",
    "about": "Full about section text or summary",
    "summary": "AI-generated summary if about is truncated",
    "experience": [ ... ],
    "education": [ ... ],
    "skills": [ ... ],
    "languages": [ ... ],
    "profile_image_url": "https://media.licdn.com/..."
  }
}
```

**Required profile_data Fields**:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | YES | Full name from profile |
| `linkedin_url` | string | YES | Profile URL |
| `headline` | string | YES | Professional headline |
| `location` | string | YES | Geographic location |
| `connections` | string | NO | Connection/follower count string |
| `about` | string | NO | About section text |
| `experience` | array | YES | Array of experience objects |
| `education` | array | NO | Array of education objects |
| `skills` | array | NO | Array of skill strings |
| `languages` | array | NO | Array of language objects |
| `profile_image_url` | string | NO | LinkedIn profile photo URL |

### experience[] Object Structure

```json
{
  "title": "Job Title",
  "company": "Company Name",
  "duration": "Start Date - End Date • X years Y months",
  "location": "City, Country",
  "company_details": "Company: size • Founded year • Type • Industry",
  "department": "Department • Level: Level",
  "description": "Role description if provided"
}
```

### education[] Object Structure

```json
{
  "degree": "Degree Title and Field",
  "institution": "Institution Name",
  "duration": "Start - End • X years"
}
```

### languages[] Object Structure

```json
{
  "language": "Language Name",
  "proficiency": "Native or bilingual proficiency"
}
```

---

## ❌ WRONG: Raw Content Dump

```json
{
  "content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
}
```

This is **UNACCEPTABLE** because:
- No structured fields
- No extraction metadata
- Cannot query programmatically
- No provenance tracking

---

## ✅ CORRECT: Structured Format

```json
{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
    "staff_id": "nationaal-archief_staff_0042_john_smith",
    "extraction_date": "2025-12-11T10:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "cost_usd": 0,
    "request_id": "req_abc123xyz"
  },
  "profile_data": {
    "name": "John Smith",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "headline": "Software Engineer at Company X",
    "location": "Amsterdam, Netherlands",
    "connections": "500 connections • 1,000 followers",
    "about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
    "experience": [
      {
        "title": "Software Engineer",
        "company": "Company X",
        "duration": "2020 - Present • 5 years",
        "location": "Amsterdam, Netherlands",
        "company_details": "Company: 51-200 employees • Founded 2015 • Technology"
      }
    ],
    "education": [
      {
        "degree": "MSc Computer Science",
        "institution": "University of Amsterdam",
        "duration": "2015 - 2017 • 2 years"
      }
    ],
    "skills": ["Python", "Data Engineering", "Heritage Digitization"],
    "languages": [
      {"language": "Dutch", "proficiency": "Native"},
      {"language": "English", "proficiency": "Full professional proficiency"}
    ],
    "profile_image_url": "https://media.licdn.com/dms/image/..."
  }
}
```

---

## Formatting Script

Use the provided formatting script to convert raw Exa dumps to proper format:

```bash
python scripts/format_linkedin_profile.py
```

This script:
1. Identifies raw content dump files in `data/custodian/person/entity/`
2. Parses the raw markdown content
3. Extracts structured fields (name, headline, experience, etc.)
4. Creates proper `extraction_metadata` block
5. Overwrites files with structured format

---

## Extraction Agent Requirement

**The `extraction_agent` field MUST always be set to `"claude-opus-4.5"`.**

This ensures:
- Consistent provenance tracking
- Clear audit trail of which model performed extraction
- Reproducibility of extraction process

---

## Validation

Before committing person entity files, verify:

1. **JSON is valid**: `python -c "import json; json.load(open('file.json'))"`
2. **Required fields present**: `extraction_metadata`, `profile_data`, `name`, `linkedin_url`, `experience`
3. **extraction_agent is correct**: Must be `"claude-opus-4.5"`
4. **No raw content dumps**: Check that `content` field does not exist at top level

---

## Migration: Converting Raw Dumps

If you find raw content dump files:

1. Run formatting script: `python scripts/format_linkedin_profile.py`
2. Verify output is properly structured
3. Check for any remaining raw files
4. Report files that couldn't be automatically converted

---

## Version History

| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2025-12-11 | Initial rule documentation based on Nationaal Archief extraction session |

---

## Related Rules

- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa tool usage guidelines
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern for custodian files
- `AGENTS.md` - Rule 20: Person Entity Profiles - Individual File Storage