glam/.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md
2025-12-11 22:32:09 +01:00

263 lines
7.8 KiB
Markdown

# Person Entity Profile Format Rule
## Rule: ALL Person Entity Profiles MUST Use Structured JSON Format
**🚨 CRITICAL: Person entity profiles stored in `data/custodian/person/entity/` MUST be properly structured JSON, NOT raw content dumps from Exa API responses.**
This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.
---
## The Problem: Raw Content Dumps
Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:
-**Unparseable data** - No structured fields to query
-**Inconsistent format** - Each profile has different structure
-**Missing metadata** - No extraction provenance tracking
-**Difficult processing** - Cannot programmatically extract career history
-**Wasted effort** - Re-extraction required for structured data
---
## Required JSON Structure
### Top-Level Structure
```json
{
"extraction_metadata": { ... },
"profile_data": { ... }
}
```
### extraction_metadata (REQUIRED)
```json
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
"staff_id": "{custodian}_staff_{index}_{name_slug}",
"extraction_date": "2025-12-10T16:00:00Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/{slug}",
"cost_usd": 0,
"request_id": "{exa_request_id}"
}
}
```
**Required Fields**:
| Field | Type | Description |
|-------|------|-------------|
| `source_file` | string | Path to source staff list file |
| `staff_id` | string | Unique staff identifier from source |
| `extraction_date` | string | ISO 8601 timestamp |
| `extraction_method` | string | `exa_contents`, `exa_crawling_exa`, or `manual` |
| `extraction_agent` | string | **MUST be `claude-opus-4.5`** |
| `linkedin_url` | string | Full LinkedIn profile URL |
| `cost_usd` | number | Exa API cost (0 for contents endpoint) |
| `request_id` | string | Exa request ID for tracing |
### profile_data (REQUIRED)
```json
{
"profile_data": {
"name": "Full Name",
"linkedin_url": "https://www.linkedin.com/in/{slug}",
"headline": "Current professional headline",
"location": "City, Region, Country",
"connections": "500 connections • 2,000 followers",
"about": "Full about section text or summary",
"summary": "AI-generated summary if about is truncated",
"experience": [ ... ],
"education": [ ... ],
"skills": [ ... ],
"languages": [ ... ],
"profile_image_url": "https://media.licdn.com/..."
}
}
```
**Required profile_data Fields**:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | YES | Full name from profile |
| `linkedin_url` | string | YES | Profile URL |
| `headline` | string | YES | Professional headline |
| `location` | string | YES | Geographic location |
| `connections` | string | NO | Connection/follower count string |
| `about` | string | NO | About section text |
| `experience` | array | YES | Array of experience objects |
| `education` | array | NO | Array of education objects |
| `skills` | array | NO | Array of skill strings |
| `languages` | array | NO | Array of language objects |
| `profile_image_url` | string | NO | LinkedIn profile photo URL |
### experience[] Object Structure
```json
{
"title": "Job Title",
"company": "Company Name",
"duration": "Start Date - End Date • X years Y months",
"location": "City, Country",
"company_details": "Company: size • Founded year • Type • Industry",
"department": "Department • Level: Level",
"description": "Role description if provided"
}
```
### education[] Object Structure
```json
{
"degree": "Degree Title and Field",
"institution": "Institution Name",
"duration": "Start - End • X years"
}
```
### languages[] Object Structure
```json
{
"language": "Language Name",
"proficiency": "Native or bilingual proficiency"
}
```
---
## ❌ WRONG: Raw Content Dump
```json
{
"content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
}
```
This is **UNACCEPTABLE** because:
- No structured fields
- No extraction metadata
- Cannot query programmatically
- No provenance tracking
---
## ✅ CORRECT: Structured Format
```json
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
"staff_id": "nationaal-archief_staff_0042_john_smith",
"extraction_date": "2025-12-11T10:00:00Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
"cost_usd": 0,
"request_id": "req_abc123xyz"
},
"profile_data": {
"name": "John Smith",
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
"headline": "Software Engineer at Company X",
"location": "Amsterdam, Netherlands",
"connections": "500 connections • 1,000 followers",
"about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
"experience": [
{
"title": "Software Engineer",
"company": "Company X",
"duration": "2020 - Present • 5 years",
"location": "Amsterdam, Netherlands",
"company_details": "Company: 51-200 employees • Founded 2015 • Technology"
}
],
"education": [
{
"degree": "MSc Computer Science",
"institution": "University of Amsterdam",
"duration": "2015 - 2017 • 2 years"
}
],
"skills": ["Python", "Data Engineering", "Heritage Digitization"],
"languages": [
{"language": "Dutch", "proficiency": "Native"},
{"language": "English", "proficiency": "Full professional proficiency"}
],
"profile_image_url": "https://media.licdn.com/dms/image/..."
}
}
```
---
## Formatting Script
Use the provided formatting script to convert raw Exa dumps to proper format:
```bash
python scripts/format_linkedin_profile.py
```
This script:
1. Identifies raw content dump files in `data/custodian/person/entity/`
2. Parses the raw markdown content
3. Extracts structured fields (name, headline, experience, etc.)
4. Creates proper `extraction_metadata` block
5. Overwrites files with structured format
---
## Extraction Agent Requirement
**The `extraction_agent` field MUST always be set to `"claude-opus-4.5"`.**
This ensures:
- Consistent provenance tracking
- Clear audit trail of which model performed extraction
- Reproducibility of extraction process
---
## Validation
Before committing person entity files, verify:
1. **JSON is valid**: `python -c "import json; json.load(open('file.json'))"`
2. **Required fields present**: `extraction_metadata`, `profile_data`, `name`, `linkedin_url`, `experience`
3. **extraction_agent is correct**: Must be `"claude-opus-4.5"`
4. **No raw content dumps**: Check that `content` field does not exist at top level
---
## Migration: Converting Raw Dumps
If you find raw content dump files:
1. Run formatting script: `python scripts/format_linkedin_profile.py`
2. Verify output is properly structured
3. Check for any remaining raw files
4. Report files that couldn't be automatically converted
---
## Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2025-12-11 | Initial rule documentation based on Nationaal Archief extraction session |
---
## Related Rules
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa tool usage guidelines
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern for custodian files
- `AGENTS.md` - Rule 20: Person Entity Profiles - Individual File Storage