7.8 KiB
7.8 KiB
Person Entity Profile Format Rule
Rule: ALL Person Entity Profiles MUST Use Structured JSON Format
🚨 CRITICAL: Person entity profiles stored in data/custodian/person/entity/ MUST be properly structured JSON, NOT raw content dumps from Exa API responses.
This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.
The Problem: Raw Content Dumps
Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:
- ❌ Unparseable data - No structured fields to query
- ❌ Inconsistent format - Each profile has different structure
- ❌ Missing metadata - No extraction provenance tracking
- ❌ Difficult processing - Cannot programmatically extract career history
- ❌ Wasted effort - Re-extraction required for structured data
Required JSON Structure
Top-Level Structure
{
"extraction_metadata": { ... },
"profile_data": { ... }
}
extraction_metadata (REQUIRED)
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
"staff_id": "{custodian}_staff_{index}_{name_slug}",
"extraction_date": "2025-12-10T16:00:00Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/{slug}",
"cost_usd": 0,
"request_id": "{exa_request_id}"
}
}
Required Fields:
| Field | Type | Description |
|---|---|---|
source_file |
string | Path to source staff list file |
staff_id |
string | Unique staff identifier from source |
extraction_date |
string | ISO 8601 timestamp |
extraction_method |
string | exa_contents, exa_crawling_exa, or manual |
extraction_agent |
string | MUST be claude-opus-4.5 |
linkedin_url |
string | Full LinkedIn profile URL |
cost_usd |
number | Exa API cost (0 for contents endpoint) |
request_id |
string | Exa request ID for tracing |
profile_data (REQUIRED)
{
"profile_data": {
"name": "Full Name",
"linkedin_url": "https://www.linkedin.com/in/{slug}",
"headline": "Current professional headline",
"location": "City, Region, Country",
"connections": "500 connections • 2,000 followers",
"about": "Full about section text or summary",
"summary": "AI-generated summary if about is truncated",
"experience": [ ... ],
"education": [ ... ],
"skills": [ ... ],
"languages": [ ... ],
"profile_image_url": "https://media.licdn.com/..."
}
}
Required profile_data Fields:
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | YES | Full name from profile |
linkedin_url |
string | YES | Profile URL |
headline |
string | YES | Professional headline |
location |
string | YES | Geographic location |
connections |
string | NO | Connection/follower count string |
about |
string | NO | About section text |
experience |
array | YES | Array of experience objects |
education |
array | NO | Array of education objects |
skills |
array | NO | Array of skill strings |
languages |
array | NO | Array of language objects |
profile_image_url |
string | NO | LinkedIn profile photo URL |
experience[] Object Structure
{
"title": "Job Title",
"company": "Company Name",
"duration": "Start Date - End Date • X years Y months",
"location": "City, Country",
"company_details": "Company: size • Founded year • Type • Industry",
"department": "Department • Level: Level",
"description": "Role description if provided"
}
education[] Object Structure
{
"degree": "Degree Title and Field",
"institution": "Institution Name",
"duration": "Start - End • X years"
}
languages[] Object Structure
{
"language": "Language Name",
"proficiency": "Native or bilingual proficiency"
}
❌ WRONG: Raw Content Dump
{
"content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
}
This is UNACCEPTABLE because:
- No structured fields
- No extraction metadata
- Cannot query programmatically
- No provenance tracking
✅ CORRECT: Structured Format
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
"staff_id": "nationaal-archief_staff_0042_john_smith",
"extraction_date": "2025-12-11T10:00:00Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
"cost_usd": 0,
"request_id": "req_abc123xyz"
},
"profile_data": {
"name": "John Smith",
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
"headline": "Software Engineer at Company X",
"location": "Amsterdam, Netherlands",
"connections": "500 connections • 1,000 followers",
"about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
"experience": [
{
"title": "Software Engineer",
"company": "Company X",
"duration": "2020 - Present • 5 years",
"location": "Amsterdam, Netherlands",
"company_details": "Company: 51-200 employees • Founded 2015 • Technology"
}
],
"education": [
{
"degree": "MSc Computer Science",
"institution": "University of Amsterdam",
"duration": "2015 - 2017 • 2 years"
}
],
"skills": ["Python", "Data Engineering", "Heritage Digitization"],
"languages": [
{"language": "Dutch", "proficiency": "Native"},
{"language": "English", "proficiency": "Full professional proficiency"}
],
"profile_image_url": "https://media.licdn.com/dms/image/..."
}
}
Formatting Script
Use the provided formatting script to convert raw Exa dumps to proper format:
python scripts/format_linkedin_profile.py
This script:
- Identifies raw content dump files in
data/custodian/person/entity/ - Parses the raw markdown content
- Extracts structured fields (name, headline, experience, etc.)
- Creates proper
extraction_metadatablock - Overwrites files with structured format
Extraction Agent Requirement
The extraction_agent field MUST always be set to "claude-opus-4.5".
This ensures:
- Consistent provenance tracking
- Clear audit trail of which model performed extraction
- Reproducibility of extraction process
Validation
Before committing person entity files, verify:
- JSON is valid:
python -c "import json; json.load(open('file.json'))" - Required fields present:
extraction_metadata,profile_data,name,linkedin_url,experience - extraction_agent is correct: Must be
"claude-opus-4.5" - No raw content dumps: Check that
contentfield does not exist at top level
Migration: Converting Raw Dumps
If you find raw content dump files:
- Run formatting script:
python scripts/format_linkedin_profile.py - Verify output is properly structured
- Check for any remaining raw files
- Report files that couldn't be automatically converted
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2025-12-11 | Initial rule documentation based on Nationaal Archief extraction session |
Related Rules
.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md- Exa tool usage guidelines.opencode/PERSON_DATA_REFERENCE_PATTERN.md- Reference pattern for custodian filesAGENTS.md- Rule 20: Person Entity Profiles - Individual File Storage