glam/.opencode/PERSON_ENTITY_PROFILE_FORMAT_RULE.md
2025-12-11 22:32:09 +01:00

7.8 KiB

Person Entity Profile Format Rule

Rule: ALL Person Entity Profiles MUST Use Structured JSON Format

🚨 CRITICAL: Person entity profiles stored in data/custodian/person/entity/ MUST be properly structured JSON, NOT raw content dumps from Exa API responses.

This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.


The Problem: Raw Content Dumps

Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:

  • Unparseable data - No structured fields to query
  • Inconsistent format - Each profile has different structure
  • Missing metadata - No extraction provenance tracking
  • Difficult processing - Cannot programmatically extract career history
  • Wasted effort - Re-extraction required for structured data

Required JSON Structure

Top-Level Structure

{
  "extraction_metadata": { ... },
  "profile_data": { ... }
}

extraction_metadata (REQUIRED)

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_{index}_{name_slug}",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "cost_usd": 0,
    "request_id": "{exa_request_id}"
  }
}

Required Fields:

Field Type Description
source_file string Path to source staff list file
staff_id string Unique staff identifier from source
extraction_date string ISO 8601 timestamp
extraction_method string exa_contents, exa_crawling_exa, or manual
extraction_agent string MUST be claude-opus-4.5
linkedin_url string Full LinkedIn profile URL
cost_usd number Exa API cost (0 for contents endpoint)
request_id string Exa request ID for tracing

profile_data (REQUIRED)

{
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "headline": "Current professional headline",
    "location": "City, Region, Country",
    "connections": "500 connections • 2,000 followers",
    "about": "Full about section text or summary",
    "summary": "AI-generated summary if about is truncated",
    "experience": [ ... ],
    "education": [ ... ],
    "skills": [ ... ],
    "languages": [ ... ],
    "profile_image_url": "https://media.licdn.com/..."
  }
}

Required profile_data Fields:

Field Type Required Description
name string YES Full name from profile
linkedin_url string YES Profile URL
headline string YES Professional headline
location string YES Geographic location
connections string NO Connection/follower count string
about string NO About section text
experience array YES Array of experience objects
education array NO Array of education objects
skills array NO Array of skill strings
languages array NO Array of language objects
profile_image_url string NO LinkedIn profile photo URL

experience[] Object Structure

{
  "title": "Job Title",
  "company": "Company Name",
  "duration": "Start Date - End Date • X years Y months",
  "location": "City, Country",
  "company_details": "Company: size • Founded year • Type • Industry",
  "department": "Department • Level: Level",
  "description": "Role description if provided"
}

education[] Object Structure

{
  "degree": "Degree Title and Field",
  "institution": "Institution Name",
  "duration": "Start - End • X years"
}

languages[] Object Structure

{
  "language": "Language Name",
  "proficiency": "Native or bilingual proficiency"
}

WRONG: Raw Content Dump

{
  "content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
}

This is UNACCEPTABLE because:

  • No structured fields
  • No extraction metadata
  • Cannot query programmatically
  • No provenance tracking

CORRECT: Structured Format

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
    "staff_id": "nationaal-archief_staff_0042_john_smith",
    "extraction_date": "2025-12-11T10:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "cost_usd": 0,
    "request_id": "req_abc123xyz"
  },
  "profile_data": {
    "name": "John Smith",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "headline": "Software Engineer at Company X",
    "location": "Amsterdam, Netherlands",
    "connections": "500 connections • 1,000 followers",
    "about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
    "experience": [
      {
        "title": "Software Engineer",
        "company": "Company X",
        "duration": "2020 - Present • 5 years",
        "location": "Amsterdam, Netherlands",
        "company_details": "Company: 51-200 employees • Founded 2015 • Technology"
      }
    ],
    "education": [
      {
        "degree": "MSc Computer Science",
        "institution": "University of Amsterdam",
        "duration": "2015 - 2017 • 2 years"
      }
    ],
    "skills": ["Python", "Data Engineering", "Heritage Digitization"],
    "languages": [
      {"language": "Dutch", "proficiency": "Native"},
      {"language": "English", "proficiency": "Full professional proficiency"}
    ],
    "profile_image_url": "https://media.licdn.com/dms/image/..."
  }
}

Formatting Script

Use the provided formatting script to convert raw Exa dumps to proper format:

python scripts/format_linkedin_profile.py

This script:

  1. Identifies raw content dump files in data/custodian/person/entity/
  2. Parses the raw markdown content
  3. Extracts structured fields (name, headline, experience, etc.)
  4. Creates proper extraction_metadata block
  5. Overwrites files with structured format

Extraction Agent Requirement

The extraction_agent field MUST always be set to "claude-opus-4.5".

This ensures:

  • Consistent provenance tracking
  • Clear audit trail of which model performed extraction
  • Reproducibility of extraction process

Validation

Before committing person entity files, verify:

  1. JSON is valid: python -c "import json; json.load(open('file.json'))"
  2. Required fields present: extraction_metadata, profile_data, name, linkedin_url, experience
  3. extraction_agent is correct: Must be "claude-opus-4.5"
  4. No raw content dumps: Check that content field does not exist at top level

Migration: Converting Raw Dumps

If you find raw content dump files:

  1. Run formatting script: python scripts/format_linkedin_profile.py
  2. Verify output is properly structured
  3. Check for any remaining raw files
  4. Report files that couldn't be automatically converted

Version History

Version Date Changes
1.0.0 2025-12-11 Initial rule documentation based on Nationaal Archief extraction session

  • .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - Exa tool usage guidelines
  • .opencode/PERSON_DATA_REFERENCE_PATTERN.md - Reference pattern for custodian files
  • AGENTS.md - Rule 20: Person Entity Profiles - Individual File Storage