kempersc 1b1cfbfca0 enrich custodians

2025-12-11 22:32:09 +01:00

7.8 KiB

Raw Blame History

Person Entity Profile Format Rule

Rule: ALL Person Entity Profiles MUST Use Structured JSON Format

🚨 CRITICAL: Person entity profiles stored in data/custodian/person/entity/ MUST be properly structured JSON, NOT raw content dumps from Exa API responses.

This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.

The Problem: Raw Content Dumps

Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:

❌ Unparseable data - No structured fields to query
❌ Inconsistent format - Each profile has different structure
❌ Missing metadata - No extraction provenance tracking
❌ Difficult processing - Cannot programmatically extract career history
❌ Wasted effort - Re-extraction required for structured data

Required JSON Structure

Top-Level Structure

{
  "extraction_metadata": { ... },
  "profile_data": { ... }
}

extraction_metadata (REQUIRED)

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
    "staff_id": "{custodian}_staff_{index}_{name_slug}",
    "extraction_date": "2025-12-10T16:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "cost_usd": 0,
    "request_id": "{exa_request_id}"
  }
}

Required Fields:

Field	Type	Description
`source_file`	string	Path to source staff list file
`staff_id`	string	Unique staff identifier from source
`extraction_date`	string	ISO 8601 timestamp
`extraction_method`	string	`exa_contents`, `exa_crawling_exa`, or `manual`
`extraction_agent`	string	MUST be `claude-opus-4.5`
`linkedin_url`	string	Full LinkedIn profile URL
`cost_usd`	number	Exa API cost (0 for contents endpoint)
`request_id`	string	Exa request ID for tracing

profile_data (REQUIRED)

{
  "profile_data": {
    "name": "Full Name",
    "linkedin_url": "https://www.linkedin.com/in/{slug}",
    "headline": "Current professional headline",
    "location": "City, Region, Country",
    "connections": "500 connections • 2,000 followers",
    "about": "Full about section text or summary",
    "summary": "AI-generated summary if about is truncated",
    "experience": [ ... ],
    "education": [ ... ],
    "skills": [ ... ],
    "languages": [ ... ],
    "profile_image_url": "https://media.licdn.com/..."
  }
}

Required profile_data Fields:

Field	Type	Required	Description
`name`	string	YES	Full name from profile
`linkedin_url`	string	YES	Profile URL
`headline`	string	YES	Professional headline
`location`	string	YES	Geographic location
`connections`	string	NO	Connection/follower count string
`about`	string	NO	About section text
`experience`	array	YES	Array of experience objects
`education`	array	NO	Array of education objects
`skills`	array	NO	Array of skill strings
`languages`	array	NO	Array of language objects
`profile_image_url`	string	NO	LinkedIn profile photo URL

experience[] Object Structure

{
  "title": "Job Title",
  "company": "Company Name",
  "duration": "Start Date - End Date • X years Y months",
  "location": "City, Country",
  "company_details": "Company: size • Founded year • Type • Industry",
  "department": "Department • Level: Level",
  "description": "Role description if provided"
}

education[] Object Structure

{
  "degree": "Degree Title and Field",
  "institution": "Institution Name",
  "duration": "Start - End • X years"
}

languages[] Object Structure

{
  "language": "Language Name",
  "proficiency": "Native or bilingual proficiency"
}

❌ WRONG: Raw Content Dump

{
  "content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
}

This is UNACCEPTABLE because:

No structured fields
No extraction metadata
Cannot query programmatically
No provenance tracking

✅ CORRECT: Structured Format

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
    "staff_id": "nationaal-archief_staff_0042_john_smith",
    "extraction_date": "2025-12-11T10:00:00Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "cost_usd": 0,
    "request_id": "req_abc123xyz"
  },
  "profile_data": {
    "name": "John Smith",
    "linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
    "headline": "Software Engineer at Company X",
    "location": "Amsterdam, Netherlands",
    "connections": "500 connections • 1,000 followers",
    "about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
    "experience": [
      {
        "title": "Software Engineer",
        "company": "Company X",
        "duration": "2020 - Present • 5 years",
        "location": "Amsterdam, Netherlands",
        "company_details": "Company: 51-200 employees • Founded 2015 • Technology"
      }
    ],
    "education": [
      {
        "degree": "MSc Computer Science",
        "institution": "University of Amsterdam",
        "duration": "2015 - 2017 • 2 years"
      }
    ],
    "skills": ["Python", "Data Engineering", "Heritage Digitization"],
    "languages": [
      {"language": "Dutch", "proficiency": "Native"},
      {"language": "English", "proficiency": "Full professional proficiency"}
    ],
    "profile_image_url": "https://media.licdn.com/dms/image/..."
  }
}

Formatting Script

Use the provided formatting script to convert raw Exa dumps to proper format:

python scripts/format_linkedin_profile.py

This script:

Identifies raw content dump files in data/custodian/person/entity/
Parses the raw markdown content
Extracts structured fields (name, headline, experience, etc.)
Creates proper extraction_metadata block
Overwrites files with structured format

Extraction Agent Requirement

The extraction_agent field MUST always be set to "claude-opus-4.5".

This ensures:

Consistent provenance tracking
Clear audit trail of which model performed extraction
Reproducibility of extraction process

Validation

Before committing person entity files, verify:

JSON is valid: python -c "import json; json.load(open('file.json'))"
Required fields present: extraction_metadata, profile_data, name, linkedin_url, experience
extraction_agent is correct: Must be "claude-opus-4.5"
No raw content dumps: Check that content field does not exist at top level

Migration: Converting Raw Dumps

If you find raw content dump files:

Run formatting script: python scripts/format_linkedin_profile.py
Verify output is properly structured
Check for any remaining raw files
Report files that couldn't be automatically converted

Version History

Version	Date	Changes
1.0.0	2025-12-11	Initial rule documentation based on Nationaal Archief extraction session

.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - Exa tool usage guidelines
.opencode/PERSON_DATA_REFERENCE_PATTERN.md - Reference pattern for custodian files
AGENTS.md - Rule 20: Person Entity Profiles - Individual File Storage

7.8 KiB Raw Blame History