263 lines
7.8 KiB
Markdown
263 lines
7.8 KiB
Markdown
# Person Entity Profile Format Rule
|
|
|
|
## Rule: ALL Person Entity Profiles MUST Use Structured JSON Format
|
|
|
|
**🚨 CRITICAL: Person entity profiles stored in `data/custodian/person/entity/` MUST be properly structured JSON, NOT raw content dumps from Exa API responses.**
|
|
|
|
This rule ensures data consistency, searchability, and interoperability across the heritage custodian knowledge base.
|
|
|
|
---
|
|
|
|
## The Problem: Raw Content Dumps
|
|
|
|
Exa API returns LinkedIn profile data as unstructured markdown text. Simply dumping this raw content into a JSON file creates:
|
|
|
|
- ❌ **Unparseable data** - No structured fields to query
|
|
- ❌ **Inconsistent format** - Each profile has different structure
|
|
- ❌ **Missing metadata** - No extraction provenance tracking
|
|
- ❌ **Difficult processing** - Cannot programmatically extract career history
|
|
- ❌ **Wasted effort** - Re-extraction required for structured data
|
|
|
|
---
|
|
|
|
## Required JSON Structure
|
|
|
|
### Top-Level Structure
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": { ... },
|
|
"profile_data": { ... }
|
|
}
|
|
```
|
|
|
|
### extraction_metadata (REQUIRED)
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "data/custodian/person/affiliated/parsed/{custodian}_staff_{timestamp}.json",
|
|
"staff_id": "{custodian}_staff_{index}_{name_slug}",
|
|
"extraction_date": "2025-12-10T16:00:00Z",
|
|
"extraction_method": "exa_contents",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/{slug}",
|
|
"cost_usd": 0,
|
|
"request_id": "{exa_request_id}"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Required Fields**:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `source_file` | string | Path to source staff list file |
|
|
| `staff_id` | string | Unique staff identifier from source |
|
|
| `extraction_date` | string | ISO 8601 timestamp |
|
|
| `extraction_method` | string | `exa_contents`, `exa_crawling_exa`, or `manual` |
|
|
| `extraction_agent` | string | **MUST be `claude-opus-4.5`** |
|
|
| `linkedin_url` | string | Full LinkedIn profile URL |
|
|
| `cost_usd` | number | Exa API cost (0 for contents endpoint) |
|
|
| `request_id` | string | Exa request ID for tracing |
|
|
|
|
### profile_data (REQUIRED)
|
|
|
|
```json
|
|
{
|
|
"profile_data": {
|
|
"name": "Full Name",
|
|
"linkedin_url": "https://www.linkedin.com/in/{slug}",
|
|
"headline": "Current professional headline",
|
|
"location": "City, Region, Country",
|
|
"connections": "500 connections • 2,000 followers",
|
|
"about": "Full about section text or summary",
|
|
"summary": "AI-generated summary if about is truncated",
|
|
"experience": [ ... ],
|
|
"education": [ ... ],
|
|
"skills": [ ... ],
|
|
"languages": [ ... ],
|
|
"profile_image_url": "https://media.licdn.com/..."
|
|
}
|
|
}
|
|
```
|
|
|
|
**Required profile_data Fields**:
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `name` | string | YES | Full name from profile |
|
|
| `linkedin_url` | string | YES | Profile URL |
|
|
| `headline` | string | YES | Professional headline |
|
|
| `location` | string | YES | Geographic location |
|
|
| `connections` | string | NO | Connection/follower count string |
|
|
| `about` | string | NO | About section text |
|
|
| `experience` | array | YES | Array of experience objects |
|
|
| `education` | array | NO | Array of education objects |
|
|
| `skills` | array | NO | Array of skill strings |
|
|
| `languages` | array | NO | Array of language objects |
|
|
| `profile_image_url` | string | NO | LinkedIn profile photo URL |
|
|
|
|
### experience[] Object Structure
|
|
|
|
```json
|
|
{
|
|
"title": "Job Title",
|
|
"company": "Company Name",
|
|
"duration": "Start Date - End Date • X years Y months",
|
|
"location": "City, Country",
|
|
"company_details": "Company: size • Founded year • Type • Industry",
|
|
"department": "Department • Level: Level",
|
|
"description": "Role description if provided"
|
|
}
|
|
```
|
|
|
|
### education[] Object Structure
|
|
|
|
```json
|
|
{
|
|
"degree": "Degree Title and Field",
|
|
"institution": "Institution Name",
|
|
"duration": "Start - End • X years"
|
|
}
|
|
```
|
|
|
|
### languages[] Object Structure
|
|
|
|
```json
|
|
{
|
|
"language": "Language Name",
|
|
"proficiency": "Native or bilingual proficiency"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## ❌ WRONG: Raw Content Dump
|
|
|
|
```json
|
|
{
|
|
"content": "# John Smith\n\nSoftware Engineer at Company X\n\nAmsterdam, Netherlands\n\n## About\n\nExperienced software engineer...\n\n## Experience\n\n### Software Engineer at Company X\n2020 - Present..."
|
|
}
|
|
```
|
|
|
|
This is **UNACCEPTABLE** because:
|
|
- No structured fields
|
|
- No extraction metadata
|
|
- Cannot query programmatically
|
|
- No provenance tracking
|
|
|
|
---
|
|
|
|
## ✅ CORRECT: Structured Format
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251210T155415Z.json",
|
|
"staff_id": "nationaal-archief_staff_0042_john_smith",
|
|
"extraction_date": "2025-12-11T10:00:00Z",
|
|
"extraction_method": "exa_contents",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
|
|
"cost_usd": 0,
|
|
"request_id": "req_abc123xyz"
|
|
},
|
|
"profile_data": {
|
|
"name": "John Smith",
|
|
"linkedin_url": "https://www.linkedin.com/in/john-smith-12345",
|
|
"headline": "Software Engineer at Company X",
|
|
"location": "Amsterdam, Netherlands",
|
|
"connections": "500 connections • 1,000 followers",
|
|
"about": "Experienced software engineer with 10 years of experience in heritage digitization projects...",
|
|
"experience": [
|
|
{
|
|
"title": "Software Engineer",
|
|
"company": "Company X",
|
|
"duration": "2020 - Present • 5 years",
|
|
"location": "Amsterdam, Netherlands",
|
|
"company_details": "Company: 51-200 employees • Founded 2015 • Technology"
|
|
}
|
|
],
|
|
"education": [
|
|
{
|
|
"degree": "MSc Computer Science",
|
|
"institution": "University of Amsterdam",
|
|
"duration": "2015 - 2017 • 2 years"
|
|
}
|
|
],
|
|
"skills": ["Python", "Data Engineering", "Heritage Digitization"],
|
|
"languages": [
|
|
{"language": "Dutch", "proficiency": "Native"},
|
|
{"language": "English", "proficiency": "Full professional proficiency"}
|
|
],
|
|
"profile_image_url": "https://media.licdn.com/dms/image/..."
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Formatting Script
|
|
|
|
Use the provided formatting script to convert raw Exa dumps to proper format:
|
|
|
|
```bash
|
|
python scripts/format_linkedin_profile.py
|
|
```
|
|
|
|
This script:
|
|
1. Identifies raw content dump files in `data/custodian/person/entity/`
|
|
2. Parses the raw markdown content
|
|
3. Extracts structured fields (name, headline, experience, etc.)
|
|
4. Creates proper `extraction_metadata` block
|
|
5. Overwrites files with structured format
|
|
|
|
---
|
|
|
|
## Extraction Agent Requirement
|
|
|
|
**The `extraction_agent` field MUST always be set to `"claude-opus-4.5"`.**
|
|
|
|
This ensures:
|
|
- Consistent provenance tracking
|
|
- Clear audit trail of which model performed extraction
|
|
- Reproducibility of extraction process
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
Before committing person entity files, verify:
|
|
|
|
1. **JSON is valid**: `python -c "import json; json.load(open('file.json'))"`
|
|
2. **Required fields present**: `extraction_metadata`, `profile_data`, `name`, `linkedin_url`, `experience`
|
|
3. **extraction_agent is correct**: Must be `"claude-opus-4.5"`
|
|
4. **No raw content dumps**: Check that `content` field does not exist at top level
|
|
|
|
---
|
|
|
|
## Migration: Converting Raw Dumps
|
|
|
|
If you find raw content dump files:
|
|
|
|
1. Run formatting script: `python scripts/format_linkedin_profile.py`
|
|
2. Verify output is properly structured
|
|
3. Check for any remaining raw files
|
|
4. Report files that couldn't be automatically converted
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Version | Date | Changes |
|
|
|---------|------|---------|
|
|
| 1.0.0 | 2025-12-11 | Initial rule documentation based on Nationaal Archief extraction session |
|
|
|
|
---
|
|
|
|
## Related Rules
|
|
|
|
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa tool usage guidelines
|
|
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern for custodian files
|
|
- `AGENTS.md` - Rule 20: Person Entity Profiles - Individual File Storage
|