11 KiB
11 KiB
Person-Custodian Data Architecture
Rule: Single Source of Truth for Person Data
🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian files store only references and affiliation provenance.
This architecture ensures:
- No data duplication across custodian files
- Clean separation between person data and affiliation data
- Cross-custodian career tracking capability
- Single place to update person information
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ DATA ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ PERSON ENTITY FILES CUSTODIAN YAML FILES │
│ (Single Source of Truth) (References Only) │
│ │
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
│ │ person/entity/ │ │ NL-ZH-DHA-A-NA.yaml │ │
│ │ {slug}_{timestamp}.json │◄──────────│ │ │
│ │ │ ref │ person_observations: │ │
│ │ • profile_data │ │ staff: │ │
│ │ • web_claims ◄──────────┼───────────│ - person_id: ... │ │
│ │ • affiliations ─────────┼──────────►│ affiliation_ │ │
│ │ │ sync │ provenance: ... │ │
│ └─────────────────────────┘ └──────────────────────┘ │
│ │
│ │ │ │
│ │ same file │ different files │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
│ │ Multiple custodians │ │ NL-NH-HAA-A-NHA.yaml │ │
│ │ reference SAME person │◄──────────│ │ │
│ │ entity file │ │ (same person, diff │ │
│ │ │ │ custodian) │ │
│ └─────────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Directory Structure
data/custodian/
├── person/
│ ├── entity/ # SINGLE SOURCE OF TRUTH for person data
│ │ ├── bibianvanreeken_20251211T000000Z.json
│ │ ├── giovanna-fossati_20251209T170000Z.json
│ │ └── ...
│ ├── affiliated/ # Custodian staff lists (parsed)
│ │ └── parsed/
│ │ └── nationaal-archief_staff_20251214T112147Z.json
│ └── connection/ # Professional network data
│ └── manual/
│ └── {slug}_connections_{timestamp}.json
│
├── NL-ZH-DHA-A-NA.yaml # Custodian file - references persons
├── NL-NH-HAA-A-NHA.yaml # Another custodian - may reference same persons
└── ...
What Goes Where
Person Entity Files (data/custodian/person/entity/)
Store ALL person-specific data:
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
"staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
"extraction_date": "2025-12-14T11:21:47Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
"cost_usd": 0
},
"profile_data": {
"name": "Bibian van Reeken",
"headline": "Projectmanager Digitalisering bij het Nationaal Archief",
"location": "The Hague, South Holland, Netherlands",
"about": "Professional summary...",
"experience": [...],
"education": [...],
"skills": [...]
},
"web_claims": [
{
"claim_type": "full_name",
"claim_value": "Bibian van Reeken",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
},
{
"claim_type": "role_title",
"claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
}
],
"affiliations": [
{
"custodian_name": "Nationaal Archief",
"custodian_slug": "nationaal-archief",
"role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
"heritage_relevant": true,
"heritage_type": "A",
"current": true,
"observed_on": "2025-12-14T11:21:47Z",
"source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
}
]
}
Custodian YAML Files (data/custodian/NL-*.yaml)
Store ONLY references and affiliation provenance:
person_observations:
staff:
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
person_name: Bibian van Reeken
role_title: Projectmanager Digitalisering bij het Nationaal Archief
heritage_relevant: true
heritage_type: A
current: true
# AFFILIATION PROVENANCE ONLY - when/how was this association observed?
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
# References to entity file (contains full profile + web claims)
linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
Data Flow
LinkedIn Company Page (staff list)
│
▼
┌───────────────────────────────────────────┐
│ 1. Parse HTML → Staff list JSON │
│ (parse_linkedin_html.py) │
│ Output: affiliated/parsed/{slug}.json │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ 2. Extract full profiles → Entity files │
│ (Exa crawling for each person) │
│ Output: entity/{slug}_{timestamp}.json │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ 3. Link to custodian YAML │
│ (link_person_observations.py) │
│ - Adds affiliation_provenance │
│ - Sets linkedin_profile_path │
│ - Updates entity affiliations array │
└───────────────────────────────────────────┘
Correct vs Incorrect Patterns
❌ WRONG - Web Claims in Custodian File
# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
staff:
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
person_name: Bibian van Reeken
# WRONG! Web claims don't belong in custodian file
web_claims:
- claim_type: full_name
claim_value: Bibian van Reeken
source_url: https://www.linkedin.com/in/bibianvanreeken
retrieved_on: '2025-12-14T11:21:47Z'
✅ CORRECT - Affiliation Provenance Only
# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
staff:
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
person_name: Bibian van Reeken
role_title: Projectmanager Digitalisering bij het Nationaal Archief
heritage_relevant: true
heritage_type: A
current: true
# CORRECT! Only affiliation provenance in custodian file
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
Benefits of This Architecture
1. No Data Duplication
Same person at multiple institutions → ONE entity file, multiple references
2. Single Source of Truth
Update person's profile once → All custodian references automatically up-to-date
3. Clean Separation of Concerns
- Entity file: Who is this person? (profile, claims)
- Custodian file: How are they associated? (affiliation provenance)
4. Cross-Custodian Career Tracking
Query all affiliations for a person from their entity file:
{
"affiliations": [
{"custodian_name": "Nationaal Archief", "current": true},
{"custodian_name": "Noord-Hollands Archief", "current": false}
]
}
5. Network Analysis Ready
Entity files contain both profile data and affiliations → Easy to build relationship graphs
Related Rules
- Rule 5: Never delete enriched data - additive only
- Rule 12: Person data reference pattern (file paths, not inline)
- Rule 20: Person entity profiles stored individually
- Rule 22: Custodian YAML is single source of truth for custodian data
- Rule 26: Person data provenance - web claims required
Scripts
| Script | Purpose |
|---|---|
scripts/parse_linkedin_html.py |
Parse LinkedIn company staff pages |
scripts/link_person_observations.py |
Link entity files to custodian YAML |
scripts/fetch_linkedin_profiles_exa.py |
Extract full profiles via Exa |
See Also
docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md- Detailed documentation.opencode/PERSON_DATA_REFERENCE_PATTERN.md- Reference pattern details.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md- Exa extraction rules