280 lines
11 KiB
Markdown
280 lines
11 KiB
Markdown
# Person-Custodian Data Architecture
|
|
|
|
## Rule: Single Source of Truth for Person Data
|
|
|
|
**🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian files store only references and affiliation provenance.**
|
|
|
|
This architecture ensures:
|
|
- No data duplication across custodian files
|
|
- Clean separation between person data and affiliation data
|
|
- Cross-custodian career tracking capability
|
|
- Single place to update person information
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ DATA ARCHITECTURE │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ PERSON ENTITY FILES CUSTODIAN YAML FILES │
|
|
│ (Single Source of Truth) (References Only) │
|
|
│ │
|
|
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
|
|
│ │ person/entity/ │ │ NL-ZH-DHA-A-NA.yaml │ │
|
|
│ │ {slug}_{timestamp}.json │◄──────────│ │ │
|
|
│ │ │ ref │ person_observations: │ │
|
|
│ │ • profile_data │ │ staff: │ │
|
|
│ │ • web_claims ◄──────────┼───────────│ - person_id: ... │ │
|
|
│ │ • affiliations ─────────┼──────────►│ affiliation_ │ │
|
|
│ │ │ sync │ provenance: ... │ │
|
|
│ └─────────────────────────┘ └──────────────────────┘ │
|
|
│ │
|
|
│ │ │ │
|
|
│ │ same file │ different files │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
|
|
│ │ Multiple custodians │ │ NL-NH-HAA-A-NHA.yaml │ │
|
|
│ │ reference SAME person │◄──────────│ │ │
|
|
│ │ entity file │ │ (same person, diff │ │
|
|
│ │ │ │ custodian) │ │
|
|
│ └─────────────────────────┘ └──────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
data/custodian/
|
|
├── person/
|
|
│ ├── entity/ # SINGLE SOURCE OF TRUTH for person data
|
|
│ │ ├── bibianvanreeken_20251211T000000Z.json
|
|
│ │ ├── giovanna-fossati_20251209T170000Z.json
|
|
│ │ └── ...
|
|
│ ├── affiliated/ # Custodian staff lists (parsed)
|
|
│ │ └── parsed/
|
|
│ │ └── nationaal-archief_staff_20251214T112147Z.json
|
|
│ └── connection/ # Professional network data
|
|
│ └── manual/
|
|
│ └── {slug}_connections_{timestamp}.json
|
|
│
|
|
├── NL-ZH-DHA-A-NA.yaml # Custodian file - references persons
|
|
├── NL-NH-HAA-A-NHA.yaml # Another custodian - may reference same persons
|
|
└── ...
|
|
```
|
|
|
|
---
|
|
|
|
## What Goes Where
|
|
|
|
### Person Entity Files (`data/custodian/person/entity/`)
|
|
|
|
**Store ALL person-specific data**:
|
|
|
|
```json
|
|
{
|
|
"extraction_metadata": {
|
|
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
|
|
"staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
|
|
"extraction_date": "2025-12-14T11:21:47Z",
|
|
"extraction_method": "exa_contents",
|
|
"extraction_agent": "claude-opus-4.5",
|
|
"linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
|
|
"cost_usd": 0
|
|
},
|
|
"profile_data": {
|
|
"name": "Bibian van Reeken",
|
|
"headline": "Projectmanager Digitalisering bij het Nationaal Archief",
|
|
"location": "The Hague, South Holland, Netherlands",
|
|
"about": "Professional summary...",
|
|
"experience": [...],
|
|
"education": [...],
|
|
"skills": [...]
|
|
},
|
|
"web_claims": [
|
|
{
|
|
"claim_type": "full_name",
|
|
"claim_value": "Bibian van Reeken",
|
|
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
|
|
"retrieved_on": "2025-12-14T11:21:47Z",
|
|
"retrieval_agent": "linkedin_html_parser"
|
|
},
|
|
{
|
|
"claim_type": "role_title",
|
|
"claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
|
|
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
|
|
"retrieved_on": "2025-12-14T11:21:47Z",
|
|
"retrieval_agent": "linkedin_html_parser"
|
|
}
|
|
],
|
|
"affiliations": [
|
|
{
|
|
"custodian_name": "Nationaal Archief",
|
|
"custodian_slug": "nationaal-archief",
|
|
"role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
|
|
"heritage_relevant": true,
|
|
"heritage_type": "A",
|
|
"current": true,
|
|
"observed_on": "2025-12-14T11:21:47Z",
|
|
"source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Custodian YAML Files (`data/custodian/NL-*.yaml`)
|
|
|
|
**Store ONLY references and affiliation provenance**:
|
|
|
|
```yaml
|
|
person_observations:
|
|
staff:
|
|
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
|
|
person_name: Bibian van Reeken
|
|
role_title: Projectmanager Digitalisering bij het Nationaal Archief
|
|
heritage_relevant: true
|
|
heritage_type: A
|
|
current: true
|
|
|
|
# AFFILIATION PROVENANCE ONLY - when/how was this association observed?
|
|
affiliation_provenance:
|
|
source_url: https://www.linkedin.com/company/nationaal-archief/people/
|
|
retrieved_on: '2025-12-14T11:21:47Z'
|
|
retrieval_agent: linkedin_html_parser
|
|
|
|
# References to entity file (contains full profile + web claims)
|
|
linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
|
|
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
|
|
```
|
|
|
|
---
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
LinkedIn Company Page (staff list)
|
|
│
|
|
▼
|
|
┌───────────────────────────────────────────┐
|
|
│ 1. Parse HTML → Staff list JSON │
|
|
│ (parse_linkedin_html.py) │
|
|
│ Output: affiliated/parsed/{slug}.json │
|
|
└───────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────────────────────┐
|
|
│ 2. Extract full profiles → Entity files │
|
|
│ (Exa crawling for each person) │
|
|
│ Output: entity/{slug}_{timestamp}.json │
|
|
└───────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────────────────────┐
|
|
│ 3. Link to custodian YAML │
|
|
│ (link_person_observations.py) │
|
|
│ - Adds affiliation_provenance │
|
|
│ - Sets linkedin_profile_path │
|
|
│ - Updates entity affiliations array │
|
|
└───────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Correct vs Incorrect Patterns
|
|
|
|
### ❌ WRONG - Web Claims in Custodian File
|
|
|
|
```yaml
|
|
# data/custodian/NL-ZH-DHA-A-NA.yaml
|
|
person_observations:
|
|
staff:
|
|
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
|
|
person_name: Bibian van Reeken
|
|
# WRONG! Web claims don't belong in custodian file
|
|
web_claims:
|
|
- claim_type: full_name
|
|
claim_value: Bibian van Reeken
|
|
source_url: https://www.linkedin.com/in/bibianvanreeken
|
|
retrieved_on: '2025-12-14T11:21:47Z'
|
|
```
|
|
|
|
### ✅ CORRECT - Affiliation Provenance Only
|
|
|
|
```yaml
|
|
# data/custodian/NL-ZH-DHA-A-NA.yaml
|
|
person_observations:
|
|
staff:
|
|
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
|
|
person_name: Bibian van Reeken
|
|
role_title: Projectmanager Digitalisering bij het Nationaal Archief
|
|
heritage_relevant: true
|
|
heritage_type: A
|
|
current: true
|
|
# CORRECT! Only affiliation provenance in custodian file
|
|
affiliation_provenance:
|
|
source_url: https://www.linkedin.com/company/nationaal-archief/people/
|
|
retrieved_on: '2025-12-14T11:21:47Z'
|
|
retrieval_agent: linkedin_html_parser
|
|
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
|
|
```
|
|
|
|
---
|
|
|
|
## Benefits of This Architecture
|
|
|
|
### 1. No Data Duplication
|
|
Same person at multiple institutions → ONE entity file, multiple references
|
|
|
|
### 2. Single Source of Truth
|
|
Update person's profile once → All custodian references automatically up-to-date
|
|
|
|
### 3. Clean Separation of Concerns
|
|
- **Entity file**: Who is this person? (profile, claims)
|
|
- **Custodian file**: How are they associated? (affiliation provenance)
|
|
|
|
### 4. Cross-Custodian Career Tracking
|
|
Query all affiliations for a person from their entity file:
|
|
```json
|
|
{
|
|
"affiliations": [
|
|
{"custodian_name": "Nationaal Archief", "current": true},
|
|
{"custodian_name": "Noord-Hollands Archief", "current": false}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 5. Network Analysis Ready
|
|
Entity files contain both profile data and affiliations → Easy to build relationship graphs
|
|
|
|
---
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 5**: Never delete enriched data - additive only
|
|
- **Rule 12**: Person data reference pattern (file paths, not inline)
|
|
- **Rule 20**: Person entity profiles stored individually
|
|
- **Rule 22**: Custodian YAML is single source of truth for custodian data
|
|
- **Rule 26**: Person data provenance - web claims required
|
|
|
|
---
|
|
|
|
## Scripts
|
|
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `scripts/parse_linkedin_html.py` | Parse LinkedIn company staff pages |
|
|
| `scripts/link_person_observations.py` | Link entity files to custodian YAML |
|
|
| `scripts/fetch_linkedin_profiles_exa.py` | Extract full profiles via Exa |
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- `docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` - Detailed documentation
|
|
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern details
|
|
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa extraction rules
|