# Person-Custodian Data Architecture

## Rule: Single Source of Truth for Person Data

**🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian files store only references and affiliation provenance.**

This architecture ensures:
- No data duplication across custodian files
- Clean separation between person data and affiliation data
- Cross-custodian career tracking capability
- Single place to update person information

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                        DATA ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  PERSON ENTITY FILES                    CUSTODIAN YAML FILES        │
│  (Single Source of Truth)               (References Only)           │
│                                                                      │
│  ┌─────────────────────────┐           ┌──────────────────────┐    │
│  │ person/entity/          │           │ NL-ZH-DHA-A-NA.yaml  │    │
│  │ {slug}_{timestamp}.json │◄──────────│                      │    │
│  │                         │  ref      │ person_observations: │    │
│  │ • profile_data          │           │   staff:             │    │
│  │ • web_claims ◄──────────┼───────────│   - person_id: ...   │    │
│  │ • affiliations ─────────┼──────────►│     affiliation_     │    │
│  │                         │  sync     │     provenance: ...  │    │
│  └─────────────────────────┘           └──────────────────────┘    │
│                                                                      │
│           │                                     │                    │
│           │ same file                           │ different files   │
│           ▼                                     ▼                    │
│  ┌─────────────────────────┐           ┌──────────────────────┐    │
│  │ Multiple custodians     │           │ NL-NH-HAA-A-NHA.yaml │    │
│  │ reference SAME person   │◄──────────│                      │    │
│  │ entity file             │           │ (same person, diff   │    │
│  │                         │           │  custodian)          │    │
│  └─────────────────────────┘           └──────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Directory Structure

```
data/custodian/
├── person/
│   ├── entity/                    # SINGLE SOURCE OF TRUTH for person data
│   │   ├── bibianvanreeken_20251211T000000Z.json
│   │   ├── giovanna-fossati_20251209T170000Z.json
│   │   └── ...
│   ├── affiliated/                # Custodian staff lists (parsed)
│   │   └── parsed/
│   │       └── nationaal-archief_staff_20251214T112147Z.json
│   └── connection/                # Professional network data
│       └── manual/
│           └── {slug}_connections_{timestamp}.json
│
├── NL-ZH-DHA-A-NA.yaml           # Custodian file - references persons
├── NL-NH-HAA-A-NHA.yaml          # Another custodian - may reference same persons
└── ...
```

---

## What Goes Where

### Person Entity Files (`data/custodian/person/entity/`)

**Store ALL person-specific data**:

```json
{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
    "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
    "cost_usd": 0
  },
  "profile_data": {
    "name": "Bibian van Reeken",
    "headline": "Projectmanager Digitalisering bij het Nationaal Archief",
    "location": "The Hague, South Holland, Netherlands",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Bibian van Reeken",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}
```

### Custodian YAML Files (`data/custodian/NL-*.yaml`)

**Store ONLY references and affiliation provenance**:

```yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    
    # AFFILIATION PROVENANCE ONLY - when/how was this association observed?
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    
    # References to entity file (contains full profile + web claims)
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
```

---

## Data Flow

```
LinkedIn Company Page (staff list)
        │
        ▼
┌───────────────────────────────────────────┐
│ 1. Parse HTML → Staff list JSON           │
│    (parse_linkedin_html.py)               │
│    Output: affiliated/parsed/{slug}.json  │
└───────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────┐
│ 2. Extract full profiles → Entity files   │
│    (Exa crawling for each person)         │
│    Output: entity/{slug}_{timestamp}.json │
└───────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────┐
│ 3. Link to custodian YAML                 │
│    (link_person_observations.py)          │
│    - Adds affiliation_provenance          │
│    - Sets linkedin_profile_path           │
│    - Updates entity affiliations array    │
└───────────────────────────────────────────┘
```

---

## Correct vs Incorrect Patterns

### ❌ WRONG - Web Claims in Custodian File

```yaml
# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    # WRONG! Web claims don't belong in custodian file
    web_claims:
      - claim_type: full_name
        claim_value: Bibian van Reeken
        source_url: https://www.linkedin.com/in/bibianvanreeken
        retrieved_on: '2025-12-14T11:21:47Z'
```

### ✅ CORRECT - Affiliation Provenance Only

```yaml
# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    # CORRECT! Only affiliation provenance in custodian file
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
```

---

## Benefits of This Architecture

### 1. No Data Duplication
Same person at multiple institutions → ONE entity file, multiple references

### 2. Single Source of Truth
Update person's profile once → All custodian references automatically up-to-date

### 3. Clean Separation of Concerns
- **Entity file**: Who is this person? (profile, claims)
- **Custodian file**: How are they associated? (affiliation provenance)

### 4. Cross-Custodian Career Tracking
Query all affiliations for a person from their entity file:
```json
{
  "affiliations": [
    {"custodian_name": "Nationaal Archief", "current": true},
    {"custodian_name": "Noord-Hollands Archief", "current": false}
  ]
}
```

### 5. Network Analysis Ready
Entity files contain both profile data and affiliations → Easy to build relationship graphs

---

## Related Rules

- **Rule 5**: Never delete enriched data - additive only
- **Rule 12**: Person data reference pattern (file paths, not inline)
- **Rule 20**: Person entity profiles stored individually
- **Rule 22**: Custodian YAML is single source of truth for custodian data
- **Rule 26**: Person data provenance - web claims required

---

## Scripts

| Script | Purpose |
|--------|---------|
| `scripts/parse_linkedin_html.py` | Parse LinkedIn company staff pages |
| `scripts/link_person_observations.py` | Link entity files to custodian YAML |
| `scripts/fetch_linkedin_profiles_exa.py` | Extract full profiles via Exa |

---

## See Also

- `docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` - Detailed documentation
- `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern details
- `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa extraction rules