glam/docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md

# Person-Custodian Data Architecture

## Overview

This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a **Single Source of Truth** pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance.

## Table of Contents

1. [Architecture Principles](#architecture-principles)
2. [Directory Structure](#directory-structure)
3. [Data Model](#data-model)
4. [Person Entity Files](#person-entity-files)
5. [Custodian YAML Files](#custodian-yaml-files)
6. [Data Flow](#data-flow)
7. [Scripts and Tools](#scripts-and-tools)
8. [Examples](#examples)
9. [Migration Guide](#migration-guide)
10. [FAQ](#faq)

---

## Architecture Principles

### 1. Single Source of Truth

**Person entity files are the authoritative source for all person data.**

- Profile information (name, headline, about, experience, education, skills)
- Web claims (provenance for extracted data)
- Affiliations (all custodians this person is associated with)

### 2. Separation of Concerns

**Different data types live in different locations:**

| Concern | Location | Rationale |
|---------|----------|-----------|
| Who is this person? | Entity file | Reusable across custodians |
| What is their background? | Entity file | Belongs to the person, not the custodian |
| Where did we get this data? | Entity file (web_claims) | Provenance is per-claim |
| How are they affiliated? | Custodian file | Relationship-specific data |
| When did we observe this? | Both | Entity has claim timestamps; Custodian has affiliation timestamp |

### 3. No Data Duplication

**Same person appearing at multiple institutions → ONE entity file**

```
Person: Sandra den Hamer
├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
│   └── affiliations: [EYE Filmmuseum, Netherlands Film Fund]
│
├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml
│   └── linkedin_profile_path: → entity file
│
└── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml
    └── linkedin_profile_path: → entity file (SAME file!)
```

### 4. Cross-Custodian Career Tracking

Entity files track all affiliations, enabling queries like:
- "Who has worked at multiple archives?"
- "Show career paths in the heritage sector"
- "Find people who moved from museums to archives"

---

## Directory Structure

```
data/custodian/
├── person/
│   │
│   ├── entity/                           # SINGLE SOURCE OF TRUTH
│   │   ├── bibianvanreeken_20251211T000000Z.json
│   │   ├── giovanna-fossati_20251209T170000Z.json
│   │   ├── sandra-den-hamer-66024510_20251209T190000Z.json
│   │   └── ...
│   │
│   ├── affiliated/                       # Staff lists by custodian
│   │   ├── manual/                       # Raw HTML/MD input files
│   │   │   └── nationaal-archief_staff_20251214.html
│   │   └── parsed/                       # Parsed JSON staff lists
│   │       ├── nationaal-archief_staff_20251214T112147Z.json
│   │       ├── noord-hollands-archief_staff_20251214T143055Z.json
│   │       └── ...
│   │
│   └── connection/                       # Professional network data
│       ├── manual/                       # Raw connection lists
│       │   └── giovanna-fossati_connections_20251211.md
│       └── parsed/                       # Parsed connection JSON
│           └── giovanna-fossati_connections_20251211T140000Z.json
│
├── NL-ZH-DHA-A-NA.yaml                  # Custodian files reference entity/
├── NL-NH-HAA-A-NHA.yaml
├── NL-GE-ARN-A-GA.yaml
├── NL-UT-UTR-A-UA.yaml
└── ...
```

### File Naming Conventions

| File Type | Pattern | Example |
|-----------|---------|---------|
| Person entity | `{linkedin_slug}_{ISO_timestamp}.json` | `bibianvanreeken_20251211T000000Z.json` |
| Staff list (parsed) | `{custodian_slug}_staff_{ISO_timestamp}.json` | `nationaal-archief_staff_20251214T112147Z.json` |
| Connections | `{linkedin_slug}_connections_{ISO_timestamp}.json` | `giovanna-fossati_connections_20251211T140000Z.json` |

---

## Data Model

### Conceptual Model

```
┌──────────────────┐         ┌──────────────────┐
│  Person Entity   │         │    Custodian     │
│                  │   N:M   │                  │
│  - profile_data  │◄───────►│  - name          │
│  - web_claims    │         │  - ghcid         │
│  - affiliations  │         │  - staff[]       │
│                  │         │                  │
└──────────────────┘         └──────────────────┘
        │                            │
        │ 1:N                        │ 1:N
        ▼                            ▼
┌──────────────────┐         ┌──────────────────┐
│    Web Claim     │         │   Staff Entry    │
│                  │         │                  │
│  - claim_type    │         │  - person_id     │
│  - claim_value   │         │  - person_name   │
│  - source_url    │         │  - role_title    │
│  - retrieved_on  │         │  - affiliation_  │
│  - retrieval_    │         │    provenance    │
│    agent         │         │  - linkedin_     │
│                  │         │    profile_path  │
└──────────────────┘         └──────────────────┘
```

### Key Relationships

| Relationship | Cardinality | Description |
|--------------|-------------|-------------|
| Person ↔ Custodian | N:M | Person can work at multiple custodians; Custodian has multiple staff |
| Person → WebClaim | 1:N | One person has many provenance claims |
| Person → Affiliation | 1:N | One person has many affiliations (tracked in entity file) |
| Custodian → StaffEntry | 1:N | One custodian has many staff entries |

---

## Person Entity Files

### Location

`data/custodian/person/entity/{linkedin_slug}_{timestamp}.json`

### Complete Schema

```json
{
  "extraction_metadata": {
    "source_file": "string",         // Path to source staff list
    "staff_id": "string",            // Unique identifier
    "extraction_date": "ISO8601",    // When profile was extracted
    "extraction_method": "string",   // exa_contents, exa_crawling_exa, manual
    "extraction_agent": "string",    // claude-opus-4.5 for manual, empty for automated
    "linkedin_url": "string",        // Full LinkedIn profile URL
    "cost_usd": 0,                   // API cost (0 for Exa contents)
    "request_id": "string"           // Optional: Exa request ID
  },

  "linkedin_profile_url": "string",  // Canonical LinkedIn URL

  "profile_data": {
    "name": "string",                // Full name
    "headline": "string",            // Current role/headline
    "location": "string",            // City, Region, Country
    "connections": "string",         // "500 connections • 2,135 followers"
    "about": "string",               // Professional summary
    "experience": [                  // Work history
      {
        "title": "string",
        "company": "string",
        "duration": "string",
        "location": "string",
        "description": "string"
      }
    ],
    "education": [                   // Education history
      {
        "school": "string",
        "degree": "string",
        "field": "string",
        "years": "string"
      }
    ],
    "skills": ["string"],            // Skills list
    "languages": [                   // Languages
      {
        "language": "string",
        "proficiency": "string"
      }
    ],
    "profile_image_url": "string"    // CDN URL for profile photo
  },

  "web_claims": [                    // Provenance for extracted data
    {
      "claim_type": "string",        // full_name, role_title, location, etc.
      "claim_value": "string",       // The extracted value
      "source_url": "string",        // Where it was found
      "retrieved_on": "ISO8601",     // When it was retrieved
      "retrieval_agent": "string"    // linkedin_html_parser, exa_crawling_exa, etc.
    }
  ],

  "affiliations": [                  // All known custodian associations
    {
      "custodian_name": "string",    // Full custodian name
      "custodian_slug": "string",    // Normalized slug
      "role_title": "string",        // Role at this custodian
      "heritage_relevant": true,     // Is this a heritage role?
      "heritage_type": "A",          // GLAMORCUBESFIXPHDNT type code
      "current": true,               // Currently employed?
      "observed_on": "ISO8601",      // When this affiliation was observed
      "source_url": "string"         // Where this was observed
    }
  ]
}
```

### Required Fields

| Field | Required | Notes |
|-------|----------|-------|
| `extraction_metadata.extraction_date` | YES | ISO 8601 timestamp |
| `extraction_metadata.linkedin_url` | YES | Full LinkedIn profile URL |
| `linkedin_profile_url` | YES | Canonical URL (may duplicate above) |
| `profile_data.name` | YES | Full name |
| `web_claims` | YES | At least one claim (usually full_name) |
| `affiliations` | NO | May be empty if no custodian association known |

---

## Custodian YAML Files

### Location

`data/custodian/{GHCID}.yaml`

### Staff Entry Schema

```yaml
person_observations:
  staff:
  - person_id: string              # Unique identifier (custodian_staff_NNNN_name_slug)
    person_name: string            # Full name (for display/search)
    role_title: string             # Current role at this custodian
    heritage_relevant: boolean     # Is this a heritage-relevant role?
    heritage_type: string          # GLAMORCUBESFIXPHDNT type code
    current: boolean               # Currently employed?

    # AFFILIATION PROVENANCE - when/how was this association observed?
    affiliation_provenance:
      source_url: string           # Where this association was found
      retrieved_on: string         # ISO 8601 timestamp
      retrieval_agent: string      # Tool used (linkedin_html_parser, etc.)

    # REFERENCES to person entity file
    linkedin_profile_url: string   # For quick access/linking
    linkedin_profile_path: string  # Path to entity JSON file
```

### What NOT to Include

**Never put these in custodian YAML:**

- `web_claims` - Belongs in entity file
- `profile_data` - Belongs in entity file
- `experience` - Belongs in entity file
- `education` - Belongs in entity file
- `skills` - Belongs in entity file
- `about` - Belongs in entity file
- Full profile content of any kind

---

## Data Flow

### Complete Pipeline

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA FLOW PIPELINE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

PHASE 1: DATA COLLECTION
─────────────────────────
LinkedIn Company Page
       │
       ▼ (Save HTML)
data/custodian/person/affiliated/manual/{slug}_staff_{date}.html


PHASE 2: PARSING
─────────────────
Manual HTML file
       │
       ▼ (parse_linkedin_html.py)
data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json
       │
       │ Contains: List of {name, headline, linkedin_url, heritage_relevant}
       │


PHASE 3: PROFILE EXTRACTION
───────────────────────────
Parsed staff list
       │
       ▼ (Exa crawling OR manual extraction)
data/custodian/person/entity/{person_slug}_{timestamp}.json
       │
       │ Contains: Full profile_data, web_claims, affiliations
       │


PHASE 4: LINKING
────────────────
Entity files + Custodian YAML
       │
       ▼ (link_person_observations.py)
       │
       ├──► Custodian YAML updated with:
       │    - person_observations.staff[] entries
       │    - affiliation_provenance
       │    - linkedin_profile_path references
       │
       └──► Entity files updated with:
            - web_claims (if not present)
            - affiliations array (new custodian added)
```

### Script Responsibilities

| Script | Input | Output | Purpose |
|--------|-------|--------|---------|
| `parse_linkedin_html.py` | Raw HTML | `affiliated/parsed/*.json` | Extract staff list |
| `fetch_linkedin_profiles_exa.py` | Staff list | `entity/*.json` | Extract full profiles |
| `link_person_observations.py` | Entity files + Staff list | Updated YAML + Entity | Create references |

---

## Scripts and Tools

### parse_linkedin_html.py

**Purpose**: Parse LinkedIn company "People" pages to extract staff lists.

**Usage**:
```bash
python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \
    --custodian-name "Nationaal Archief" \
    --custodian-slug "nationaal-archief" \
    --output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json
```

**Output**: JSON file with staff entries containing:
- `name`, `headline`, `linkedin_url`
- `heritage_relevant`, `heritage_type`
- `degree` (LinkedIn connection degree)

### link_person_observations.py

**Purpose**: Link person entity files to custodian YAML files.

**Usage**:
```bash
python scripts/link_person_observations.py \
    --custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --entity-dir data/custodian/person/entity
```

**Actions**:
1. Reads staff list to get person identifiers
2. Finds matching entity files in `entity/`
3. Updates custodian YAML with `person_observations.staff[]`
4. Adds `affiliation_provenance` and `linkedin_profile_path`
5. Updates entity files with new affiliations and web_claims

### fetch_linkedin_profiles_exa.py

**Purpose**: Extract full LinkedIn profiles using Exa API.

**Usage**:
```bash
python scripts/fetch_linkedin_profiles_exa.py \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --output-dir data/custodian/person/entity \
    --limit 50
```

---

## Examples

### Example 1: Complete Person Entity File

```json
{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
    "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
    "cost_usd": 0
  },
  "linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken",
  "profile_data": {
    "name": "Bibian van Reeken",
    "headline": "Projectmanager Digitalisering bij het Nationaal Archief",
    "location": "The Hague, South Holland, Netherlands",
    "connections": "500+ connections",
    "about": "Experienced project manager specializing in digitization...",
    "experience": [
      {
        "title": "Projectmanager Digitalisering",
        "company": "Nationaal Archief",
        "duration": "3 years",
        "location": "The Hague, Netherlands"
      }
    ],
    "education": [
      {
        "school": "Leiden University",
        "degree": "Master",
        "field": "History"
      }
    ],
    "skills": ["Project Management", "Digitization", "Archives"]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Bibian van Reeken",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}
```

### Example 2: Custodian YAML Staff Section

```yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json

  - person_id: nationaal-archief_staff_0002_jan_de_vries
    person_name: Jan de Vries
    role_title: Senior Archivist
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/jandevries12345
    linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json
```

### Example 3: Cross-Custodian Reference

Person works at two custodians:

**Entity file** (`sandra-den-hamer-66024510_20251209T190000Z.json`):
```json
{
  "affiliations": [
    {
      "custodian_name": "EYE Filmmuseum",
      "custodian_slug": "eye-filmmuseum",
      "role_title": "Director",
      "current": false,
      "observed_on": "2025-12-09T19:00:00Z"
    },
    {
      "custodian_name": "Netherlands Film Fund",
      "custodian_slug": "netherlands-filmfonds",
      "role_title": "Interim CEO",
      "current": true,
      "observed_on": "2025-12-14T10:00:00Z"
    }
  ]
}
```

**Custodian 1** (`NL-NH-AMS-U-EFM.yaml`):
```yaml
person_observations:
  staff:
  - person_id: eye-filmmuseum_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Director
    current: false
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
```

**Custodian 2** (`NL-ZH-DHA-O-NFF.yaml`):
```yaml
person_observations:
  staff:
  - person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Interim CEO
    current: true
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
```

**Note**: Both custodians reference the SAME entity file!

---

## Migration Guide

### Migrating from Inline Web Claims

If you have custodian files with inline `web_claims`, migrate them:

**Before** (incorrect):
```yaml
person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    web_claims:  # WRONG - should not be here
      - claim_type: full_name
        claim_value: John Doe
```

**After** (correct):
```yaml
person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/example/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json
```

**Migration steps**:
1. Create entity file with profile data + web claims
2. Remove `web_claims` from custodian YAML
3. Add `affiliation_provenance` block
4. Add `linkedin_profile_path` reference

---

## FAQ

### Q: Why separate entity files from custodian files?

**A**: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times.

### Q: Where do web claims go?

**A**: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation.

### Q: What if I don't have a LinkedIn URL?

**A**: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier.

### Q: Can a person have multiple entity files?

**A**: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The `person_id` is the key identifier.

### Q: What timestamp format should I use?

**A**: ISO 8601 without separators: `YYYYMMDDTHHMMSSZ` (e.g., `20251214T112147Z`).

---

## Related Documentation

- **Agent Rules**: See `AGENTS.md` Rule 27
- **Agent Rule File**: `.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md`
- **Person Reference Pattern**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md`
- **LinkedIn Extraction**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md`
- **Data Fabrication**: `.opencode/DATA_FABRICATION_PROHIBITION.md` (Rule 21)