glam/.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md
2025-12-14 17:09:55 +01:00

11 KiB

Person-Custodian Data Architecture

Rule: Single Source of Truth for Person Data

🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian files store only references and affiliation provenance.

This architecture ensures:

  • No data duplication across custodian files
  • Clean separation between person data and affiliation data
  • Cross-custodian career tracking capability
  • Single place to update person information

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        DATA ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  PERSON ENTITY FILES                    CUSTODIAN YAML FILES        │
│  (Single Source of Truth)               (References Only)           │
│                                                                      │
│  ┌─────────────────────────┐           ┌──────────────────────┐    │
│  │ person/entity/          │           │ NL-ZH-DHA-A-NA.yaml  │    │
│  │ {slug}_{timestamp}.json │◄──────────│                      │    │
│  │                         │  ref      │ person_observations: │    │
│  │ • profile_data          │           │   staff:             │    │
│  │ • web_claims ◄──────────┼───────────│   - person_id: ...   │    │
│  │ • affiliations ─────────┼──────────►│     affiliation_     │    │
│  │                         │  sync     │     provenance: ...  │    │
│  └─────────────────────────┘           └──────────────────────┘    │
│                                                                      │
│           │                                     │                    │
│           │ same file                           │ different files   │
│           ▼                                     ▼                    │
│  ┌─────────────────────────┐           ┌──────────────────────┐    │
│  │ Multiple custodians     │           │ NL-NH-HAA-A-NHA.yaml │    │
│  │ reference SAME person   │◄──────────│                      │    │
│  │ entity file             │           │ (same person, diff   │    │
│  │                         │           │  custodian)          │    │
│  └─────────────────────────┘           └──────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Directory Structure

data/custodian/
├── person/
│   ├── entity/                    # SINGLE SOURCE OF TRUTH for person data
│   │   ├── bibianvanreeken_20251211T000000Z.json
│   │   ├── giovanna-fossati_20251209T170000Z.json
│   │   └── ...
│   ├── affiliated/                # Custodian staff lists (parsed)
│   │   └── parsed/
│   │       └── nationaal-archief_staff_20251214T112147Z.json
│   └── connection/                # Professional network data
│       └── manual/
│           └── {slug}_connections_{timestamp}.json
│
├── NL-ZH-DHA-A-NA.yaml           # Custodian file - references persons
├── NL-NH-HAA-A-NHA.yaml          # Another custodian - may reference same persons
└── ...

What Goes Where

Person Entity Files (data/custodian/person/entity/)

Store ALL person-specific data:

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
    "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
    "cost_usd": 0
  },
  "profile_data": {
    "name": "Bibian van Reeken",
    "headline": "Projectmanager Digitalisering bij het Nationaal Archief",
    "location": "The Hague, South Holland, Netherlands",
    "about": "Professional summary...",
    "experience": [...],
    "education": [...],
    "skills": [...]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Bibian van Reeken",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}

Custodian YAML Files (data/custodian/NL-*.yaml)

Store ONLY references and affiliation provenance:

person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    
    # AFFILIATION PROVENANCE ONLY - when/how was this association observed?
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    
    # References to entity file (contains full profile + web claims)
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json

Data Flow

LinkedIn Company Page (staff list)
        │
        ▼
┌───────────────────────────────────────────┐
│ 1. Parse HTML → Staff list JSON           │
│    (parse_linkedin_html.py)               │
│    Output: affiliated/parsed/{slug}.json  │
└───────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────┐
│ 2. Extract full profiles → Entity files   │
│    (Exa crawling for each person)         │
│    Output: entity/{slug}_{timestamp}.json │
└───────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────┐
│ 3. Link to custodian YAML                 │
│    (link_person_observations.py)          │
│    - Adds affiliation_provenance          │
│    - Sets linkedin_profile_path           │
│    - Updates entity affiliations array    │
└───────────────────────────────────────────┘

Correct vs Incorrect Patterns

WRONG - Web Claims in Custodian File

# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    # WRONG! Web claims don't belong in custodian file
    web_claims:
      - claim_type: full_name
        claim_value: Bibian van Reeken
        source_url: https://www.linkedin.com/in/bibianvanreeken
        retrieved_on: '2025-12-14T11:21:47Z'

CORRECT - Affiliation Provenance Only

# data/custodian/NL-ZH-DHA-A-NA.yaml
person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    # CORRECT! Only affiliation provenance in custodian file
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json

Benefits of This Architecture

1. No Data Duplication

Same person at multiple institutions → ONE entity file, multiple references

2. Single Source of Truth

Update person's profile once → All custodian references automatically up-to-date

3. Clean Separation of Concerns

  • Entity file: Who is this person? (profile, claims)
  • Custodian file: How are they associated? (affiliation provenance)

4. Cross-Custodian Career Tracking

Query all affiliations for a person from their entity file:

{
  "affiliations": [
    {"custodian_name": "Nationaal Archief", "current": true},
    {"custodian_name": "Noord-Hollands Archief", "current": false}
  ]
}

5. Network Analysis Ready

Entity files contain both profile data and affiliations → Easy to build relationship graphs


  • Rule 5: Never delete enriched data - additive only
  • Rule 12: Person data reference pattern (file paths, not inline)
  • Rule 20: Person entity profiles stored individually
  • Rule 22: Custodian YAML is single source of truth for custodian data
  • Rule 26: Person data provenance - web claims required

Scripts

Script Purpose
scripts/parse_linkedin_html.py Parse LinkedIn company staff pages
scripts/link_person_observations.py Link entity files to custodian YAML
scripts/fetch_linkedin_profiles_exa.py Extract full profiles via Exa

See Also

  • docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md - Detailed documentation
  • .opencode/PERSON_DATA_REFERENCE_PATTERN.md - Reference pattern details
  • .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - Exa extraction rules