glam/docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md
2025-12-14 17:09:55 +01:00

22 KiB

Person-Custodian Data Architecture

Overview

This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a Single Source of Truth pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance.

Table of Contents

  1. Architecture Principles
  2. Directory Structure
  3. Data Model
  4. Person Entity Files
  5. Custodian YAML Files
  6. Data Flow
  7. Scripts and Tools
  8. Examples
  9. Migration Guide
  10. FAQ

Architecture Principles

1. Single Source of Truth

Person entity files are the authoritative source for all person data.

  • Profile information (name, headline, about, experience, education, skills)
  • Web claims (provenance for extracted data)
  • Affiliations (all custodians this person is associated with)

2. Separation of Concerns

Different data types live in different locations:

Concern Location Rationale
Who is this person? Entity file Reusable across custodians
What is their background? Entity file Belongs to the person, not the custodian
Where did we get this data? Entity file (web_claims) Provenance is per-claim
How are they affiliated? Custodian file Relationship-specific data
When did we observe this? Both Entity has claim timestamps; Custodian has affiliation timestamp

3. No Data Duplication

Same person appearing at multiple institutions → ONE entity file

Person: Sandra den Hamer
├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
│   └── affiliations: [EYE Filmmuseum, Netherlands Film Fund]
│
├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml
│   └── linkedin_profile_path: → entity file
│
└── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml
    └── linkedin_profile_path: → entity file (SAME file!)

4. Cross-Custodian Career Tracking

Entity files track all affiliations, enabling queries like:

  • "Who has worked at multiple archives?"
  • "Show career paths in the heritage sector"
  • "Find people who moved from museums to archives"

Directory Structure

data/custodian/
├── person/
│   │
│   ├── entity/                           # SINGLE SOURCE OF TRUTH
│   │   ├── bibianvanreeken_20251211T000000Z.json
│   │   ├── giovanna-fossati_20251209T170000Z.json
│   │   ├── sandra-den-hamer-66024510_20251209T190000Z.json
│   │   └── ...
│   │
│   ├── affiliated/                       # Staff lists by custodian
│   │   ├── manual/                       # Raw HTML/MD input files
│   │   │   └── nationaal-archief_staff_20251214.html
│   │   └── parsed/                       # Parsed JSON staff lists
│   │       ├── nationaal-archief_staff_20251214T112147Z.json
│   │       ├── noord-hollands-archief_staff_20251214T143055Z.json
│   │       └── ...
│   │
│   └── connection/                       # Professional network data
│       ├── manual/                       # Raw connection lists
│       │   └── giovanna-fossati_connections_20251211.md
│       └── parsed/                       # Parsed connection JSON
│           └── giovanna-fossati_connections_20251211T140000Z.json
│
├── NL-ZH-DHA-A-NA.yaml                  # Custodian files reference entity/
├── NL-NH-HAA-A-NHA.yaml
├── NL-GE-ARN-A-GA.yaml
├── NL-UT-UTR-A-UA.yaml
└── ...

File Naming Conventions

File Type Pattern Example
Person entity {linkedin_slug}_{ISO_timestamp}.json bibianvanreeken_20251211T000000Z.json
Staff list (parsed) {custodian_slug}_staff_{ISO_timestamp}.json nationaal-archief_staff_20251214T112147Z.json
Connections {linkedin_slug}_connections_{ISO_timestamp}.json giovanna-fossati_connections_20251211T140000Z.json

Data Model

Conceptual Model

┌──────────────────┐         ┌──────────────────┐
│  Person Entity   │         │    Custodian     │
│                  │   N:M   │                  │
│  - profile_data  │◄───────►│  - name          │
│  - web_claims    │         │  - ghcid         │
│  - affiliations  │         │  - staff[]       │
│                  │         │                  │
└──────────────────┘         └──────────────────┘
        │                            │
        │ 1:N                        │ 1:N
        ▼                            ▼
┌──────────────────┐         ┌──────────────────┐
│    Web Claim     │         │   Staff Entry    │
│                  │         │                  │
│  - claim_type    │         │  - person_id     │
│  - claim_value   │         │  - person_name   │
│  - source_url    │         │  - role_title    │
│  - retrieved_on  │         │  - affiliation_  │
│  - retrieval_    │         │    provenance    │
│    agent         │         │  - linkedin_     │
│                  │         │    profile_path  │
└──────────────────┘         └──────────────────┘

Key Relationships

Relationship Cardinality Description
Person ↔ Custodian N:M Person can work at multiple custodians; Custodian has multiple staff
Person → WebClaim 1:N One person has many provenance claims
Person → Affiliation 1:N One person has many affiliations (tracked in entity file)
Custodian → StaffEntry 1:N One custodian has many staff entries

Person Entity Files

Location

data/custodian/person/entity/{linkedin_slug}_{timestamp}.json

Complete Schema

{
  "extraction_metadata": {
    "source_file": "string",         // Path to source staff list
    "staff_id": "string",            // Unique identifier
    "extraction_date": "ISO8601",    // When profile was extracted
    "extraction_method": "string",   // exa_contents, exa_crawling_exa, manual
    "extraction_agent": "string",    // claude-opus-4.5 for manual, empty for automated
    "linkedin_url": "string",        // Full LinkedIn profile URL
    "cost_usd": 0,                   // API cost (0 for Exa contents)
    "request_id": "string"           // Optional: Exa request ID
  },
  
  "linkedin_profile_url": "string",  // Canonical LinkedIn URL
  
  "profile_data": {
    "name": "string",                // Full name
    "headline": "string",            // Current role/headline
    "location": "string",            // City, Region, Country
    "connections": "string",         // "500 connections • 2,135 followers"
    "about": "string",               // Professional summary
    "experience": [                  // Work history
      {
        "title": "string",
        "company": "string",
        "duration": "string",
        "location": "string",
        "description": "string"
      }
    ],
    "education": [                   // Education history
      {
        "school": "string",
        "degree": "string",
        "field": "string",
        "years": "string"
      }
    ],
    "skills": ["string"],            // Skills list
    "languages": [                   // Languages
      {
        "language": "string",
        "proficiency": "string"
      }
    ],
    "profile_image_url": "string"    // CDN URL for profile photo
  },
  
  "web_claims": [                    // Provenance for extracted data
    {
      "claim_type": "string",        // full_name, role_title, location, etc.
      "claim_value": "string",       // The extracted value
      "source_url": "string",        // Where it was found
      "retrieved_on": "ISO8601",     // When it was retrieved
      "retrieval_agent": "string"    // linkedin_html_parser, exa_crawling_exa, etc.
    }
  ],
  
  "affiliations": [                  // All known custodian associations
    {
      "custodian_name": "string",    // Full custodian name
      "custodian_slug": "string",    // Normalized slug
      "role_title": "string",        // Role at this custodian
      "heritage_relevant": true,     // Is this a heritage role?
      "heritage_type": "A",          // GLAMORCUBESFIXPHDNT type code
      "current": true,               // Currently employed?
      "observed_on": "ISO8601",      // When this affiliation was observed
      "source_url": "string"         // Where this was observed
    }
  ]
}

Required Fields

Field Required Notes
extraction_metadata.extraction_date YES ISO 8601 timestamp
extraction_metadata.linkedin_url YES Full LinkedIn profile URL
linkedin_profile_url YES Canonical URL (may duplicate above)
profile_data.name YES Full name
web_claims YES At least one claim (usually full_name)
affiliations NO May be empty if no custodian association known

Custodian YAML Files

Location

data/custodian/{GHCID}.yaml

Staff Entry Schema

person_observations:
  staff:
  - person_id: string              # Unique identifier (custodian_staff_NNNN_name_slug)
    person_name: string            # Full name (for display/search)
    role_title: string             # Current role at this custodian
    heritage_relevant: boolean     # Is this a heritage-relevant role?
    heritage_type: string          # GLAMORCUBESFIXPHDNT type code
    current: boolean               # Currently employed?
    
    # AFFILIATION PROVENANCE - when/how was this association observed?
    affiliation_provenance:
      source_url: string           # Where this association was found
      retrieved_on: string         # ISO 8601 timestamp
      retrieval_agent: string      # Tool used (linkedin_html_parser, etc.)
    
    # REFERENCES to person entity file
    linkedin_profile_url: string   # For quick access/linking
    linkedin_profile_path: string  # Path to entity JSON file

What NOT to Include

Never put these in custodian YAML:

  • web_claims - Belongs in entity file
  • profile_data - Belongs in entity file
  • experience - Belongs in entity file
  • education - Belongs in entity file
  • skills - Belongs in entity file
  • about - Belongs in entity file
  • Full profile content of any kind

Data Flow

Complete Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA FLOW PIPELINE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

PHASE 1: DATA COLLECTION
─────────────────────────
LinkedIn Company Page
       │
       ▼ (Save HTML)
data/custodian/person/affiliated/manual/{slug}_staff_{date}.html


PHASE 2: PARSING
─────────────────
Manual HTML file
       │
       ▼ (parse_linkedin_html.py)
data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json
       │
       │ Contains: List of {name, headline, linkedin_url, heritage_relevant}
       │


PHASE 3: PROFILE EXTRACTION
───────────────────────────
Parsed staff list
       │
       ▼ (Exa crawling OR manual extraction)
data/custodian/person/entity/{person_slug}_{timestamp}.json
       │
       │ Contains: Full profile_data, web_claims, affiliations
       │


PHASE 4: LINKING
────────────────
Entity files + Custodian YAML
       │
       ▼ (link_person_observations.py)
       │
       ├──► Custodian YAML updated with:
       │    - person_observations.staff[] entries
       │    - affiliation_provenance
       │    - linkedin_profile_path references
       │
       └──► Entity files updated with:
            - web_claims (if not present)
            - affiliations array (new custodian added)

Script Responsibilities

Script Input Output Purpose
parse_linkedin_html.py Raw HTML affiliated/parsed/*.json Extract staff list
fetch_linkedin_profiles_exa.py Staff list entity/*.json Extract full profiles
link_person_observations.py Entity files + Staff list Updated YAML + Entity Create references

Scripts and Tools

parse_linkedin_html.py

Purpose: Parse LinkedIn company "People" pages to extract staff lists.

Usage:

python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \
    --custodian-name "Nationaal Archief" \
    --custodian-slug "nationaal-archief" \
    --output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json

Output: JSON file with staff entries containing:

  • name, headline, linkedin_url
  • heritage_relevant, heritage_type
  • degree (LinkedIn connection degree)

Purpose: Link person entity files to custodian YAML files.

Usage:

python scripts/link_person_observations.py \
    --custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --entity-dir data/custodian/person/entity

Actions:

  1. Reads staff list to get person identifiers
  2. Finds matching entity files in entity/
  3. Updates custodian YAML with person_observations.staff[]
  4. Adds affiliation_provenance and linkedin_profile_path
  5. Updates entity files with new affiliations and web_claims

fetch_linkedin_profiles_exa.py

Purpose: Extract full LinkedIn profiles using Exa API.

Usage:

python scripts/fetch_linkedin_profiles_exa.py \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --output-dir data/custodian/person/entity \
    --limit 50

Examples

Example 1: Complete Person Entity File

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
    "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
    "cost_usd": 0
  },
  "linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken",
  "profile_data": {
    "name": "Bibian van Reeken",
    "headline": "Projectmanager Digitalisering bij het Nationaal Archief",
    "location": "The Hague, South Holland, Netherlands",
    "connections": "500+ connections",
    "about": "Experienced project manager specializing in digitization...",
    "experience": [
      {
        "title": "Projectmanager Digitalisering",
        "company": "Nationaal Archief",
        "duration": "3 years",
        "location": "The Hague, Netherlands"
      }
    ],
    "education": [
      {
        "school": "Leiden University",
        "degree": "Master",
        "field": "History"
      }
    ],
    "skills": ["Project Management", "Digitization", "Archives"]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Bibian van Reeken",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}

Example 2: Custodian YAML Staff Section

person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
    
  - person_id: nationaal-archief_staff_0002_jan_de_vries
    person_name: Jan de Vries
    role_title: Senior Archivist
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/jandevries12345
    linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json

Example 3: Cross-Custodian Reference

Person works at two custodians:

Entity file (sandra-den-hamer-66024510_20251209T190000Z.json):

{
  "affiliations": [
    {
      "custodian_name": "EYE Filmmuseum",
      "custodian_slug": "eye-filmmuseum",
      "role_title": "Director",
      "current": false,
      "observed_on": "2025-12-09T19:00:00Z"
    },
    {
      "custodian_name": "Netherlands Film Fund",
      "custodian_slug": "netherlands-filmfonds",
      "role_title": "Interim CEO",
      "current": true,
      "observed_on": "2025-12-14T10:00:00Z"
    }
  ]
}

Custodian 1 (NL-NH-AMS-U-EFM.yaml):

person_observations:
  staff:
  - person_id: eye-filmmuseum_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Director
    current: false
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json

Custodian 2 (NL-ZH-DHA-O-NFF.yaml):

person_observations:
  staff:
  - person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Interim CEO
    current: true
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json

Note: Both custodians reference the SAME entity file!


Migration Guide

Migrating from Inline Web Claims

If you have custodian files with inline web_claims, migrate them:

Before (incorrect):

person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    web_claims:  # WRONG - should not be here
      - claim_type: full_name
        claim_value: John Doe

After (correct):

person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/example/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json

Migration steps:

  1. Create entity file with profile data + web claims
  2. Remove web_claims from custodian YAML
  3. Add affiliation_provenance block
  4. Add linkedin_profile_path reference

FAQ

Q: Why separate entity files from custodian files?

A: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times.

Q: Where do web claims go?

A: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation.

Q: What if I don't have a LinkedIn URL?

A: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier.

Q: Can a person have multiple entity files?

A: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The person_id is the key identifier.

Q: What timestamp format should I use?

A: ISO 8601 without separators: YYYYMMDDTHHMMSSZ (e.g., 20251214T112147Z).


  • Agent Rules: See AGENTS.md Rule 27
  • Agent Rule File: .opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md
  • Person Reference Pattern: .opencode/PERSON_DATA_REFERENCE_PATTERN.md
  • LinkedIn Extraction: .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md
  • Data Fabrication: .opencode/DATA_FABRICATION_PROHIBITION.md (Rule 21)