kempersc c50c35fd3a enrich person custodian

2025-12-14 17:09:55 +01:00

22 KiB

Raw Blame History

Person-Custodian Data Architecture

Overview

This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a Single Source of Truth pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance.

Architecture Principles
Directory Structure
Data Model
Person Entity Files
Custodian YAML Files
Data Flow
Scripts and Tools
Examples
Migration Guide
FAQ

Architecture Principles

1. Single Source of Truth

Person entity files are the authoritative source for all person data.

Profile information (name, headline, about, experience, education, skills)
Web claims (provenance for extracted data)
Affiliations (all custodians this person is associated with)

2. Separation of Concerns

Different data types live in different locations:

Concern	Location	Rationale
Who is this person?	Entity file	Reusable across custodians
What is their background?	Entity file	Belongs to the person, not the custodian
Where did we get this data?	Entity file (web_claims)	Provenance is per-claim
How are they affiliated?	Custodian file	Relationship-specific data
When did we observe this?	Both	Entity has claim timestamps; Custodian has affiliation timestamp

3. No Data Duplication

Same person appearing at multiple institutions → ONE entity file

Person: Sandra den Hamer
├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
│   └── affiliations: [EYE Filmmuseum, Netherlands Film Fund]
│
├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml
│   └── linkedin_profile_path: → entity file
│
└── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml
    └── linkedin_profile_path: → entity file (SAME file!)

4. Cross-Custodian Career Tracking

Entity files track all affiliations, enabling queries like:

"Who has worked at multiple archives?"
"Show career paths in the heritage sector"
"Find people who moved from museums to archives"

Directory Structure

data/custodian/
├── person/
│   │
│   ├── entity/                           # SINGLE SOURCE OF TRUTH
│   │   ├── bibianvanreeken_20251211T000000Z.json
│   │   ├── giovanna-fossati_20251209T170000Z.json
│   │   ├── sandra-den-hamer-66024510_20251209T190000Z.json
│   │   └── ...
│   │
│   ├── affiliated/                       # Staff lists by custodian
│   │   ├── manual/                       # Raw HTML/MD input files
│   │   │   └── nationaal-archief_staff_20251214.html
│   │   └── parsed/                       # Parsed JSON staff lists
│   │       ├── nationaal-archief_staff_20251214T112147Z.json
│   │       ├── noord-hollands-archief_staff_20251214T143055Z.json
│   │       └── ...
│   │
│   └── connection/                       # Professional network data
│       ├── manual/                       # Raw connection lists
│       │   └── giovanna-fossati_connections_20251211.md
│       └── parsed/                       # Parsed connection JSON
│           └── giovanna-fossati_connections_20251211T140000Z.json
│
├── NL-ZH-DHA-A-NA.yaml                  # Custodian files reference entity/
├── NL-NH-HAA-A-NHA.yaml
├── NL-GE-ARN-A-GA.yaml
├── NL-UT-UTR-A-UA.yaml
└── ...

File Naming Conventions

File Type	Pattern	Example
Person entity	`{linkedin_slug}_{ISO_timestamp}.json`	`bibianvanreeken_20251211T000000Z.json`
Staff list (parsed)	`{custodian_slug}_staff_{ISO_timestamp}.json`	`nationaal-archief_staff_20251214T112147Z.json`
Connections	`{linkedin_slug}_connections_{ISO_timestamp}.json`	`giovanna-fossati_connections_20251211T140000Z.json`

Data Model

Conceptual Model

┌──────────────────┐         ┌──────────────────┐
│  Person Entity   │         │    Custodian     │
│                  │   N:M   │                  │
│  - profile_data  │◄───────►│  - name          │
│  - web_claims    │         │  - ghcid         │
│  - affiliations  │         │  - staff[]       │
│                  │         │                  │
└──────────────────┘         └──────────────────┘
        │                            │
        │ 1:N                        │ 1:N
        ▼                            ▼
┌──────────────────┐         ┌──────────────────┐
│    Web Claim     │         │   Staff Entry    │
│                  │         │                  │
│  - claim_type    │         │  - person_id     │
│  - claim_value   │         │  - person_name   │
│  - source_url    │         │  - role_title    │
│  - retrieved_on  │         │  - affiliation_  │
│  - retrieval_    │         │    provenance    │
│    agent         │         │  - linkedin_     │
│                  │         │    profile_path  │
└──────────────────┘         └──────────────────┘

Key Relationships

Relationship	Cardinality	Description
Person ↔ Custodian	N:M	Person can work at multiple custodians; Custodian has multiple staff
Person → WebClaim	1:N	One person has many provenance claims
Person → Affiliation	1:N	One person has many affiliations (tracked in entity file)
Custodian → StaffEntry	1:N	One custodian has many staff entries

Person Entity Files

Location

data/custodian/person/entity/{linkedin_slug}_{timestamp}.json

Complete Schema

{
  "extraction_metadata": {
    "source_file": "string",         // Path to source staff list
    "staff_id": "string",            // Unique identifier
    "extraction_date": "ISO8601",    // When profile was extracted
    "extraction_method": "string",   // exa_contents, exa_crawling_exa, manual
    "extraction_agent": "string",    // claude-opus-4.5 for manual, empty for automated
    "linkedin_url": "string",        // Full LinkedIn profile URL
    "cost_usd": 0,                   // API cost (0 for Exa contents)
    "request_id": "string"           // Optional: Exa request ID
  },
  
  "linkedin_profile_url": "string",  // Canonical LinkedIn URL
  
  "profile_data": {
    "name": "string",                // Full name
    "headline": "string",            // Current role/headline
    "location": "string",            // City, Region, Country
    "connections": "string",         // "500 connections • 2,135 followers"
    "about": "string",               // Professional summary
    "experience": [                  // Work history
      {
        "title": "string",
        "company": "string",
        "duration": "string",
        "location": "string",
        "description": "string"
      }
    ],
    "education": [                   // Education history
      {
        "school": "string",
        "degree": "string",
        "field": "string",
        "years": "string"
      }
    ],
    "skills": ["string"],            // Skills list
    "languages": [                   // Languages
      {
        "language": "string",
        "proficiency": "string"
      }
    ],
    "profile_image_url": "string"    // CDN URL for profile photo
  },
  
  "web_claims": [                    // Provenance for extracted data
    {
      "claim_type": "string",        // full_name, role_title, location, etc.
      "claim_value": "string",       // The extracted value
      "source_url": "string",        // Where it was found
      "retrieved_on": "ISO8601",     // When it was retrieved
      "retrieval_agent": "string"    // linkedin_html_parser, exa_crawling_exa, etc.
    }
  ],
  
  "affiliations": [                  // All known custodian associations
    {
      "custodian_name": "string",    // Full custodian name
      "custodian_slug": "string",    // Normalized slug
      "role_title": "string",        // Role at this custodian
      "heritage_relevant": true,     // Is this a heritage role?
      "heritage_type": "A",          // GLAMORCUBESFIXPHDNT type code
      "current": true,               // Currently employed?
      "observed_on": "ISO8601",      // When this affiliation was observed
      "source_url": "string"         // Where this was observed
    }
  ]
}

Required Fields

Field	Required	Notes
`extraction_metadata.extraction_date`	YES	ISO 8601 timestamp
`extraction_metadata.linkedin_url`	YES	Full LinkedIn profile URL
`linkedin_profile_url`	YES	Canonical URL (may duplicate above)
`profile_data.name`	YES	Full name
`web_claims`	YES	At least one claim (usually full_name)
`affiliations`	NO	May be empty if no custodian association known

Custodian YAML Files

Location

data/custodian/{GHCID}.yaml

Staff Entry Schema

person_observations:
  staff:
  - person_id: string              # Unique identifier (custodian_staff_NNNN_name_slug)
    person_name: string            # Full name (for display/search)
    role_title: string             # Current role at this custodian
    heritage_relevant: boolean     # Is this a heritage-relevant role?
    heritage_type: string          # GLAMORCUBESFIXPHDNT type code
    current: boolean               # Currently employed?
    
    # AFFILIATION PROVENANCE - when/how was this association observed?
    affiliation_provenance:
      source_url: string           # Where this association was found
      retrieved_on: string         # ISO 8601 timestamp
      retrieval_agent: string      # Tool used (linkedin_html_parser, etc.)
    
    # REFERENCES to person entity file
    linkedin_profile_url: string   # For quick access/linking
    linkedin_profile_path: string  # Path to entity JSON file

What NOT to Include

Never put these in custodian YAML:

web_claims - Belongs in entity file
profile_data - Belongs in entity file
experience - Belongs in entity file
education - Belongs in entity file
skills - Belongs in entity file
about - Belongs in entity file
Full profile content of any kind

Data Flow

Complete Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA FLOW PIPELINE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

PHASE 1: DATA COLLECTION
─────────────────────────
LinkedIn Company Page
       │
       ▼ (Save HTML)
data/custodian/person/affiliated/manual/{slug}_staff_{date}.html


PHASE 2: PARSING
─────────────────
Manual HTML file
       │
       ▼ (parse_linkedin_html.py)
data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json
       │
       │ Contains: List of {name, headline, linkedin_url, heritage_relevant}
       │


PHASE 3: PROFILE EXTRACTION
───────────────────────────
Parsed staff list
       │
       ▼ (Exa crawling OR manual extraction)
data/custodian/person/entity/{person_slug}_{timestamp}.json
       │
       │ Contains: Full profile_data, web_claims, affiliations
       │


PHASE 4: LINKING
────────────────
Entity files + Custodian YAML
       │
       ▼ (link_person_observations.py)
       │
       ├──► Custodian YAML updated with:
       │    - person_observations.staff[] entries
       │    - affiliation_provenance
       │    - linkedin_profile_path references
       │
       └──► Entity files updated with:
            - web_claims (if not present)
            - affiliations array (new custodian added)

Script Responsibilities

Script	Input	Output	Purpose
`parse_linkedin_html.py`	Raw HTML	`affiliated/parsed/*.json`	Extract staff list
`fetch_linkedin_profiles_exa.py`	Staff list	`entity/*.json`	Extract full profiles
`link_person_observations.py`	Entity files + Staff list	Updated YAML + Entity	Create references

Scripts and Tools

parse_linkedin_html.py

Purpose: Parse LinkedIn company "People" pages to extract staff lists.

Usage:

python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \
    --custodian-name "Nationaal Archief" \
    --custodian-slug "nationaal-archief" \
    --output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json

Output: JSON file with staff entries containing:

name, headline, linkedin_url
heritage_relevant, heritage_type
degree (LinkedIn connection degree)

link_person_observations.py

Purpose: Link person entity files to custodian YAML files.

Usage:

python scripts/link_person_observations.py \
    --custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --entity-dir data/custodian/person/entity

Actions:

Reads staff list to get person identifiers
Finds matching entity files in entity/
Updates custodian YAML with person_observations.staff[]
Adds affiliation_provenance and linkedin_profile_path
Updates entity files with new affiliations and web_claims

fetch_linkedin_profiles_exa.py

Purpose: Extract full LinkedIn profiles using Exa API.

Usage:

python scripts/fetch_linkedin_profiles_exa.py \
    --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
    --output-dir data/custodian/person/entity \
    --limit 50

Examples

Example 1: Complete Person Entity File

{
  "extraction_metadata": {
    "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
    "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
    "extraction_date": "2025-12-14T11:21:47Z",
    "extraction_method": "exa_contents",
    "extraction_agent": "claude-opus-4.5",
    "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
    "cost_usd": 0
  },
  "linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken",
  "profile_data": {
    "name": "Bibian van Reeken",
    "headline": "Projectmanager Digitalisering bij het Nationaal Archief",
    "location": "The Hague, South Holland, Netherlands",
    "connections": "500+ connections",
    "about": "Experienced project manager specializing in digitization...",
    "experience": [
      {
        "title": "Projectmanager Digitalisering",
        "company": "Nationaal Archief",
        "duration": "3 years",
        "location": "The Hague, Netherlands"
      }
    ],
    "education": [
      {
        "school": "Leiden University",
        "degree": "Master",
        "field": "History"
      }
    ],
    "skills": ["Project Management", "Digitization", "Archives"]
  },
  "web_claims": [
    {
      "claim_type": "full_name",
      "claim_value": "Bibian van Reeken",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    },
    {
      "claim_type": "role_title",
      "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
      "source_url": "https://www.linkedin.com/in/bibianvanreeken",
      "retrieved_on": "2025-12-14T11:21:47Z",
      "retrieval_agent": "linkedin_html_parser"
    }
  ],
  "affiliations": [
    {
      "custodian_name": "Nationaal Archief",
      "custodian_slug": "nationaal-archief",
      "role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
      "heritage_relevant": true,
      "heritage_type": "A",
      "current": true,
      "observed_on": "2025-12-14T11:21:47Z",
      "source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
    }
  ]
}

Example 2: Custodian YAML Staff Section

person_observations:
  staff:
  - person_id: nationaal-archief_staff_0001_bibian_van_reeken
    person_name: Bibian van Reeken
    role_title: Projectmanager Digitalisering bij het Nationaal Archief
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
    linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
    
  - person_id: nationaal-archief_staff_0002_jan_de_vries
    person_name: Jan de Vries
    role_title: Senior Archivist
    heritage_relevant: true
    heritage_type: A
    current: true
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/nationaal-archief/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_url: https://www.linkedin.com/in/jandevries12345
    linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json

Example 3: Cross-Custodian Reference

Person works at two custodians:

Entity file (sandra-den-hamer-66024510_20251209T190000Z.json):

{
  "affiliations": [
    {
      "custodian_name": "EYE Filmmuseum",
      "custodian_slug": "eye-filmmuseum",
      "role_title": "Director",
      "current": false,
      "observed_on": "2025-12-09T19:00:00Z"
    },
    {
      "custodian_name": "Netherlands Film Fund",
      "custodian_slug": "netherlands-filmfonds",
      "role_title": "Interim CEO",
      "current": true,
      "observed_on": "2025-12-14T10:00:00Z"
    }
  ]
}

Custodian 1 (NL-NH-AMS-U-EFM.yaml):

person_observations:
  staff:
  - person_id: eye-filmmuseum_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Director
    current: false
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json

Custodian 2 (NL-ZH-DHA-O-NFF.yaml):

person_observations:
  staff:
  - person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer
    person_name: Sandra den Hamer
    role_title: Interim CEO
    current: true
    linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json

Note: Both custodians reference the SAME entity file!

Migration Guide

Migrating from Inline Web Claims

If you have custodian files with inline web_claims, migrate them:

Before (incorrect):

person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    web_claims:  # WRONG - should not be here
      - claim_type: full_name
        claim_value: John Doe

After (correct):

person_observations:
  staff:
  - person_id: example_staff_0001_john_doe
    person_name: John Doe
    affiliation_provenance:
      source_url: https://www.linkedin.com/company/example/people/
      retrieved_on: '2025-12-14T11:21:47Z'
      retrieval_agent: linkedin_html_parser
    linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json

Migration steps:

Create entity file with profile data + web claims
Remove web_claims from custodian YAML
Add affiliation_provenance block
Add linkedin_profile_path reference

FAQ

Q: Why separate entity files from custodian files?

A: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times.

Q: Where do web claims go?

A: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation.

Q: What if I don't have a LinkedIn URL?

A: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier.

Q: Can a person have multiple entity files?

A: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The person_id is the key identifier.

Q: What timestamp format should I use?

A: ISO 8601 without separators: YYYYMMDDTHHMMSSZ (e.g., 20251214T112147Z).

Agent Rules: See AGENTS.md Rule 27
Agent Rule File: .opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md
Person Reference Pattern: .opencode/PERSON_DATA_REFERENCE_PATTERN.md
LinkedIn Extraction: .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md
Data Fabrication: .opencode/DATA_FABRICATION_PROHIBITION.md (Rule 21)

22 KiB Raw Blame History

Person-Custodian Data Architecture

Overview

Table of Contents

Architecture Principles

1. Single Source of Truth

2. Separation of Concerns

3. No Data Duplication

4. Cross-Custodian Career Tracking

Directory Structure

File Naming Conventions

Data Model

Conceptual Model

Key Relationships

Person Entity Files

Location

Complete Schema

Required Fields

Custodian YAML Files

Location

Staff Entry Schema

What NOT to Include

Data Flow

Complete Pipeline

Script Responsibilities

Scripts and Tools

parse_linkedin_html.py

link_person_observations.py

fetch_linkedin_profiles_exa.py

Examples

Example 1: Complete Person Entity File

Example 2: Custodian YAML Staff Section

Example 3: Cross-Custodian Reference

Migration Guide

Migrating from Inline Web Claims

FAQ

Q: Why separate entity files from custodian files?

Q: Where do web claims go?

Q: What if I don't have a LinkedIn URL?

Q: Can a person have multiple entity files?

Q: What timestamp format should I use?

Related Documentation

22 KiB

Raw Blame History