22 KiB
Person-Custodian Data Architecture
Overview
This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a Single Source of Truth pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance.
Table of Contents
- Architecture Principles
- Directory Structure
- Data Model
- Person Entity Files
- Custodian YAML Files
- Data Flow
- Scripts and Tools
- Examples
- Migration Guide
- FAQ
Architecture Principles
1. Single Source of Truth
Person entity files are the authoritative source for all person data.
- Profile information (name, headline, about, experience, education, skills)
- Web claims (provenance for extracted data)
- Affiliations (all custodians this person is associated with)
2. Separation of Concerns
Different data types live in different locations:
| Concern | Location | Rationale |
|---|---|---|
| Who is this person? | Entity file | Reusable across custodians |
| What is their background? | Entity file | Belongs to the person, not the custodian |
| Where did we get this data? | Entity file (web_claims) | Provenance is per-claim |
| How are they affiliated? | Custodian file | Relationship-specific data |
| When did we observe this? | Both | Entity has claim timestamps; Custodian has affiliation timestamp |
3. No Data Duplication
Same person appearing at multiple institutions → ONE entity file
Person: Sandra den Hamer
├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
│ └── affiliations: [EYE Filmmuseum, Netherlands Film Fund]
│
├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml
│ └── linkedin_profile_path: → entity file
│
└── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml
└── linkedin_profile_path: → entity file (SAME file!)
4. Cross-Custodian Career Tracking
Entity files track all affiliations, enabling queries like:
- "Who has worked at multiple archives?"
- "Show career paths in the heritage sector"
- "Find people who moved from museums to archives"
Directory Structure
data/custodian/
├── person/
│ │
│ ├── entity/ # SINGLE SOURCE OF TRUTH
│ │ ├── bibianvanreeken_20251211T000000Z.json
│ │ ├── giovanna-fossati_20251209T170000Z.json
│ │ ├── sandra-den-hamer-66024510_20251209T190000Z.json
│ │ └── ...
│ │
│ ├── affiliated/ # Staff lists by custodian
│ │ ├── manual/ # Raw HTML/MD input files
│ │ │ └── nationaal-archief_staff_20251214.html
│ │ └── parsed/ # Parsed JSON staff lists
│ │ ├── nationaal-archief_staff_20251214T112147Z.json
│ │ ├── noord-hollands-archief_staff_20251214T143055Z.json
│ │ └── ...
│ │
│ └── connection/ # Professional network data
│ ├── manual/ # Raw connection lists
│ │ └── giovanna-fossati_connections_20251211.md
│ └── parsed/ # Parsed connection JSON
│ └── giovanna-fossati_connections_20251211T140000Z.json
│
├── NL-ZH-DHA-A-NA.yaml # Custodian files reference entity/
├── NL-NH-HAA-A-NHA.yaml
├── NL-GE-ARN-A-GA.yaml
├── NL-UT-UTR-A-UA.yaml
└── ...
File Naming Conventions
| File Type | Pattern | Example |
|---|---|---|
| Person entity | {linkedin_slug}_{ISO_timestamp}.json |
bibianvanreeken_20251211T000000Z.json |
| Staff list (parsed) | {custodian_slug}_staff_{ISO_timestamp}.json |
nationaal-archief_staff_20251214T112147Z.json |
| Connections | {linkedin_slug}_connections_{ISO_timestamp}.json |
giovanna-fossati_connections_20251211T140000Z.json |
Data Model
Conceptual Model
┌──────────────────┐ ┌──────────────────┐
│ Person Entity │ │ Custodian │
│ │ N:M │ │
│ - profile_data │◄───────►│ - name │
│ - web_claims │ │ - ghcid │
│ - affiliations │ │ - staff[] │
│ │ │ │
└──────────────────┘ └──────────────────┘
│ │
│ 1:N │ 1:N
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Web Claim │ │ Staff Entry │
│ │ │ │
│ - claim_type │ │ - person_id │
│ - claim_value │ │ - person_name │
│ - source_url │ │ - role_title │
│ - retrieved_on │ │ - affiliation_ │
│ - retrieval_ │ │ provenance │
│ agent │ │ - linkedin_ │
│ │ │ profile_path │
└──────────────────┘ └──────────────────┘
Key Relationships
| Relationship | Cardinality | Description |
|---|---|---|
| Person ↔ Custodian | N:M | Person can work at multiple custodians; Custodian has multiple staff |
| Person → WebClaim | 1:N | One person has many provenance claims |
| Person → Affiliation | 1:N | One person has many affiliations (tracked in entity file) |
| Custodian → StaffEntry | 1:N | One custodian has many staff entries |
Person Entity Files
Location
data/custodian/person/entity/{linkedin_slug}_{timestamp}.json
Complete Schema
{
"extraction_metadata": {
"source_file": "string", // Path to source staff list
"staff_id": "string", // Unique identifier
"extraction_date": "ISO8601", // When profile was extracted
"extraction_method": "string", // exa_contents, exa_crawling_exa, manual
"extraction_agent": "string", // claude-opus-4.5 for manual, empty for automated
"linkedin_url": "string", // Full LinkedIn profile URL
"cost_usd": 0, // API cost (0 for Exa contents)
"request_id": "string" // Optional: Exa request ID
},
"linkedin_profile_url": "string", // Canonical LinkedIn URL
"profile_data": {
"name": "string", // Full name
"headline": "string", // Current role/headline
"location": "string", // City, Region, Country
"connections": "string", // "500 connections • 2,135 followers"
"about": "string", // Professional summary
"experience": [ // Work history
{
"title": "string",
"company": "string",
"duration": "string",
"location": "string",
"description": "string"
}
],
"education": [ // Education history
{
"school": "string",
"degree": "string",
"field": "string",
"years": "string"
}
],
"skills": ["string"], // Skills list
"languages": [ // Languages
{
"language": "string",
"proficiency": "string"
}
],
"profile_image_url": "string" // CDN URL for profile photo
},
"web_claims": [ // Provenance for extracted data
{
"claim_type": "string", // full_name, role_title, location, etc.
"claim_value": "string", // The extracted value
"source_url": "string", // Where it was found
"retrieved_on": "ISO8601", // When it was retrieved
"retrieval_agent": "string" // linkedin_html_parser, exa_crawling_exa, etc.
}
],
"affiliations": [ // All known custodian associations
{
"custodian_name": "string", // Full custodian name
"custodian_slug": "string", // Normalized slug
"role_title": "string", // Role at this custodian
"heritage_relevant": true, // Is this a heritage role?
"heritage_type": "A", // GLAMORCUBESFIXPHDNT type code
"current": true, // Currently employed?
"observed_on": "ISO8601", // When this affiliation was observed
"source_url": "string" // Where this was observed
}
]
}
Required Fields
| Field | Required | Notes |
|---|---|---|
extraction_metadata.extraction_date |
YES | ISO 8601 timestamp |
extraction_metadata.linkedin_url |
YES | Full LinkedIn profile URL |
linkedin_profile_url |
YES | Canonical URL (may duplicate above) |
profile_data.name |
YES | Full name |
web_claims |
YES | At least one claim (usually full_name) |
affiliations |
NO | May be empty if no custodian association known |
Custodian YAML Files
Location
data/custodian/{GHCID}.yaml
Staff Entry Schema
person_observations:
staff:
- person_id: string # Unique identifier (custodian_staff_NNNN_name_slug)
person_name: string # Full name (for display/search)
role_title: string # Current role at this custodian
heritage_relevant: boolean # Is this a heritage-relevant role?
heritage_type: string # GLAMORCUBESFIXPHDNT type code
current: boolean # Currently employed?
# AFFILIATION PROVENANCE - when/how was this association observed?
affiliation_provenance:
source_url: string # Where this association was found
retrieved_on: string # ISO 8601 timestamp
retrieval_agent: string # Tool used (linkedin_html_parser, etc.)
# REFERENCES to person entity file
linkedin_profile_url: string # For quick access/linking
linkedin_profile_path: string # Path to entity JSON file
What NOT to Include
Never put these in custodian YAML:
web_claims- Belongs in entity fileprofile_data- Belongs in entity fileexperience- Belongs in entity fileeducation- Belongs in entity fileskills- Belongs in entity fileabout- Belongs in entity file- Full profile content of any kind
Data Flow
Complete Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
PHASE 1: DATA COLLECTION
─────────────────────────
LinkedIn Company Page
│
▼ (Save HTML)
data/custodian/person/affiliated/manual/{slug}_staff_{date}.html
PHASE 2: PARSING
─────────────────
Manual HTML file
│
▼ (parse_linkedin_html.py)
data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json
│
│ Contains: List of {name, headline, linkedin_url, heritage_relevant}
│
PHASE 3: PROFILE EXTRACTION
───────────────────────────
Parsed staff list
│
▼ (Exa crawling OR manual extraction)
data/custodian/person/entity/{person_slug}_{timestamp}.json
│
│ Contains: Full profile_data, web_claims, affiliations
│
PHASE 4: LINKING
────────────────
Entity files + Custodian YAML
│
▼ (link_person_observations.py)
│
├──► Custodian YAML updated with:
│ - person_observations.staff[] entries
│ - affiliation_provenance
│ - linkedin_profile_path references
│
└──► Entity files updated with:
- web_claims (if not present)
- affiliations array (new custodian added)
Script Responsibilities
| Script | Input | Output | Purpose |
|---|---|---|---|
parse_linkedin_html.py |
Raw HTML | affiliated/parsed/*.json |
Extract staff list |
fetch_linkedin_profiles_exa.py |
Staff list | entity/*.json |
Extract full profiles |
link_person_observations.py |
Entity files + Staff list | Updated YAML + Entity | Create references |
Scripts and Tools
parse_linkedin_html.py
Purpose: Parse LinkedIn company "People" pages to extract staff lists.
Usage:
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \
--custodian-name "Nationaal Archief" \
--custodian-slug "nationaal-archief" \
--output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json
Output: JSON file with staff entries containing:
name,headline,linkedin_urlheritage_relevant,heritage_typedegree(LinkedIn connection degree)
link_person_observations.py
Purpose: Link person entity files to custodian YAML files.
Usage:
python scripts/link_person_observations.py \
--custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \
--staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
--entity-dir data/custodian/person/entity
Actions:
- Reads staff list to get person identifiers
- Finds matching entity files in
entity/ - Updates custodian YAML with
person_observations.staff[] - Adds
affiliation_provenanceandlinkedin_profile_path - Updates entity files with new affiliations and web_claims
fetch_linkedin_profiles_exa.py
Purpose: Extract full LinkedIn profiles using Exa API.
Usage:
python scripts/fetch_linkedin_profiles_exa.py \
--staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
--output-dir data/custodian/person/entity \
--limit 50
Examples
Example 1: Complete Person Entity File
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
"staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
"extraction_date": "2025-12-14T11:21:47Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
"cost_usd": 0
},
"linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken",
"profile_data": {
"name": "Bibian van Reeken",
"headline": "Projectmanager Digitalisering bij het Nationaal Archief",
"location": "The Hague, South Holland, Netherlands",
"connections": "500+ connections",
"about": "Experienced project manager specializing in digitization...",
"experience": [
{
"title": "Projectmanager Digitalisering",
"company": "Nationaal Archief",
"duration": "3 years",
"location": "The Hague, Netherlands"
}
],
"education": [
{
"school": "Leiden University",
"degree": "Master",
"field": "History"
}
],
"skills": ["Project Management", "Digitization", "Archives"]
},
"web_claims": [
{
"claim_type": "full_name",
"claim_value": "Bibian van Reeken",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
},
{
"claim_type": "role_title",
"claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
}
],
"affiliations": [
{
"custodian_name": "Nationaal Archief",
"custodian_slug": "nationaal-archief",
"role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
"heritage_relevant": true,
"heritage_type": "A",
"current": true,
"observed_on": "2025-12-14T11:21:47Z",
"source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
}
]
}
Example 2: Custodian YAML Staff Section
person_observations:
staff:
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
person_name: Bibian van Reeken
role_title: Projectmanager Digitalisering bij het Nationaal Archief
heritage_relevant: true
heritage_type: A
current: true
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
- person_id: nationaal-archief_staff_0002_jan_de_vries
person_name: Jan de Vries
role_title: Senior Archivist
heritage_relevant: true
heritage_type: A
current: true
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_url: https://www.linkedin.com/in/jandevries12345
linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json
Example 3: Cross-Custodian Reference
Person works at two custodians:
Entity file (sandra-den-hamer-66024510_20251209T190000Z.json):
{
"affiliations": [
{
"custodian_name": "EYE Filmmuseum",
"custodian_slug": "eye-filmmuseum",
"role_title": "Director",
"current": false,
"observed_on": "2025-12-09T19:00:00Z"
},
{
"custodian_name": "Netherlands Film Fund",
"custodian_slug": "netherlands-filmfonds",
"role_title": "Interim CEO",
"current": true,
"observed_on": "2025-12-14T10:00:00Z"
}
]
}
Custodian 1 (NL-NH-AMS-U-EFM.yaml):
person_observations:
staff:
- person_id: eye-filmmuseum_staff_0001_sandra_den_hamer
person_name: Sandra den Hamer
role_title: Director
current: false
linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
Custodian 2 (NL-ZH-DHA-O-NFF.yaml):
person_observations:
staff:
- person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer
person_name: Sandra den Hamer
role_title: Interim CEO
current: true
linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
Note: Both custodians reference the SAME entity file!
Migration Guide
Migrating from Inline Web Claims
If you have custodian files with inline web_claims, migrate them:
Before (incorrect):
person_observations:
staff:
- person_id: example_staff_0001_john_doe
person_name: John Doe
web_claims: # WRONG - should not be here
- claim_type: full_name
claim_value: John Doe
After (correct):
person_observations:
staff:
- person_id: example_staff_0001_john_doe
person_name: John Doe
affiliation_provenance:
source_url: https://www.linkedin.com/company/example/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json
Migration steps:
- Create entity file with profile data + web claims
- Remove
web_claimsfrom custodian YAML - Add
affiliation_provenanceblock - Add
linkedin_profile_pathreference
FAQ
Q: Why separate entity files from custodian files?
A: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times.
Q: Where do web claims go?
A: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation.
Q: What if I don't have a LinkedIn URL?
A: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier.
Q: Can a person have multiple entity files?
A: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The person_id is the key identifier.
Q: What timestamp format should I use?
A: ISO 8601 without separators: YYYYMMDDTHHMMSSZ (e.g., 20251214T112147Z).
Related Documentation
- Agent Rules: See
AGENTS.mdRule 27 - Agent Rule File:
.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md - Person Reference Pattern:
.opencode/PERSON_DATA_REFERENCE_PATTERN.md - LinkedIn Extraction:
.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - Data Fabrication:
.opencode/DATA_FABRICATION_PROHIBITION.md(Rule 21)