glam/docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md
2025-12-14 17:09:55 +01:00

629 lines
22 KiB
Markdown

# Person-Custodian Data Architecture
## Overview
This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a **Single Source of Truth** pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance.
## Table of Contents
1. [Architecture Principles](#architecture-principles)
2. [Directory Structure](#directory-structure)
3. [Data Model](#data-model)
4. [Person Entity Files](#person-entity-files)
5. [Custodian YAML Files](#custodian-yaml-files)
6. [Data Flow](#data-flow)
7. [Scripts and Tools](#scripts-and-tools)
8. [Examples](#examples)
9. [Migration Guide](#migration-guide)
10. [FAQ](#faq)
---
## Architecture Principles
### 1. Single Source of Truth
**Person entity files are the authoritative source for all person data.**
- Profile information (name, headline, about, experience, education, skills)
- Web claims (provenance for extracted data)
- Affiliations (all custodians this person is associated with)
### 2. Separation of Concerns
**Different data types live in different locations:**
| Concern | Location | Rationale |
|---------|----------|-----------|
| Who is this person? | Entity file | Reusable across custodians |
| What is their background? | Entity file | Belongs to the person, not the custodian |
| Where did we get this data? | Entity file (web_claims) | Provenance is per-claim |
| How are they affiliated? | Custodian file | Relationship-specific data |
| When did we observe this? | Both | Entity has claim timestamps; Custodian has affiliation timestamp |
### 3. No Data Duplication
**Same person appearing at multiple institutions → ONE entity file**
```
Person: Sandra den Hamer
├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
│ └── affiliations: [EYE Filmmuseum, Netherlands Film Fund]
├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml
│ └── linkedin_profile_path: → entity file
└── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml
└── linkedin_profile_path: → entity file (SAME file!)
```
### 4. Cross-Custodian Career Tracking
Entity files track all affiliations, enabling queries like:
- "Who has worked at multiple archives?"
- "Show career paths in the heritage sector"
- "Find people who moved from museums to archives"
---
## Directory Structure
```
data/custodian/
├── person/
│ │
│ ├── entity/ # SINGLE SOURCE OF TRUTH
│ │ ├── bibianvanreeken_20251211T000000Z.json
│ │ ├── giovanna-fossati_20251209T170000Z.json
│ │ ├── sandra-den-hamer-66024510_20251209T190000Z.json
│ │ └── ...
│ │
│ ├── affiliated/ # Staff lists by custodian
│ │ ├── manual/ # Raw HTML/MD input files
│ │ │ └── nationaal-archief_staff_20251214.html
│ │ └── parsed/ # Parsed JSON staff lists
│ │ ├── nationaal-archief_staff_20251214T112147Z.json
│ │ ├── noord-hollands-archief_staff_20251214T143055Z.json
│ │ └── ...
│ │
│ └── connection/ # Professional network data
│ ├── manual/ # Raw connection lists
│ │ └── giovanna-fossati_connections_20251211.md
│ └── parsed/ # Parsed connection JSON
│ └── giovanna-fossati_connections_20251211T140000Z.json
├── NL-ZH-DHA-A-NA.yaml # Custodian files reference entity/
├── NL-NH-HAA-A-NHA.yaml
├── NL-GE-ARN-A-GA.yaml
├── NL-UT-UTR-A-UA.yaml
└── ...
```
### File Naming Conventions
| File Type | Pattern | Example |
|-----------|---------|---------|
| Person entity | `{linkedin_slug}_{ISO_timestamp}.json` | `bibianvanreeken_20251211T000000Z.json` |
| Staff list (parsed) | `{custodian_slug}_staff_{ISO_timestamp}.json` | `nationaal-archief_staff_20251214T112147Z.json` |
| Connections | `{linkedin_slug}_connections_{ISO_timestamp}.json` | `giovanna-fossati_connections_20251211T140000Z.json` |
---
## Data Model
### Conceptual Model
```
┌──────────────────┐ ┌──────────────────┐
│ Person Entity │ │ Custodian │
│ │ N:M │ │
│ - profile_data │◄───────►│ - name │
│ - web_claims │ │ - ghcid │
│ - affiliations │ │ - staff[] │
│ │ │ │
└──────────────────┘ └──────────────────┘
│ │
│ 1:N │ 1:N
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Web Claim │ │ Staff Entry │
│ │ │ │
│ - claim_type │ │ - person_id │
│ - claim_value │ │ - person_name │
│ - source_url │ │ - role_title │
│ - retrieved_on │ │ - affiliation_ │
│ - retrieval_ │ │ provenance │
│ agent │ │ - linkedin_ │
│ │ │ profile_path │
└──────────────────┘ └──────────────────┘
```
### Key Relationships
| Relationship | Cardinality | Description |
|--------------|-------------|-------------|
| Person ↔ Custodian | N:M | Person can work at multiple custodians; Custodian has multiple staff |
| Person → WebClaim | 1:N | One person has many provenance claims |
| Person → Affiliation | 1:N | One person has many affiliations (tracked in entity file) |
| Custodian → StaffEntry | 1:N | One custodian has many staff entries |
---
## Person Entity Files
### Location
`data/custodian/person/entity/{linkedin_slug}_{timestamp}.json`
### Complete Schema
```json
{
"extraction_metadata": {
"source_file": "string", // Path to source staff list
"staff_id": "string", // Unique identifier
"extraction_date": "ISO8601", // When profile was extracted
"extraction_method": "string", // exa_contents, exa_crawling_exa, manual
"extraction_agent": "string", // claude-opus-4.5 for manual, empty for automated
"linkedin_url": "string", // Full LinkedIn profile URL
"cost_usd": 0, // API cost (0 for Exa contents)
"request_id": "string" // Optional: Exa request ID
},
"linkedin_profile_url": "string", // Canonical LinkedIn URL
"profile_data": {
"name": "string", // Full name
"headline": "string", // Current role/headline
"location": "string", // City, Region, Country
"connections": "string", // "500 connections • 2,135 followers"
"about": "string", // Professional summary
"experience": [ // Work history
{
"title": "string",
"company": "string",
"duration": "string",
"location": "string",
"description": "string"
}
],
"education": [ // Education history
{
"school": "string",
"degree": "string",
"field": "string",
"years": "string"
}
],
"skills": ["string"], // Skills list
"languages": [ // Languages
{
"language": "string",
"proficiency": "string"
}
],
"profile_image_url": "string" // CDN URL for profile photo
},
"web_claims": [ // Provenance for extracted data
{
"claim_type": "string", // full_name, role_title, location, etc.
"claim_value": "string", // The extracted value
"source_url": "string", // Where it was found
"retrieved_on": "ISO8601", // When it was retrieved
"retrieval_agent": "string" // linkedin_html_parser, exa_crawling_exa, etc.
}
],
"affiliations": [ // All known custodian associations
{
"custodian_name": "string", // Full custodian name
"custodian_slug": "string", // Normalized slug
"role_title": "string", // Role at this custodian
"heritage_relevant": true, // Is this a heritage role?
"heritage_type": "A", // GLAMORCUBESFIXPHDNT type code
"current": true, // Currently employed?
"observed_on": "ISO8601", // When this affiliation was observed
"source_url": "string" // Where this was observed
}
]
}
```
### Required Fields
| Field | Required | Notes |
|-------|----------|-------|
| `extraction_metadata.extraction_date` | YES | ISO 8601 timestamp |
| `extraction_metadata.linkedin_url` | YES | Full LinkedIn profile URL |
| `linkedin_profile_url` | YES | Canonical URL (may duplicate above) |
| `profile_data.name` | YES | Full name |
| `web_claims` | YES | At least one claim (usually full_name) |
| `affiliations` | NO | May be empty if no custodian association known |
---
## Custodian YAML Files
### Location
`data/custodian/{GHCID}.yaml`
### Staff Entry Schema
```yaml
person_observations:
staff:
- person_id: string # Unique identifier (custodian_staff_NNNN_name_slug)
person_name: string # Full name (for display/search)
role_title: string # Current role at this custodian
heritage_relevant: boolean # Is this a heritage-relevant role?
heritage_type: string # GLAMORCUBESFIXPHDNT type code
current: boolean # Currently employed?
# AFFILIATION PROVENANCE - when/how was this association observed?
affiliation_provenance:
source_url: string # Where this association was found
retrieved_on: string # ISO 8601 timestamp
retrieval_agent: string # Tool used (linkedin_html_parser, etc.)
# REFERENCES to person entity file
linkedin_profile_url: string # For quick access/linking
linkedin_profile_path: string # Path to entity JSON file
```
### What NOT to Include
**Never put these in custodian YAML:**
- `web_claims` - Belongs in entity file
- `profile_data` - Belongs in entity file
- `experience` - Belongs in entity file
- `education` - Belongs in entity file
- `skills` - Belongs in entity file
- `about` - Belongs in entity file
- Full profile content of any kind
---
## Data Flow
### Complete Pipeline
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
PHASE 1: DATA COLLECTION
─────────────────────────
LinkedIn Company Page
▼ (Save HTML)
data/custodian/person/affiliated/manual/{slug}_staff_{date}.html
PHASE 2: PARSING
─────────────────
Manual HTML file
▼ (parse_linkedin_html.py)
data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json
│ Contains: List of {name, headline, linkedin_url, heritage_relevant}
PHASE 3: PROFILE EXTRACTION
───────────────────────────
Parsed staff list
▼ (Exa crawling OR manual extraction)
data/custodian/person/entity/{person_slug}_{timestamp}.json
│ Contains: Full profile_data, web_claims, affiliations
PHASE 4: LINKING
────────────────
Entity files + Custodian YAML
▼ (link_person_observations.py)
├──► Custodian YAML updated with:
│ - person_observations.staff[] entries
│ - affiliation_provenance
│ - linkedin_profile_path references
└──► Entity files updated with:
- web_claims (if not present)
- affiliations array (new custodian added)
```
### Script Responsibilities
| Script | Input | Output | Purpose |
|--------|-------|--------|---------|
| `parse_linkedin_html.py` | Raw HTML | `affiliated/parsed/*.json` | Extract staff list |
| `fetch_linkedin_profiles_exa.py` | Staff list | `entity/*.json` | Extract full profiles |
| `link_person_observations.py` | Entity files + Staff list | Updated YAML + Entity | Create references |
---
## Scripts and Tools
### parse_linkedin_html.py
**Purpose**: Parse LinkedIn company "People" pages to extract staff lists.
**Usage**:
```bash
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \
--custodian-name "Nationaal Archief" \
--custodian-slug "nationaal-archief" \
--output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json
```
**Output**: JSON file with staff entries containing:
- `name`, `headline`, `linkedin_url`
- `heritage_relevant`, `heritage_type`
- `degree` (LinkedIn connection degree)
### link_person_observations.py
**Purpose**: Link person entity files to custodian YAML files.
**Usage**:
```bash
python scripts/link_person_observations.py \
--custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \
--staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
--entity-dir data/custodian/person/entity
```
**Actions**:
1. Reads staff list to get person identifiers
2. Finds matching entity files in `entity/`
3. Updates custodian YAML with `person_observations.staff[]`
4. Adds `affiliation_provenance` and `linkedin_profile_path`
5. Updates entity files with new affiliations and web_claims
### fetch_linkedin_profiles_exa.py
**Purpose**: Extract full LinkedIn profiles using Exa API.
**Usage**:
```bash
python scripts/fetch_linkedin_profiles_exa.py \
--staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \
--output-dir data/custodian/person/entity \
--limit 50
```
---
## Examples
### Example 1: Complete Person Entity File
```json
{
"extraction_metadata": {
"source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json",
"staff_id": "nationaal-archief_staff_0001_bibian_van_reeken",
"extraction_date": "2025-12-14T11:21:47Z",
"extraction_method": "exa_contents",
"extraction_agent": "claude-opus-4.5",
"linkedin_url": "https://www.linkedin.com/in/bibianvanreeken",
"cost_usd": 0
},
"linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken",
"profile_data": {
"name": "Bibian van Reeken",
"headline": "Projectmanager Digitalisering bij het Nationaal Archief",
"location": "The Hague, South Holland, Netherlands",
"connections": "500+ connections",
"about": "Experienced project manager specializing in digitization...",
"experience": [
{
"title": "Projectmanager Digitalisering",
"company": "Nationaal Archief",
"duration": "3 years",
"location": "The Hague, Netherlands"
}
],
"education": [
{
"school": "Leiden University",
"degree": "Master",
"field": "History"
}
],
"skills": ["Project Management", "Digitization", "Archives"]
},
"web_claims": [
{
"claim_type": "full_name",
"claim_value": "Bibian van Reeken",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
},
{
"claim_type": "role_title",
"claim_value": "Projectmanager Digitalisering bij het Nationaal Archief",
"source_url": "https://www.linkedin.com/in/bibianvanreeken",
"retrieved_on": "2025-12-14T11:21:47Z",
"retrieval_agent": "linkedin_html_parser"
}
],
"affiliations": [
{
"custodian_name": "Nationaal Archief",
"custodian_slug": "nationaal-archief",
"role_title": "Projectmanager Digitalisering bij het Nationaal Archief",
"heritage_relevant": true,
"heritage_type": "A",
"current": true,
"observed_on": "2025-12-14T11:21:47Z",
"source_url": "https://www.linkedin.com/company/nationaal-archief/people/"
}
]
}
```
### Example 2: Custodian YAML Staff Section
```yaml
person_observations:
staff:
- person_id: nationaal-archief_staff_0001_bibian_van_reeken
person_name: Bibian van Reeken
role_title: Projectmanager Digitalisering bij het Nationaal Archief
heritage_relevant: true
heritage_type: A
current: true
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken
linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json
- person_id: nationaal-archief_staff_0002_jan_de_vries
person_name: Jan de Vries
role_title: Senior Archivist
heritage_relevant: true
heritage_type: A
current: true
affiliation_provenance:
source_url: https://www.linkedin.com/company/nationaal-archief/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_url: https://www.linkedin.com/in/jandevries12345
linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json
```
### Example 3: Cross-Custodian Reference
Person works at two custodians:
**Entity file** (`sandra-den-hamer-66024510_20251209T190000Z.json`):
```json
{
"affiliations": [
{
"custodian_name": "EYE Filmmuseum",
"custodian_slug": "eye-filmmuseum",
"role_title": "Director",
"current": false,
"observed_on": "2025-12-09T19:00:00Z"
},
{
"custodian_name": "Netherlands Film Fund",
"custodian_slug": "netherlands-filmfonds",
"role_title": "Interim CEO",
"current": true,
"observed_on": "2025-12-14T10:00:00Z"
}
]
}
```
**Custodian 1** (`NL-NH-AMS-U-EFM.yaml`):
```yaml
person_observations:
staff:
- person_id: eye-filmmuseum_staff_0001_sandra_den_hamer
person_name: Sandra den Hamer
role_title: Director
current: false
linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
```
**Custodian 2** (`NL-ZH-DHA-O-NFF.yaml`):
```yaml
person_observations:
staff:
- person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer
person_name: Sandra den Hamer
role_title: Interim CEO
current: true
linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json
```
**Note**: Both custodians reference the SAME entity file!
---
## Migration Guide
### Migrating from Inline Web Claims
If you have custodian files with inline `web_claims`, migrate them:
**Before** (incorrect):
```yaml
person_observations:
staff:
- person_id: example_staff_0001_john_doe
person_name: John Doe
web_claims: # WRONG - should not be here
- claim_type: full_name
claim_value: John Doe
```
**After** (correct):
```yaml
person_observations:
staff:
- person_id: example_staff_0001_john_doe
person_name: John Doe
affiliation_provenance:
source_url: https://www.linkedin.com/company/example/people/
retrieved_on: '2025-12-14T11:21:47Z'
retrieval_agent: linkedin_html_parser
linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json
```
**Migration steps**:
1. Create entity file with profile data + web claims
2. Remove `web_claims` from custodian YAML
3. Add `affiliation_provenance` block
4. Add `linkedin_profile_path` reference
---
## FAQ
### Q: Why separate entity files from custodian files?
**A**: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times.
### Q: Where do web claims go?
**A**: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation.
### Q: What if I don't have a LinkedIn URL?
**A**: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier.
### Q: Can a person have multiple entity files?
**A**: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The `person_id` is the key identifier.
### Q: What timestamp format should I use?
**A**: ISO 8601 without separators: `YYYYMMDDTHHMMSSZ` (e.g., `20251214T112147Z`).
---
## Related Documentation
- **Agent Rules**: See `AGENTS.md` Rule 27
- **Agent Rule File**: `.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md`
- **Person Reference Pattern**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md`
- **LinkedIn Extraction**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md`
- **Data Fabrication**: `.opencode/DATA_FABRICATION_PROHIBITION.md` (Rule 21)