# Person-Custodian Data Architecture ## Overview This document describes the data architecture for managing person/staff information in the GLAM Heritage Custodian project. The architecture follows a **Single Source of Truth** pattern where person entity files contain all person-specific data, while custodian files contain only references and affiliation provenance. ## Table of Contents 1. [Architecture Principles](#architecture-principles) 2. [Directory Structure](#directory-structure) 3. [Data Model](#data-model) 4. [Person Entity Files](#person-entity-files) 5. [Custodian YAML Files](#custodian-yaml-files) 6. [Data Flow](#data-flow) 7. [Scripts and Tools](#scripts-and-tools) 8. [Examples](#examples) 9. [Migration Guide](#migration-guide) 10. [FAQ](#faq) --- ## Architecture Principles ### 1. Single Source of Truth **Person entity files are the authoritative source for all person data.** - Profile information (name, headline, about, experience, education, skills) - Web claims (provenance for extracted data) - Affiliations (all custodians this person is associated with) ### 2. Separation of Concerns **Different data types live in different locations:** | Concern | Location | Rationale | |---------|----------|-----------| | Who is this person? | Entity file | Reusable across custodians | | What is their background? | Entity file | Belongs to the person, not the custodian | | Where did we get this data? | Entity file (web_claims) | Provenance is per-claim | | How are they affiliated? | Custodian file | Relationship-specific data | | When did we observe this? | Both | Entity has claim timestamps; Custodian has affiliation timestamp | ### 3. No Data Duplication **Same person appearing at multiple institutions → ONE entity file** ``` Person: Sandra den Hamer ├── Entity: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json │ └── affiliations: [EYE Filmmuseum, Netherlands Film Fund] │ ├── Reference: data/custodian/NL-NH-AMS-U-EFM.yaml │ └── linkedin_profile_path: → entity file │ └── Reference: data/custodian/NL-ZH-DHA-O-NFF.yaml └── linkedin_profile_path: → entity file (SAME file!) ``` ### 4. Cross-Custodian Career Tracking Entity files track all affiliations, enabling queries like: - "Who has worked at multiple archives?" - "Show career paths in the heritage sector" - "Find people who moved from museums to archives" --- ## Directory Structure ``` data/custodian/ ├── person/ │ │ │ ├── entity/ # SINGLE SOURCE OF TRUTH │ │ ├── bibianvanreeken_20251211T000000Z.json │ │ ├── giovanna-fossati_20251209T170000Z.json │ │ ├── sandra-den-hamer-66024510_20251209T190000Z.json │ │ └── ... │ │ │ ├── affiliated/ # Staff lists by custodian │ │ ├── manual/ # Raw HTML/MD input files │ │ │ └── nationaal-archief_staff_20251214.html │ │ └── parsed/ # Parsed JSON staff lists │ │ ├── nationaal-archief_staff_20251214T112147Z.json │ │ ├── noord-hollands-archief_staff_20251214T143055Z.json │ │ └── ... │ │ │ └── connection/ # Professional network data │ ├── manual/ # Raw connection lists │ │ └── giovanna-fossati_connections_20251211.md │ └── parsed/ # Parsed connection JSON │ └── giovanna-fossati_connections_20251211T140000Z.json │ ├── NL-ZH-DHA-A-NA.yaml # Custodian files reference entity/ ├── NL-NH-HAA-A-NHA.yaml ├── NL-GE-ARN-A-GA.yaml ├── NL-UT-UTR-A-UA.yaml └── ... ``` ### File Naming Conventions | File Type | Pattern | Example | |-----------|---------|---------| | Person entity | `{linkedin_slug}_{ISO_timestamp}.json` | `bibianvanreeken_20251211T000000Z.json` | | Staff list (parsed) | `{custodian_slug}_staff_{ISO_timestamp}.json` | `nationaal-archief_staff_20251214T112147Z.json` | | Connections | `{linkedin_slug}_connections_{ISO_timestamp}.json` | `giovanna-fossati_connections_20251211T140000Z.json` | --- ## Data Model ### Conceptual Model ``` ┌──────────────────┐ ┌──────────────────┐ │ Person Entity │ │ Custodian │ │ │ N:M │ │ │ - profile_data │◄───────►│ - name │ │ - web_claims │ │ - ghcid │ │ - affiliations │ │ - staff[] │ │ │ │ │ └──────────────────┘ └──────────────────┘ │ │ │ 1:N │ 1:N ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ Web Claim │ │ Staff Entry │ │ │ │ │ │ - claim_type │ │ - person_id │ │ - claim_value │ │ - person_name │ │ - source_url │ │ - role_title │ │ - retrieved_on │ │ - affiliation_ │ │ - retrieval_ │ │ provenance │ │ agent │ │ - linkedin_ │ │ │ │ profile_path │ └──────────────────┘ └──────────────────┘ ``` ### Key Relationships | Relationship | Cardinality | Description | |--------------|-------------|-------------| | Person ↔ Custodian | N:M | Person can work at multiple custodians; Custodian has multiple staff | | Person → WebClaim | 1:N | One person has many provenance claims | | Person → Affiliation | 1:N | One person has many affiliations (tracked in entity file) | | Custodian → StaffEntry | 1:N | One custodian has many staff entries | --- ## Person Entity Files ### Location `data/custodian/person/entity/{linkedin_slug}_{timestamp}.json` ### Complete Schema ```json { "extraction_metadata": { "source_file": "string", // Path to source staff list "staff_id": "string", // Unique identifier "extraction_date": "ISO8601", // When profile was extracted "extraction_method": "string", // exa_contents, exa_crawling_exa, manual "extraction_agent": "string", // claude-opus-4.5 for manual, empty for automated "linkedin_url": "string", // Full LinkedIn profile URL "cost_usd": 0, // API cost (0 for Exa contents) "request_id": "string" // Optional: Exa request ID }, "linkedin_profile_url": "string", // Canonical LinkedIn URL "profile_data": { "name": "string", // Full name "headline": "string", // Current role/headline "location": "string", // City, Region, Country "connections": "string", // "500 connections • 2,135 followers" "about": "string", // Professional summary "experience": [ // Work history { "title": "string", "company": "string", "duration": "string", "location": "string", "description": "string" } ], "education": [ // Education history { "school": "string", "degree": "string", "field": "string", "years": "string" } ], "skills": ["string"], // Skills list "languages": [ // Languages { "language": "string", "proficiency": "string" } ], "profile_image_url": "string" // CDN URL for profile photo }, "web_claims": [ // Provenance for extracted data { "claim_type": "string", // full_name, role_title, location, etc. "claim_value": "string", // The extracted value "source_url": "string", // Where it was found "retrieved_on": "ISO8601", // When it was retrieved "retrieval_agent": "string" // linkedin_html_parser, exa_crawling_exa, etc. } ], "affiliations": [ // All known custodian associations { "custodian_name": "string", // Full custodian name "custodian_slug": "string", // Normalized slug "role_title": "string", // Role at this custodian "heritage_relevant": true, // Is this a heritage role? "heritage_type": "A", // GLAMORCUBESFIXPHDNT type code "current": true, // Currently employed? "observed_on": "ISO8601", // When this affiliation was observed "source_url": "string" // Where this was observed } ] } ``` ### Required Fields | Field | Required | Notes | |-------|----------|-------| | `extraction_metadata.extraction_date` | YES | ISO 8601 timestamp | | `extraction_metadata.linkedin_url` | YES | Full LinkedIn profile URL | | `linkedin_profile_url` | YES | Canonical URL (may duplicate above) | | `profile_data.name` | YES | Full name | | `web_claims` | YES | At least one claim (usually full_name) | | `affiliations` | NO | May be empty if no custodian association known | --- ## Custodian YAML Files ### Location `data/custodian/{GHCID}.yaml` ### Staff Entry Schema ```yaml person_observations: staff: - person_id: string # Unique identifier (custodian_staff_NNNN_name_slug) person_name: string # Full name (for display/search) role_title: string # Current role at this custodian heritage_relevant: boolean # Is this a heritage-relevant role? heritage_type: string # GLAMORCUBESFIXPHDNT type code current: boolean # Currently employed? # AFFILIATION PROVENANCE - when/how was this association observed? affiliation_provenance: source_url: string # Where this association was found retrieved_on: string # ISO 8601 timestamp retrieval_agent: string # Tool used (linkedin_html_parser, etc.) # REFERENCES to person entity file linkedin_profile_url: string # For quick access/linking linkedin_profile_path: string # Path to entity JSON file ``` ### What NOT to Include **Never put these in custodian YAML:** - `web_claims` - Belongs in entity file - `profile_data` - Belongs in entity file - `experience` - Belongs in entity file - `education` - Belongs in entity file - `skills` - Belongs in entity file - `about` - Belongs in entity file - Full profile content of any kind --- ## Data Flow ### Complete Pipeline ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ DATA FLOW PIPELINE │ └─────────────────────────────────────────────────────────────────────────────┘ PHASE 1: DATA COLLECTION ───────────────────────── LinkedIn Company Page │ ▼ (Save HTML) data/custodian/person/affiliated/manual/{slug}_staff_{date}.html PHASE 2: PARSING ───────────────── Manual HTML file │ ▼ (parse_linkedin_html.py) data/custodian/person/affiliated/parsed/{slug}_staff_{timestamp}.json │ │ Contains: List of {name, headline, linkedin_url, heritage_relevant} │ PHASE 3: PROFILE EXTRACTION ─────────────────────────── Parsed staff list │ ▼ (Exa crawling OR manual extraction) data/custodian/person/entity/{person_slug}_{timestamp}.json │ │ Contains: Full profile_data, web_claims, affiliations │ PHASE 4: LINKING ──────────────── Entity files + Custodian YAML │ ▼ (link_person_observations.py) │ ├──► Custodian YAML updated with: │ - person_observations.staff[] entries │ - affiliation_provenance │ - linkedin_profile_path references │ └──► Entity files updated with: - web_claims (if not present) - affiliations array (new custodian added) ``` ### Script Responsibilities | Script | Input | Output | Purpose | |--------|-------|--------|---------| | `parse_linkedin_html.py` | Raw HTML | `affiliated/parsed/*.json` | Extract staff list | | `fetch_linkedin_profiles_exa.py` | Staff list | `entity/*.json` | Extract full profiles | | `link_person_observations.py` | Entity files + Staff list | Updated YAML + Entity | Create references | --- ## Scripts and Tools ### parse_linkedin_html.py **Purpose**: Parse LinkedIn company "People" pages to extract staff lists. **Usage**: ```bash python scripts/parse_linkedin_html.py \ "data/custodian/person/affiliated/manual/Nationaal Archief_ People _ LinkedIn.html" \ --custodian-name "Nationaal Archief" \ --custodian-slug "nationaal-archief" \ --output data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json ``` **Output**: JSON file with staff entries containing: - `name`, `headline`, `linkedin_url` - `heritage_relevant`, `heritage_type` - `degree` (LinkedIn connection degree) ### link_person_observations.py **Purpose**: Link person entity files to custodian YAML files. **Usage**: ```bash python scripts/link_person_observations.py \ --custodian-file data/custodian/NL-ZH-DHA-A-NA.yaml \ --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \ --entity-dir data/custodian/person/entity ``` **Actions**: 1. Reads staff list to get person identifiers 2. Finds matching entity files in `entity/` 3. Updates custodian YAML with `person_observations.staff[]` 4. Adds `affiliation_provenance` and `linkedin_profile_path` 5. Updates entity files with new affiliations and web_claims ### fetch_linkedin_profiles_exa.py **Purpose**: Extract full LinkedIn profiles using Exa API. **Usage**: ```bash python scripts/fetch_linkedin_profiles_exa.py \ --staff-file data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json \ --output-dir data/custodian/person/entity \ --limit 50 ``` --- ## Examples ### Example 1: Complete Person Entity File ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json", "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken", "extraction_date": "2025-12-14T11:21:47Z", "extraction_method": "exa_contents", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken", "cost_usd": 0 }, "linkedin_profile_url": "https://www.linkedin.com/in/bibianvanreeken", "profile_data": { "name": "Bibian van Reeken", "headline": "Projectmanager Digitalisering bij het Nationaal Archief", "location": "The Hague, South Holland, Netherlands", "connections": "500+ connections", "about": "Experienced project manager specializing in digitization...", "experience": [ { "title": "Projectmanager Digitalisering", "company": "Nationaal Archief", "duration": "3 years", "location": "The Hague, Netherlands" } ], "education": [ { "school": "Leiden University", "degree": "Master", "field": "History" } ], "skills": ["Project Management", "Digitization", "Archives"] }, "web_claims": [ { "claim_type": "full_name", "claim_value": "Bibian van Reeken", "source_url": "https://www.linkedin.com/in/bibianvanreeken", "retrieved_on": "2025-12-14T11:21:47Z", "retrieval_agent": "linkedin_html_parser" }, { "claim_type": "role_title", "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief", "source_url": "https://www.linkedin.com/in/bibianvanreeken", "retrieved_on": "2025-12-14T11:21:47Z", "retrieval_agent": "linkedin_html_parser" } ], "affiliations": [ { "custodian_name": "Nationaal Archief", "custodian_slug": "nationaal-archief", "role_title": "Projectmanager Digitalisering bij het Nationaal Archief", "heritage_relevant": true, "heritage_type": "A", "current": true, "observed_on": "2025-12-14T11:21:47Z", "source_url": "https://www.linkedin.com/company/nationaal-archief/people/" } ] } ``` ### Example 2: Custodian YAML Staff Section ```yaml person_observations: staff: - person_id: nationaal-archief_staff_0001_bibian_van_reeken person_name: Bibian van Reeken role_title: Projectmanager Digitalisering bij het Nationaal Archief heritage_relevant: true heritage_type: A current: true affiliation_provenance: source_url: https://www.linkedin.com/company/nationaal-archief/people/ retrieved_on: '2025-12-14T11:21:47Z' retrieval_agent: linkedin_html_parser linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json - person_id: nationaal-archief_staff_0002_jan_de_vries person_name: Jan de Vries role_title: Senior Archivist heritage_relevant: true heritage_type: A current: true affiliation_provenance: source_url: https://www.linkedin.com/company/nationaal-archief/people/ retrieved_on: '2025-12-14T11:21:47Z' retrieval_agent: linkedin_html_parser linkedin_profile_url: https://www.linkedin.com/in/jandevries12345 linkedin_profile_path: data/custodian/person/entity/jandevries12345_20251214T150000Z.json ``` ### Example 3: Cross-Custodian Reference Person works at two custodians: **Entity file** (`sandra-den-hamer-66024510_20251209T190000Z.json`): ```json { "affiliations": [ { "custodian_name": "EYE Filmmuseum", "custodian_slug": "eye-filmmuseum", "role_title": "Director", "current": false, "observed_on": "2025-12-09T19:00:00Z" }, { "custodian_name": "Netherlands Film Fund", "custodian_slug": "netherlands-filmfonds", "role_title": "Interim CEO", "current": true, "observed_on": "2025-12-14T10:00:00Z" } ] } ``` **Custodian 1** (`NL-NH-AMS-U-EFM.yaml`): ```yaml person_observations: staff: - person_id: eye-filmmuseum_staff_0001_sandra_den_hamer person_name: Sandra den Hamer role_title: Director current: false linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json ``` **Custodian 2** (`NL-ZH-DHA-O-NFF.yaml`): ```yaml person_observations: staff: - person_id: netherlands-filmfonds_staff_0001_sandra_den_hamer person_name: Sandra den Hamer role_title: Interim CEO current: true linkedin_profile_path: data/custodian/person/entity/sandra-den-hamer-66024510_20251209T190000Z.json ``` **Note**: Both custodians reference the SAME entity file! --- ## Migration Guide ### Migrating from Inline Web Claims If you have custodian files with inline `web_claims`, migrate them: **Before** (incorrect): ```yaml person_observations: staff: - person_id: example_staff_0001_john_doe person_name: John Doe web_claims: # WRONG - should not be here - claim_type: full_name claim_value: John Doe ``` **After** (correct): ```yaml person_observations: staff: - person_id: example_staff_0001_john_doe person_name: John Doe affiliation_provenance: source_url: https://www.linkedin.com/company/example/people/ retrieved_on: '2025-12-14T11:21:47Z' retrieval_agent: linkedin_html_parser linkedin_profile_path: data/custodian/person/entity/johndoe_20251214T000000Z.json ``` **Migration steps**: 1. Create entity file with profile data + web claims 2. Remove `web_claims` from custodian YAML 3. Add `affiliation_provenance` block 4. Add `linkedin_profile_path` reference --- ## FAQ ### Q: Why separate entity files from custodian files? **A**: To avoid data duplication. A person working at 3 custodians would otherwise have their profile data copied 3 times. With this architecture, there's ONE entity file referenced 3 times. ### Q: Where do web claims go? **A**: Always in the person entity file, never in custodian YAML. Web claims are about the person, not about their affiliation. ### Q: What if I don't have a LinkedIn URL? **A**: You can still create an entity file using other sources (institutional website, manual research). Use a different slug pattern based on the available identifier. ### Q: Can a person have multiple entity files? **A**: Ideally no - one person = one entity file. However, if you create duplicates by accident, they can be merged later. The `person_id` is the key identifier. ### Q: What timestamp format should I use? **A**: ISO 8601 without separators: `YYYYMMDDTHHMMSSZ` (e.g., `20251214T112147Z`). --- ## Related Documentation - **Agent Rules**: See `AGENTS.md` Rule 27 - **Agent Rule File**: `.opencode/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` - **Person Reference Pattern**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - **LinkedIn Extraction**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - **Data Fabrication**: `.opencode/DATA_FABRICATION_PROHIBITION.md` (Rule 21)