# Person-Custodian Data Architecture ## Rule: Single Source of Truth for Person Data **🚨 CRITICAL: Person entity files are the SINGLE SOURCE OF TRUTH for all person data. Custodian files store only references and affiliation provenance.** This architecture ensures: - No data duplication across custodian files - Clean separation between person data and affiliation data - Cross-custodian career tracking capability - Single place to update person information --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ DATA ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ PERSON ENTITY FILES CUSTODIAN YAML FILES │ │ (Single Source of Truth) (References Only) │ │ │ │ ┌─────────────────────────┐ ┌──────────────────────┐ │ │ │ person/entity/ │ │ NL-ZH-DHA-A-NA.yaml │ │ │ │ {slug}_{timestamp}.json │◄──────────│ │ │ │ │ │ ref │ person_observations: │ │ │ │ • profile_data │ │ staff: │ │ │ │ • web_claims ◄──────────┼───────────│ - person_id: ... │ │ │ │ • affiliations ─────────┼──────────►│ affiliation_ │ │ │ │ │ sync │ provenance: ... │ │ │ └─────────────────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ │ │ same file │ different files │ │ ▼ ▼ │ │ ┌─────────────────────────┐ ┌──────────────────────┐ │ │ │ Multiple custodians │ │ NL-NH-HAA-A-NHA.yaml │ │ │ │ reference SAME person │◄──────────│ │ │ │ │ entity file │ │ (same person, diff │ │ │ │ │ │ custodian) │ │ │ └─────────────────────────┘ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## Directory Structure ``` data/custodian/ ├── person/ │ ├── entity/ # SINGLE SOURCE OF TRUTH for person data │ │ ├── bibianvanreeken_20251211T000000Z.json │ │ ├── giovanna-fossati_20251209T170000Z.json │ │ └── ... │ ├── affiliated/ # Custodian staff lists (parsed) │ │ └── parsed/ │ │ └── nationaal-archief_staff_20251214T112147Z.json │ └── connection/ # Professional network data │ └── manual/ │ └── {slug}_connections_{timestamp}.json │ ├── NL-ZH-DHA-A-NA.yaml # Custodian file - references persons ├── NL-NH-HAA-A-NHA.yaml # Another custodian - may reference same persons └── ... ``` --- ## What Goes Where ### Person Entity Files (`data/custodian/person/entity/`) **Store ALL person-specific data**: ```json { "extraction_metadata": { "source_file": "data/custodian/person/affiliated/parsed/nationaal-archief_staff_20251214T112147Z.json", "staff_id": "nationaal-archief_staff_0001_bibian_van_reeken", "extraction_date": "2025-12-14T11:21:47Z", "extraction_method": "exa_contents", "extraction_agent": "claude-opus-4.5", "linkedin_url": "https://www.linkedin.com/in/bibianvanreeken", "cost_usd": 0 }, "profile_data": { "name": "Bibian van Reeken", "headline": "Projectmanager Digitalisering bij het Nationaal Archief", "location": "The Hague, South Holland, Netherlands", "about": "Professional summary...", "experience": [...], "education": [...], "skills": [...] }, "web_claims": [ { "claim_type": "full_name", "claim_value": "Bibian van Reeken", "source_url": "https://www.linkedin.com/in/bibianvanreeken", "retrieved_on": "2025-12-14T11:21:47Z", "retrieval_agent": "linkedin_html_parser" }, { "claim_type": "role_title", "claim_value": "Projectmanager Digitalisering bij het Nationaal Archief", "source_url": "https://www.linkedin.com/in/bibianvanreeken", "retrieved_on": "2025-12-14T11:21:47Z", "retrieval_agent": "linkedin_html_parser" } ], "affiliations": [ { "custodian_name": "Nationaal Archief", "custodian_slug": "nationaal-archief", "role_title": "Projectmanager Digitalisering bij het Nationaal Archief", "heritage_relevant": true, "heritage_type": "A", "current": true, "observed_on": "2025-12-14T11:21:47Z", "source_url": "https://www.linkedin.com/company/nationaal-archief/people/" } ] } ``` ### Custodian YAML Files (`data/custodian/NL-*.yaml`) **Store ONLY references and affiliation provenance**: ```yaml person_observations: staff: - person_id: nationaal-archief_staff_0001_bibian_van_reeken person_name: Bibian van Reeken role_title: Projectmanager Digitalisering bij het Nationaal Archief heritage_relevant: true heritage_type: A current: true # AFFILIATION PROVENANCE ONLY - when/how was this association observed? affiliation_provenance: source_url: https://www.linkedin.com/company/nationaal-archief/people/ retrieved_on: '2025-12-14T11:21:47Z' retrieval_agent: linkedin_html_parser # References to entity file (contains full profile + web claims) linkedin_profile_url: https://www.linkedin.com/in/bibianvanreeken linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json ``` --- ## Data Flow ``` LinkedIn Company Page (staff list) │ ▼ ┌───────────────────────────────────────────┐ │ 1. Parse HTML → Staff list JSON │ │ (parse_linkedin_html.py) │ │ Output: affiliated/parsed/{slug}.json │ └───────────────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────┐ │ 2. Extract full profiles → Entity files │ │ (Exa crawling for each person) │ │ Output: entity/{slug}_{timestamp}.json │ └───────────────────────────────────────────┘ │ ▼ ┌───────────────────────────────────────────┐ │ 3. Link to custodian YAML │ │ (link_person_observations.py) │ │ - Adds affiliation_provenance │ │ - Sets linkedin_profile_path │ │ - Updates entity affiliations array │ └───────────────────────────────────────────┘ ``` --- ## Correct vs Incorrect Patterns ### ❌ WRONG - Web Claims in Custodian File ```yaml # data/custodian/NL-ZH-DHA-A-NA.yaml person_observations: staff: - person_id: nationaal-archief_staff_0001_bibian_van_reeken person_name: Bibian van Reeken # WRONG! Web claims don't belong in custodian file web_claims: - claim_type: full_name claim_value: Bibian van Reeken source_url: https://www.linkedin.com/in/bibianvanreeken retrieved_on: '2025-12-14T11:21:47Z' ``` ### ✅ CORRECT - Affiliation Provenance Only ```yaml # data/custodian/NL-ZH-DHA-A-NA.yaml person_observations: staff: - person_id: nationaal-archief_staff_0001_bibian_van_reeken person_name: Bibian van Reeken role_title: Projectmanager Digitalisering bij het Nationaal Archief heritage_relevant: true heritage_type: A current: true # CORRECT! Only affiliation provenance in custodian file affiliation_provenance: source_url: https://www.linkedin.com/company/nationaal-archief/people/ retrieved_on: '2025-12-14T11:21:47Z' retrieval_agent: linkedin_html_parser linkedin_profile_path: data/custodian/person/entity/bibianvanreeken_20251211T000000Z.json ``` --- ## Benefits of This Architecture ### 1. No Data Duplication Same person at multiple institutions → ONE entity file, multiple references ### 2. Single Source of Truth Update person's profile once → All custodian references automatically up-to-date ### 3. Clean Separation of Concerns - **Entity file**: Who is this person? (profile, claims) - **Custodian file**: How are they associated? (affiliation provenance) ### 4. Cross-Custodian Career Tracking Query all affiliations for a person from their entity file: ```json { "affiliations": [ {"custodian_name": "Nationaal Archief", "current": true}, {"custodian_name": "Noord-Hollands Archief", "current": false} ] } ``` ### 5. Network Analysis Ready Entity files contain both profile data and affiliations → Easy to build relationship graphs --- ## Related Rules - **Rule 5**: Never delete enriched data - additive only - **Rule 12**: Person data reference pattern (file paths, not inline) - **Rule 20**: Person entity profiles stored individually - **Rule 22**: Custodian YAML is single source of truth for custodian data - **Rule 26**: Person data provenance - web claims required --- ## Scripts | Script | Purpose | |--------|---------| | `scripts/parse_linkedin_html.py` | Parse LinkedIn company staff pages | | `scripts/link_person_observations.py` | Link entity files to custodian YAML | | `scripts/fetch_linkedin_profiles_exa.py` | Extract full profiles via Exa | --- ## See Also - `docs/PERSON_CUSTODIAN_DATA_ARCHITECTURE.md` - Detailed documentation - `.opencode/PERSON_DATA_REFERENCE_PATTERN.md` - Reference pattern details - `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` - Exa extraction rules