docs: add Rules 13-16 for custodian type annotations, Exa LinkedIn, connection data, photo CDN

2025-12-10 09:04:14 +01:00 · 2025-12-10 09:04:14 +01:00 · 162ca3ad79
commit 162ca3ad79
parent c4b0f17a43
1 changed files with 285 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -839,6 +839,291 @@ collection_management_specialist:

 ---

+### Rule 13: Custodian Type Annotations on LinkML Schema Elements
+
+**All LinkML schema elements (classes, slots, enums) MUST be annotated with their applicable GLAMORCUBESFIXPHDNT custodian type codes using the `annotations` block.**
+
+This convention enables:
+- **Visual categorization** in UML diagrams (cube face highlighting)
+- **Semantic filtering** by heritage institution type
+- **Validation** of slot/enum applicability to custodian types
+
+**Annotation Keys**:
+
+| Key | Type | Required | Description |
+|-----|------|----------|-------------|
+| `custodian_types` | list[string] | YES | List of applicable type codes (e.g., `[A, L]`) |
+| `custodian_types_rationale` | string | NO | Explanation of why these types apply |
+| `custodian_types_primary` | string | NO | Primary type if multiple apply |
+
+**Type Codes** (single letters from GLAMORCUBESFIXPHDNT):
+
+| Code | Type | Code | Type |
+|------|------|------|------|
+| G | Gallery | F | Feature custodian |
+| L | Library | I | Intangible heritage |
+| A | Archive | X | Mixed types |
+| M | Museum | P | Personal collection |
+| O | Official institution | H | Holy/sacred site |
+| R | Research center | D | Digital platform |
+| C | Corporation | N | NGO |
+| U | Unknown | T | Taste/smell heritage |
+| B | Botanical/Zoo | | |
+| E | Education provider | | |
+| S | Collecting society | | |
+
+**Example - Class Annotation**:
+```yaml
+classes:
+  ArchivalFonds:
+    class_uri: rico:RecordSet
+    description: An archival fonds representing a collection of records
+    annotations:
+      custodian_types: [A]
+      custodian_types_rationale: "Archival fonds are specific to archive institutions"
+```
+
+**Example - Slot Annotation**:
+```yaml
+slots:
+  arrangement_system:
+    description: The archival arrangement system used
+    range: ArrangementSystemEnum
+    annotations:
+      custodian_types: [A]
+      custodian_types_rationale: "Arrangement systems are an archival concept"
+```
+
+**Example - Enum Value Annotation**:
+```yaml
+enums:
+  CollectionTypeEnum:
+    permissible_values:
+      ARCHIVAL_FONDS:
+        description: A complete archival fonds
+        annotations:
+          custodian_types: [A]
+```
+
+**Universal Types**: Use `["*"]` for elements that apply to all custodian types:
+```yaml
+slots:
+  preferred_label:
+    annotations:
+      custodian_types: ["*"]
+```
+
+**Validation Rules**:
+1. Values MUST be valid single-letter GLAMORCUBESFIXPHDNT codes
+2. Child class types MUST be subset of or equal to parent class types
+3. Slot annotations should align with using class annotations
+
+**See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation
+
+---
+
+### Rule 14: Exa MCP LinkedIn Profile Extraction
+
+**When extracting LinkedIn profile data for heritage custodian staff, use the `exa_crawling_exa` tool with direct profile URL for comprehensive extraction.**
+
+**Tool Selection**:
+
+| Scenario | Tool | Parameters |
+|----------|------|------------|
+| Profile URL known | `exa_crawling_exa` | `url`, `maxCharacters: 10000` |
+| Profile URL unknown | `exa_linkedin_search_exa` | `query`, `searchType: "profiles"` |
+| Fallback search | `exa_web_search_exa` | `query: "site:linkedin.com/in/ {name}"` |
+
+**Preferred Workflow** (when LinkedIn URL is available):
+
+```
+1. Use exa_crawling_exa with direct URL
+   ↓
+2. Extract comprehensive profile data
+   ↓
+3. Save to data/custodian/person/{linkedin-slug}_{timestamp}.json
+   ↓
+4. Update custodian file with person_profile_path reference
+```
+
+**Example - Direct URL Extraction**:
+```
+Tool: exa_crawling_exa
+Parameters:
+  url: "https://www.linkedin.com/in/alexandr-belov-bb547b46"
+  maxCharacters: 10000
+```
+
+**Why `exa_crawling_exa` over `exa_linkedin_search_exa`**:
+- Returns **complete** career history with dates, durations, company metadata
+- Includes **all** education entries with institutions and degrees
+- Captures **full** about section text
+- Returns **profile image URL**
+- More reliable than search (which may return irrelevant profiles)
+
+**Output Location**: `data/custodian/person/{linkedin-slug}_{ISO-timestamp}.json`
+
+**Required JSON Fields**:
+- `exa_search_metadata` - Tool, timestamp, request ID, cost
+- `linkedin_profile_url` - Source URL
+- `profile_data.career_history[]` - Complete work history
+- `profile_data.education[]` - All education entries
+- `profile_data.heritage_relevant_experience[]` - Tagged heritage roles
+
+**See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation
+
+---
+
+### Rule 15: Connection Data Registration - Full Network Preservation
+
+**🚨 CRITICAL: When connection data is manually recorded for a person, ALL connections MUST be fully registered in a dedicated connections file in `data/custodian/person/`.**
+
+This rule ensures complete preservation of professional network data, enabling heritage sector network analysis and cross-custodian relationship discovery.
+
+**Connection File Naming Convention**:
+
+```
+{linkedin-slug}_connections_{ISO-timestamp}.json
+```
+
+**Examples**:
+```
+alexandr-belov-bb547b46_connections_20251210T160000Z.json
+giovanna-fossati_connections_20251211T140000Z.json
+```
+
+**Required Connection File Structure**:
+
+```json
+{
+  "source_metadata": {
+    "source_url": "https://www.linkedin.com/search/results/people/?...",
+    "scraped_timestamp": "2025-12-10T16:00:00Z",
+    "scrape_method": "manual_linkedin_browse",
+    "target_profile": "{linkedin-slug}",
+    "target_name": "Full Name",
+    "connections_extracted": 107
+  },
+  "connections": [
+    {
+      "name": "Connection Name",
+      "degree": "1st" | "2nd" | "3rd+",
+      "headline": "Current role or description",
+      "location": "City, Region, Country",
+      "organization": "Primary organization",
+      "heritage_relevant": true | false,
+      "heritage_type": "A" | "L" | "M" | "R" | "E" | "D" | "G" | etc.
+    }
+  ],
+  "network_analysis": {
+    "total_connections_extracted": 107,
+    "heritage_relevant_count": 56,
+    "heritage_relevant_percentage": 52.3,
+    "connections_by_heritage_type": {...}
+  }
+}
+```
+
+**Minimum Required Fields per Connection**:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `name` | string | Full name of connection |
+| `degree` | string | Connection degree: `1st`, `2nd`, `3rd+` |
+| `headline` | string | Current role/description |
+| `heritage_relevant` | boolean | Is this person in heritage sector? |
+
+**Heritage Type Codes**: Use single-letter GLAMORCUBESFIXPHDNT codes (G, L, A, M, O, R, C, U, B, E, S, F, I, X, P, H, D, N, T).
+
+**Referencing from Custodian Files**:
+
+```yaml
+collection_management_specialist:
+- name: Alexandr Belov
+  role: Collection/Information Specialist
+  linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
+  current: true
+  person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json
+  person_connections_path: data/custodian/person/alexandr-belov-bb547b46_connections_20251210T160000Z.json
+```
+
+**Why This Rule Matters**:
+1. **Complete Data Preservation**: Connections are expensive to extract (manual scraping, rate limits)
+2. **Heritage Sector Mapping**: Understanding who knows whom in the heritage community
+3. **Cross-Custodian Discovery**: Find staff who work at multiple institutions
+4. **Network Analysis**: Identify key influencers and knowledge hubs
+
+**See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation including network analysis sections
+
+---
+
+### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages
+
+**🚨 CRITICAL: When storing LinkedIn profile photo URLs, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.**
+
+The overlay page URL (`/overlay/photo/`) is trivially derivable from any LinkedIn profile URL and provides zero informational value. The actual CDN URL requires extraction effort and is the only URL that directly serves the image.
+
+**URL Patterns**:
+
+| Type | Pattern | Value |
+|------|---------|-------|
+| ❌ **WRONG** (overlay page) | `https://www.linkedin.com/in/{slug}/overlay/photo/` | Derivable, useless |
+| ✅ **CORRECT** (CDN image) | `https://media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...` | Actual image |
+
+**CDN URL Structure**:
+```
+https://media.licdn.com/dms/image/v2/{image_id}/profile-displayphoto-shrink_{size}/{encoded_params}
+```
+
+Where `{size}` can be: `100_100`, `200_200`, `400_400`, `800_800`
+
+**Person Profile JSON Schema**:
+
+```json
+{
+  "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
+  "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/...",
+  "photo_urls": null
+}
+```
+
+**When CDN URL is Not Available** (profile viewed without login, or no photo):
+
+```json
+{
+  "linkedin_profile_url": "https://www.linkedin.com/in/example-person",
+  "linkedin_photo_url": null,
+  "photo_urls": {
+    "primary": "https://example.org/staff/person-headshot.jpg",
+    "sources": [
+      {
+        "url": "https://example.org/staff/person-headshot.jpg",
+        "source": "Institutional website",
+        "retrieved_date": "2025-12-09"
+      }
+    ]
+  }
+}
+```
+
+**Extraction Methods** (in order of preference):
+
+1. **Browser DevTools** - Inspect `<img>` element in profile photo overlay, copy `src` attribute
+2. **Exa Crawling** - Use `exa_crawling_exa` tool with profile URL, extract from returned content
+3. **Alternative Sources** - If LinkedIn CDN unavailable, use institutional websites, conference photos, etc.
+
+**Rationale**:
+- Overlay URL = `profile_url + "/overlay/photo/"` → No information gain
+- CDN URL contains unique image identifiers → Actual data
+- CDN URLs can be used to display images directly
+- Multiple size variants available by changing size parameter
+
+**See**: 
+- `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for agent rule reference
+- `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` for complete extraction documentation
+
+---
+
 ## Project Overview

 **Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.