docs: add Rules 13-16 for custodian type annotations, Exa LinkedIn, connection data, photo CDN

This commit is contained in:
kempersc 2025-12-10 09:04:14 +01:00
parent c4b0f17a43
commit 162ca3ad79

285
AGENTS.md
View file

@ -839,6 +839,291 @@ collection_management_specialist:
---
### Rule 13: Custodian Type Annotations on LinkML Schema Elements
**All LinkML schema elements (classes, slots, enums) MUST be annotated with their applicable GLAMORCUBESFIXPHDNT custodian type codes using the `annotations` block.**
This convention enables:
- **Visual categorization** in UML diagrams (cube face highlighting)
- **Semantic filtering** by heritage institution type
- **Validation** of slot/enum applicability to custodian types
**Annotation Keys**:
| Key | Type | Required | Description |
|-----|------|----------|-------------|
| `custodian_types` | list[string] | YES | List of applicable type codes (e.g., `[A, L]`) |
| `custodian_types_rationale` | string | NO | Explanation of why these types apply |
| `custodian_types_primary` | string | NO | Primary type if multiple apply |
**Type Codes** (single letters from GLAMORCUBESFIXPHDNT):
| Code | Type | Code | Type |
|------|------|------|------|
| G | Gallery | F | Feature custodian |
| L | Library | I | Intangible heritage |
| A | Archive | X | Mixed types |
| M | Museum | P | Personal collection |
| O | Official institution | H | Holy/sacred site |
| R | Research center | D | Digital platform |
| C | Corporation | N | NGO |
| U | Unknown | T | Taste/smell heritage |
| B | Botanical/Zoo | | |
| E | Education provider | | |
| S | Collecting society | | |
**Example - Class Annotation**:
```yaml
classes:
ArchivalFonds:
class_uri: rico:RecordSet
description: An archival fonds representing a collection of records
annotations:
custodian_types: [A]
custodian_types_rationale: "Archival fonds are specific to archive institutions"
```
**Example - Slot Annotation**:
```yaml
slots:
arrangement_system:
description: The archival arrangement system used
range: ArrangementSystemEnum
annotations:
custodian_types: [A]
custodian_types_rationale: "Arrangement systems are an archival concept"
```
**Example - Enum Value Annotation**:
```yaml
enums:
CollectionTypeEnum:
permissible_values:
ARCHIVAL_FONDS:
description: A complete archival fonds
annotations:
custodian_types: [A]
```
**Universal Types**: Use `["*"]` for elements that apply to all custodian types:
```yaml
slots:
preferred_label:
annotations:
custodian_types: ["*"]
```
**Validation Rules**:
1. Values MUST be valid single-letter GLAMORCUBESFIXPHDNT codes
2. Child class types MUST be subset of or equal to parent class types
3. Slot annotations should align with using class annotations
**See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation
---
### Rule 14: Exa MCP LinkedIn Profile Extraction
**When extracting LinkedIn profile data for heritage custodian staff, use the `exa_crawling_exa` tool with direct profile URL for comprehensive extraction.**
**Tool Selection**:
| Scenario | Tool | Parameters |
|----------|------|------------|
| Profile URL known | `exa_crawling_exa` | `url`, `maxCharacters: 10000` |
| Profile URL unknown | `exa_linkedin_search_exa` | `query`, `searchType: "profiles"` |
| Fallback search | `exa_web_search_exa` | `query: "site:linkedin.com/in/ {name}"` |
**Preferred Workflow** (when LinkedIn URL is available):
```
1. Use exa_crawling_exa with direct URL
2. Extract comprehensive profile data
3. Save to data/custodian/person/{linkedin-slug}_{timestamp}.json
4. Update custodian file with person_profile_path reference
```
**Example - Direct URL Extraction**:
```
Tool: exa_crawling_exa
Parameters:
url: "https://www.linkedin.com/in/alexandr-belov-bb547b46"
maxCharacters: 10000
```
**Why `exa_crawling_exa` over `exa_linkedin_search_exa`**:
- Returns **complete** career history with dates, durations, company metadata
- Includes **all** education entries with institutions and degrees
- Captures **full** about section text
- Returns **profile image URL**
- More reliable than search (which may return irrelevant profiles)
**Output Location**: `data/custodian/person/{linkedin-slug}_{ISO-timestamp}.json`
**Required JSON Fields**:
- `exa_search_metadata` - Tool, timestamp, request ID, cost
- `linkedin_profile_url` - Source URL
- `profile_data.career_history[]` - Complete work history
- `profile_data.education[]` - All education entries
- `profile_data.heritage_relevant_experience[]` - Tagged heritage roles
**See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation
---
### Rule 15: Connection Data Registration - Full Network Preservation
**🚨 CRITICAL: When connection data is manually recorded for a person, ALL connections MUST be fully registered in a dedicated connections file in `data/custodian/person/`.**
This rule ensures complete preservation of professional network data, enabling heritage sector network analysis and cross-custodian relationship discovery.
**Connection File Naming Convention**:
```
{linkedin-slug}_connections_{ISO-timestamp}.json
```
**Examples**:
```
alexandr-belov-bb547b46_connections_20251210T160000Z.json
giovanna-fossati_connections_20251211T140000Z.json
```
**Required Connection File Structure**:
```json
{
"source_metadata": {
"source_url": "https://www.linkedin.com/search/results/people/?...",
"scraped_timestamp": "2025-12-10T16:00:00Z",
"scrape_method": "manual_linkedin_browse",
"target_profile": "{linkedin-slug}",
"target_name": "Full Name",
"connections_extracted": 107
},
"connections": [
{
"name": "Connection Name",
"degree": "1st" | "2nd" | "3rd+",
"headline": "Current role or description",
"location": "City, Region, Country",
"organization": "Primary organization",
"heritage_relevant": true | false,
"heritage_type": "A" | "L" | "M" | "R" | "E" | "D" | "G" | etc.
}
],
"network_analysis": {
"total_connections_extracted": 107,
"heritage_relevant_count": 56,
"heritage_relevant_percentage": 52.3,
"connections_by_heritage_type": {...}
}
}
```
**Minimum Required Fields per Connection**:
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Full name of connection |
| `degree` | string | Connection degree: `1st`, `2nd`, `3rd+` |
| `headline` | string | Current role/description |
| `heritage_relevant` | boolean | Is this person in heritage sector? |
**Heritage Type Codes**: Use single-letter GLAMORCUBESFIXPHDNT codes (G, L, A, M, O, R, C, U, B, E, S, F, I, X, P, H, D, N, T).
**Referencing from Custodian Files**:
```yaml
collection_management_specialist:
- name: Alexandr Belov
role: Collection/Information Specialist
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
current: true
person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json
person_connections_path: data/custodian/person/alexandr-belov-bb547b46_connections_20251210T160000Z.json
```
**Why This Rule Matters**:
1. **Complete Data Preservation**: Connections are expensive to extract (manual scraping, rate limits)
2. **Heritage Sector Mapping**: Understanding who knows whom in the heritage community
3. **Cross-Custodian Discovery**: Find staff who work at multiple institutions
4. **Network Analysis**: Identify key influencers and knowledge hubs
**See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation including network analysis sections
---
### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages
**🚨 CRITICAL: When storing LinkedIn profile photo URLs, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.**
The overlay page URL (`/overlay/photo/`) is trivially derivable from any LinkedIn profile URL and provides zero informational value. The actual CDN URL requires extraction effort and is the only URL that directly serves the image.
**URL Patterns**:
| Type | Pattern | Value |
|------|---------|-------|
| ❌ **WRONG** (overlay page) | `https://www.linkedin.com/in/{slug}/overlay/photo/` | Derivable, useless |
| ✅ **CORRECT** (CDN image) | `https://media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...` | Actual image |
**CDN URL Structure**:
```
https://media.licdn.com/dms/image/v2/{image_id}/profile-displayphoto-shrink_{size}/{encoded_params}
```
Where `{size}` can be: `100_100`, `200_200`, `400_400`, `800_800`
**Person Profile JSON Schema**:
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/...",
"photo_urls": null
}
```
**When CDN URL is Not Available** (profile viewed without login, or no photo):
```json
{
"linkedin_profile_url": "https://www.linkedin.com/in/example-person",
"linkedin_photo_url": null,
"photo_urls": {
"primary": "https://example.org/staff/person-headshot.jpg",
"sources": [
{
"url": "https://example.org/staff/person-headshot.jpg",
"source": "Institutional website",
"retrieved_date": "2025-12-09"
}
]
}
}
```
**Extraction Methods** (in order of preference):
1. **Browser DevTools** - Inspect `<img>` element in profile photo overlay, copy `src` attribute
2. **Exa Crawling** - Use `exa_crawling_exa` tool with profile URL, extract from returned content
3. **Alternative Sources** - If LinkedIn CDN unavailable, use institutional websites, conference photos, etc.
**Rationale**:
- Overlay URL = `profile_url + "/overlay/photo/"` → No information gain
- CDN URL contains unique image identifiers → Actual data
- CDN URLs can be used to display images directly
- Multiple size variants available by changing size parameter
**See**:
- `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for agent rule reference
- `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` for complete extraction documentation
---
## Project Overview
**Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.