docs: add Rules 13-16 for custodian type annotations, Exa LinkedIn, connection data, photo CDN
This commit is contained in:
parent
c4b0f17a43
commit
162ca3ad79
1 changed files with 285 additions and 0 deletions
285
AGENTS.md
285
AGENTS.md
|
|
@ -839,6 +839,291 @@ collection_management_specialist:
|
|||
|
||||
---
|
||||
|
||||
### Rule 13: Custodian Type Annotations on LinkML Schema Elements
|
||||
|
||||
**All LinkML schema elements (classes, slots, enums) MUST be annotated with their applicable GLAMORCUBESFIXPHDNT custodian type codes using the `annotations` block.**
|
||||
|
||||
This convention enables:
|
||||
- **Visual categorization** in UML diagrams (cube face highlighting)
|
||||
- **Semantic filtering** by heritage institution type
|
||||
- **Validation** of slot/enum applicability to custodian types
|
||||
|
||||
**Annotation Keys**:
|
||||
|
||||
| Key | Type | Required | Description |
|
||||
|-----|------|----------|-------------|
|
||||
| `custodian_types` | list[string] | YES | List of applicable type codes (e.g., `[A, L]`) |
|
||||
| `custodian_types_rationale` | string | NO | Explanation of why these types apply |
|
||||
| `custodian_types_primary` | string | NO | Primary type if multiple apply |
|
||||
|
||||
**Type Codes** (single letters from GLAMORCUBESFIXPHDNT):
|
||||
|
||||
| Code | Type | Code | Type |
|
||||
|------|------|------|------|
|
||||
| G | Gallery | F | Feature custodian |
|
||||
| L | Library | I | Intangible heritage |
|
||||
| A | Archive | X | Mixed types |
|
||||
| M | Museum | P | Personal collection |
|
||||
| O | Official institution | H | Holy/sacred site |
|
||||
| R | Research center | D | Digital platform |
|
||||
| C | Corporation | N | NGO |
|
||||
| U | Unknown | T | Taste/smell heritage |
|
||||
| B | Botanical/Zoo | | |
|
||||
| E | Education provider | | |
|
||||
| S | Collecting society | | |
|
||||
|
||||
**Example - Class Annotation**:
|
||||
```yaml
|
||||
classes:
|
||||
ArchivalFonds:
|
||||
class_uri: rico:RecordSet
|
||||
description: An archival fonds representing a collection of records
|
||||
annotations:
|
||||
custodian_types: [A]
|
||||
custodian_types_rationale: "Archival fonds are specific to archive institutions"
|
||||
```
|
||||
|
||||
**Example - Slot Annotation**:
|
||||
```yaml
|
||||
slots:
|
||||
arrangement_system:
|
||||
description: The archival arrangement system used
|
||||
range: ArrangementSystemEnum
|
||||
annotations:
|
||||
custodian_types: [A]
|
||||
custodian_types_rationale: "Arrangement systems are an archival concept"
|
||||
```
|
||||
|
||||
**Example - Enum Value Annotation**:
|
||||
```yaml
|
||||
enums:
|
||||
CollectionTypeEnum:
|
||||
permissible_values:
|
||||
ARCHIVAL_FONDS:
|
||||
description: A complete archival fonds
|
||||
annotations:
|
||||
custodian_types: [A]
|
||||
```
|
||||
|
||||
**Universal Types**: Use `["*"]` for elements that apply to all custodian types:
|
||||
```yaml
|
||||
slots:
|
||||
preferred_label:
|
||||
annotations:
|
||||
custodian_types: ["*"]
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
1. Values MUST be valid single-letter GLAMORCUBESFIXPHDNT codes
|
||||
2. Child class types MUST be subset of or equal to parent class types
|
||||
3. Slot annotations should align with using class annotations
|
||||
|
||||
**See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation
|
||||
|
||||
---
|
||||
|
||||
### Rule 14: Exa MCP LinkedIn Profile Extraction
|
||||
|
||||
**When extracting LinkedIn profile data for heritage custodian staff, use the `exa_crawling_exa` tool with direct profile URL for comprehensive extraction.**
|
||||
|
||||
**Tool Selection**:
|
||||
|
||||
| Scenario | Tool | Parameters |
|
||||
|----------|------|------------|
|
||||
| Profile URL known | `exa_crawling_exa` | `url`, `maxCharacters: 10000` |
|
||||
| Profile URL unknown | `exa_linkedin_search_exa` | `query`, `searchType: "profiles"` |
|
||||
| Fallback search | `exa_web_search_exa` | `query: "site:linkedin.com/in/ {name}"` |
|
||||
|
||||
**Preferred Workflow** (when LinkedIn URL is available):
|
||||
|
||||
```
|
||||
1. Use exa_crawling_exa with direct URL
|
||||
↓
|
||||
2. Extract comprehensive profile data
|
||||
↓
|
||||
3. Save to data/custodian/person/{linkedin-slug}_{timestamp}.json
|
||||
↓
|
||||
4. Update custodian file with person_profile_path reference
|
||||
```
|
||||
|
||||
**Example - Direct URL Extraction**:
|
||||
```
|
||||
Tool: exa_crawling_exa
|
||||
Parameters:
|
||||
url: "https://www.linkedin.com/in/alexandr-belov-bb547b46"
|
||||
maxCharacters: 10000
|
||||
```
|
||||
|
||||
**Why `exa_crawling_exa` over `exa_linkedin_search_exa`**:
|
||||
- Returns **complete** career history with dates, durations, company metadata
|
||||
- Includes **all** education entries with institutions and degrees
|
||||
- Captures **full** about section text
|
||||
- Returns **profile image URL**
|
||||
- More reliable than search (which may return irrelevant profiles)
|
||||
|
||||
**Output Location**: `data/custodian/person/{linkedin-slug}_{ISO-timestamp}.json`
|
||||
|
||||
**Required JSON Fields**:
|
||||
- `exa_search_metadata` - Tool, timestamp, request ID, cost
|
||||
- `linkedin_profile_url` - Source URL
|
||||
- `profile_data.career_history[]` - Complete work history
|
||||
- `profile_data.education[]` - All education entries
|
||||
- `profile_data.heritage_relevant_experience[]` - Tagged heritage roles
|
||||
|
||||
**See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation
|
||||
|
||||
---
|
||||
|
||||
### Rule 15: Connection Data Registration - Full Network Preservation
|
||||
|
||||
**🚨 CRITICAL: When connection data is manually recorded for a person, ALL connections MUST be fully registered in a dedicated connections file in `data/custodian/person/`.**
|
||||
|
||||
This rule ensures complete preservation of professional network data, enabling heritage sector network analysis and cross-custodian relationship discovery.
|
||||
|
||||
**Connection File Naming Convention**:
|
||||
|
||||
```
|
||||
{linkedin-slug}_connections_{ISO-timestamp}.json
|
||||
```
|
||||
|
||||
**Examples**:
|
||||
```
|
||||
alexandr-belov-bb547b46_connections_20251210T160000Z.json
|
||||
giovanna-fossati_connections_20251211T140000Z.json
|
||||
```
|
||||
|
||||
**Required Connection File Structure**:
|
||||
|
||||
```json
|
||||
{
|
||||
"source_metadata": {
|
||||
"source_url": "https://www.linkedin.com/search/results/people/?...",
|
||||
"scraped_timestamp": "2025-12-10T16:00:00Z",
|
||||
"scrape_method": "manual_linkedin_browse",
|
||||
"target_profile": "{linkedin-slug}",
|
||||
"target_name": "Full Name",
|
||||
"connections_extracted": 107
|
||||
},
|
||||
"connections": [
|
||||
{
|
||||
"name": "Connection Name",
|
||||
"degree": "1st" | "2nd" | "3rd+",
|
||||
"headline": "Current role or description",
|
||||
"location": "City, Region, Country",
|
||||
"organization": "Primary organization",
|
||||
"heritage_relevant": true | false,
|
||||
"heritage_type": "A" | "L" | "M" | "R" | "E" | "D" | "G" | etc.
|
||||
}
|
||||
],
|
||||
"network_analysis": {
|
||||
"total_connections_extracted": 107,
|
||||
"heritage_relevant_count": 56,
|
||||
"heritage_relevant_percentage": 52.3,
|
||||
"connections_by_heritage_type": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Minimum Required Fields per Connection**:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `name` | string | Full name of connection |
|
||||
| `degree` | string | Connection degree: `1st`, `2nd`, `3rd+` |
|
||||
| `headline` | string | Current role/description |
|
||||
| `heritage_relevant` | boolean | Is this person in heritage sector? |
|
||||
|
||||
**Heritage Type Codes**: Use single-letter GLAMORCUBESFIXPHDNT codes (G, L, A, M, O, R, C, U, B, E, S, F, I, X, P, H, D, N, T).
|
||||
|
||||
**Referencing from Custodian Files**:
|
||||
|
||||
```yaml
|
||||
collection_management_specialist:
|
||||
- name: Alexandr Belov
|
||||
role: Collection/Information Specialist
|
||||
linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46
|
||||
current: true
|
||||
person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json
|
||||
person_connections_path: data/custodian/person/alexandr-belov-bb547b46_connections_20251210T160000Z.json
|
||||
```
|
||||
|
||||
**Why This Rule Matters**:
|
||||
1. **Complete Data Preservation**: Connections are expensive to extract (manual scraping, rate limits)
|
||||
2. **Heritage Sector Mapping**: Understanding who knows whom in the heritage community
|
||||
3. **Cross-Custodian Discovery**: Find staff who work at multiple institutions
|
||||
4. **Network Analysis**: Identify key influencers and knowledge hubs
|
||||
|
||||
**See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation including network analysis sections
|
||||
|
||||
---
|
||||
|
||||
### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages
|
||||
|
||||
**🚨 CRITICAL: When storing LinkedIn profile photo URLs, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.**
|
||||
|
||||
The overlay page URL (`/overlay/photo/`) is trivially derivable from any LinkedIn profile URL and provides zero informational value. The actual CDN URL requires extraction effort and is the only URL that directly serves the image.
|
||||
|
||||
**URL Patterns**:
|
||||
|
||||
| Type | Pattern | Value |
|
||||
|------|---------|-------|
|
||||
| ❌ **WRONG** (overlay page) | `https://www.linkedin.com/in/{slug}/overlay/photo/` | Derivable, useless |
|
||||
| ✅ **CORRECT** (CDN image) | `https://media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...` | Actual image |
|
||||
|
||||
**CDN URL Structure**:
|
||||
```
|
||||
https://media.licdn.com/dms/image/v2/{image_id}/profile-displayphoto-shrink_{size}/{encoded_params}
|
||||
```
|
||||
|
||||
Where `{size}` can be: `100_100`, `200_200`, `400_400`, `800_800`
|
||||
|
||||
**Person Profile JSON Schema**:
|
||||
|
||||
```json
|
||||
{
|
||||
"linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati",
|
||||
"linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/...",
|
||||
"photo_urls": null
|
||||
}
|
||||
```
|
||||
|
||||
**When CDN URL is Not Available** (profile viewed without login, or no photo):
|
||||
|
||||
```json
|
||||
{
|
||||
"linkedin_profile_url": "https://www.linkedin.com/in/example-person",
|
||||
"linkedin_photo_url": null,
|
||||
"photo_urls": {
|
||||
"primary": "https://example.org/staff/person-headshot.jpg",
|
||||
"sources": [
|
||||
{
|
||||
"url": "https://example.org/staff/person-headshot.jpg",
|
||||
"source": "Institutional website",
|
||||
"retrieved_date": "2025-12-09"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Extraction Methods** (in order of preference):
|
||||
|
||||
1. **Browser DevTools** - Inspect `<img>` element in profile photo overlay, copy `src` attribute
|
||||
2. **Exa Crawling** - Use `exa_crawling_exa` tool with profile URL, extract from returned content
|
||||
3. **Alternative Sources** - If LinkedIn CDN unavailable, use institutional websites, conference photos, etc.
|
||||
|
||||
**Rationale**:
|
||||
- Overlay URL = `profile_url + "/overlay/photo/"` → No information gain
|
||||
- CDN URL contains unique image identifiers → Actual data
|
||||
- CDN URLs can be used to display images directly
|
||||
- Multiple size variants available by changing size parameter
|
||||
|
||||
**See**:
|
||||
- `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for agent rule reference
|
||||
- `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` for complete extraction documentation
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.
|
||||
|
|
|
|||
Loading…
Reference in a new issue