diff --git a/AGENTS.md b/AGENTS.md index 7617b49c69..718af46db1 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -839,6 +839,291 @@ collection_management_specialist: --- +### Rule 13: Custodian Type Annotations on LinkML Schema Elements + +**All LinkML schema elements (classes, slots, enums) MUST be annotated with their applicable GLAMORCUBESFIXPHDNT custodian type codes using the `annotations` block.** + +This convention enables: +- **Visual categorization** in UML diagrams (cube face highlighting) +- **Semantic filtering** by heritage institution type +- **Validation** of slot/enum applicability to custodian types + +**Annotation Keys**: + +| Key | Type | Required | Description | +|-----|------|----------|-------------| +| `custodian_types` | list[string] | YES | List of applicable type codes (e.g., `[A, L]`) | +| `custodian_types_rationale` | string | NO | Explanation of why these types apply | +| `custodian_types_primary` | string | NO | Primary type if multiple apply | + +**Type Codes** (single letters from GLAMORCUBESFIXPHDNT): + +| Code | Type | Code | Type | +|------|------|------|------| +| G | Gallery | F | Feature custodian | +| L | Library | I | Intangible heritage | +| A | Archive | X | Mixed types | +| M | Museum | P | Personal collection | +| O | Official institution | H | Holy/sacred site | +| R | Research center | D | Digital platform | +| C | Corporation | N | NGO | +| U | Unknown | T | Taste/smell heritage | +| B | Botanical/Zoo | | | +| E | Education provider | | | +| S | Collecting society | | | + +**Example - Class Annotation**: +```yaml +classes: + ArchivalFonds: + class_uri: rico:RecordSet + description: An archival fonds representing a collection of records + annotations: + custodian_types: [A] + custodian_types_rationale: "Archival fonds are specific to archive institutions" +``` + +**Example - Slot Annotation**: +```yaml +slots: + arrangement_system: + description: The archival arrangement system used + range: ArrangementSystemEnum + annotations: + custodian_types: [A] + custodian_types_rationale: "Arrangement systems are an archival concept" +``` + +**Example - Enum Value Annotation**: +```yaml +enums: + CollectionTypeEnum: + permissible_values: + ARCHIVAL_FONDS: + description: A complete archival fonds + annotations: + custodian_types: [A] +``` + +**Universal Types**: Use `["*"]` for elements that apply to all custodian types: +```yaml +slots: + preferred_label: + annotations: + custodian_types: ["*"] +``` + +**Validation Rules**: +1. Values MUST be valid single-letter GLAMORCUBESFIXPHDNT codes +2. Child class types MUST be subset of or equal to parent class types +3. Slot annotations should align with using class annotations + +**See**: `.opencode/CUSTODIAN_TYPE_ANNOTATION_CONVENTION.md` for complete documentation + +--- + +### Rule 14: Exa MCP LinkedIn Profile Extraction + +**When extracting LinkedIn profile data for heritage custodian staff, use the `exa_crawling_exa` tool with direct profile URL for comprehensive extraction.** + +**Tool Selection**: + +| Scenario | Tool | Parameters | +|----------|------|------------| +| Profile URL known | `exa_crawling_exa` | `url`, `maxCharacters: 10000` | +| Profile URL unknown | `exa_linkedin_search_exa` | `query`, `searchType: "profiles"` | +| Fallback search | `exa_web_search_exa` | `query: "site:linkedin.com/in/ {name}"` | + +**Preferred Workflow** (when LinkedIn URL is available): + +``` +1. Use exa_crawling_exa with direct URL + ↓ +2. Extract comprehensive profile data + ↓ +3. Save to data/custodian/person/{linkedin-slug}_{timestamp}.json + ↓ +4. Update custodian file with person_profile_path reference +``` + +**Example - Direct URL Extraction**: +``` +Tool: exa_crawling_exa +Parameters: + url: "https://www.linkedin.com/in/alexandr-belov-bb547b46" + maxCharacters: 10000 +``` + +**Why `exa_crawling_exa` over `exa_linkedin_search_exa`**: +- Returns **complete** career history with dates, durations, company metadata +- Includes **all** education entries with institutions and degrees +- Captures **full** about section text +- Returns **profile image URL** +- More reliable than search (which may return irrelevant profiles) + +**Output Location**: `data/custodian/person/{linkedin-slug}_{ISO-timestamp}.json` + +**Required JSON Fields**: +- `exa_search_metadata` - Tool, timestamp, request ID, cost +- `linkedin_profile_url` - Source URL +- `profile_data.career_history[]` - Complete work history +- `profile_data.education[]` - All education entries +- `profile_data.heritage_relevant_experience[]` - Tagged heritage roles + +**See**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md` for complete documentation + +--- + +### Rule 15: Connection Data Registration - Full Network Preservation + +**🚨 CRITICAL: When connection data is manually recorded for a person, ALL connections MUST be fully registered in a dedicated connections file in `data/custodian/person/`.** + +This rule ensures complete preservation of professional network data, enabling heritage sector network analysis and cross-custodian relationship discovery. + +**Connection File Naming Convention**: + +``` +{linkedin-slug}_connections_{ISO-timestamp}.json +``` + +**Examples**: +``` +alexandr-belov-bb547b46_connections_20251210T160000Z.json +giovanna-fossati_connections_20251211T140000Z.json +``` + +**Required Connection File Structure**: + +```json +{ + "source_metadata": { + "source_url": "https://www.linkedin.com/search/results/people/?...", + "scraped_timestamp": "2025-12-10T16:00:00Z", + "scrape_method": "manual_linkedin_browse", + "target_profile": "{linkedin-slug}", + "target_name": "Full Name", + "connections_extracted": 107 + }, + "connections": [ + { + "name": "Connection Name", + "degree": "1st" | "2nd" | "3rd+", + "headline": "Current role or description", + "location": "City, Region, Country", + "organization": "Primary organization", + "heritage_relevant": true | false, + "heritage_type": "A" | "L" | "M" | "R" | "E" | "D" | "G" | etc. + } + ], + "network_analysis": { + "total_connections_extracted": 107, + "heritage_relevant_count": 56, + "heritage_relevant_percentage": 52.3, + "connections_by_heritage_type": {...} + } +} +``` + +**Minimum Required Fields per Connection**: + +| Field | Type | Description | +|-------|------|-------------| +| `name` | string | Full name of connection | +| `degree` | string | Connection degree: `1st`, `2nd`, `3rd+` | +| `headline` | string | Current role/description | +| `heritage_relevant` | boolean | Is this person in heritage sector? | + +**Heritage Type Codes**: Use single-letter GLAMORCUBESFIXPHDNT codes (G, L, A, M, O, R, C, U, B, E, S, F, I, X, P, H, D, N, T). + +**Referencing from Custodian Files**: + +```yaml +collection_management_specialist: +- name: Alexandr Belov + role: Collection/Information Specialist + linkedin_url: https://www.linkedin.com/in/alexandr-belov-bb547b46 + current: true + person_profile_path: data/custodian/person/alexandr-belov-bb547b46_20251210T120000Z.json + person_connections_path: data/custodian/person/alexandr-belov-bb547b46_connections_20251210T160000Z.json +``` + +**Why This Rule Matters**: +1. **Complete Data Preservation**: Connections are expensive to extract (manual scraping, rate limits) +2. **Heritage Sector Mapping**: Understanding who knows whom in the heritage community +3. **Cross-Custodian Discovery**: Find staff who work at multiple institutions +4. **Network Analysis**: Identify key influencers and knowledge hubs + +**See**: `.opencode/CONNECTION_DATA_REGISTRATION_RULE.md` for complete documentation including network analysis sections + +--- + +### Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages + +**🚨 CRITICAL: When storing LinkedIn profile photo URLs, store the ACTUAL CDN image URL from `media.licdn.com`, NOT the overlay page URL.** + +The overlay page URL (`/overlay/photo/`) is trivially derivable from any LinkedIn profile URL and provides zero informational value. The actual CDN URL requires extraction effort and is the only URL that directly serves the image. + +**URL Patterns**: + +| Type | Pattern | Value | +|------|---------|-------| +| ❌ **WRONG** (overlay page) | `https://www.linkedin.com/in/{slug}/overlay/photo/` | Derivable, useless | +| ✅ **CORRECT** (CDN image) | `https://media.licdn.com/dms/image/v2/{ID}/profile-displayphoto-shrink_800_800/...` | Actual image | + +**CDN URL Structure**: +``` +https://media.licdn.com/dms/image/v2/{image_id}/profile-displayphoto-shrink_{size}/{encoded_params} +``` + +Where `{size}` can be: `100_100`, `200_200`, `400_400`, `800_800` + +**Person Profile JSON Schema**: + +```json +{ + "linkedin_profile_url": "https://www.linkedin.com/in/giovannafossati", + "linkedin_photo_url": "https://media.licdn.com/dms/image/v2/C4D03AQHQCBcoih82SQ/profile-displayphoto-shrink_800_800/...", + "photo_urls": null +} +``` + +**When CDN URL is Not Available** (profile viewed without login, or no photo): + +```json +{ + "linkedin_profile_url": "https://www.linkedin.com/in/example-person", + "linkedin_photo_url": null, + "photo_urls": { + "primary": "https://example.org/staff/person-headshot.jpg", + "sources": [ + { + "url": "https://example.org/staff/person-headshot.jpg", + "source": "Institutional website", + "retrieved_date": "2025-12-09" + } + ] + } +} +``` + +**Extraction Methods** (in order of preference): + +1. **Browser DevTools** - Inspect `` element in profile photo overlay, copy `src` attribute +2. **Exa Crawling** - Use `exa_crawling_exa` tool with profile URL, extract from returned content +3. **Alternative Sources** - If LinkedIn CDN unavailable, use institutional websites, conference photos, etc. + +**Rationale**: +- Overlay URL = `profile_url + "/overlay/photo/"` → No information gain +- CDN URL contains unique image identifiers → Actual data +- CDN URLs can be used to display images directly +- Multiple size variants available by changing size parameter + +**See**: +- `.opencode/LINKEDIN_PHOTO_CDN_RULE.md` for agent rule reference +- `docs/LINKEDIN_PHOTO_URL_EXTRACTION.md` for complete extraction documentation + +--- + ## Project Overview **Goal**: Extract structured data about worldwide GLAMORCUBESFIXPHDNT (Galleries, Libraries, Archives, Museums, Official institutions, Research centers, Corporations, Unknown, Botanical gardens/zoos, Educational providers, Societies, Features, Intangible heritage groups, miXed, Personal collections, Holy sites, Digital platforms, NGOs, Taste/smell heritage) institutions from 139+ Claude conversation JSON files and integrate with authoritative CSV datasets.