# LinkedIn Connection Unique Identifiers Rule

## Summary

When parsing LinkedIn connections, **EVERY connection MUST receive a unique `connection_id`**, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").

This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.

---

## Connection ID Format

```
{target_slug}_conn_{index:04d}_{name_slug}
```

**Components**:
- `target_slug`: LinkedIn slug of the profile owner (e.g., `anne-gant-59908a18`)
- `conn`: Literal string indicating this is a connection
- `index`: 4-digit zero-padded index (0000-9999)
- `name_slug`: Normalized name of the connection (see Name Slug Generation below)

**Examples**:
```
anne-gant-59908a18_conn_0042_amy_b
giovannafossati_conn_0156_linkedin_member
elif-rongen-kaynakci-35295a17_conn_0003_tina_m_bastajian
```

---

## Name Type Classification

Every connection MUST include a `name_type` field with one of three values:

| Type | Pattern | Example | `name_type` Value |
|------|---------|---------|-------------------|
| **Full Name** | First + Last name | "John Smith" | `full` |
| **Abbreviated** | Contains single initial | "Amy B.", "S. Buse Yildirim", "Tina M. Bastajian" | `abbreviated` |
| **Anonymous** | Privacy-hidden profile | "LinkedIn Member" | `anonymous` |

---

## Abbreviated Name Detection Patterns

A name is classified as `abbreviated` when it contains a single letter followed by a period:

```python
# Detection patterns
"Amy B."           # Last name abbreviated
"S. Buse Yildirim" # First name abbreviated  
"Tina M. Bastajian" # Middle initial (also abbreviated)
"I. Can Koc"       # Turkish first initial
"Elena K."         # Last name abbreviated
"Misato E."        # Last name abbreviated
```

**Detection Regex**:
```python
import re

def is_abbreviated_name(name: str) -> bool:
    """Check if name contains abbreviated components (single letter + period)."""
    # Pattern: single letter followed by period (e.g., "Amy B." or "S. Buse")
    abbreviated_pattern = r'\b[A-Z]\.'
    return bool(re.search(abbreviated_pattern, name))
```

---

## Anonymous Name Detection

A name is classified as `anonymous` when it matches privacy-hidden patterns:

```python
ANONYMOUS_PATTERNS = [
    "linkedin member",
    "member",  # Exact match only
]

def is_anonymous_name(name: str) -> bool:
    """Check if name indicates an anonymous/hidden profile."""
    normalized = name.lower().strip()
    return normalized in ANONYMOUS_PATTERNS or normalized == "linkedin member"
```

---

## Name Slug Generation Rules

The `name_slug` component is generated by normalizing the connection's name:

1. **Normalize unicode** (NFD decomposition)
2. **Remove diacritics** (e.g., e, o→o, n→n, I→i)
3. **Convert to lowercase**
4. **Replace non-alphanumeric** with underscores
5. **Collapse multiple underscores**
6. **Truncate to 30 characters** maximum

**Implementation**:
```python
import unicodedata
import re

def generate_name_slug(name: str, max_length: int = 30) -> str:
    """Generate a URL-safe slug from a name."""
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', name)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    # Lowercase
    lowercase = ascii_text.lower()
    # Replace non-alphanumeric with underscores
    slug = re.sub(r'[^a-z0-9]+', '_', lowercase)
    # Remove leading/trailing underscores and collapse multiple
    slug = re.sub(r'_+', '_', slug).strip('_')
    # Truncate
    return slug[:max_length]
```

**Examples**:
| Name | Slug |
|------|------|
| "Tina M. Bastajian" | `tina_m_bastajian` |
| "Elena K." | `elena_k` |
| "LinkedIn Member" | `linkedin_member` |
| "Geraldine Vooren" | `geraldine_vooren` |
| "Nusta Nina" | `nusta_nina` |

---

## Required Fields per Connection

Every connection entry in the JSON output MUST include:

```json
{
  "connection_id": "target-slug_conn_0042_amy_b",
  "name": "Amy B.",
  "name_type": "abbreviated",
  "degree": "2nd",
  "headline": "Film Archivist at EYE Filmmuseum",
  "location": "Amsterdam, Netherlands",
  "heritage_relevant": true,
  "heritage_type": "A"
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `connection_id` | string | **YES** | Unique identifier |
| `name` | string | **YES** | Full name as displayed |
| `name_type` | string | **YES** | `full`, `abbreviated`, or `anonymous` |
| `degree` | string | **YES** | Connection degree: `1st`, `2nd`, `3rd+` |
| `headline` | string | NO | Current role/description |
| `location` | string | NO | Location as displayed |
| `heritage_relevant` | boolean | NO | Is this person in heritage sector? |
| `heritage_type` | string | NO | GLAMORCUBESFIXPHDNT code if heritage_relevant |

---

## Implementation Script

**Location**: `scripts/parse_linkedin_connections.py`

**Key Functions**:
- `is_abbreviated_name(name)` - Detects single-letter initials
- `is_anonymous_name(name)` - Detects "LinkedIn Member" patterns
- `generate_connection_id(name, index, target_slug)` - Creates unique IDs
- `parse_connections_file(input_path, target_slug)` - Main parsing function

**Usage**:
```bash
python scripts/parse_linkedin_connections.py \
    data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
    data/custodian/person/{slug}_connections_{timestamp}.json \
    --target-name "Full Name" \
    --target-slug "linkedin-slug"
```

---

## Statistics from Real Extractions

| Person | Total | Full Names | Abbreviated | Anonymous |
|--------|-------|------------|-------------|-----------|
| Elif Rongen-Kaynakci | 475 | 449 (94.5%) | 26 (5.5%) | 0 |
| Giovanna Fossati | 776 | 746 (96.1%) | 30 (3.9%) | 0 |

Typical abbreviated name rate: **3-6%** of connections.

---

## Understanding Duplicates in Raw Connection Data

**🚨 CRITICAL: Duplicates in raw manually-registered connection data are expected and NOT a data quality issue.**

### Why Duplicates Occur

When manually registering LinkedIn connections, the **same person can appear multiple times** in the raw data because:

1. **Connection degree is relative to the VIEWER, not the target profile**
2. In a social network graph, **multiple paths of different lengths can exist to the same node**
3. A single person can simultaneously be reachable as:
   - **1st degree** - Direct connection to the viewer
   - **2nd degree** - Also reachable via another 1st degree connection
   - **3rd+ degree** - Also reachable via even longer paths

### Example Scenario (Graph Theory)

```
Viewer: Alice (the person conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol

Carol may appear multiple times because multiple graph paths exist:

    Alice -----> Carol           (1st degree: direct edge)
    Alice -----> Dave -----> Carol    (2nd degree: path length 2)
    Alice -----> Eve -----> Frank -----> Carol  (3rd+: path length 3)

All three paths lead to the same node (Carol). LinkedIn may display
different degree values depending on which path it evaluates.
```

### Key Constraint

**The `degree` field (1st, 2nd, 3rd+) reflects the relationship between:**
- The **viewer** (person logged into LinkedIn) and the **connection**
- **NOT** the relationship between the target profile and the connection

This is a LinkedIn UI limitation - when browsing someone else's connections, LinkedIn shows YOUR connection degree to those people, not theirs.

### Impact on Data Processing

When processing raw MD files manually registered from LinkedIn:

1. **Expect duplicates** - Same name appearing 2-3 times is normal
2. **Deduplicate by name** - Keep only unique names in final JSON
3. **Count unique names** - Not total lines in raw file
4. **Document discrepancy** - Raw file line count ≠ unique connections

### Real-World Example

| File | Raw Lines | Unique Names | Duplicates | Explanation |
|------|-----------|--------------|------------|-------------|
| Giovanna Fossati MD | 985 | 766 | 219 (22%) | Same people appearing at multiple degrees |

### Processing Recommendation

```python
# When parsing raw MD connections file:
names_seen = set()
unique_connections = []

for entry in raw_entries:
    normalized_name = normalize_name(entry['name'])
    if normalized_name not in names_seen:
        names_seen.add(normalized_name)
        unique_connections.append(entry)
    # else: skip duplicate (same person at different degree)
```

---

## Why This Rule Matters

1. **Deduplication**: Same abbreviated name across different connection lists can be linked
2. **Privacy Respect**: Preserves privacy while enabling analysis
3. **Complete Data**: No connections are silently dropped
4. **Network Analysis**: Enables heritage sector relationship mapping even with partial data
5. **Audit Trail**: Every connection can be traced back to its source

---

## Related Rules

- **Rule 15**: Connection Data Registration - Full Network Preservation (`AGENTS.md`)
- **Rule 16**: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages (`AGENTS.md`)
- **EXA LinkedIn Extraction**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md`
- **Person Data Reference Pattern**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md`

---

## Version History

| Date | Version | Changes |
|------|---------|---------|
| 2025-12-10 | 1.0 | Initial rule creation |
| 2025-12-11 | 1.1 | Added duplicate explanation section |
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths to same node, not scroll position |