glam/docs/LINKEDIN_CONNECTION_DUPLICATES.md

# Understanding Duplicates in LinkedIn Connection Data

## Summary

When manually registering LinkedIn connection data, **duplicates are expected and normal**. This document explains why they occur and how to handle them.

---

## The Core Constraint

**The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.**

This is a fundamental LinkedIn UI limitation that affects all manual connection registration.

---

## Why Duplicates Occur

### Scenario

```
Viewer:     Alice (logged into LinkedIn, conducting the search)
Target:     Bob (the profile whose connections are being analyzed)
Connection: Carol (appears in Bob's connection list)
```

### What LinkedIn Shows

When Alice views Bob's connections and sees Carol:

| Field Displayed | Value Shown | What It Means |
|-----------------|-------------|---------------|
| `name` | "Carol Smith" | Carol's name |
| `degree` | "2nd" | **Carol is Alice's 2nd degree** (NOT Bob's!) |
| `headline` | "Film Archivist" | Carol's current role |

### The Problem

Carol may appear **multiple times** in the registered data because connection degrees are determined by **graph traversal paths**, and multiple paths of different lengths can exist to the same node in a social network:

1. **Carol is Alice's 1st degree** - Alice connected directly to Carol
2. **Carol is also Alice's 2nd degree** - Carol is also connected to Dave, who is Alice's 1st degree connection
3. **Carol is also Alice's 3rd+ degree** - Carol is also reachable via a longer path through other connections

This is a fundamental property of social network graphs: **multiple edges can lead to the same node via different path lengths**. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list.

---

## Real-World Example

From the Giovanna Fossati connection extraction:

| Metric | Value |
|--------|-------|
| Raw lines in MD file | 985 |
| Unique names (after deduplication) | 766 |
| Duplicate entries | 219 (22%) |
| Final JSON connections | 776 |

**Explanation**: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources).

---

## How to Handle Duplicates

### During Parsing

```python
def parse_connections_file(raw_entries: list) -> list:
    """Parse raw registered entries, deduplicating by name."""
    names_seen = set()
    unique_connections = []

    for entry in raw_entries:
        # Normalize name for deduplication
        normalized_name = normalize_name(entry['name'])

        if normalized_name not in names_seen:
            names_seen.add(normalized_name)
            unique_connections.append(entry)
        else:
            # Same person at different degree - skip duplicate
            continue

    return unique_connections
```

### Name Normalization

```python
import unicodedata
import re

def normalize_name(name: str) -> str:
    """Normalize name for deduplication comparison."""
    # NFD decomposition
    normalized = unicodedata.normalize('NFD', name)
    # Remove diacritics
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    # Lowercase and strip
    return ascii_name.lower().strip()
```

---

## Important Clarifications

### What the `degree` Field DOES Mean

The `degree` field in registered connection data represents:

- **1st**: The connection is directly connected to the **viewer** (Alice)
- **2nd**: The connection is connected to someone the **viewer** knows
- **3rd+**: The connection is 3+ steps from the **viewer**

### What the `degree` Field Does NOT Mean

The `degree` field does NOT represent:

- The relationship between the **target profile** (Bob) and the connection
- Any meaningful network distance for analysis purposes
- A stable value (changes based on who is viewing)

---

## Implications for Heritage Network Analysis

### Limitation

Because the `degree` field is viewer-relative, we cannot determine:

- How closely connected two heritage professionals are
- The network centrality of a person within the heritage sector
- True 1st-degree connections of a target profile

### What We CAN Determine

Despite this limitation, we can still:

1. **Identify heritage professionals** - Names + headlines reveal sector affiliation
2. **Find cross-institutional connections** - Same person in multiple custodian connection lists
3. **Map organizational networks** - Which institutions' staff know each other
4. **Count sector presence** - Percentage of connections in heritage sector

---

## Documentation Trail

This constraint is documented in:

1. **AGENTS.md** - Rule 17 (LinkedIn Connection Unique Identifiers)
2. **.opencode/LINKEDIN_CONNECTION_ID_RULE.md** - Complete implementation rules
3. **This document** - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md)

---

## Version History

| Date | Version | Changes |
|------|---------|---------|
| 2025-12-11 | 1.0 | Initial documentation |
| 2025-12-11 | 1.1 | Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered" |
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session |