163 lines
5.4 KiB
Markdown
163 lines
5.4 KiB
Markdown
# Understanding Duplicates in LinkedIn Connection Data
|
|
|
|
## Summary
|
|
|
|
When manually registering LinkedIn connection data, **duplicates are expected and normal**. This document explains why they occur and how to handle them.
|
|
|
|
---
|
|
|
|
## The Core Constraint
|
|
|
|
**The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.**
|
|
|
|
This is a fundamental LinkedIn UI limitation that affects all manual connection registration.
|
|
|
|
---
|
|
|
|
## Why Duplicates Occur
|
|
|
|
### Scenario
|
|
|
|
```
|
|
Viewer: Alice (logged into LinkedIn, conducting the search)
|
|
Target: Bob (the profile whose connections are being analyzed)
|
|
Connection: Carol (appears in Bob's connection list)
|
|
```
|
|
|
|
### What LinkedIn Shows
|
|
|
|
When Alice views Bob's connections and sees Carol:
|
|
|
|
| Field Displayed | Value Shown | What It Means |
|
|
|-----------------|-------------|---------------|
|
|
| `name` | "Carol Smith" | Carol's name |
|
|
| `degree` | "2nd" | **Carol is Alice's 2nd degree** (NOT Bob's!) |
|
|
| `headline` | "Film Archivist" | Carol's current role |
|
|
|
|
### The Problem
|
|
|
|
Carol may appear **multiple times** in the registered data because connection degrees are determined by **graph traversal paths**, and multiple paths of different lengths can exist to the same node in a social network:
|
|
|
|
1. **Carol is Alice's 1st degree** - Alice connected directly to Carol
|
|
2. **Carol is also Alice's 2nd degree** - Carol is also connected to Dave, who is Alice's 1st degree connection
|
|
3. **Carol is also Alice's 3rd+ degree** - Carol is also reachable via a longer path through other connections
|
|
|
|
This is a fundamental property of social network graphs: **multiple edges can lead to the same node via different path lengths**. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list.
|
|
|
|
---
|
|
|
|
## Real-World Example
|
|
|
|
From the Giovanna Fossati connection extraction:
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Raw lines in MD file | 985 |
|
|
| Unique names (after deduplication) | 766 |
|
|
| Duplicate entries | 219 (22%) |
|
|
| Final JSON connections | 776 |
|
|
|
|
**Explanation**: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources).
|
|
|
|
---
|
|
|
|
## How to Handle Duplicates
|
|
|
|
### During Parsing
|
|
|
|
```python
|
|
def parse_connections_file(raw_entries: list) -> list:
|
|
"""Parse raw registered entries, deduplicating by name."""
|
|
names_seen = set()
|
|
unique_connections = []
|
|
|
|
for entry in raw_entries:
|
|
# Normalize name for deduplication
|
|
normalized_name = normalize_name(entry['name'])
|
|
|
|
if normalized_name not in names_seen:
|
|
names_seen.add(normalized_name)
|
|
unique_connections.append(entry)
|
|
else:
|
|
# Same person at different degree - skip duplicate
|
|
continue
|
|
|
|
return unique_connections
|
|
```
|
|
|
|
### Name Normalization
|
|
|
|
```python
|
|
import unicodedata
|
|
import re
|
|
|
|
def normalize_name(name: str) -> str:
|
|
"""Normalize name for deduplication comparison."""
|
|
# NFD decomposition
|
|
normalized = unicodedata.normalize('NFD', name)
|
|
# Remove diacritics
|
|
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
|
# Lowercase and strip
|
|
return ascii_name.lower().strip()
|
|
```
|
|
|
|
---
|
|
|
|
## Important Clarifications
|
|
|
|
### What the `degree` Field DOES Mean
|
|
|
|
The `degree` field in registered connection data represents:
|
|
|
|
- **1st**: The connection is directly connected to the **viewer** (Alice)
|
|
- **2nd**: The connection is connected to someone the **viewer** knows
|
|
- **3rd+**: The connection is 3+ steps from the **viewer**
|
|
|
|
### What the `degree` Field Does NOT Mean
|
|
|
|
The `degree` field does NOT represent:
|
|
|
|
- The relationship between the **target profile** (Bob) and the connection
|
|
- Any meaningful network distance for analysis purposes
|
|
- A stable value (changes based on who is viewing)
|
|
|
|
---
|
|
|
|
## Implications for Heritage Network Analysis
|
|
|
|
### Limitation
|
|
|
|
Because the `degree` field is viewer-relative, we cannot determine:
|
|
|
|
- How closely connected two heritage professionals are
|
|
- The network centrality of a person within the heritage sector
|
|
- True 1st-degree connections of a target profile
|
|
|
|
### What We CAN Determine
|
|
|
|
Despite this limitation, we can still:
|
|
|
|
1. **Identify heritage professionals** - Names + headlines reveal sector affiliation
|
|
2. **Find cross-institutional connections** - Same person in multiple custodian connection lists
|
|
3. **Map organizational networks** - Which institutions' staff know each other
|
|
4. **Count sector presence** - Percentage of connections in heritage sector
|
|
|
|
---
|
|
|
|
## Documentation Trail
|
|
|
|
This constraint is documented in:
|
|
|
|
1. **AGENTS.md** - Rule 17 (LinkedIn Connection Unique Identifiers)
|
|
2. **.opencode/LINKEDIN_CONNECTION_ID_RULE.md** - Complete implementation rules
|
|
3. **This document** - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md)
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Date | Version | Changes |
|
|
|------|---------|---------|
|
|
| 2025-12-11 | 1.0 | Initial documentation |
|
|
| 2025-12-11 | 1.1 | Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered" |
|
|
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session |
|