5.4 KiB
Understanding Duplicates in LinkedIn Connection Data
Summary
When manually registering LinkedIn connection data, duplicates are expected and normal. This document explains why they occur and how to handle them.
The Core Constraint
The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.
This is a fundamental LinkedIn UI limitation that affects all manual connection registration.
Why Duplicates Occur
Scenario
Viewer: Alice (logged into LinkedIn, conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol (appears in Bob's connection list)
What LinkedIn Shows
When Alice views Bob's connections and sees Carol:
| Field Displayed | Value Shown | What It Means |
|---|---|---|
name |
"Carol Smith" | Carol's name |
degree |
"2nd" | Carol is Alice's 2nd degree (NOT Bob's!) |
headline |
"Film Archivist" | Carol's current role |
The Problem
Carol may appear multiple times in the registered data because connection degrees are determined by graph traversal paths, and multiple paths of different lengths can exist to the same node in a social network:
- Carol is Alice's 1st degree - Alice connected directly to Carol
- Carol is also Alice's 2nd degree - Carol is also connected to Dave, who is Alice's 1st degree connection
- Carol is also Alice's 3rd+ degree - Carol is also reachable via a longer path through other connections
This is a fundamental property of social network graphs: multiple edges can lead to the same node via different path lengths. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list.
Real-World Example
From the Giovanna Fossati connection extraction:
| Metric | Value |
|---|---|
| Raw lines in MD file | 985 |
| Unique names (after deduplication) | 766 |
| Duplicate entries | 219 (22%) |
| Final JSON connections | 776 |
Explanation: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources).
How to Handle Duplicates
During Parsing
def parse_connections_file(raw_entries: list) -> list:
"""Parse raw registered entries, deduplicating by name."""
names_seen = set()
unique_connections = []
for entry in raw_entries:
# Normalize name for deduplication
normalized_name = normalize_name(entry['name'])
if normalized_name not in names_seen:
names_seen.add(normalized_name)
unique_connections.append(entry)
else:
# Same person at different degree - skip duplicate
continue
return unique_connections
Name Normalization
import unicodedata
import re
def normalize_name(name: str) -> str:
"""Normalize name for deduplication comparison."""
# NFD decomposition
normalized = unicodedata.normalize('NFD', name)
# Remove diacritics
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Lowercase and strip
return ascii_name.lower().strip()
Important Clarifications
What the degree Field DOES Mean
The degree field in registered connection data represents:
- 1st: The connection is directly connected to the viewer (Alice)
- 2nd: The connection is connected to someone the viewer knows
- 3rd+: The connection is 3+ steps from the viewer
What the degree Field Does NOT Mean
The degree field does NOT represent:
- The relationship between the target profile (Bob) and the connection
- Any meaningful network distance for analysis purposes
- A stable value (changes based on who is viewing)
Implications for Heritage Network Analysis
Limitation
Because the degree field is viewer-relative, we cannot determine:
- How closely connected two heritage professionals are
- The network centrality of a person within the heritage sector
- True 1st-degree connections of a target profile
What We CAN Determine
Despite this limitation, we can still:
- Identify heritage professionals - Names + headlines reveal sector affiliation
- Find cross-institutional connections - Same person in multiple custodian connection lists
- Map organizational networks - Which institutions' staff know each other
- Count sector presence - Percentage of connections in heritage sector
Documentation Trail
This constraint is documented in:
- AGENTS.md - Rule 17 (LinkedIn Connection Unique Identifiers)
- .opencode/LINKEDIN_CONNECTION_ID_RULE.md - Complete implementation rules
- This document - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md)
Version History
| Date | Version | Changes |
|---|---|---|
| 2025-12-11 | 1.0 | Initial documentation |
| 2025-12-11 | 1.1 | Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered" |
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session |