# Understanding Duplicates in LinkedIn Connection Data ## Summary When manually registering LinkedIn connection data, **duplicates are expected and normal**. This document explains why they occur and how to handle them. --- ## The Core Constraint **The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.** This is a fundamental LinkedIn UI limitation that affects all manual connection registration. --- ## Why Duplicates Occur ### Scenario ``` Viewer: Alice (logged into LinkedIn, conducting the search) Target: Bob (the profile whose connections are being analyzed) Connection: Carol (appears in Bob's connection list) ``` ### What LinkedIn Shows When Alice views Bob's connections and sees Carol: | Field Displayed | Value Shown | What It Means | |-----------------|-------------|---------------| | `name` | "Carol Smith" | Carol's name | | `degree` | "2nd" | **Carol is Alice's 2nd degree** (NOT Bob's!) | | `headline` | "Film Archivist" | Carol's current role | ### The Problem Carol may appear **multiple times** in the registered data because connection degrees are determined by **graph traversal paths**, and multiple paths of different lengths can exist to the same node in a social network: 1. **Carol is Alice's 1st degree** - Alice connected directly to Carol 2. **Carol is also Alice's 2nd degree** - Carol is also connected to Dave, who is Alice's 1st degree connection 3. **Carol is also Alice's 3rd+ degree** - Carol is also reachable via a longer path through other connections This is a fundamental property of social network graphs: **multiple edges can lead to the same node via different path lengths**. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list. --- ## Real-World Example From the Giovanna Fossati connection extraction: | Metric | Value | |--------|-------| | Raw lines in MD file | 985 | | Unique names (after deduplication) | 766 | | Duplicate entries | 219 (22%) | | Final JSON connections | 776 | **Explanation**: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources). --- ## How to Handle Duplicates ### During Parsing ```python def parse_connections_file(raw_entries: list) -> list: """Parse raw registered entries, deduplicating by name.""" names_seen = set() unique_connections = [] for entry in raw_entries: # Normalize name for deduplication normalized_name = normalize_name(entry['name']) if normalized_name not in names_seen: names_seen.add(normalized_name) unique_connections.append(entry) else: # Same person at different degree - skip duplicate continue return unique_connections ``` ### Name Normalization ```python import unicodedata import re def normalize_name(name: str) -> str: """Normalize name for deduplication comparison.""" # NFD decomposition normalized = unicodedata.normalize('NFD', name) # Remove diacritics ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn') # Lowercase and strip return ascii_name.lower().strip() ``` --- ## Important Clarifications ### What the `degree` Field DOES Mean The `degree` field in registered connection data represents: - **1st**: The connection is directly connected to the **viewer** (Alice) - **2nd**: The connection is connected to someone the **viewer** knows - **3rd+**: The connection is 3+ steps from the **viewer** ### What the `degree` Field Does NOT Mean The `degree` field does NOT represent: - The relationship between the **target profile** (Bob) and the connection - Any meaningful network distance for analysis purposes - A stable value (changes based on who is viewing) --- ## Implications for Heritage Network Analysis ### Limitation Because the `degree` field is viewer-relative, we cannot determine: - How closely connected two heritage professionals are - The network centrality of a person within the heritage sector - True 1st-degree connections of a target profile ### What We CAN Determine Despite this limitation, we can still: 1. **Identify heritage professionals** - Names + headlines reveal sector affiliation 2. **Find cross-institutional connections** - Same person in multiple custodian connection lists 3. **Map organizational networks** - Which institutions' staff know each other 4. **Count sector presence** - Percentage of connections in heritage sector --- ## Documentation Trail This constraint is documented in: 1. **AGENTS.md** - Rule 17 (LinkedIn Connection Unique Identifiers) 2. **.opencode/LINKEDIN_CONNECTION_ID_RULE.md** - Complete implementation rules 3. **This document** - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md) --- ## Version History | Date | Version | Changes | |------|---------|---------| | 2025-12-11 | 1.0 | Initial documentation | | 2025-12-11 | 1.1 | Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered" | | 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session |