glam/docs/LINKEDIN_CONNECTION_DUPLICATES.md
2025-12-10 13:01:13 +01:00

5.4 KiB

Understanding Duplicates in LinkedIn Connection Data

Summary

When manually registering LinkedIn connection data, duplicates are expected and normal. This document explains why they occur and how to handle them.


The Core Constraint

The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.

This is a fundamental LinkedIn UI limitation that affects all manual connection registration.


Why Duplicates Occur

Scenario

Viewer:     Alice (logged into LinkedIn, conducting the search)
Target:     Bob (the profile whose connections are being analyzed)
Connection: Carol (appears in Bob's connection list)

What LinkedIn Shows

When Alice views Bob's connections and sees Carol:

Field Displayed Value Shown What It Means
name "Carol Smith" Carol's name
degree "2nd" Carol is Alice's 2nd degree (NOT Bob's!)
headline "Film Archivist" Carol's current role

The Problem

Carol may appear multiple times in the registered data because connection degrees are determined by graph traversal paths, and multiple paths of different lengths can exist to the same node in a social network:

  1. Carol is Alice's 1st degree - Alice connected directly to Carol
  2. Carol is also Alice's 2nd degree - Carol is also connected to Dave, who is Alice's 1st degree connection
  3. Carol is also Alice's 3rd+ degree - Carol is also reachable via a longer path through other connections

This is a fundamental property of social network graphs: multiple edges can lead to the same node via different path lengths. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list.


Real-World Example

From the Giovanna Fossati connection extraction:

Metric Value
Raw lines in MD file 985
Unique names (after deduplication) 766
Duplicate entries 219 (22%)
Final JSON connections 776

Explanation: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources).


How to Handle Duplicates

During Parsing

def parse_connections_file(raw_entries: list) -> list:
    """Parse raw registered entries, deduplicating by name."""
    names_seen = set()
    unique_connections = []
    
    for entry in raw_entries:
        # Normalize name for deduplication
        normalized_name = normalize_name(entry['name'])
        
        if normalized_name not in names_seen:
            names_seen.add(normalized_name)
            unique_connections.append(entry)
        else:
            # Same person at different degree - skip duplicate
            continue
    
    return unique_connections

Name Normalization

import unicodedata
import re

def normalize_name(name: str) -> str:
    """Normalize name for deduplication comparison."""
    # NFD decomposition
    normalized = unicodedata.normalize('NFD', name)
    # Remove diacritics
    ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    # Lowercase and strip
    return ascii_name.lower().strip()

Important Clarifications

What the degree Field DOES Mean

The degree field in registered connection data represents:

  • 1st: The connection is directly connected to the viewer (Alice)
  • 2nd: The connection is connected to someone the viewer knows
  • 3rd+: The connection is 3+ steps from the viewer

What the degree Field Does NOT Mean

The degree field does NOT represent:

  • The relationship between the target profile (Bob) and the connection
  • Any meaningful network distance for analysis purposes
  • A stable value (changes based on who is viewing)

Implications for Heritage Network Analysis

Limitation

Because the degree field is viewer-relative, we cannot determine:

  • How closely connected two heritage professionals are
  • The network centrality of a person within the heritage sector
  • True 1st-degree connections of a target profile

What We CAN Determine

Despite this limitation, we can still:

  1. Identify heritage professionals - Names + headlines reveal sector affiliation
  2. Find cross-institutional connections - Same person in multiple custodian connection lists
  3. Map organizational networks - Which institutions' staff know each other
  4. Count sector presence - Percentage of connections in heritage sector

Documentation Trail

This constraint is documented in:

  1. AGENTS.md - Rule 17 (LinkedIn Connection Unique Identifiers)
  2. .opencode/LINKEDIN_CONNECTION_ID_RULE.md - Complete implementation rules
  3. This document - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md)

Version History

Date Version Changes
2025-12-11 1.0 Initial documentation
2025-12-11 1.1 Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered"
2025-12-11 1.2 Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session