glam/docs/LINKEDIN_CONNECTION_DUPLICATES.md
2025-12-10 13:01:13 +01:00

163 lines
5.4 KiB
Markdown

# Understanding Duplicates in LinkedIn Connection Data
## Summary
When manually registering LinkedIn connection data, **duplicates are expected and normal**. This document explains why they occur and how to handle them.
---
## The Core Constraint
**The connection degree (1st, 2nd, 3rd+) displayed in LinkedIn is ALWAYS relative to the VIEWER (the person logged in and conducting the search), NOT the target profile being analyzed.**
This is a fundamental LinkedIn UI limitation that affects all manual connection registration.
---
## Why Duplicates Occur
### Scenario
```
Viewer: Alice (logged into LinkedIn, conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol (appears in Bob's connection list)
```
### What LinkedIn Shows
When Alice views Bob's connections and sees Carol:
| Field Displayed | Value Shown | What It Means |
|-----------------|-------------|---------------|
| `name` | "Carol Smith" | Carol's name |
| `degree` | "2nd" | **Carol is Alice's 2nd degree** (NOT Bob's!) |
| `headline` | "Film Archivist" | Carol's current role |
### The Problem
Carol may appear **multiple times** in the registered data because connection degrees are determined by **graph traversal paths**, and multiple paths of different lengths can exist to the same node in a social network:
1. **Carol is Alice's 1st degree** - Alice connected directly to Carol
2. **Carol is also Alice's 2nd degree** - Carol is also connected to Dave, who is Alice's 1st degree connection
3. **Carol is also Alice's 3rd+ degree** - Carol is also reachable via a longer path through other connections
This is a fundamental property of social network graphs: **multiple edges can lead to the same node via different path lengths**. LinkedIn may display different degree values for the same person depending on which path it evaluates when rendering the connection list.
---
## Real-World Example
From the Giovanna Fossati connection extraction:
| Metric | Value |
|--------|-------|
| Raw lines in MD file | 985 |
| Unique names (after deduplication) | 766 |
| Duplicate entries | 219 (22%) |
| Final JSON connections | 776 |
**Explanation**: The 219 duplicates were the same people appearing at different scroll positions or connection degrees. The final JSON has 776 connections (766 from MD + 10 from other sources).
---
## How to Handle Duplicates
### During Parsing
```python
def parse_connections_file(raw_entries: list) -> list:
"""Parse raw registered entries, deduplicating by name."""
names_seen = set()
unique_connections = []
for entry in raw_entries:
# Normalize name for deduplication
normalized_name = normalize_name(entry['name'])
if normalized_name not in names_seen:
names_seen.add(normalized_name)
unique_connections.append(entry)
else:
# Same person at different degree - skip duplicate
continue
return unique_connections
```
### Name Normalization
```python
import unicodedata
import re
def normalize_name(name: str) -> str:
"""Normalize name for deduplication comparison."""
# NFD decomposition
normalized = unicodedata.normalize('NFD', name)
# Remove diacritics
ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Lowercase and strip
return ascii_name.lower().strip()
```
---
## Important Clarifications
### What the `degree` Field DOES Mean
The `degree` field in registered connection data represents:
- **1st**: The connection is directly connected to the **viewer** (Alice)
- **2nd**: The connection is connected to someone the **viewer** knows
- **3rd+**: The connection is 3+ steps from the **viewer**
### What the `degree` Field Does NOT Mean
The `degree` field does NOT represent:
- The relationship between the **target profile** (Bob) and the connection
- Any meaningful network distance for analysis purposes
- A stable value (changes based on who is viewing)
---
## Implications for Heritage Network Analysis
### Limitation
Because the `degree` field is viewer-relative, we cannot determine:
- How closely connected two heritage professionals are
- The network centrality of a person within the heritage sector
- True 1st-degree connections of a target profile
### What We CAN Determine
Despite this limitation, we can still:
1. **Identify heritage professionals** - Names + headlines reveal sector affiliation
2. **Find cross-institutional connections** - Same person in multiple custodian connection lists
3. **Map organizational networks** - Which institutions' staff know each other
4. **Count sector presence** - Percentage of connections in heritage sector
---
## Documentation Trail
This constraint is documented in:
1. **AGENTS.md** - Rule 17 (LinkedIn Connection Unique Identifiers)
2. **.opencode/LINKEDIN_CONNECTION_ID_RULE.md** - Complete implementation rules
3. **This document** - Detailed explanation (docs/LINKEDIN_CONNECTION_DUPLICATES.md)
---
## Version History
| Date | Version | Changes |
|------|---------|---------|
| 2025-12-11 | 1.0 | Initial documentation |
| 2025-12-11 | 1.1 | Corrected: duplicates occur because same person appears at multiple degrees to viewer, not because of target list. Changed "scraped" to "manually registered" |
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths, not scroll position/browse session |