9.5 KiB
LinkedIn Connection Unique Identifiers Rule
Summary
When parsing LinkedIn connections, EVERY connection MUST receive a unique connection_id, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").
This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.
Connection ID Format
{target_slug}_conn_{index:04d}_{name_slug}
Components:
target_slug: LinkedIn slug of the profile owner (e.g.,anne-gant-59908a18)conn: Literal string indicating this is a connectionindex: 4-digit zero-padded index (0000-9999)name_slug: Normalized name of the connection (see Name Slug Generation below)
Examples:
anne-gant-59908a18_conn_0042_amy_b
giovannafossati_conn_0156_linkedin_member
elif-rongen-kaynakci-35295a17_conn_0003_tina_m_bastajian
Name Type Classification
Every connection MUST include a name_type field with one of three values:
| Type | Pattern | Example | name_type Value |
|---|---|---|---|
| Full Name | First + Last name | "John Smith" | full |
| Abbreviated | Contains single initial | "Amy B.", "S. Buse Yildirim", "Tina M. Bastajian" | abbreviated |
| Anonymous | Privacy-hidden profile | "LinkedIn Member" | anonymous |
Abbreviated Name Detection Patterns
A name is classified as abbreviated when it contains a single letter followed by a period:
# Detection patterns
"Amy B." # Last name abbreviated
"S. Buse Yildirim" # First name abbreviated
"Tina M. Bastajian" # Middle initial (also abbreviated)
"I. Can Koc" # Turkish first initial
"Elena K." # Last name abbreviated
"Misato E." # Last name abbreviated
Detection Regex:
import re
def is_abbreviated_name(name: str) -> bool:
"""Check if name contains abbreviated components (single letter + period)."""
# Pattern: single letter followed by period (e.g., "Amy B." or "S. Buse")
abbreviated_pattern = r'\b[A-Z]\.'
return bool(re.search(abbreviated_pattern, name))
Anonymous Name Detection
A name is classified as anonymous when it matches privacy-hidden patterns:
ANONYMOUS_PATTERNS = [
"linkedin member",
"member", # Exact match only
]
def is_anonymous_name(name: str) -> bool:
"""Check if name indicates an anonymous/hidden profile."""
normalized = name.lower().strip()
return normalized in ANONYMOUS_PATTERNS or normalized == "linkedin member"
Name Slug Generation Rules
The name_slug component is generated by normalizing the connection's name:
- Normalize unicode (NFD decomposition)
- Remove diacritics (e.g., e, o→o, n→n, I→i)
- Convert to lowercase
- Replace non-alphanumeric with underscores
- Collapse multiple underscores
- Truncate to 30 characters maximum
Implementation:
import unicodedata
import re
def generate_name_slug(name: str, max_length: int = 30) -> str:
"""Generate a URL-safe slug from a name."""
# NFD decomposition separates base characters from combining marks
normalized = unicodedata.normalize('NFD', name)
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
# Lowercase
lowercase = ascii_text.lower()
# Replace non-alphanumeric with underscores
slug = re.sub(r'[^a-z0-9]+', '_', lowercase)
# Remove leading/trailing underscores and collapse multiple
slug = re.sub(r'_+', '_', slug).strip('_')
# Truncate
return slug[:max_length]
Examples:
| Name | Slug |
|---|---|
| "Tina M. Bastajian" | tina_m_bastajian |
| "Elena K." | elena_k |
| "LinkedIn Member" | linkedin_member |
| "Geraldine Vooren" | geraldine_vooren |
| "Nusta Nina" | nusta_nina |
Required Fields per Connection
Every connection entry in the JSON output MUST include:
{
"connection_id": "target-slug_conn_0042_amy_b",
"name": "Amy B.",
"name_type": "abbreviated",
"degree": "2nd",
"headline": "Film Archivist at EYE Filmmuseum",
"location": "Amsterdam, Netherlands",
"heritage_relevant": true,
"heritage_type": "A"
}
| Field | Type | Required | Description |
|---|---|---|---|
connection_id |
string | YES | Unique identifier |
name |
string | YES | Full name as displayed |
name_type |
string | YES | full, abbreviated, or anonymous |
degree |
string | YES | Connection degree: 1st, 2nd, 3rd+ |
headline |
string | NO | Current role/description |
location |
string | NO | Location as displayed |
heritage_relevant |
boolean | NO | Is this person in heritage sector? |
heritage_type |
string | NO | GLAMORCUBESFIXPHDNT code if heritage_relevant |
Implementation Script
Location: scripts/parse_linkedin_connections.py
Key Functions:
is_abbreviated_name(name)- Detects single-letter initialsis_anonymous_name(name)- Detects "LinkedIn Member" patternsgenerate_connection_id(name, index, target_slug)- Creates unique IDsparse_connections_file(input_path, target_slug)- Main parsing function
Usage:
python scripts/parse_linkedin_connections.py \
data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
data/custodian/person/{slug}_connections_{timestamp}.json \
--target-name "Full Name" \
--target-slug "linkedin-slug"
Statistics from Real Extractions
| Person | Total | Full Names | Abbreviated | Anonymous |
|---|---|---|---|---|
| Elif Rongen-Kaynakci | 475 | 449 (94.5%) | 26 (5.5%) | 0 |
| Giovanna Fossati | 776 | 746 (96.1%) | 30 (3.9%) | 0 |
Typical abbreviated name rate: 3-6% of connections.
Understanding Duplicates in Raw Connection Data
🚨 CRITICAL: Duplicates in raw manually-registered connection data are expected and NOT a data quality issue.
Why Duplicates Occur
When manually registering LinkedIn connections, the same person can appear multiple times in the raw data because:
- Connection degree is relative to the VIEWER, not the target profile
- In a social network graph, multiple paths of different lengths can exist to the same node
- A single person can simultaneously be reachable as:
- 1st degree - Direct connection to the viewer
- 2nd degree - Also reachable via another 1st degree connection
- 3rd+ degree - Also reachable via even longer paths
Example Scenario (Graph Theory)
Viewer: Alice (the person conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol
Carol may appear multiple times because multiple graph paths exist:
Alice -----> Carol (1st degree: direct edge)
Alice -----> Dave -----> Carol (2nd degree: path length 2)
Alice -----> Eve -----> Frank -----> Carol (3rd+: path length 3)
All three paths lead to the same node (Carol). LinkedIn may display
different degree values depending on which path it evaluates.
Key Constraint
The degree field (1st, 2nd, 3rd+) reflects the relationship between:
- The viewer (person logged into LinkedIn) and the connection
- NOT the relationship between the target profile and the connection
This is a LinkedIn UI limitation - when browsing someone else's connections, LinkedIn shows YOUR connection degree to those people, not theirs.
Impact on Data Processing
When processing raw MD files manually registered from LinkedIn:
- Expect duplicates - Same name appearing 2-3 times is normal
- Deduplicate by name - Keep only unique names in final JSON
- Count unique names - Not total lines in raw file
- Document discrepancy - Raw file line count ≠ unique connections
Real-World Example
| File | Raw Lines | Unique Names | Duplicates | Explanation |
|---|---|---|---|---|
| Giovanna Fossati MD | 985 | 766 | 219 (22%) | Same people appearing at multiple degrees |
Processing Recommendation
# When parsing raw MD connections file:
names_seen = set()
unique_connections = []
for entry in raw_entries:
normalized_name = normalize_name(entry['name'])
if normalized_name not in names_seen:
names_seen.add(normalized_name)
unique_connections.append(entry)
# else: skip duplicate (same person at different degree)
Why This Rule Matters
- Deduplication: Same abbreviated name across different connection lists can be linked
- Privacy Respect: Preserves privacy while enabling analysis
- Complete Data: No connections are silently dropped
- Network Analysis: Enables heritage sector relationship mapping even with partial data
- Audit Trail: Every connection can be traced back to its source
Related Rules
- Rule 15: Connection Data Registration - Full Network Preservation (
AGENTS.md) - Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages (
AGENTS.md) - EXA LinkedIn Extraction:
.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md - Person Data Reference Pattern:
.opencode/PERSON_DATA_REFERENCE_PATTERN.md
Version History
| Date | Version | Changes |
|---|---|---|
| 2025-12-10 | 1.0 | Initial rule creation |
| 2025-12-11 | 1.1 | Added duplicate explanation section |
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths to same node, not scroll position |