glam/.opencode/LINKEDIN_CONNECTION_ID_RULE.md
2025-12-10 13:01:13 +01:00

9.5 KiB

LinkedIn Connection Unique Identifiers Rule

Summary

When parsing LinkedIn connections, EVERY connection MUST receive a unique connection_id, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").

This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.


Connection ID Format

{target_slug}_conn_{index:04d}_{name_slug}

Components:

  • target_slug: LinkedIn slug of the profile owner (e.g., anne-gant-59908a18)
  • conn: Literal string indicating this is a connection
  • index: 4-digit zero-padded index (0000-9999)
  • name_slug: Normalized name of the connection (see Name Slug Generation below)

Examples:

anne-gant-59908a18_conn_0042_amy_b
giovannafossati_conn_0156_linkedin_member
elif-rongen-kaynakci-35295a17_conn_0003_tina_m_bastajian

Name Type Classification

Every connection MUST include a name_type field with one of three values:

Type Pattern Example name_type Value
Full Name First + Last name "John Smith" full
Abbreviated Contains single initial "Amy B.", "S. Buse Yildirim", "Tina M. Bastajian" abbreviated
Anonymous Privacy-hidden profile "LinkedIn Member" anonymous

Abbreviated Name Detection Patterns

A name is classified as abbreviated when it contains a single letter followed by a period:

# Detection patterns
"Amy B."           # Last name abbreviated
"S. Buse Yildirim" # First name abbreviated  
"Tina M. Bastajian" # Middle initial (also abbreviated)
"I. Can Koc"       # Turkish first initial
"Elena K."         # Last name abbreviated
"Misato E."        # Last name abbreviated

Detection Regex:

import re

def is_abbreviated_name(name: str) -> bool:
    """Check if name contains abbreviated components (single letter + period)."""
    # Pattern: single letter followed by period (e.g., "Amy B." or "S. Buse")
    abbreviated_pattern = r'\b[A-Z]\.'
    return bool(re.search(abbreviated_pattern, name))

Anonymous Name Detection

A name is classified as anonymous when it matches privacy-hidden patterns:

ANONYMOUS_PATTERNS = [
    "linkedin member",
    "member",  # Exact match only
]

def is_anonymous_name(name: str) -> bool:
    """Check if name indicates an anonymous/hidden profile."""
    normalized = name.lower().strip()
    return normalized in ANONYMOUS_PATTERNS or normalized == "linkedin member"

Name Slug Generation Rules

The name_slug component is generated by normalizing the connection's name:

  1. Normalize unicode (NFD decomposition)
  2. Remove diacritics (e.g., e, o→o, n→n, I→i)
  3. Convert to lowercase
  4. Replace non-alphanumeric with underscores
  5. Collapse multiple underscores
  6. Truncate to 30 characters maximum

Implementation:

import unicodedata
import re

def generate_name_slug(name: str, max_length: int = 30) -> str:
    """Generate a URL-safe slug from a name."""
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', name)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    # Lowercase
    lowercase = ascii_text.lower()
    # Replace non-alphanumeric with underscores
    slug = re.sub(r'[^a-z0-9]+', '_', lowercase)
    # Remove leading/trailing underscores and collapse multiple
    slug = re.sub(r'_+', '_', slug).strip('_')
    # Truncate
    return slug[:max_length]

Examples:

Name Slug
"Tina M. Bastajian" tina_m_bastajian
"Elena K." elena_k
"LinkedIn Member" linkedin_member
"Geraldine Vooren" geraldine_vooren
"Nusta Nina" nusta_nina

Required Fields per Connection

Every connection entry in the JSON output MUST include:

{
  "connection_id": "target-slug_conn_0042_amy_b",
  "name": "Amy B.",
  "name_type": "abbreviated",
  "degree": "2nd",
  "headline": "Film Archivist at EYE Filmmuseum",
  "location": "Amsterdam, Netherlands",
  "heritage_relevant": true,
  "heritage_type": "A"
}
Field Type Required Description
connection_id string YES Unique identifier
name string YES Full name as displayed
name_type string YES full, abbreviated, or anonymous
degree string YES Connection degree: 1st, 2nd, 3rd+
headline string NO Current role/description
location string NO Location as displayed
heritage_relevant boolean NO Is this person in heritage sector?
heritage_type string NO GLAMORCUBESFIXPHDNT code if heritage_relevant

Implementation Script

Location: scripts/parse_linkedin_connections.py

Key Functions:

  • is_abbreviated_name(name) - Detects single-letter initials
  • is_anonymous_name(name) - Detects "LinkedIn Member" patterns
  • generate_connection_id(name, index, target_slug) - Creates unique IDs
  • parse_connections_file(input_path, target_slug) - Main parsing function

Usage:

python scripts/parse_linkedin_connections.py \
    data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
    data/custodian/person/{slug}_connections_{timestamp}.json \
    --target-name "Full Name" \
    --target-slug "linkedin-slug"

Statistics from Real Extractions

Person Total Full Names Abbreviated Anonymous
Elif Rongen-Kaynakci 475 449 (94.5%) 26 (5.5%) 0
Giovanna Fossati 776 746 (96.1%) 30 (3.9%) 0

Typical abbreviated name rate: 3-6% of connections.


Understanding Duplicates in Raw Connection Data

🚨 CRITICAL: Duplicates in raw manually-registered connection data are expected and NOT a data quality issue.

Why Duplicates Occur

When manually registering LinkedIn connections, the same person can appear multiple times in the raw data because:

  1. Connection degree is relative to the VIEWER, not the target profile
  2. In a social network graph, multiple paths of different lengths can exist to the same node
  3. A single person can simultaneously be reachable as:
    • 1st degree - Direct connection to the viewer
    • 2nd degree - Also reachable via another 1st degree connection
    • 3rd+ degree - Also reachable via even longer paths

Example Scenario (Graph Theory)

Viewer: Alice (the person conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol

Carol may appear multiple times because multiple graph paths exist:

    Alice -----> Carol           (1st degree: direct edge)
    Alice -----> Dave -----> Carol    (2nd degree: path length 2)
    Alice -----> Eve -----> Frank -----> Carol  (3rd+: path length 3)

All three paths lead to the same node (Carol). LinkedIn may display
different degree values depending on which path it evaluates.

Key Constraint

The degree field (1st, 2nd, 3rd+) reflects the relationship between:

  • The viewer (person logged into LinkedIn) and the connection
  • NOT the relationship between the target profile and the connection

This is a LinkedIn UI limitation - when browsing someone else's connections, LinkedIn shows YOUR connection degree to those people, not theirs.

Impact on Data Processing

When processing raw MD files manually registered from LinkedIn:

  1. Expect duplicates - Same name appearing 2-3 times is normal
  2. Deduplicate by name - Keep only unique names in final JSON
  3. Count unique names - Not total lines in raw file
  4. Document discrepancy - Raw file line count ≠ unique connections

Real-World Example

File Raw Lines Unique Names Duplicates Explanation
Giovanna Fossati MD 985 766 219 (22%) Same people appearing at multiple degrees

Processing Recommendation

# When parsing raw MD connections file:
names_seen = set()
unique_connections = []

for entry in raw_entries:
    normalized_name = normalize_name(entry['name'])
    if normalized_name not in names_seen:
        names_seen.add(normalized_name)
        unique_connections.append(entry)
    # else: skip duplicate (same person at different degree)

Why This Rule Matters

  1. Deduplication: Same abbreviated name across different connection lists can be linked
  2. Privacy Respect: Preserves privacy while enabling analysis
  3. Complete Data: No connections are silently dropped
  4. Network Analysis: Enables heritage sector relationship mapping even with partial data
  5. Audit Trail: Every connection can be traced back to its source

  • Rule 15: Connection Data Registration - Full Network Preservation (AGENTS.md)
  • Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages (AGENTS.md)
  • EXA LinkedIn Extraction: .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md
  • Person Data Reference Pattern: .opencode/PERSON_DATA_REFERENCE_PATTERN.md

Version History

Date Version Changes
2025-12-10 1.0 Initial rule creation
2025-12-11 1.1 Added duplicate explanation section
2025-12-11 1.2 Fixed explanation: duplicates due to graph-theoretical multiple paths to same node, not scroll position