kempersc 41959f0766 correct HCID!

2025-12-10 13:01:13 +01:00

9.5 KiB

Raw Blame History

LinkedIn Connection Unique Identifiers Rule

Summary

When parsing LinkedIn connections, EVERY connection MUST receive a unique connection_id, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").

This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.

Connection ID Format

{target_slug}_conn_{index:04d}_{name_slug}

Components:

target_slug: LinkedIn slug of the profile owner (e.g., anne-gant-59908a18)
conn: Literal string indicating this is a connection
index: 4-digit zero-padded index (0000-9999)
name_slug: Normalized name of the connection (see Name Slug Generation below)

Examples:

anne-gant-59908a18_conn_0042_amy_b
giovannafossati_conn_0156_linkedin_member
elif-rongen-kaynakci-35295a17_conn_0003_tina_m_bastajian

Name Type Classification

Every connection MUST include a name_type field with one of three values:

Type	Pattern	Example	`name_type` Value
Full Name	First + Last name	"John Smith"	`full`
Abbreviated	Contains single initial	"Amy B.", "S. Buse Yildirim", "Tina M. Bastajian"	`abbreviated`
Anonymous	Privacy-hidden profile	"LinkedIn Member"	`anonymous`

Abbreviated Name Detection Patterns

A name is classified as abbreviated when it contains a single letter followed by a period:

# Detection patterns
"Amy B."           # Last name abbreviated
"S. Buse Yildirim" # First name abbreviated  
"Tina M. Bastajian" # Middle initial (also abbreviated)
"I. Can Koc"       # Turkish first initial
"Elena K."         # Last name abbreviated
"Misato E."        # Last name abbreviated

Detection Regex:

import re

def is_abbreviated_name(name: str) -> bool:
    """Check if name contains abbreviated components (single letter + period)."""
    # Pattern: single letter followed by period (e.g., "Amy B." or "S. Buse")
    abbreviated_pattern = r'\b[A-Z]\.'
    return bool(re.search(abbreviated_pattern, name))

Anonymous Name Detection

A name is classified as anonymous when it matches privacy-hidden patterns:

ANONYMOUS_PATTERNS = [
    "linkedin member",
    "member",  # Exact match only
]

def is_anonymous_name(name: str) -> bool:
    """Check if name indicates an anonymous/hidden profile."""
    normalized = name.lower().strip()
    return normalized in ANONYMOUS_PATTERNS or normalized == "linkedin member"

Name Slug Generation Rules

The name_slug component is generated by normalizing the connection's name:

Normalize unicode (NFD decomposition)
Remove diacritics (e.g., e, o→o, n→n, I→i)
Convert to lowercase
Replace non-alphanumeric with underscores
Collapse multiple underscores
Truncate to 30 characters maximum

Implementation:

import unicodedata
import re

def generate_name_slug(name: str, max_length: int = 30) -> str:
    """Generate a URL-safe slug from a name."""
    # NFD decomposition separates base characters from combining marks
    normalized = unicodedata.normalize('NFD', name)
    # Remove combining marks (category 'Mn' = Mark, Nonspacing)
    ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
    # Lowercase
    lowercase = ascii_text.lower()
    # Replace non-alphanumeric with underscores
    slug = re.sub(r'[^a-z0-9]+', '_', lowercase)
    # Remove leading/trailing underscores and collapse multiple
    slug = re.sub(r'_+', '_', slug).strip('_')
    # Truncate
    return slug[:max_length]

Examples:

Name	Slug
"Tina M. Bastajian"	`tina_m_bastajian`
"Elena K."	`elena_k`
"LinkedIn Member"	`linkedin_member`
"Geraldine Vooren"	`geraldine_vooren`
"Nusta Nina"	`nusta_nina`

Required Fields per Connection

Every connection entry in the JSON output MUST include:

{
  "connection_id": "target-slug_conn_0042_amy_b",
  "name": "Amy B.",
  "name_type": "abbreviated",
  "degree": "2nd",
  "headline": "Film Archivist at EYE Filmmuseum",
  "location": "Amsterdam, Netherlands",
  "heritage_relevant": true,
  "heritage_type": "A"
}

Field	Type	Required	Description
`connection_id`	string	YES	Unique identifier
`name`	string	YES	Full name as displayed
`name_type`	string	YES	`full`, `abbreviated`, or `anonymous`
`degree`	string	YES	Connection degree: `1st`, `2nd`, `3rd+`
`headline`	string	NO	Current role/description
`location`	string	NO	Location as displayed
`heritage_relevant`	boolean	NO	Is this person in heritage sector?
`heritage_type`	string	NO	GLAMORCUBESFIXPHDNT code if heritage_relevant

Implementation Script

Location: scripts/parse_linkedin_connections.py

Key Functions:

is_abbreviated_name(name) - Detects single-letter initials
is_anonymous_name(name) - Detects "LinkedIn Member" patterns
generate_connection_id(name, index, target_slug) - Creates unique IDs
parse_connections_file(input_path, target_slug) - Main parsing function

Usage:

python scripts/parse_linkedin_connections.py \
    data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
    data/custodian/person/{slug}_connections_{timestamp}.json \
    --target-name "Full Name" \
    --target-slug "linkedin-slug"

Statistics from Real Extractions

Person	Total	Full Names	Abbreviated	Anonymous
Elif Rongen-Kaynakci	475	449 (94.5%)	26 (5.5%)	0
Giovanna Fossati	776	746 (96.1%)	30 (3.9%)	0

Typical abbreviated name rate: 3-6% of connections.

Understanding Duplicates in Raw Connection Data

🚨 CRITICAL: Duplicates in raw manually-registered connection data are expected and NOT a data quality issue.

Why Duplicates Occur

When manually registering LinkedIn connections, the same person can appear multiple times in the raw data because:

Connection degree is relative to the VIEWER, not the target profile
In a social network graph, multiple paths of different lengths can exist to the same node
A single person can simultaneously be reachable as:
- 1st degree - Direct connection to the viewer
- 2nd degree - Also reachable via another 1st degree connection
- 3rd+ degree - Also reachable via even longer paths

Example Scenario (Graph Theory)

Viewer: Alice (the person conducting the search)
Target: Bob (the profile whose connections are being analyzed)
Connection: Carol

Carol may appear multiple times because multiple graph paths exist:

    Alice -----> Carol           (1st degree: direct edge)
    Alice -----> Dave -----> Carol    (2nd degree: path length 2)
    Alice -----> Eve -----> Frank -----> Carol  (3rd+: path length 3)

All three paths lead to the same node (Carol). LinkedIn may display
different degree values depending on which path it evaluates.

Key Constraint

The degree field (1st, 2nd, 3rd+) reflects the relationship between:

The viewer (person logged into LinkedIn) and the connection
NOT the relationship between the target profile and the connection

This is a LinkedIn UI limitation - when browsing someone else's connections, LinkedIn shows YOUR connection degree to those people, not theirs.

Impact on Data Processing

When processing raw MD files manually registered from LinkedIn:

Expect duplicates - Same name appearing 2-3 times is normal
Deduplicate by name - Keep only unique names in final JSON
Count unique names - Not total lines in raw file
Document discrepancy - Raw file line count ≠ unique connections

Real-World Example

File	Raw Lines	Unique Names	Duplicates	Explanation
Giovanna Fossati MD	985	766	219 (22%)	Same people appearing at multiple degrees

Processing Recommendation

# When parsing raw MD connections file:
names_seen = set()
unique_connections = []

for entry in raw_entries:
    normalized_name = normalize_name(entry['name'])
    if normalized_name not in names_seen:
        names_seen.add(normalized_name)
        unique_connections.append(entry)
    # else: skip duplicate (same person at different degree)

Why This Rule Matters

Deduplication: Same abbreviated name across different connection lists can be linked
Privacy Respect: Preserves privacy while enabling analysis
Complete Data: No connections are silently dropped
Network Analysis: Enables heritage sector relationship mapping even with partial data
Audit Trail: Every connection can be traced back to its source

Rule 15: Connection Data Registration - Full Network Preservation (AGENTS.md)
Rule 16: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages (AGENTS.md)
EXA LinkedIn Extraction: .opencode/EXA_LINKEDIN_EXTRACTION_RULES.md
Person Data Reference Pattern: .opencode/PERSON_DATA_REFERENCE_PATTERN.md

Version History

Date	Version	Changes
2025-12-10	1.0	Initial rule creation
2025-12-11	1.1	Added duplicate explanation section
2025-12-11	1.2	Fixed explanation: duplicates due to graph-theoretical multiple paths to same node, not scroll position

9.5 KiB Raw Blame History