291 lines
9.5 KiB
Markdown
291 lines
9.5 KiB
Markdown
# LinkedIn Connection Unique Identifiers Rule
|
|
|
|
## Summary
|
|
|
|
When parsing LinkedIn connections, **EVERY connection MUST receive a unique `connection_id`**, including abbreviated names (e.g., "Amy B.") and anonymous entries (e.g., "LinkedIn Member").
|
|
|
|
This rule ensures complete data preservation for heritage sector network analysis, even when LinkedIn privacy settings obscure full names.
|
|
|
|
---
|
|
|
|
## Connection ID Format
|
|
|
|
```
|
|
{target_slug}_conn_{index:04d}_{name_slug}
|
|
```
|
|
|
|
**Components**:
|
|
- `target_slug`: LinkedIn slug of the profile owner (e.g., `anne-gant-59908a18`)
|
|
- `conn`: Literal string indicating this is a connection
|
|
- `index`: 4-digit zero-padded index (0000-9999)
|
|
- `name_slug`: Normalized name of the connection (see Name Slug Generation below)
|
|
|
|
**Examples**:
|
|
```
|
|
anne-gant-59908a18_conn_0042_amy_b
|
|
giovannafossati_conn_0156_linkedin_member
|
|
elif-rongen-kaynakci-35295a17_conn_0003_tina_m_bastajian
|
|
```
|
|
|
|
---
|
|
|
|
## Name Type Classification
|
|
|
|
Every connection MUST include a `name_type` field with one of three values:
|
|
|
|
| Type | Pattern | Example | `name_type` Value |
|
|
|------|---------|---------|-------------------|
|
|
| **Full Name** | First + Last name | "John Smith" | `full` |
|
|
| **Abbreviated** | Contains single initial | "Amy B.", "S. Buse Yildirim", "Tina M. Bastajian" | `abbreviated` |
|
|
| **Anonymous** | Privacy-hidden profile | "LinkedIn Member" | `anonymous` |
|
|
|
|
---
|
|
|
|
## Abbreviated Name Detection Patterns
|
|
|
|
A name is classified as `abbreviated` when it contains a single letter followed by a period:
|
|
|
|
```python
|
|
# Detection patterns
|
|
"Amy B." # Last name abbreviated
|
|
"S. Buse Yildirim" # First name abbreviated
|
|
"Tina M. Bastajian" # Middle initial (also abbreviated)
|
|
"I. Can Koc" # Turkish first initial
|
|
"Elena K." # Last name abbreviated
|
|
"Misato E." # Last name abbreviated
|
|
```
|
|
|
|
**Detection Regex**:
|
|
```python
|
|
import re
|
|
|
|
def is_abbreviated_name(name: str) -> bool:
|
|
"""Check if name contains abbreviated components (single letter + period)."""
|
|
# Pattern: single letter followed by period (e.g., "Amy B." or "S. Buse")
|
|
abbreviated_pattern = r'\b[A-Z]\.'
|
|
return bool(re.search(abbreviated_pattern, name))
|
|
```
|
|
|
|
---
|
|
|
|
## Anonymous Name Detection
|
|
|
|
A name is classified as `anonymous` when it matches privacy-hidden patterns:
|
|
|
|
```python
|
|
ANONYMOUS_PATTERNS = [
|
|
"linkedin member",
|
|
"member", # Exact match only
|
|
]
|
|
|
|
def is_anonymous_name(name: str) -> bool:
|
|
"""Check if name indicates an anonymous/hidden profile."""
|
|
normalized = name.lower().strip()
|
|
return normalized in ANONYMOUS_PATTERNS or normalized == "linkedin member"
|
|
```
|
|
|
|
---
|
|
|
|
## Name Slug Generation Rules
|
|
|
|
The `name_slug` component is generated by normalizing the connection's name:
|
|
|
|
1. **Normalize unicode** (NFD decomposition)
|
|
2. **Remove diacritics** (e.g., e, o→o, n→n, I→i)
|
|
3. **Convert to lowercase**
|
|
4. **Replace non-alphanumeric** with underscores
|
|
5. **Collapse multiple underscores**
|
|
6. **Truncate to 30 characters** maximum
|
|
|
|
**Implementation**:
|
|
```python
|
|
import unicodedata
|
|
import re
|
|
|
|
def generate_name_slug(name: str, max_length: int = 30) -> str:
|
|
"""Generate a URL-safe slug from a name."""
|
|
# NFD decomposition separates base characters from combining marks
|
|
normalized = unicodedata.normalize('NFD', name)
|
|
# Remove combining marks (category 'Mn' = Mark, Nonspacing)
|
|
ascii_text = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
|
|
# Lowercase
|
|
lowercase = ascii_text.lower()
|
|
# Replace non-alphanumeric with underscores
|
|
slug = re.sub(r'[^a-z0-9]+', '_', lowercase)
|
|
# Remove leading/trailing underscores and collapse multiple
|
|
slug = re.sub(r'_+', '_', slug).strip('_')
|
|
# Truncate
|
|
return slug[:max_length]
|
|
```
|
|
|
|
**Examples**:
|
|
| Name | Slug |
|
|
|------|------|
|
|
| "Tina M. Bastajian" | `tina_m_bastajian` |
|
|
| "Elena K." | `elena_k` |
|
|
| "LinkedIn Member" | `linkedin_member` |
|
|
| "Geraldine Vooren" | `geraldine_vooren` |
|
|
| "Nusta Nina" | `nusta_nina` |
|
|
|
|
---
|
|
|
|
## Required Fields per Connection
|
|
|
|
Every connection entry in the JSON output MUST include:
|
|
|
|
```json
|
|
{
|
|
"connection_id": "target-slug_conn_0042_amy_b",
|
|
"name": "Amy B.",
|
|
"name_type": "abbreviated",
|
|
"degree": "2nd",
|
|
"headline": "Film Archivist at EYE Filmmuseum",
|
|
"location": "Amsterdam, Netherlands",
|
|
"heritage_relevant": true,
|
|
"heritage_type": "A"
|
|
}
|
|
```
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `connection_id` | string | **YES** | Unique identifier |
|
|
| `name` | string | **YES** | Full name as displayed |
|
|
| `name_type` | string | **YES** | `full`, `abbreviated`, or `anonymous` |
|
|
| `degree` | string | **YES** | Connection degree: `1st`, `2nd`, `3rd+` |
|
|
| `headline` | string | NO | Current role/description |
|
|
| `location` | string | NO | Location as displayed |
|
|
| `heritage_relevant` | boolean | NO | Is this person in heritage sector? |
|
|
| `heritage_type` | string | NO | GLAMORCUBESFIXPHDNT code if heritage_relevant |
|
|
|
|
---
|
|
|
|
## Implementation Script
|
|
|
|
**Location**: `scripts/parse_linkedin_connections.py`
|
|
|
|
**Key Functions**:
|
|
- `is_abbreviated_name(name)` - Detects single-letter initials
|
|
- `is_anonymous_name(name)` - Detects "LinkedIn Member" patterns
|
|
- `generate_connection_id(name, index, target_slug)` - Creates unique IDs
|
|
- `parse_connections_file(input_path, target_slug)` - Main parsing function
|
|
|
|
**Usage**:
|
|
```bash
|
|
python scripts/parse_linkedin_connections.py \
|
|
data/custodian/person/manual_register/{slug}_connections_{timestamp}.md \
|
|
data/custodian/person/{slug}_connections_{timestamp}.json \
|
|
--target-name "Full Name" \
|
|
--target-slug "linkedin-slug"
|
|
```
|
|
|
|
---
|
|
|
|
## Statistics from Real Extractions
|
|
|
|
| Person | Total | Full Names | Abbreviated | Anonymous |
|
|
|--------|-------|------------|-------------|-----------|
|
|
| Elif Rongen-Kaynakci | 475 | 449 (94.5%) | 26 (5.5%) | 0 |
|
|
| Giovanna Fossati | 776 | 746 (96.1%) | 30 (3.9%) | 0 |
|
|
|
|
Typical abbreviated name rate: **3-6%** of connections.
|
|
|
|
---
|
|
|
|
## Understanding Duplicates in Raw Connection Data
|
|
|
|
**🚨 CRITICAL: Duplicates in raw manually-registered connection data are expected and NOT a data quality issue.**
|
|
|
|
### Why Duplicates Occur
|
|
|
|
When manually registering LinkedIn connections, the **same person can appear multiple times** in the raw data because:
|
|
|
|
1. **Connection degree is relative to the VIEWER, not the target profile**
|
|
2. In a social network graph, **multiple paths of different lengths can exist to the same node**
|
|
3. A single person can simultaneously be reachable as:
|
|
- **1st degree** - Direct connection to the viewer
|
|
- **2nd degree** - Also reachable via another 1st degree connection
|
|
- **3rd+ degree** - Also reachable via even longer paths
|
|
|
|
### Example Scenario (Graph Theory)
|
|
|
|
```
|
|
Viewer: Alice (the person conducting the search)
|
|
Target: Bob (the profile whose connections are being analyzed)
|
|
Connection: Carol
|
|
|
|
Carol may appear multiple times because multiple graph paths exist:
|
|
|
|
Alice -----> Carol (1st degree: direct edge)
|
|
Alice -----> Dave -----> Carol (2nd degree: path length 2)
|
|
Alice -----> Eve -----> Frank -----> Carol (3rd+: path length 3)
|
|
|
|
All three paths lead to the same node (Carol). LinkedIn may display
|
|
different degree values depending on which path it evaluates.
|
|
```
|
|
|
|
### Key Constraint
|
|
|
|
**The `degree` field (1st, 2nd, 3rd+) reflects the relationship between:**
|
|
- The **viewer** (person logged into LinkedIn) and the **connection**
|
|
- **NOT** the relationship between the target profile and the connection
|
|
|
|
This is a LinkedIn UI limitation - when browsing someone else's connections, LinkedIn shows YOUR connection degree to those people, not theirs.
|
|
|
|
### Impact on Data Processing
|
|
|
|
When processing raw MD files manually registered from LinkedIn:
|
|
|
|
1. **Expect duplicates** - Same name appearing 2-3 times is normal
|
|
2. **Deduplicate by name** - Keep only unique names in final JSON
|
|
3. **Count unique names** - Not total lines in raw file
|
|
4. **Document discrepancy** - Raw file line count ≠ unique connections
|
|
|
|
### Real-World Example
|
|
|
|
| File | Raw Lines | Unique Names | Duplicates | Explanation |
|
|
|------|-----------|--------------|------------|-------------|
|
|
| Giovanna Fossati MD | 985 | 766 | 219 (22%) | Same people appearing at multiple degrees |
|
|
|
|
### Processing Recommendation
|
|
|
|
```python
|
|
# When parsing raw MD connections file:
|
|
names_seen = set()
|
|
unique_connections = []
|
|
|
|
for entry in raw_entries:
|
|
normalized_name = normalize_name(entry['name'])
|
|
if normalized_name not in names_seen:
|
|
names_seen.add(normalized_name)
|
|
unique_connections.append(entry)
|
|
# else: skip duplicate (same person at different degree)
|
|
```
|
|
|
|
---
|
|
|
|
## Why This Rule Matters
|
|
|
|
1. **Deduplication**: Same abbreviated name across different connection lists can be linked
|
|
2. **Privacy Respect**: Preserves privacy while enabling analysis
|
|
3. **Complete Data**: No connections are silently dropped
|
|
4. **Network Analysis**: Enables heritage sector relationship mapping even with partial data
|
|
5. **Audit Trail**: Every connection can be traced back to its source
|
|
|
|
---
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 15**: Connection Data Registration - Full Network Preservation (`AGENTS.md`)
|
|
- **Rule 16**: LinkedIn Photo URLs - Store CDN URLs, Not Overlay Pages (`AGENTS.md`)
|
|
- **EXA LinkedIn Extraction**: `.opencode/EXA_LINKEDIN_EXTRACTION_RULES.md`
|
|
- **Person Data Reference Pattern**: `.opencode/PERSON_DATA_REFERENCE_PATTERN.md`
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
| Date | Version | Changes |
|
|
|------|---------|---------|
|
|
| 2025-12-10 | 1.0 | Initial rule creation |
|
|
| 2025-12-11 | 1.1 | Added duplicate explanation section |
|
|
| 2025-12-11 | 1.2 | Fixed explanation: duplicates due to graph-theoretical multiple paths to same node, not scroll position |
|