6.9 KiB
Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs
When extracting LinkedIn profile data where the profile is privacy-restricted (showing as "LinkedIn Member" or incorrectly showing the logged-in user's name), the name CAN be reliably derived from the LinkedIn slug if it contains hyphens.
The Problem
When saving LinkedIn HTML pages while logged in, privacy-restricted profiles may incorrectly capture the logged-in user's name instead of the actual profile owner's name. This creates "name contamination" where dozens of profiles have the wrong name.
Example contamination:
- File:
willem-blok-b6a46648_20251211T000000Z.json - Incorrect name: "Simon Kemper" (the logged-in user)
- Correct name: "Willem Blok" (derived from slug)
Key Principle: Slug-to-Name Derivation is NOT Fabrication
Deriving names from LinkedIn slugs is ALLOWED because:
- LinkedIn slugs are generated from the user's actual name
- The transformation is deterministic and reversible
- This is data transformation, not data fabrication
Per Rule 21 (Data Fabrication Prohibition): Fabricating data is strictly prohibited. However, deriving names from existing data (the slug) is a reliable transformation, not fabrication.
Slug Types and Handling
1. Hyphenated Slugs (Reliable - CAN derive name)
Slugs with hyphens between name parts can be reliably converted to names:
| Slug | Derived Name |
|---|---|
willem-blok-b6a46648 |
Willem Blok |
dave-van-den-nieuwenhof-4446b3146 |
Dave van den Nieuwenhof |
charlotte-van-beek-55370314 |
Charlotte van Beek |
jan-van-den-borre-3657211b3 |
Jan van den Borre |
josée-lunsingh-scheurleer-van-den-berg-00765415 |
Josée Lunsingh Scheurleer van den Berg |
Algorithm:
- URL-decode the slug (e.g.,
%C3%AB→ë) - Remove trailing ID suffix (hex or numeric, 5+ digits)
- Split by hyphens
- Capitalize each part, EXCEPT Dutch particles when not first word
2. Compound Slugs Without Hyphens (Must Use Mapping)
Slugs without ANY hyphens cannot be reliably parsed because word boundaries are unknown:
| Slug | Correct Name | Why Unparseable |
|---|---|---|
jponjee |
J. Ponjee | Is it "J Ponjee", "JP Onjee", "Jpon Jee"? |
sharellyemanuelson |
Sharelly Emanuelson | Where does first name end? |
addieroelofsen |
Addie Roelofsen | Could be "Addier Oelofsen" |
Known Compound Slugs Mapping:
KNOWN_COMPOUND_SLUGS = {
'jponjee': 'J. Ponjee',
'sharellyemanuelson': 'Sharelly Emanuelson',
'addieroelofsen': 'Addie Roelofsen',
'adheliap': 'Adhelia P.',
'anejanboomsma': 'Anejan Boomsma',
'fredericlogghe': 'Frederic Logghe',
'dirkjanheinen': 'Dirkjan Heinen',
}
For UNKNOWN compound slugs: Set name to "Unknown" and preserve the original slug in metadata for future resolution.
3. Abbreviated Names (Keep as-is)
Some slugs indicate abbreviated names on the original profile:
| Slug | Derived Name | Notes |
|---|---|---|
miriam-h-38b500b2 |
Miriam H | User chose to show only last initial |
simon-k-94938251 |
Simon K | User chose to show only last initial |
annegret-v-588b06197 |
Annegret V | User chose to show only last initial |
These are correct - the user intentionally abbreviated their name on LinkedIn.
Dutch Name Particles
Dutch particles should stay lowercase when NOT the first word:
| Particle | Example |
|---|---|
| van | Charlotte van Beek |
| de | Rob de Jong |
| den | Jan van den Borre |
| der | Herman van der Berg |
| het | Jan van het Veld |
| 't | Jan van 't Hof |
Exception: When the particle is the FIRST word, capitalize it:
de-jong-12345→ "De Jong" (particle is first)rob-de-jong-12345→ "Rob de Jong" (particle follows first name)
Implementation
Python Function
import re
from urllib.parse import unquote
KNOWN_COMPOUND_SLUGS = {
'jponjee': 'J. Ponjee',
'sharellyemanuelson': 'Sharelly Emanuelson',
'addieroelofsen': 'Addie Roelofsen',
'adheliap': 'Adhelia P.',
'anejanboomsma': 'Anejan Boomsma',
'fredericlogghe': 'Frederic Logghe',
'dirkjanheinen': 'Dirkjan Heinen',
}
def slug_to_name(slug: str) -> tuple[str, bool]:
"""Convert LinkedIn slug to name.
Returns:
tuple: (name, is_reliable)
"""
decoded_slug = unquote(slug)
# Check known compound slugs
if decoded_slug in KNOWN_COMPOUND_SLUGS:
return (KNOWN_COMPOUND_SLUGS[decoded_slug], True)
# Unknown compound slug (no hyphens)
if '-' not in decoded_slug:
return ("Unknown", False)
# Remove trailing ID
clean_slug = re.sub(r'[-_][\da-f]{6,}$', '', decoded_slug)
clean_slug = re.sub(r'[-_]\d{5,}$', '', clean_slug)
parts = [p for p in clean_slug.split('-') if p]
if not parts:
return ("Unknown", False)
# Dutch particles
dutch_particles = {'van', 'de', 'den', 'der', 'het', 't', "'t"}
name_parts = []
for i, part in enumerate(parts):
if part.lower() in dutch_particles and i > 0:
name_parts.append(part.lower())
else:
name_parts.append(part.capitalize())
return (' '.join(name_parts), True)
Scripts
| Script | Purpose |
|---|---|
scripts/fix_simon_kemper_contamination.py |
Fix entity files with contaminated names |
scripts/fix_missing_entity_profiles.py |
Fix source data file with contaminated names |
scripts/parse_linkedin_html.py |
Parser that should use this logic for privacy-restricted profiles |
When to Apply This Rule
- Parsing new LinkedIn HTML: When a profile shows "LinkedIn Member" or logged-in user's name
- Fixing existing data: When contamination is discovered in existing files
- Creating entity profiles: When profile data is incomplete but slug is available
When NOT to Apply This Rule
- Profile has valid name: If LinkedIn returned the actual name, use it
- Unknown compound slugs: If slug has no hyphens AND is not in the known mapping, use "Unknown"
- Fabricating additional data: This rule ONLY covers name derivation, not other profile fields
Related Rules
- Rule 21: Data Fabrication Prohibition - slug derivation is transformation, not fabrication
- Rule 19: HTML-Only LinkedIn Extraction - always use HTML source, not copy-paste
- Rule 20: Person Entity Profiles - individual file storage requirements
Audit Trail
When fixing contaminated names, add a note to the extraction metadata:
{
"extraction_metadata": {
"notes": "Name corrected from 'Simon Kemper' (contamination) to 'Willem Blok' (derived from slug) on 2025-12-15T10:00:00Z"
}
}
For unknown compound slugs, preserve the original slug:
{
"extraction_metadata": {
"original_slug": "unknowncompoundslug",
"notes": "Name set to 'Unknown' (was 'Simon Kemper' contamination). Compound slug cannot be reliably parsed."
}
}