glam/.opencode/ANONYMOUS_PROFILE_NAME_RULE.md
2025-12-15 22:31:41 +01:00

6.9 KiB

Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs

When extracting LinkedIn profile data where the profile is privacy-restricted (showing as "LinkedIn Member" or incorrectly showing the logged-in user's name), the name CAN be reliably derived from the LinkedIn slug if it contains hyphens.

The Problem

When saving LinkedIn HTML pages while logged in, privacy-restricted profiles may incorrectly capture the logged-in user's name instead of the actual profile owner's name. This creates "name contamination" where dozens of profiles have the wrong name.

Example contamination:

  • File: willem-blok-b6a46648_20251211T000000Z.json
  • Incorrect name: "Simon Kemper" (the logged-in user)
  • Correct name: "Willem Blok" (derived from slug)

Key Principle: Slug-to-Name Derivation is NOT Fabrication

Deriving names from LinkedIn slugs is ALLOWED because:

  1. LinkedIn slugs are generated from the user's actual name
  2. The transformation is deterministic and reversible
  3. This is data transformation, not data fabrication

Per Rule 21 (Data Fabrication Prohibition): Fabricating data is strictly prohibited. However, deriving names from existing data (the slug) is a reliable transformation, not fabrication.

Slug Types and Handling

1. Hyphenated Slugs (Reliable - CAN derive name)

Slugs with hyphens between name parts can be reliably converted to names:

Slug Derived Name
willem-blok-b6a46648 Willem Blok
dave-van-den-nieuwenhof-4446b3146 Dave van den Nieuwenhof
charlotte-van-beek-55370314 Charlotte van Beek
jan-van-den-borre-3657211b3 Jan van den Borre
josée-lunsingh-scheurleer-van-den-berg-00765415 Josée Lunsingh Scheurleer van den Berg

Algorithm:

  1. URL-decode the slug (e.g., %C3%ABë)
  2. Remove trailing ID suffix (hex or numeric, 5+ digits)
  3. Split by hyphens
  4. Capitalize each part, EXCEPT Dutch particles when not first word

2. Compound Slugs Without Hyphens (Must Use Mapping)

Slugs without ANY hyphens cannot be reliably parsed because word boundaries are unknown:

Slug Correct Name Why Unparseable
jponjee J. Ponjee Is it "J Ponjee", "JP Onjee", "Jpon Jee"?
sharellyemanuelson Sharelly Emanuelson Where does first name end?
addieroelofsen Addie Roelofsen Could be "Addier Oelofsen"

Known Compound Slugs Mapping:

KNOWN_COMPOUND_SLUGS = {
    'jponjee': 'J. Ponjee',
    'sharellyemanuelson': 'Sharelly Emanuelson',
    'addieroelofsen': 'Addie Roelofsen',
    'adheliap': 'Adhelia P.',
    'anejanboomsma': 'Anejan Boomsma',
    'fredericlogghe': 'Frederic Logghe',
    'dirkjanheinen': 'Dirkjan Heinen',
}

For UNKNOWN compound slugs: Set name to "Unknown" and preserve the original slug in metadata for future resolution.

3. Abbreviated Names (Keep as-is)

Some slugs indicate abbreviated names on the original profile:

Slug Derived Name Notes
miriam-h-38b500b2 Miriam H User chose to show only last initial
simon-k-94938251 Simon K User chose to show only last initial
annegret-v-588b06197 Annegret V User chose to show only last initial

These are correct - the user intentionally abbreviated their name on LinkedIn.

Dutch Name Particles

Dutch particles should stay lowercase when NOT the first word:

Particle Example
van Charlotte van Beek
de Rob de Jong
den Jan van den Borre
der Herman van der Berg
het Jan van het Veld
't Jan van 't Hof

Exception: When the particle is the FIRST word, capitalize it:

  • de-jong-12345 → "De Jong" (particle is first)
  • rob-de-jong-12345 → "Rob de Jong" (particle follows first name)

Implementation

Python Function

import re
from urllib.parse import unquote

KNOWN_COMPOUND_SLUGS = {
    'jponjee': 'J. Ponjee',
    'sharellyemanuelson': 'Sharelly Emanuelson',
    'addieroelofsen': 'Addie Roelofsen',
    'adheliap': 'Adhelia P.',
    'anejanboomsma': 'Anejan Boomsma',
    'fredericlogghe': 'Frederic Logghe',
    'dirkjanheinen': 'Dirkjan Heinen',
}

def slug_to_name(slug: str) -> tuple[str, bool]:
    """Convert LinkedIn slug to name.
    
    Returns:
        tuple: (name, is_reliable)
    """
    decoded_slug = unquote(slug)
    
    # Check known compound slugs
    if decoded_slug in KNOWN_COMPOUND_SLUGS:
        return (KNOWN_COMPOUND_SLUGS[decoded_slug], True)
    
    # Unknown compound slug (no hyphens)
    if '-' not in decoded_slug:
        return ("Unknown", False)
    
    # Remove trailing ID
    clean_slug = re.sub(r'[-_][\da-f]{6,}$', '', decoded_slug)
    clean_slug = re.sub(r'[-_]\d{5,}$', '', clean_slug)
    
    parts = [p for p in clean_slug.split('-') if p]
    if not parts:
        return ("Unknown", False)
    
    # Dutch particles
    dutch_particles = {'van', 'de', 'den', 'der', 'het', 't', "'t"}
    
    name_parts = []
    for i, part in enumerate(parts):
        if part.lower() in dutch_particles and i > 0:
            name_parts.append(part.lower())
        else:
            name_parts.append(part.capitalize())
    
    return (' '.join(name_parts), True)

Scripts

Script Purpose
scripts/fix_simon_kemper_contamination.py Fix entity files with contaminated names
scripts/fix_missing_entity_profiles.py Fix source data file with contaminated names
scripts/parse_linkedin_html.py Parser that should use this logic for privacy-restricted profiles

When to Apply This Rule

  1. Parsing new LinkedIn HTML: When a profile shows "LinkedIn Member" or logged-in user's name
  2. Fixing existing data: When contamination is discovered in existing files
  3. Creating entity profiles: When profile data is incomplete but slug is available

When NOT to Apply This Rule

  1. Profile has valid name: If LinkedIn returned the actual name, use it
  2. Unknown compound slugs: If slug has no hyphens AND is not in the known mapping, use "Unknown"
  3. Fabricating additional data: This rule ONLY covers name derivation, not other profile fields
  • Rule 21: Data Fabrication Prohibition - slug derivation is transformation, not fabrication
  • Rule 19: HTML-Only LinkedIn Extraction - always use HTML source, not copy-paste
  • Rule 20: Person Entity Profiles - individual file storage requirements

Audit Trail

When fixing contaminated names, add a note to the extraction metadata:

{
  "extraction_metadata": {
    "notes": "Name corrected from 'Simon Kemper' (contamination) to 'Willem Blok' (derived from slug) on 2025-12-15T10:00:00Z"
  }
}

For unknown compound slugs, preserve the original slug:

{
  "extraction_metadata": {
    "original_slug": "unknowncompoundslug",
    "notes": "Name set to 'Unknown' (was 'Simon Kemper' contamination). Compound slug cannot be reliably parsed."
  }
}