# Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs **When extracting LinkedIn profile data where the profile is privacy-restricted (showing as "LinkedIn Member" or incorrectly showing the logged-in user's name), the name CAN be reliably derived from the LinkedIn slug if it contains hyphens.** ## The Problem When saving LinkedIn HTML pages while logged in, privacy-restricted profiles may incorrectly capture the logged-in user's name instead of the actual profile owner's name. This creates "name contamination" where dozens of profiles have the wrong name. **Example contamination:** - File: `willem-blok-b6a46648_20251211T000000Z.json` - Incorrect name: "Simon Kemper" (the logged-in user) - Correct name: "Willem Blok" (derived from slug) ## Key Principle: Slug-to-Name Derivation is NOT Fabrication Deriving names from LinkedIn slugs is **ALLOWED** because: 1. LinkedIn slugs are generated from the user's actual name 2. The transformation is deterministic and reversible 3. This is data transformation, not data fabrication **Per Rule 21 (Data Fabrication Prohibition):** Fabricating data is strictly prohibited. However, deriving names from existing data (the slug) is a reliable transformation, not fabrication. ## Slug Types and Handling ### 1. Hyphenated Slugs (Reliable - CAN derive name) Slugs with hyphens between name parts can be reliably converted to names: | Slug | Derived Name | |------|--------------| | `willem-blok-b6a46648` | Willem Blok | | `dave-van-den-nieuwenhof-4446b3146` | Dave van den Nieuwenhof | | `charlotte-van-beek-55370314` | Charlotte van Beek | | `jan-van-den-borre-3657211b3` | Jan van den Borre | | `josée-lunsingh-scheurleer-van-den-berg-00765415` | Josée Lunsingh Scheurleer van den Berg | **Algorithm:** 1. URL-decode the slug (e.g., `%C3%AB` → `ë`) 2. Remove trailing ID suffix (hex or numeric, 5+ digits) 3. Split by hyphens 4. Capitalize each part, EXCEPT Dutch particles when not first word ### 2. Compound Slugs Without Hyphens (Must Use Mapping) Slugs without ANY hyphens cannot be reliably parsed because word boundaries are unknown: | Slug | Correct Name | Why Unparseable | |------|--------------|-----------------| | `jponjee` | J. Ponjee | Is it "J Ponjee", "JP Onjee", "Jpon Jee"? | | `sharellyemanuelson` | Sharelly Emanuelson | Where does first name end? | | `addieroelofsen` | Addie Roelofsen | Could be "Addier Oelofsen" | **Known Compound Slugs Mapping:** ```python KNOWN_COMPOUND_SLUGS = { 'jponjee': 'J. Ponjee', 'sharellyemanuelson': 'Sharelly Emanuelson', 'addieroelofsen': 'Addie Roelofsen', 'adheliap': 'Adhelia P.', 'anejanboomsma': 'Anejan Boomsma', 'fredericlogghe': 'Frederic Logghe', 'dirkjanheinen': 'Dirkjan Heinen', } ``` **For UNKNOWN compound slugs:** Set name to "Unknown" and preserve the original slug in metadata for future resolution. ### 3. Abbreviated Names (Keep as-is) Some slugs indicate abbreviated names on the original profile: | Slug | Derived Name | Notes | |------|--------------|-------| | `miriam-h-38b500b2` | Miriam H | User chose to show only last initial | | `simon-k-94938251` | Simon K | User chose to show only last initial | | `annegret-v-588b06197` | Annegret V | User chose to show only last initial | These are **correct** - the user intentionally abbreviated their name on LinkedIn. ## Dutch Name Particles Dutch particles should stay lowercase when NOT the first word: | Particle | Example | |----------|---------| | van | Charlotte **van** Beek | | de | Rob **de** Jong | | den | Jan van **den** Borre | | der | Herman van **der** Berg | | het | Jan van **het** Veld | | 't | Jan van **'t** Hof | **Exception:** When the particle is the FIRST word, capitalize it: - `de-jong-12345` → "De Jong" (particle is first) - `rob-de-jong-12345` → "Rob de Jong" (particle follows first name) ## Implementation ### Python Function ```python import re from urllib.parse import unquote KNOWN_COMPOUND_SLUGS = { 'jponjee': 'J. Ponjee', 'sharellyemanuelson': 'Sharelly Emanuelson', 'addieroelofsen': 'Addie Roelofsen', 'adheliap': 'Adhelia P.', 'anejanboomsma': 'Anejan Boomsma', 'fredericlogghe': 'Frederic Logghe', 'dirkjanheinen': 'Dirkjan Heinen', } def slug_to_name(slug: str) -> tuple[str, bool]: """Convert LinkedIn slug to name. Returns: tuple: (name, is_reliable) """ decoded_slug = unquote(slug) # Check known compound slugs if decoded_slug in KNOWN_COMPOUND_SLUGS: return (KNOWN_COMPOUND_SLUGS[decoded_slug], True) # Unknown compound slug (no hyphens) if '-' not in decoded_slug: return ("Unknown", False) # Remove trailing ID clean_slug = re.sub(r'[-_][\da-f]{6,}$', '', decoded_slug) clean_slug = re.sub(r'[-_]\d{5,}$', '', clean_slug) parts = [p for p in clean_slug.split('-') if p] if not parts: return ("Unknown", False) # Dutch particles dutch_particles = {'van', 'de', 'den', 'der', 'het', 't', "'t"} name_parts = [] for i, part in enumerate(parts): if part.lower() in dutch_particles and i > 0: name_parts.append(part.lower()) else: name_parts.append(part.capitalize()) return (' '.join(name_parts), True) ``` ## Scripts | Script | Purpose | |--------|---------| | `scripts/fix_simon_kemper_contamination.py` | Fix entity files with contaminated names | | `scripts/fix_missing_entity_profiles.py` | Fix source data file with contaminated names | | `scripts/parse_linkedin_html.py` | Parser that should use this logic for privacy-restricted profiles | ## When to Apply This Rule 1. **Parsing new LinkedIn HTML:** When a profile shows "LinkedIn Member" or logged-in user's name 2. **Fixing existing data:** When contamination is discovered in existing files 3. **Creating entity profiles:** When profile data is incomplete but slug is available ## When NOT to Apply This Rule 1. **Profile has valid name:** If LinkedIn returned the actual name, use it 2. **Unknown compound slugs:** If slug has no hyphens AND is not in the known mapping, use "Unknown" 3. **Fabricating additional data:** This rule ONLY covers name derivation, not other profile fields ## Related Rules - **Rule 21:** Data Fabrication Prohibition - slug derivation is transformation, not fabrication - **Rule 19:** HTML-Only LinkedIn Extraction - always use HTML source, not copy-paste - **Rule 20:** Person Entity Profiles - individual file storage requirements ## Audit Trail When fixing contaminated names, add a note to the extraction metadata: ```json { "extraction_metadata": { "notes": "Name corrected from 'Simon Kemper' (contamination) to 'Willem Blok' (derived from slug) on 2025-12-15T10:00:00Z" } } ``` For unknown compound slugs, preserve the original slug: ```json { "extraction_metadata": { "original_slug": "unknowncompoundslug", "notes": "Name set to 'Unknown' (was 'Simon Kemper' contamination). Compound slug cannot be reliably parsed." } } ```