glam/.opencode/ANONYMOUS_PROFILE_NAME_RULE.md
2025-12-15 22:31:41 +01:00

200 lines
6.9 KiB
Markdown

# Rule 29: Anonymous Profile Name Derivation from LinkedIn Slugs
**When extracting LinkedIn profile data where the profile is privacy-restricted (showing as "LinkedIn Member" or incorrectly showing the logged-in user's name), the name CAN be reliably derived from the LinkedIn slug if it contains hyphens.**
## The Problem
When saving LinkedIn HTML pages while logged in, privacy-restricted profiles may incorrectly capture the logged-in user's name instead of the actual profile owner's name. This creates "name contamination" where dozens of profiles have the wrong name.
**Example contamination:**
- File: `willem-blok-b6a46648_20251211T000000Z.json`
- Incorrect name: "Simon Kemper" (the logged-in user)
- Correct name: "Willem Blok" (derived from slug)
## Key Principle: Slug-to-Name Derivation is NOT Fabrication
Deriving names from LinkedIn slugs is **ALLOWED** because:
1. LinkedIn slugs are generated from the user's actual name
2. The transformation is deterministic and reversible
3. This is data transformation, not data fabrication
**Per Rule 21 (Data Fabrication Prohibition):** Fabricating data is strictly prohibited. However, deriving names from existing data (the slug) is a reliable transformation, not fabrication.
## Slug Types and Handling
### 1. Hyphenated Slugs (Reliable - CAN derive name)
Slugs with hyphens between name parts can be reliably converted to names:
| Slug | Derived Name |
|------|--------------|
| `willem-blok-b6a46648` | Willem Blok |
| `dave-van-den-nieuwenhof-4446b3146` | Dave van den Nieuwenhof |
| `charlotte-van-beek-55370314` | Charlotte van Beek |
| `jan-van-den-borre-3657211b3` | Jan van den Borre |
| `josée-lunsingh-scheurleer-van-den-berg-00765415` | Josée Lunsingh Scheurleer van den Berg |
**Algorithm:**
1. URL-decode the slug (e.g., `%C3%AB``ë`)
2. Remove trailing ID suffix (hex or numeric, 5+ digits)
3. Split by hyphens
4. Capitalize each part, EXCEPT Dutch particles when not first word
### 2. Compound Slugs Without Hyphens (Must Use Mapping)
Slugs without ANY hyphens cannot be reliably parsed because word boundaries are unknown:
| Slug | Correct Name | Why Unparseable |
|------|--------------|-----------------|
| `jponjee` | J. Ponjee | Is it "J Ponjee", "JP Onjee", "Jpon Jee"? |
| `sharellyemanuelson` | Sharelly Emanuelson | Where does first name end? |
| `addieroelofsen` | Addie Roelofsen | Could be "Addier Oelofsen" |
**Known Compound Slugs Mapping:**
```python
KNOWN_COMPOUND_SLUGS = {
'jponjee': 'J. Ponjee',
'sharellyemanuelson': 'Sharelly Emanuelson',
'addieroelofsen': 'Addie Roelofsen',
'adheliap': 'Adhelia P.',
'anejanboomsma': 'Anejan Boomsma',
'fredericlogghe': 'Frederic Logghe',
'dirkjanheinen': 'Dirkjan Heinen',
}
```
**For UNKNOWN compound slugs:** Set name to "Unknown" and preserve the original slug in metadata for future resolution.
### 3. Abbreviated Names (Keep as-is)
Some slugs indicate abbreviated names on the original profile:
| Slug | Derived Name | Notes |
|------|--------------|-------|
| `miriam-h-38b500b2` | Miriam H | User chose to show only last initial |
| `simon-k-94938251` | Simon K | User chose to show only last initial |
| `annegret-v-588b06197` | Annegret V | User chose to show only last initial |
These are **correct** - the user intentionally abbreviated their name on LinkedIn.
## Dutch Name Particles
Dutch particles should stay lowercase when NOT the first word:
| Particle | Example |
|----------|---------|
| van | Charlotte **van** Beek |
| de | Rob **de** Jong |
| den | Jan van **den** Borre |
| der | Herman van **der** Berg |
| het | Jan van **het** Veld |
| 't | Jan van **'t** Hof |
**Exception:** When the particle is the FIRST word, capitalize it:
- `de-jong-12345` → "De Jong" (particle is first)
- `rob-de-jong-12345` → "Rob de Jong" (particle follows first name)
## Implementation
### Python Function
```python
import re
from urllib.parse import unquote
KNOWN_COMPOUND_SLUGS = {
'jponjee': 'J. Ponjee',
'sharellyemanuelson': 'Sharelly Emanuelson',
'addieroelofsen': 'Addie Roelofsen',
'adheliap': 'Adhelia P.',
'anejanboomsma': 'Anejan Boomsma',
'fredericlogghe': 'Frederic Logghe',
'dirkjanheinen': 'Dirkjan Heinen',
}
def slug_to_name(slug: str) -> tuple[str, bool]:
"""Convert LinkedIn slug to name.
Returns:
tuple: (name, is_reliable)
"""
decoded_slug = unquote(slug)
# Check known compound slugs
if decoded_slug in KNOWN_COMPOUND_SLUGS:
return (KNOWN_COMPOUND_SLUGS[decoded_slug], True)
# Unknown compound slug (no hyphens)
if '-' not in decoded_slug:
return ("Unknown", False)
# Remove trailing ID
clean_slug = re.sub(r'[-_][\da-f]{6,}$', '', decoded_slug)
clean_slug = re.sub(r'[-_]\d{5,}$', '', clean_slug)
parts = [p for p in clean_slug.split('-') if p]
if not parts:
return ("Unknown", False)
# Dutch particles
dutch_particles = {'van', 'de', 'den', 'der', 'het', 't', "'t"}
name_parts = []
for i, part in enumerate(parts):
if part.lower() in dutch_particles and i > 0:
name_parts.append(part.lower())
else:
name_parts.append(part.capitalize())
return (' '.join(name_parts), True)
```
## Scripts
| Script | Purpose |
|--------|---------|
| `scripts/fix_simon_kemper_contamination.py` | Fix entity files with contaminated names |
| `scripts/fix_missing_entity_profiles.py` | Fix source data file with contaminated names |
| `scripts/parse_linkedin_html.py` | Parser that should use this logic for privacy-restricted profiles |
## When to Apply This Rule
1. **Parsing new LinkedIn HTML:** When a profile shows "LinkedIn Member" or logged-in user's name
2. **Fixing existing data:** When contamination is discovered in existing files
3. **Creating entity profiles:** When profile data is incomplete but slug is available
## When NOT to Apply This Rule
1. **Profile has valid name:** If LinkedIn returned the actual name, use it
2. **Unknown compound slugs:** If slug has no hyphens AND is not in the known mapping, use "Unknown"
3. **Fabricating additional data:** This rule ONLY covers name derivation, not other profile fields
## Related Rules
- **Rule 21:** Data Fabrication Prohibition - slug derivation is transformation, not fabrication
- **Rule 19:** HTML-Only LinkedIn Extraction - always use HTML source, not copy-paste
- **Rule 20:** Person Entity Profiles - individual file storage requirements
## Audit Trail
When fixing contaminated names, add a note to the extraction metadata:
```json
{
"extraction_metadata": {
"notes": "Name corrected from 'Simon Kemper' (contamination) to 'Willem Blok' (derived from slug) on 2025-12-15T10:00:00Z"
}
}
```
For unknown compound slugs, preserve the original slug:
```json
{
"extraction_metadata": {
"original_slug": "unknowncompoundslug",
"notes": "Name set to 'Unknown' (was 'Simon Kemper' contamination). Compound slug cannot be reliably parsed."
}
}
```