# HTML-Only LinkedIn Extraction Rule
**Rule ID**: HTML_ONLY_LINKEDIN_EXTRACTION
**Status**: ACTIVE
**Created**: 2025-12-11
**Updated**: 2025-12-11 (PYMK section corrected)
**Applies to**: All LinkedIn data extraction (custodian staff, person connections)
## Summary
**Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.**
When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.
## Rationale
| Data Source | Profile URLs | Headlines | Names | Connection Degree | Coverage |
|-------------|--------------|-----------|-------|-------------------|----------|
| **HTML (saved page)** | 100% | 100% | 100% | 100% | **100%** |
| **MD (copy-paste)** | 0% | 90% | 100% | 90% | ~90% |
**Key finding**: HTML files contain ALL data including LinkedIn profile URLs (`linkedin.com/in/{slug}`), which cannot be obtained from MD copy-paste.
## HTML Structure for LinkedIn People Pages
### Regular Profiles (with profile URL)
```html
John Smith
1st
Senior Curator at Museum
```
### Anonymous Profiles ("LinkedIn Member")
Privacy-protected profiles have a different structure:
```html
LinkedIn Member
Head Barista with SCA certification...
```
**Key difference**: Anonymous profiles have the `org-people-profile-card__profile-image-N` ID on an `
` tag (not an `` tag), with no `href` attribute.
## Extraction Scripts
| Script | Purpose | Input |
|--------|---------|-------|
| `scripts/parse_linkedin_html.py` | Extract staff from company People page | `*.html` |
| `scripts/batch_parse_linkedin_orgs.py` | Batch process multiple org HTML files | `directory/*.html` |
| `scripts/parse_linkedin_connections.py` | Extract connections from person | `*.md` (fallback only) |
### Usage: Single Staff Extraction
```bash
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
--custodian-name "Rijksmuseum" \
--custodian-slug "rijksmuseum" \
--output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json
```
### Usage: Batch Processing
```bash
python scripts/batch_parse_linkedin_orgs.py
# Processes all HTML files in data/custodian/person/affiliated/manual/
# Outputs JSON to data/custodian/person/affiliated/parsed/
```
## File Naming Convention
### HTML Input Files
```
data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html
```
Example: `Rijksmuseum_ People _ LinkedIn.html`
### JSON Output Files
```
data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json
```
Example: `rijksmuseum_staff_20251211T000000Z.json`
## Output Structure
```json
{
"custodian_metadata": {
"custodian_name": "Rijksmuseum",
"custodian_slug": "rijksmuseum",
"industry": "Museums, Historical Sites, and Zoos",
"location": {"city": "Amsterdam", "region": "North Holland"},
"associated_members": 797
},
"source_metadata": {
"source_type": "linkedin_company_people_page_html",
"source_file": "Rijksmuseum_ People _ LinkedIn.html",
"registered_timestamp": "2025-12-11T00:00:00Z",
"registration_method": "html_parsing",
"staff_extracted": 800,
"duplicate_profiles_merged": 5
},
"staff": [
{
"staff_id": "rijksmuseum_staff_0000_john_smith",
"name": "John Smith",
"name_type": "full",
"degree": "1st",
"headline": "Senior Curator at Rijksmuseum",
"linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
"linkedin_slug": "john-smith-12345",
"heritage_relevant": true,
"heritage_type": "M"
},
{
"staff_id": "rijksmuseum_staff_0042_mattie_boom",
"name": "Mattie Boom",
"name_type": "full",
"degree": "2nd",
"headline": "Senior Curator of Photography",
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
"linkedin_slug": "mattie-boom-8346bb79",
"alternate_profiles": [
{
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
"linkedin_slug": "mattie-boom-76a122386"
}
],
"heritage_relevant": true,
"heritage_type": "M"
}
],
"staff_analysis": {
"total_staff_extracted": 800,
"with_linkedin_url": 703,
"with_alternate_profiles": 5,
"anonymous_members": 91,
"heritage_relevant_count": 643,
"staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
}
}
```
## How to Save LinkedIn HTML
1. Navigate to company's LinkedIn page (e.g., `linkedin.com/company/rijksmuseum/people/`)
2. **Scroll down to load ALL profiles** (LinkedIn uses infinite scroll)
- **CRITICAL**: Keep scrolling until no new profiles load
- Large organizations (500+ members) may require several minutes of scrolling
3. **File > Save Page As...** > "Webpage, Complete"
4. Save to `data/custodian/person/affiliated/manual/`
**Tip**: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.
## Understanding "People you may know" Section
**IMPORTANT CLARIFICATION (2025-12-11)**: The `` element containing "People you may know" is the **section title for the actual affiliated members list**, NOT a separate recommendations section.
### Corrected Understanding
```html
People you may know
...
...
...
```
The parser does **NOT** filter any cards based on index. All profile cards (`org-people-profile-card__profile-image-N`) are affiliated members.
### Historical Bug (Fixed 2025-12-11)
A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:
- Small orgs (5 members or fewer): **0 extracted**
- Small-medium orgs: **-6 variance** (exactly 6 real members incorrectly filtered)
This bug has been fixed. The parser now extracts ALL profile cards.
## Count Discrepancies
The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:
### Typical Variance: +1 to +10 (More Extracted than Expected)
| Cause | Explanation |
|-------|-------------|
| Header caching | LinkedIn updates "N associated members" asynchronously |
| Recent joins | New members may render before count updates |
| Page includes founder/CEO | Sometimes counted separately in header |
### Negative Variance (Fewer Extracted than Expected)
| Cause | Explanation | Solution |
|-------|-------------|----------|
| **Incomplete scroll** | Page wasn't fully scrolled before saving | Re-save after complete scroll |
| Privacy changes | Members hid profiles after count update | Normal, accept variance |
| Deleted accounts | Members left LinkedIn | Normal, accept variance |
**Acceptable variance**: <5% is normal. Investigate if variance is >10% or consistently negative.
### Example Batch Results (2025-12-11)
| Organization | Expected | Extracted | Variance | Notes |
|--------------|----------|-----------|----------|-------|
| Eye Filmmuseum | 250 | 251 | +1 | Normal |
| Het Utrechts Archief | 77 | 77 | 0 | Perfect |
| KB nationale bibliotheek | 624 | 561 | -63 | Incomplete scroll |
| SURF | 947 | 957 | +10 | Normal (large org) |
| Rijksmuseum | 797 | 800 | +3 | Normal |
## Duplicate Profile Merging
Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:
### Detection
Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:
- First profile becomes the **primary** entry
- Additional profiles stored in `alternate_profiles` array
### Example Output
```json
{
"name": "Mattie Boom",
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
"linkedin_slug": "mattie-boom-8346bb79",
"alternate_profiles": [
{
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
"linkedin_slug": "mattie-boom-76a122386"
}
]
}
```
### Why This Matters
- Prevents inflated staff counts
- Preserves all profile URLs for verification
- Enables future de-duplication with external sources
## Anonymous Member Handling
Anonymous "LinkedIn Member" entries:
- Are NOT deduplicated (each is unique)
- Receive unique IDs: `{slug}_staff_{index}_linkedin_member_{N}`
- Have `name_type: "anonymous"`
- Have headlines but NO profile URLs
- Are included in heritage-relevance detection
## Validation
After extraction, verify:
- [ ] Total staff is within ~5% of expected associated members count
- [ ] `with_linkedin_url` + `anonymous_members` approximately equals `total_staff_extracted`
- [ ] Staff with `alternate_profiles` matches `duplicate_profiles_merged` count
- [ ] Heritage type detection working (staff_by_heritage_type populated)
- [ ] No duplicate staff IDs
- [ ] No UI element contamination (names like "Show more", "Load more", etc.)
- [ ] High data quality: >95% should have headlines
### Verification Script
```bash
# Run extraction with verbose output
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
--custodian-name "Example" \
--custodian-slug "example"
# Expected output includes:
# Total staff: 251
# Expected (associated members): 250
# Difference: +1 (more than expected) # Small positive variance = normal
# Duplicate profiles merged: 3
```
## Related Rules
- **Rule 15**: Connection Data Registration (`.opencode/CONNECTION_DATA_REGISTRATION_RULE.md`)
- **Rule 17**: LinkedIn Connection Unique Identifiers (`.opencode/LINKEDIN_CONNECTION_ID_RULE.md`)
- **Rule 18**: Custodian Staff Parsing (`.opencode/CUSTODIAN_STAFF_PARSING_RULE.md`)
## Supersedes
This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.