300 lines
11 KiB
Markdown
300 lines
11 KiB
Markdown
# HTML-Only LinkedIn Extraction Rule
|
|
|
|
**Rule ID**: HTML_ONLY_LINKEDIN_EXTRACTION
|
|
**Status**: ACTIVE
|
|
**Created**: 2025-12-11
|
|
**Updated**: 2025-12-11 (PYMK section corrected)
|
|
**Applies to**: All LinkedIn data extraction (custodian staff, person connections)
|
|
|
|
## Summary
|
|
|
|
**Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.**
|
|
|
|
When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.
|
|
|
|
## Rationale
|
|
|
|
| Data Source | Profile URLs | Headlines | Names | Connection Degree | Coverage |
|
|
|-------------|--------------|-----------|-------|-------------------|----------|
|
|
| **HTML (saved page)** | 100% | 100% | 100% | 100% | **100%** |
|
|
| **MD (copy-paste)** | 0% | 90% | 100% | 90% | ~90% |
|
|
|
|
**Key finding**: HTML files contain ALL data including LinkedIn profile URLs (`linkedin.com/in/{slug}`), which cannot be obtained from MD copy-paste.
|
|
|
|
## HTML Structure for LinkedIn People Pages
|
|
|
|
### Regular Profiles (with profile URL)
|
|
|
|
```html
|
|
<a href="https://www.linkedin.com/in/john-smith-12345"
|
|
id="org-people-profile-card__profile-image-42">
|
|
<img alt="John Smith" ... />
|
|
</a>
|
|
<div class="artdeco-entity-lockup__title">John Smith</div>
|
|
<div class="artdeco-entity-lockup__badge">1st</div>
|
|
<div class="artdeco-entity-lockup__subtitle">Senior Curator at Museum</div>
|
|
```
|
|
|
|
### Anonymous Profiles ("LinkedIn Member")
|
|
|
|
Privacy-protected profiles have a different structure:
|
|
|
|
```html
|
|
<img id="org-people-profile-card__profile-image-268" ... />
|
|
<!-- Note: img element, NOT <a> element - no href -->
|
|
<div class="artdeco-entity-lockup__title">LinkedIn Member</div>
|
|
<div class="artdeco-entity-lockup__subtitle">Head Barista with SCA certification...</div>
|
|
```
|
|
|
|
**Key difference**: Anonymous profiles have the `org-people-profile-card__profile-image-N` ID on an `<img>` tag (not an `<a>` tag), with no `href` attribute.
|
|
|
|
## Extraction Scripts
|
|
|
|
| Script | Purpose | Input |
|
|
|--------|---------|-------|
|
|
| `scripts/parse_linkedin_html.py` | Extract staff from company People page | `*.html` |
|
|
| `scripts/batch_parse_linkedin_orgs.py` | Batch process multiple org HTML files | `directory/*.html` |
|
|
| `scripts/parse_linkedin_connections.py` | Extract connections from person | `*.md` (fallback only) |
|
|
|
|
### Usage: Single Staff Extraction
|
|
|
|
```bash
|
|
python scripts/parse_linkedin_html.py \
|
|
"data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
|
|
--custodian-name "Rijksmuseum" \
|
|
--custodian-slug "rijksmuseum" \
|
|
--output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json
|
|
```
|
|
|
|
### Usage: Batch Processing
|
|
|
|
```bash
|
|
python scripts/batch_parse_linkedin_orgs.py
|
|
# Processes all HTML files in data/custodian/person/affiliated/manual/
|
|
# Outputs JSON to data/custodian/person/affiliated/parsed/
|
|
```
|
|
|
|
## File Naming Convention
|
|
|
|
### HTML Input Files
|
|
|
|
```
|
|
data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html
|
|
```
|
|
|
|
Example: `Rijksmuseum_ People _ LinkedIn.html`
|
|
|
|
### JSON Output Files
|
|
|
|
```
|
|
data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json
|
|
```
|
|
|
|
Example: `rijksmuseum_staff_20251211T000000Z.json`
|
|
|
|
## Output Structure
|
|
|
|
```json
|
|
{
|
|
"custodian_metadata": {
|
|
"custodian_name": "Rijksmuseum",
|
|
"custodian_slug": "rijksmuseum",
|
|
"industry": "Museums, Historical Sites, and Zoos",
|
|
"location": {"city": "Amsterdam", "region": "North Holland"},
|
|
"associated_members": 797
|
|
},
|
|
"source_metadata": {
|
|
"source_type": "linkedin_company_people_page_html",
|
|
"source_file": "Rijksmuseum_ People _ LinkedIn.html",
|
|
"registered_timestamp": "2025-12-11T00:00:00Z",
|
|
"registration_method": "html_parsing",
|
|
"staff_extracted": 800,
|
|
"duplicate_profiles_merged": 5
|
|
},
|
|
"staff": [
|
|
{
|
|
"staff_id": "rijksmuseum_staff_0000_john_smith",
|
|
"name": "John Smith",
|
|
"name_type": "full",
|
|
"degree": "1st",
|
|
"headline": "Senior Curator at Rijksmuseum",
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
|
|
"linkedin_slug": "john-smith-12345",
|
|
"heritage_relevant": true,
|
|
"heritage_type": "M"
|
|
},
|
|
{
|
|
"staff_id": "rijksmuseum_staff_0042_mattie_boom",
|
|
"name": "Mattie Boom",
|
|
"name_type": "full",
|
|
"degree": "2nd",
|
|
"headline": "Senior Curator of Photography",
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
|
|
"linkedin_slug": "mattie-boom-8346bb79",
|
|
"alternate_profiles": [
|
|
{
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
|
|
"linkedin_slug": "mattie-boom-76a122386"
|
|
}
|
|
],
|
|
"heritage_relevant": true,
|
|
"heritage_type": "M"
|
|
}
|
|
],
|
|
"staff_analysis": {
|
|
"total_staff_extracted": 800,
|
|
"with_linkedin_url": 703,
|
|
"with_alternate_profiles": 5,
|
|
"anonymous_members": 91,
|
|
"heritage_relevant_count": 643,
|
|
"staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
|
|
}
|
|
}
|
|
```
|
|
|
|
## How to Save LinkedIn HTML
|
|
|
|
1. Navigate to company's LinkedIn page (e.g., `linkedin.com/company/rijksmuseum/people/`)
|
|
2. **Scroll down to load ALL profiles** (LinkedIn uses infinite scroll)
|
|
- **CRITICAL**: Keep scrolling until no new profiles load
|
|
- Large organizations (500+ members) may require several minutes of scrolling
|
|
3. **File > Save Page As...** > "Webpage, Complete"
|
|
4. Save to `data/custodian/person/affiliated/manual/`
|
|
|
|
**Tip**: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.
|
|
|
|
## Understanding "People you may know" Section
|
|
|
|
**IMPORTANT CLARIFICATION (2025-12-11)**: The `<h2>` element containing "People you may know" is the **section title for the actual affiliated members list**, NOT a separate recommendations section.
|
|
|
|
### Corrected Understanding
|
|
|
|
```html
|
|
<h2 class="sZUfDGIO...">
|
|
People you may know <!-- This is a HEADING, not filtering marker -->
|
|
</h2>
|
|
<!-- ALL cards below are affiliated members -->
|
|
<a id="org-people-profile-card__profile-image-0" href="...">...</a> <!-- Real member -->
|
|
<a id="org-people-profile-card__profile-image-1" href="...">...</a> <!-- Real member -->
|
|
...
|
|
```
|
|
|
|
The parser does **NOT** filter any cards based on index. All profile cards (`org-people-profile-card__profile-image-N`) are affiliated members.
|
|
|
|
### Historical Bug (Fixed 2025-12-11)
|
|
|
|
A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:
|
|
- Small orgs (5 members or fewer): **0 extracted**
|
|
- Small-medium orgs: **-6 variance** (exactly 6 real members incorrectly filtered)
|
|
|
|
This bug has been fixed. The parser now extracts ALL profile cards.
|
|
|
|
## Count Discrepancies
|
|
|
|
The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:
|
|
|
|
### Typical Variance: +1 to +10 (More Extracted than Expected)
|
|
|
|
| Cause | Explanation |
|
|
|-------|-------------|
|
|
| Header caching | LinkedIn updates "N associated members" asynchronously |
|
|
| Recent joins | New members may render before count updates |
|
|
| Page includes founder/CEO | Sometimes counted separately in header |
|
|
|
|
### Negative Variance (Fewer Extracted than Expected)
|
|
|
|
| Cause | Explanation | Solution |
|
|
|-------|-------------|----------|
|
|
| **Incomplete scroll** | Page wasn't fully scrolled before saving | Re-save after complete scroll |
|
|
| Privacy changes | Members hid profiles after count update | Normal, accept variance |
|
|
| Deleted accounts | Members left LinkedIn | Normal, accept variance |
|
|
|
|
**Acceptable variance**: <5% is normal. Investigate if variance is >10% or consistently negative.
|
|
|
|
### Example Batch Results (2025-12-11)
|
|
|
|
| Organization | Expected | Extracted | Variance | Notes |
|
|
|--------------|----------|-----------|----------|-------|
|
|
| Eye Filmmuseum | 250 | 251 | +1 | Normal |
|
|
| Het Utrechts Archief | 77 | 77 | 0 | Perfect |
|
|
| KB nationale bibliotheek | 624 | 561 | -63 | Incomplete scroll |
|
|
| SURF | 947 | 957 | +10 | Normal (large org) |
|
|
| Rijksmuseum | 797 | 800 | +3 | Normal |
|
|
|
|
## Duplicate Profile Merging
|
|
|
|
Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:
|
|
|
|
### Detection
|
|
|
|
Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:
|
|
- First profile becomes the **primary** entry
|
|
- Additional profiles stored in `alternate_profiles` array
|
|
|
|
### Example Output
|
|
|
|
```json
|
|
{
|
|
"name": "Mattie Boom",
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
|
|
"linkedin_slug": "mattie-boom-8346bb79",
|
|
"alternate_profiles": [
|
|
{
|
|
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
|
|
"linkedin_slug": "mattie-boom-76a122386"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Why This Matters
|
|
|
|
- Prevents inflated staff counts
|
|
- Preserves all profile URLs for verification
|
|
- Enables future de-duplication with external sources
|
|
|
|
## Anonymous Member Handling
|
|
|
|
Anonymous "LinkedIn Member" entries:
|
|
- Are NOT deduplicated (each is unique)
|
|
- Receive unique IDs: `{slug}_staff_{index}_linkedin_member_{N}`
|
|
- Have `name_type: "anonymous"`
|
|
- Have headlines but NO profile URLs
|
|
- Are included in heritage-relevance detection
|
|
|
|
## Validation
|
|
|
|
After extraction, verify:
|
|
- [ ] Total staff is within ~5% of expected associated members count
|
|
- [ ] `with_linkedin_url` + `anonymous_members` approximately equals `total_staff_extracted`
|
|
- [ ] Staff with `alternate_profiles` matches `duplicate_profiles_merged` count
|
|
- [ ] Heritage type detection working (staff_by_heritage_type populated)
|
|
- [ ] No duplicate staff IDs
|
|
- [ ] No UI element contamination (names like "Show more", "Load more", etc.)
|
|
- [ ] High data quality: >95% should have headlines
|
|
|
|
### Verification Script
|
|
|
|
```bash
|
|
# Run extraction with verbose output
|
|
python scripts/parse_linkedin_html.py \
|
|
"data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
|
|
--custodian-name "Example" \
|
|
--custodian-slug "example"
|
|
|
|
# Expected output includes:
|
|
# Total staff: 251
|
|
# Expected (associated members): 250
|
|
# Difference: +1 (more than expected) # Small positive variance = normal
|
|
# Duplicate profiles merged: 3
|
|
```
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 15**: Connection Data Registration (`.opencode/CONNECTION_DATA_REGISTRATION_RULE.md`)
|
|
- **Rule 17**: LinkedIn Connection Unique Identifiers (`.opencode/LINKEDIN_CONNECTION_ID_RULE.md`)
|
|
- **Rule 18**: Custodian Staff Parsing (`.opencode/CUSTODIAN_STAFF_PARSING_RULE.md`)
|
|
|
|
## Supersedes
|
|
|
|
This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.
|