11 KiB
HTML-Only LinkedIn Extraction Rule
Rule ID: HTML_ONLY_LINKEDIN_EXTRACTION
Status: ACTIVE
Created: 2025-12-11
Updated: 2025-12-11 (PYMK section corrected)
Applies to: All LinkedIn data extraction (custodian staff, person connections)
Summary
Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.
When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.
Rationale
| Data Source | Profile URLs | Headlines | Names | Connection Degree | Coverage |
|---|---|---|---|---|---|
| HTML (saved page) | 100% | 100% | 100% | 100% | 100% |
| MD (copy-paste) | 0% | 90% | 100% | 90% | ~90% |
Key finding: HTML files contain ALL data including LinkedIn profile URLs (linkedin.com/in/{slug}), which cannot be obtained from MD copy-paste.
HTML Structure for LinkedIn People Pages
Regular Profiles (with profile URL)
<a href="https://www.linkedin.com/in/john-smith-12345"
id="org-people-profile-card__profile-image-42">
<img alt="John Smith" ... />
</a>
<div class="artdeco-entity-lockup__title">John Smith</div>
<div class="artdeco-entity-lockup__badge">1st</div>
<div class="artdeco-entity-lockup__subtitle">Senior Curator at Museum</div>
Anonymous Profiles ("LinkedIn Member")
Privacy-protected profiles have a different structure:
<img id="org-people-profile-card__profile-image-268" ... />
<!-- Note: img element, NOT <a> element - no href -->
<div class="artdeco-entity-lockup__title">LinkedIn Member</div>
<div class="artdeco-entity-lockup__subtitle">Head Barista with SCA certification...</div>
Key difference: Anonymous profiles have the org-people-profile-card__profile-image-N ID on an <img> tag (not an <a> tag), with no href attribute.
Extraction Scripts
| Script | Purpose | Input |
|---|---|---|
scripts/parse_linkedin_html.py |
Extract staff from company People page | *.html |
scripts/batch_parse_linkedin_orgs.py |
Batch process multiple org HTML files | directory/*.html |
scripts/parse_linkedin_connections.py |
Extract connections from person | *.md (fallback only) |
Usage: Single Staff Extraction
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
--custodian-name "Rijksmuseum" \
--custodian-slug "rijksmuseum" \
--output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json
Usage: Batch Processing
python scripts/batch_parse_linkedin_orgs.py
# Processes all HTML files in data/custodian/person/affiliated/manual/
# Outputs JSON to data/custodian/person/affiliated/parsed/
File Naming Convention
HTML Input Files
data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html
Example: Rijksmuseum_ People _ LinkedIn.html
JSON Output Files
data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json
Example: rijksmuseum_staff_20251211T000000Z.json
Output Structure
{
"custodian_metadata": {
"custodian_name": "Rijksmuseum",
"custodian_slug": "rijksmuseum",
"industry": "Museums, Historical Sites, and Zoos",
"location": {"city": "Amsterdam", "region": "North Holland"},
"associated_members": 797
},
"source_metadata": {
"source_type": "linkedin_company_people_page_html",
"source_file": "Rijksmuseum_ People _ LinkedIn.html",
"registered_timestamp": "2025-12-11T00:00:00Z",
"registration_method": "html_parsing",
"staff_extracted": 800,
"duplicate_profiles_merged": 5
},
"staff": [
{
"staff_id": "rijksmuseum_staff_0000_john_smith",
"name": "John Smith",
"name_type": "full",
"degree": "1st",
"headline": "Senior Curator at Rijksmuseum",
"linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
"linkedin_slug": "john-smith-12345",
"heritage_relevant": true,
"heritage_type": "M"
},
{
"staff_id": "rijksmuseum_staff_0042_mattie_boom",
"name": "Mattie Boom",
"name_type": "full",
"degree": "2nd",
"headline": "Senior Curator of Photography",
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
"linkedin_slug": "mattie-boom-8346bb79",
"alternate_profiles": [
{
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
"linkedin_slug": "mattie-boom-76a122386"
}
],
"heritage_relevant": true,
"heritage_type": "M"
}
],
"staff_analysis": {
"total_staff_extracted": 800,
"with_linkedin_url": 703,
"with_alternate_profiles": 5,
"anonymous_members": 91,
"heritage_relevant_count": 643,
"staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
}
}
How to Save LinkedIn HTML
- Navigate to company's LinkedIn page (e.g.,
linkedin.com/company/rijksmuseum/people/) - Scroll down to load ALL profiles (LinkedIn uses infinite scroll)
- CRITICAL: Keep scrolling until no new profiles load
- Large organizations (500+ members) may require several minutes of scrolling
- File > Save Page As... > "Webpage, Complete"
- Save to
data/custodian/person/affiliated/manual/
Tip: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.
Understanding "People you may know" Section
IMPORTANT CLARIFICATION (2025-12-11): The <h2> element containing "People you may know" is the section title for the actual affiliated members list, NOT a separate recommendations section.
Corrected Understanding
<h2 class="sZUfDGIO...">
People you may know <!-- This is a HEADING, not filtering marker -->
</h2>
<!-- ALL cards below are affiliated members -->
<a id="org-people-profile-card__profile-image-0" href="...">...</a> <!-- Real member -->
<a id="org-people-profile-card__profile-image-1" href="...">...</a> <!-- Real member -->
...
The parser does NOT filter any cards based on index. All profile cards (org-people-profile-card__profile-image-N) are affiliated members.
Historical Bug (Fixed 2025-12-11)
A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:
- Small orgs (5 members or fewer): 0 extracted
- Small-medium orgs: -6 variance (exactly 6 real members incorrectly filtered)
This bug has been fixed. The parser now extracts ALL profile cards.
Count Discrepancies
The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:
Typical Variance: +1 to +10 (More Extracted than Expected)
| Cause | Explanation |
|---|---|
| Header caching | LinkedIn updates "N associated members" asynchronously |
| Recent joins | New members may render before count updates |
| Page includes founder/CEO | Sometimes counted separately in header |
Negative Variance (Fewer Extracted than Expected)
| Cause | Explanation | Solution |
|---|---|---|
| Incomplete scroll | Page wasn't fully scrolled before saving | Re-save after complete scroll |
| Privacy changes | Members hid profiles after count update | Normal, accept variance |
| Deleted accounts | Members left LinkedIn | Normal, accept variance |
Acceptable variance: <5% is normal. Investigate if variance is >10% or consistently negative.
Example Batch Results (2025-12-11)
| Organization | Expected | Extracted | Variance | Notes |
|---|---|---|---|---|
| Eye Filmmuseum | 250 | 251 | +1 | Normal |
| Het Utrechts Archief | 77 | 77 | 0 | Perfect |
| KB nationale bibliotheek | 624 | 561 | -63 | Incomplete scroll |
| SURF | 947 | 957 | +10 | Normal (large org) |
| Rijksmuseum | 797 | 800 | +3 | Normal |
Duplicate Profile Merging
Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:
Detection
Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:
- First profile becomes the primary entry
- Additional profiles stored in
alternate_profilesarray
Example Output
{
"name": "Mattie Boom",
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
"linkedin_slug": "mattie-boom-8346bb79",
"alternate_profiles": [
{
"linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
"linkedin_slug": "mattie-boom-76a122386"
}
]
}
Why This Matters
- Prevents inflated staff counts
- Preserves all profile URLs for verification
- Enables future de-duplication with external sources
Anonymous Member Handling
Anonymous "LinkedIn Member" entries:
- Are NOT deduplicated (each is unique)
- Receive unique IDs:
{slug}_staff_{index}_linkedin_member_{N} - Have
name_type: "anonymous" - Have headlines but NO profile URLs
- Are included in heritage-relevance detection
Validation
After extraction, verify:
- Total staff is within ~5% of expected associated members count
with_linkedin_url+anonymous_membersapproximately equalstotal_staff_extracted- Staff with
alternate_profilesmatchesduplicate_profiles_mergedcount - Heritage type detection working (staff_by_heritage_type populated)
- No duplicate staff IDs
- No UI element contamination (names like "Show more", "Load more", etc.)
- High data quality: >95% should have headlines
Verification Script
# Run extraction with verbose output
python scripts/parse_linkedin_html.py \
"data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
--custodian-name "Example" \
--custodian-slug "example"
# Expected output includes:
# Total staff: 251
# Expected (associated members): 250
# Difference: +1 (more than expected) # Small positive variance = normal
# Duplicate profiles merged: 3
Related Rules
- Rule 15: Connection Data Registration (
.opencode/CONNECTION_DATA_REGISTRATION_RULE.md) - Rule 17: LinkedIn Connection Unique Identifiers (
.opencode/LINKEDIN_CONNECTION_ID_RULE.md) - Rule 18: Custodian Staff Parsing (
.opencode/CUSTODIAN_STAFF_PARSING_RULE.md)
Supersedes
This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.