glam/.opencode/HTML_ONLY_LINKEDIN_EXTRACTION_RULE.md
2025-12-10 18:04:25 +01:00

11 KiB

HTML-Only LinkedIn Extraction Rule

Rule ID: HTML_ONLY_LINKEDIN_EXTRACTION
Status: ACTIVE
Created: 2025-12-11
Updated: 2025-12-11 (PYMK section corrected)
Applies to: All LinkedIn data extraction (custodian staff, person connections)

Summary

Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.

When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.

Rationale

Data Source Profile URLs Headlines Names Connection Degree Coverage
HTML (saved page) 100% 100% 100% 100% 100%
MD (copy-paste) 0% 90% 100% 90% ~90%

Key finding: HTML files contain ALL data including LinkedIn profile URLs (linkedin.com/in/{slug}), which cannot be obtained from MD copy-paste.

HTML Structure for LinkedIn People Pages

Regular Profiles (with profile URL)

<a href="https://www.linkedin.com/in/john-smith-12345" 
   id="org-people-profile-card__profile-image-42">
  <img alt="John Smith" ... />
</a>
<div class="artdeco-entity-lockup__title">John Smith</div>
<div class="artdeco-entity-lockup__badge">1st</div>
<div class="artdeco-entity-lockup__subtitle">Senior Curator at Museum</div>

Anonymous Profiles ("LinkedIn Member")

Privacy-protected profiles have a different structure:

<img id="org-people-profile-card__profile-image-268" ... />
<!-- Note: img element, NOT <a> element - no href -->
<div class="artdeco-entity-lockup__title">LinkedIn Member</div>
<div class="artdeco-entity-lockup__subtitle">Head Barista with SCA certification...</div>

Key difference: Anonymous profiles have the org-people-profile-card__profile-image-N ID on an <img> tag (not an <a> tag), with no href attribute.

Extraction Scripts

Script Purpose Input
scripts/parse_linkedin_html.py Extract staff from company People page *.html
scripts/batch_parse_linkedin_orgs.py Batch process multiple org HTML files directory/*.html
scripts/parse_linkedin_connections.py Extract connections from person *.md (fallback only)

Usage: Single Staff Extraction

python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
    --custodian-name "Rijksmuseum" \
    --custodian-slug "rijksmuseum" \
    --output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json

Usage: Batch Processing

python scripts/batch_parse_linkedin_orgs.py
# Processes all HTML files in data/custodian/person/affiliated/manual/
# Outputs JSON to data/custodian/person/affiliated/parsed/

File Naming Convention

HTML Input Files

data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html

Example: Rijksmuseum_ People _ LinkedIn.html

JSON Output Files

data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json

Example: rijksmuseum_staff_20251211T000000Z.json

Output Structure

{
  "custodian_metadata": {
    "custodian_name": "Rijksmuseum",
    "custodian_slug": "rijksmuseum",
    "industry": "Museums, Historical Sites, and Zoos",
    "location": {"city": "Amsterdam", "region": "North Holland"},
    "associated_members": 797
  },
  "source_metadata": {
    "source_type": "linkedin_company_people_page_html",
    "source_file": "Rijksmuseum_ People _ LinkedIn.html",
    "registered_timestamp": "2025-12-11T00:00:00Z",
    "registration_method": "html_parsing",
    "staff_extracted": 800,
    "duplicate_profiles_merged": 5
  },
  "staff": [
    {
      "staff_id": "rijksmuseum_staff_0000_john_smith",
      "name": "John Smith",
      "name_type": "full",
      "degree": "1st",
      "headline": "Senior Curator at Rijksmuseum",
      "linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
      "linkedin_slug": "john-smith-12345",
      "heritage_relevant": true,
      "heritage_type": "M"
    },
    {
      "staff_id": "rijksmuseum_staff_0042_mattie_boom",
      "name": "Mattie Boom",
      "name_type": "full",
      "degree": "2nd",
      "headline": "Senior Curator of Photography",
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
      "linkedin_slug": "mattie-boom-8346bb79",
      "alternate_profiles": [
        {
          "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
          "linkedin_slug": "mattie-boom-76a122386"
        }
      ],
      "heritage_relevant": true,
      "heritage_type": "M"
    }
  ],
  "staff_analysis": {
    "total_staff_extracted": 800,
    "with_linkedin_url": 703,
    "with_alternate_profiles": 5,
    "anonymous_members": 91,
    "heritage_relevant_count": 643,
    "staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
  }
}

How to Save LinkedIn HTML

  1. Navigate to company's LinkedIn page (e.g., linkedin.com/company/rijksmuseum/people/)
  2. Scroll down to load ALL profiles (LinkedIn uses infinite scroll)
    • CRITICAL: Keep scrolling until no new profiles load
    • Large organizations (500+ members) may require several minutes of scrolling
  3. File > Save Page As... > "Webpage, Complete"
  4. Save to data/custodian/person/affiliated/manual/

Tip: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.

Understanding "People you may know" Section

IMPORTANT CLARIFICATION (2025-12-11): The <h2> element containing "People you may know" is the section title for the actual affiliated members list, NOT a separate recommendations section.

Corrected Understanding

<h2 class="sZUfDGIO...">
  People you may know  <!-- This is a HEADING, not filtering marker -->
</h2>
<!-- ALL cards below are affiliated members -->
<a id="org-people-profile-card__profile-image-0" href="...">...</a>  <!-- Real member -->
<a id="org-people-profile-card__profile-image-1" href="...">...</a>  <!-- Real member -->
...

The parser does NOT filter any cards based on index. All profile cards (org-people-profile-card__profile-image-N) are affiliated members.

Historical Bug (Fixed 2025-12-11)

A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:

  • Small orgs (5 members or fewer): 0 extracted
  • Small-medium orgs: -6 variance (exactly 6 real members incorrectly filtered)

This bug has been fixed. The parser now extracts ALL profile cards.

Count Discrepancies

The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:

Typical Variance: +1 to +10 (More Extracted than Expected)

Cause Explanation
Header caching LinkedIn updates "N associated members" asynchronously
Recent joins New members may render before count updates
Page includes founder/CEO Sometimes counted separately in header

Negative Variance (Fewer Extracted than Expected)

Cause Explanation Solution
Incomplete scroll Page wasn't fully scrolled before saving Re-save after complete scroll
Privacy changes Members hid profiles after count update Normal, accept variance
Deleted accounts Members left LinkedIn Normal, accept variance

Acceptable variance: <5% is normal. Investigate if variance is >10% or consistently negative.

Example Batch Results (2025-12-11)

Organization Expected Extracted Variance Notes
Eye Filmmuseum 250 251 +1 Normal
Het Utrechts Archief 77 77 0 Perfect
KB nationale bibliotheek 624 561 -63 Incomplete scroll
SURF 947 957 +10 Normal (large org)
Rijksmuseum 797 800 +3 Normal

Duplicate Profile Merging

Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:

Detection

Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:

  • First profile becomes the primary entry
  • Additional profiles stored in alternate_profiles array

Example Output

{
  "name": "Mattie Boom",
  "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
  "linkedin_slug": "mattie-boom-8346bb79",
  "alternate_profiles": [
    {
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
      "linkedin_slug": "mattie-boom-76a122386"
    }
  ]
}

Why This Matters

  • Prevents inflated staff counts
  • Preserves all profile URLs for verification
  • Enables future de-duplication with external sources

Anonymous Member Handling

Anonymous "LinkedIn Member" entries:

  • Are NOT deduplicated (each is unique)
  • Receive unique IDs: {slug}_staff_{index}_linkedin_member_{N}
  • Have name_type: "anonymous"
  • Have headlines but NO profile URLs
  • Are included in heritage-relevance detection

Validation

After extraction, verify:

  • Total staff is within ~5% of expected associated members count
  • with_linkedin_url + anonymous_members approximately equals total_staff_extracted
  • Staff with alternate_profiles matches duplicate_profiles_merged count
  • Heritage type detection working (staff_by_heritage_type populated)
  • No duplicate staff IDs
  • No UI element contamination (names like "Show more", "Load more", etc.)
  • High data quality: >95% should have headlines

Verification Script

# Run extraction with verbose output
python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
    --custodian-name "Example" \
    --custodian-slug "example"

# Expected output includes:
#   Total staff: 251
#   Expected (associated members): 250
#   Difference: +1 (more than expected)  # Small positive variance = normal
#   Duplicate profiles merged: 3
  • Rule 15: Connection Data Registration (.opencode/CONNECTION_DATA_REGISTRATION_RULE.md)
  • Rule 17: LinkedIn Connection Unique Identifiers (.opencode/LINKEDIN_CONNECTION_ID_RULE.md)
  • Rule 18: Custodian Staff Parsing (.opencode/CUSTODIAN_STAFF_PARSING_RULE.md)

Supersedes

This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.