# HTML-Only LinkedIn Extraction Rule **Rule ID**: HTML_ONLY_LINKEDIN_EXTRACTION **Status**: ACTIVE **Created**: 2025-12-11 **Updated**: 2025-12-11 (PYMK section corrected) **Applies to**: All LinkedIn data extraction (custodian staff, person connections) ## Summary **Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.** When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source. ## Rationale | Data Source | Profile URLs | Headlines | Names | Connection Degree | Coverage | |-------------|--------------|-----------|-------|-------------------|----------| | **HTML (saved page)** | 100% | 100% | 100% | 100% | **100%** | | **MD (copy-paste)** | 0% | 90% | 100% | 90% | ~90% | **Key finding**: HTML files contain ALL data including LinkedIn profile URLs (`linkedin.com/in/{slug}`), which cannot be obtained from MD copy-paste. ## HTML Structure for LinkedIn People Pages ### Regular Profiles (with profile URL) ```html John Smith
John Smith
1st
Senior Curator at Museum
``` ### Anonymous Profiles ("LinkedIn Member") Privacy-protected profiles have a different structure: ```html
LinkedIn Member
Head Barista with SCA certification...
``` **Key difference**: Anonymous profiles have the `org-people-profile-card__profile-image-N` ID on an `` tag (not an `` tag), with no `href` attribute. ## Extraction Scripts | Script | Purpose | Input | |--------|---------|-------| | `scripts/parse_linkedin_html.py` | Extract staff from company People page | `*.html` | | `scripts/batch_parse_linkedin_orgs.py` | Batch process multiple org HTML files | `directory/*.html` | | `scripts/parse_linkedin_connections.py` | Extract connections from person | `*.md` (fallback only) | ### Usage: Single Staff Extraction ```bash python scripts/parse_linkedin_html.py \ "data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \ --custodian-name "Rijksmuseum" \ --custodian-slug "rijksmuseum" \ --output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json ``` ### Usage: Batch Processing ```bash python scripts/batch_parse_linkedin_orgs.py # Processes all HTML files in data/custodian/person/affiliated/manual/ # Outputs JSON to data/custodian/person/affiliated/parsed/ ``` ## File Naming Convention ### HTML Input Files ``` data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html ``` Example: `Rijksmuseum_ People _ LinkedIn.html` ### JSON Output Files ``` data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json ``` Example: `rijksmuseum_staff_20251211T000000Z.json` ## Output Structure ```json { "custodian_metadata": { "custodian_name": "Rijksmuseum", "custodian_slug": "rijksmuseum", "industry": "Museums, Historical Sites, and Zoos", "location": {"city": "Amsterdam", "region": "North Holland"}, "associated_members": 797 }, "source_metadata": { "source_type": "linkedin_company_people_page_html", "source_file": "Rijksmuseum_ People _ LinkedIn.html", "registered_timestamp": "2025-12-11T00:00:00Z", "registration_method": "html_parsing", "staff_extracted": 800, "duplicate_profiles_merged": 5 }, "staff": [ { "staff_id": "rijksmuseum_staff_0000_john_smith", "name": "John Smith", "name_type": "full", "degree": "1st", "headline": "Senior Curator at Rijksmuseum", "linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345", "linkedin_slug": "john-smith-12345", "heritage_relevant": true, "heritage_type": "M" }, { "staff_id": "rijksmuseum_staff_0042_mattie_boom", "name": "Mattie Boom", "name_type": "full", "degree": "2nd", "headline": "Senior Curator of Photography", "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79", "linkedin_slug": "mattie-boom-8346bb79", "alternate_profiles": [ { "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386", "linkedin_slug": "mattie-boom-76a122386" } ], "heritage_relevant": true, "heritage_type": "M" } ], "staff_analysis": { "total_staff_extracted": 800, "with_linkedin_url": 703, "with_alternate_profiles": 5, "anonymous_members": 91, "heritage_relevant_count": 643, "staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2} } } ``` ## How to Save LinkedIn HTML 1. Navigate to company's LinkedIn page (e.g., `linkedin.com/company/rijksmuseum/people/`) 2. **Scroll down to load ALL profiles** (LinkedIn uses infinite scroll) - **CRITICAL**: Keep scrolling until no new profiles load - Large organizations (500+ members) may require several minutes of scrolling 3. **File > Save Page As...** > "Webpage, Complete" 4. Save to `data/custodian/person/affiliated/manual/` **Tip**: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving. ## Understanding "People you may know" Section **IMPORTANT CLARIFICATION (2025-12-11)**: The `

` element containing "People you may know" is the **section title for the actual affiliated members list**, NOT a separate recommendations section. ### Corrected Understanding ```html

People you may know

... ... ... ``` The parser does **NOT** filter any cards based on index. All profile cards (`org-people-profile-card__profile-image-N`) are affiliated members. ### Historical Bug (Fixed 2025-12-11) A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused: - Small orgs (5 members or fewer): **0 extracted** - Small-medium orgs: **-6 variance** (exactly 6 real members incorrectly filtered) This bug has been fixed. The parser now extracts ALL profile cards. ## Count Discrepancies The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes: ### Typical Variance: +1 to +10 (More Extracted than Expected) | Cause | Explanation | |-------|-------------| | Header caching | LinkedIn updates "N associated members" asynchronously | | Recent joins | New members may render before count updates | | Page includes founder/CEO | Sometimes counted separately in header | ### Negative Variance (Fewer Extracted than Expected) | Cause | Explanation | Solution | |-------|-------------|----------| | **Incomplete scroll** | Page wasn't fully scrolled before saving | Re-save after complete scroll | | Privacy changes | Members hid profiles after count update | Normal, accept variance | | Deleted accounts | Members left LinkedIn | Normal, accept variance | **Acceptable variance**: <5% is normal. Investigate if variance is >10% or consistently negative. ### Example Batch Results (2025-12-11) | Organization | Expected | Extracted | Variance | Notes | |--------------|----------|-----------|----------|-------| | Eye Filmmuseum | 250 | 251 | +1 | Normal | | Het Utrechts Archief | 77 | 77 | 0 | Perfect | | KB nationale bibliotheek | 624 | 561 | -63 | Incomplete scroll | | SURF | 947 | 957 | +10 | Normal (large org) | | Rijksmuseum | 797 | 800 | +3 | Normal | ## Duplicate Profile Merging Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them: ### Detection Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name: - First profile becomes the **primary** entry - Additional profiles stored in `alternate_profiles` array ### Example Output ```json { "name": "Mattie Boom", "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79", "linkedin_slug": "mattie-boom-8346bb79", "alternate_profiles": [ { "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386", "linkedin_slug": "mattie-boom-76a122386" } ] } ``` ### Why This Matters - Prevents inflated staff counts - Preserves all profile URLs for verification - Enables future de-duplication with external sources ## Anonymous Member Handling Anonymous "LinkedIn Member" entries: - Are NOT deduplicated (each is unique) - Receive unique IDs: `{slug}_staff_{index}_linkedin_member_{N}` - Have `name_type: "anonymous"` - Have headlines but NO profile URLs - Are included in heritage-relevance detection ## Validation After extraction, verify: - [ ] Total staff is within ~5% of expected associated members count - [ ] `with_linkedin_url` + `anonymous_members` approximately equals `total_staff_extracted` - [ ] Staff with `alternate_profiles` matches `duplicate_profiles_merged` count - [ ] Heritage type detection working (staff_by_heritage_type populated) - [ ] No duplicate staff IDs - [ ] No UI element contamination (names like "Show more", "Load more", etc.) - [ ] High data quality: >95% should have headlines ### Verification Script ```bash # Run extraction with verbose output python scripts/parse_linkedin_html.py \ "data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \ --custodian-name "Example" \ --custodian-slug "example" # Expected output includes: # Total staff: 251 # Expected (associated members): 250 # Difference: +1 (more than expected) # Small positive variance = normal # Duplicate profiles merged: 3 ``` ## Related Rules - **Rule 15**: Connection Data Registration (`.opencode/CONNECTION_DATA_REGISTRATION_RULE.md`) - **Rule 17**: LinkedIn Connection Unique Identifiers (`.opencode/LINKEDIN_CONNECTION_ID_RULE.md`) - **Rule 18**: Custodian Staff Parsing (`.opencode/CUSTODIAN_STAFF_PARSING_RULE.md`) ## Supersedes This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.