kempersc be3fbac601 enrich entries and persons

2025-12-10 18:04:25 +01:00

11 KiB

Raw Blame History

HTML-Only LinkedIn Extraction Rule

Rule ID: HTML_ONLY_LINKEDIN_EXTRACTION
Status: ACTIVE
Created: 2025-12-11
Updated: 2025-12-11 (PYMK section corrected)
Applies to: All LinkedIn data extraction (custodian staff, person connections)

Summary

Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.

When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.

Rationale

Data Source	Profile URLs	Headlines	Names	Connection Degree	Coverage
HTML (saved page)	100%	100%	100%	100%	100%
MD (copy-paste)	0%	90%	100%	90%	~90%

Key finding: HTML files contain ALL data including LinkedIn profile URLs (linkedin.com/in/{slug}), which cannot be obtained from MD copy-paste.

HTML Structure for LinkedIn People Pages

Regular Profiles (with profile URL)

<a href="https://www.linkedin.com/in/john-smith-12345" 
   id="org-people-profile-card__profile-image-42">
  <img alt="John Smith" ... />
</a>
<div class="artdeco-entity-lockup__title">John Smith</div>
<div class="artdeco-entity-lockup__badge">1st</div>
<div class="artdeco-entity-lockup__subtitle">Senior Curator at Museum</div>

Anonymous Profiles ("LinkedIn Member")

Privacy-protected profiles have a different structure:

<img id="org-people-profile-card__profile-image-268" ... />
<!-- Note: img element, NOT <a> element - no href -->
<div class="artdeco-entity-lockup__title">LinkedIn Member</div>
<div class="artdeco-entity-lockup__subtitle">Head Barista with SCA certification...</div>

Key difference: Anonymous profiles have the org-people-profile-card__profile-image-N ID on an <img> tag (not an <a> tag), with no href attribute.

Extraction Scripts

Script	Purpose	Input
`scripts/parse_linkedin_html.py`	Extract staff from company People page	`*.html`
`scripts/batch_parse_linkedin_orgs.py`	Batch process multiple org HTML files	`directory/*.html`
`scripts/parse_linkedin_connections.py`	Extract connections from person	`*.md` (fallback only)

Usage: Single Staff Extraction

python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
    --custodian-name "Rijksmuseum" \
    --custodian-slug "rijksmuseum" \
    --output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json

Usage: Batch Processing

python scripts/batch_parse_linkedin_orgs.py
# Processes all HTML files in data/custodian/person/affiliated/manual/
# Outputs JSON to data/custodian/person/affiliated/parsed/

File Naming Convention

HTML Input Files

data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html

Example: Rijksmuseum_ People _ LinkedIn.html

JSON Output Files

data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json

Example: rijksmuseum_staff_20251211T000000Z.json

Output Structure

{
  "custodian_metadata": {
    "custodian_name": "Rijksmuseum",
    "custodian_slug": "rijksmuseum",
    "industry": "Museums, Historical Sites, and Zoos",
    "location": {"city": "Amsterdam", "region": "North Holland"},
    "associated_members": 797
  },
  "source_metadata": {
    "source_type": "linkedin_company_people_page_html",
    "source_file": "Rijksmuseum_ People _ LinkedIn.html",
    "registered_timestamp": "2025-12-11T00:00:00Z",
    "registration_method": "html_parsing",
    "staff_extracted": 800,
    "duplicate_profiles_merged": 5
  },
  "staff": [
    {
      "staff_id": "rijksmuseum_staff_0000_john_smith",
      "name": "John Smith",
      "name_type": "full",
      "degree": "1st",
      "headline": "Senior Curator at Rijksmuseum",
      "linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
      "linkedin_slug": "john-smith-12345",
      "heritage_relevant": true,
      "heritage_type": "M"
    },
    {
      "staff_id": "rijksmuseum_staff_0042_mattie_boom",
      "name": "Mattie Boom",
      "name_type": "full",
      "degree": "2nd",
      "headline": "Senior Curator of Photography",
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
      "linkedin_slug": "mattie-boom-8346bb79",
      "alternate_profiles": [
        {
          "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
          "linkedin_slug": "mattie-boom-76a122386"
        }
      ],
      "heritage_relevant": true,
      "heritage_type": "M"
    }
  ],
  "staff_analysis": {
    "total_staff_extracted": 800,
    "with_linkedin_url": 703,
    "with_alternate_profiles": 5,
    "anonymous_members": 91,
    "heritage_relevant_count": 643,
    "staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
  }
}

How to Save LinkedIn HTML

Navigate to company's LinkedIn page (e.g., linkedin.com/company/rijksmuseum/people/)
Scroll down to load ALL profiles (LinkedIn uses infinite scroll)
- CRITICAL: Keep scrolling until no new profiles load
- Large organizations (500+ members) may require several minutes of scrolling
File > Save Page As... > "Webpage, Complete"
Save to data/custodian/person/affiliated/manual/

Tip: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.

Understanding "People you may know" Section

IMPORTANT CLARIFICATION (2025-12-11): The <h2> element containing "People you may know" is the section title for the actual affiliated members list, NOT a separate recommendations section.

Corrected Understanding

<h2 class="sZUfDGIO...">
  People you may know  <!-- This is a HEADING, not filtering marker -->
</h2>
<!-- ALL cards below are affiliated members -->
<a id="org-people-profile-card__profile-image-0" href="...">...</a>  <!-- Real member -->
<a id="org-people-profile-card__profile-image-1" href="...">...</a>  <!-- Real member -->
...

The parser does NOT filter any cards based on index. All profile cards (org-people-profile-card__profile-image-N) are affiliated members.

Historical Bug (Fixed 2025-12-11)

A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:

Small orgs (5 members or fewer): 0 extracted
Small-medium orgs: -6 variance (exactly 6 real members incorrectly filtered)

This bug has been fixed. The parser now extracts ALL profile cards.

Count Discrepancies

The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:

Typical Variance: +1 to +10 (More Extracted than Expected)

Cause	Explanation
Header caching	LinkedIn updates "N associated members" asynchronously
Recent joins	New members may render before count updates
Page includes founder/CEO	Sometimes counted separately in header

Negative Variance (Fewer Extracted than Expected)

Cause	Explanation	Solution
Incomplete scroll	Page wasn't fully scrolled before saving	Re-save after complete scroll
Privacy changes	Members hid profiles after count update	Normal, accept variance
Deleted accounts	Members left LinkedIn	Normal, accept variance

Acceptable variance: <5% is normal. Investigate if variance is >10% or consistently negative.

Example Batch Results (2025-12-11)

Organization	Expected	Extracted	Variance	Notes
Eye Filmmuseum	250	251	+1	Normal
Het Utrechts Archief	77	77	0	Perfect
KB nationale bibliotheek	624	561	-63	Incomplete scroll
SURF	947	957	+10	Normal (large org)
Rijksmuseum	797	800	+3	Normal

Duplicate Profile Merging

Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:

Detection

Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:

First profile becomes the primary entry
Additional profiles stored in alternate_profiles array

Example Output

{
  "name": "Mattie Boom",
  "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
  "linkedin_slug": "mattie-boom-8346bb79",
  "alternate_profiles": [
    {
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
      "linkedin_slug": "mattie-boom-76a122386"
    }
  ]
}

Why This Matters

Prevents inflated staff counts
Preserves all profile URLs for verification
Enables future de-duplication with external sources

Anonymous Member Handling

Anonymous "LinkedIn Member" entries:

Are NOT deduplicated (each is unique)
Receive unique IDs: {slug}_staff_{index}_linkedin_member_{N}
Have name_type: "anonymous"
Have headlines but NO profile URLs
Are included in heritage-relevance detection

Validation

After extraction, verify:

Total staff is within ~5% of expected associated members count
with_linkedin_url + anonymous_members approximately equals total_staff_extracted
Staff with alternate_profiles matches duplicate_profiles_merged count
Heritage type detection working (staff_by_heritage_type populated)
No duplicate staff IDs
No UI element contamination (names like "Show more", "Load more", etc.)
High data quality: >95% should have headlines

Verification Script

# Run extraction with verbose output
python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
    --custodian-name "Example" \
    --custodian-slug "example"

# Expected output includes:
#   Total staff: 251
#   Expected (associated members): 250
#   Difference: +1 (more than expected)  # Small positive variance = normal
#   Duplicate profiles merged: 3

Rule 15: Connection Data Registration (.opencode/CONNECTION_DATA_REGISTRATION_RULE.md)
Rule 17: LinkedIn Connection Unique Identifiers (.opencode/LINKEDIN_CONNECTION_ID_RULE.md)
Rule 18: Custodian Staff Parsing (.opencode/CUSTODIAN_STAFF_PARSING_RULE.md)

Supersedes

This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.

11 KiB Raw Blame History