glam/.opencode/HTML_ONLY_LINKEDIN_EXTRACTION_RULE.md

# HTML-Only LinkedIn Extraction Rule

**Rule ID**: HTML_ONLY_LINKEDIN_EXTRACTION
**Status**: ACTIVE
**Created**: 2025-12-11
**Updated**: 2025-12-11 (PYMK section corrected)
**Applies to**: All LinkedIn data extraction (custodian staff, person connections)

## Summary

**Use ONLY manually saved HTML files (not MD copy-paste) for extracting LinkedIn data.**

When extracting people data from LinkedIn, the HTML source contains 100% of the data. Markdown copy-paste loses critical information and should NOT be used as a primary source.

## Rationale

| Data Source | Profile URLs | Headlines | Names | Connection Degree | Coverage |
|-------------|--------------|-----------|-------|-------------------|----------|
| **HTML (saved page)** | 100% | 100% | 100% | 100% | **100%** |
| **MD (copy-paste)** | 0% | 90% | 100% | 90% | ~90% |

**Key finding**: HTML files contain ALL data including LinkedIn profile URLs (`linkedin.com/in/{slug}`), which cannot be obtained from MD copy-paste.

## HTML Structure for LinkedIn People Pages

### Regular Profiles (with profile URL)

```html
<a href="https://www.linkedin.com/in/john-smith-12345"
   id="org-people-profile-card__profile-image-42">
  <img alt="John Smith" ... />
</a>
<div class="artdeco-entity-lockup__title">John Smith</div>
<div class="artdeco-entity-lockup__badge">1st</div>
<div class="artdeco-entity-lockup__subtitle">Senior Curator at Museum</div>
```

### Anonymous Profiles ("LinkedIn Member")

Privacy-protected profiles have a different structure:

```html
<img id="org-people-profile-card__profile-image-268" ... />
<!-- Note: img element, NOT <a> element - no href -->
<div class="artdeco-entity-lockup__title">LinkedIn Member</div>
<div class="artdeco-entity-lockup__subtitle">Head Barista with SCA certification...</div>
```

**Key difference**: Anonymous profiles have the `org-people-profile-card__profile-image-N` ID on an `<img>` tag (not an `<a>` tag), with no `href` attribute.

## Extraction Scripts

| Script | Purpose | Input |
|--------|---------|-------|
| `scripts/parse_linkedin_html.py` | Extract staff from company People page | `*.html` |
| `scripts/batch_parse_linkedin_orgs.py` | Batch process multiple org HTML files | `directory/*.html` |
| `scripts/parse_linkedin_connections.py` | Extract connections from person | `*.md` (fallback only) |

### Usage: Single Staff Extraction

```bash
python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Rijksmuseum_ People _ LinkedIn.html" \
    --custodian-name "Rijksmuseum" \
    --custodian-slug "rijksmuseum" \
    --output data/custodian/person/affiliated/parsed/rijksmuseum_staff_20251211T000000Z.json
```

### Usage: Batch Processing

```bash
python scripts/batch_parse_linkedin_orgs.py
# Processes all HTML files in data/custodian/person/affiliated/manual/
# Outputs JSON to data/custodian/person/affiliated/parsed/
```

## File Naming Convention

### HTML Input Files

```
data/custodian/person/affiliated/manual/{CustodianName}_ People _ LinkedIn.html
```

Example: `Rijksmuseum_ People _ LinkedIn.html`

### JSON Output Files

```
data/custodian/person/affiliated/parsed/{custodian_slug}_staff_{ISO-timestamp}.json
```

Example: `rijksmuseum_staff_20251211T000000Z.json`

## Output Structure

```json
{
  "custodian_metadata": {
    "custodian_name": "Rijksmuseum",
    "custodian_slug": "rijksmuseum",
    "industry": "Museums, Historical Sites, and Zoos",
    "location": {"city": "Amsterdam", "region": "North Holland"},
    "associated_members": 797
  },
  "source_metadata": {
    "source_type": "linkedin_company_people_page_html",
    "source_file": "Rijksmuseum_ People _ LinkedIn.html",
    "registered_timestamp": "2025-12-11T00:00:00Z",
    "registration_method": "html_parsing",
    "staff_extracted": 800,
    "duplicate_profiles_merged": 5
  },
  "staff": [
    {
      "staff_id": "rijksmuseum_staff_0000_john_smith",
      "name": "John Smith",
      "name_type": "full",
      "degree": "1st",
      "headline": "Senior Curator at Rijksmuseum",
      "linkedin_profile_url": "https://www.linkedin.com/in/john-smith-12345",
      "linkedin_slug": "john-smith-12345",
      "heritage_relevant": true,
      "heritage_type": "M"
    },
    {
      "staff_id": "rijksmuseum_staff_0042_mattie_boom",
      "name": "Mattie Boom",
      "name_type": "full",
      "degree": "2nd",
      "headline": "Senior Curator of Photography",
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
      "linkedin_slug": "mattie-boom-8346bb79",
      "alternate_profiles": [
        {
          "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
          "linkedin_slug": "mattie-boom-76a122386"
        }
      ],
      "heritage_relevant": true,
      "heritage_type": "M"
    }
  ],
  "staff_analysis": {
    "total_staff_extracted": 800,
    "with_linkedin_url": 703,
    "with_alternate_profiles": 5,
    "anonymous_members": 91,
    "heritage_relevant_count": 643,
    "staff_by_heritage_type": {"M": 537, "A": 1, "L": 1, "E": 45, "D": 16, "R": 10, "O": 2}
  }
}
```

## How to Save LinkedIn HTML

1. Navigate to company's LinkedIn page (e.g., `linkedin.com/company/rijksmuseum/people/`)
2. **Scroll down to load ALL profiles** (LinkedIn uses infinite scroll)
   - **CRITICAL**: Keep scrolling until no new profiles load
   - Large organizations (500+ members) may require several minutes of scrolling
3. **File > Save Page As...** > "Webpage, Complete"
4. Save to `data/custodian/person/affiliated/manual/`

**Tip**: If extracted count is significantly lower than expected, the page was likely not fully scrolled before saving.

## Understanding "People you may know" Section

**IMPORTANT CLARIFICATION (2025-12-11)**: The `<h2>` element containing "People you may know" is the **section title for the actual affiliated members list**, NOT a separate recommendations section.

### Corrected Understanding

```html
<h2 class="sZUfDGIO...">
  People you may know  <!-- This is a HEADING, not filtering marker -->
</h2>
<!-- ALL cards below are affiliated members -->
<a id="org-people-profile-card__profile-image-0" href="...">...</a>  <!-- Real member -->
<a id="org-people-profile-card__profile-image-1" href="...">...</a>  <!-- Real member -->
...
```

The parser does **NOT** filter any cards based on index. All profile cards (`org-people-profile-card__profile-image-N`) are affiliated members.

### Historical Bug (Fixed 2025-12-11)

A previous version of the parser incorrectly assumed cards 0-5 were "recommendations" and filtered them out. This caused:
- Small orgs (5 members or fewer): **0 extracted**
- Small-medium orgs: **-6 variance** (exactly 6 real members incorrectly filtered)

This bug has been fixed. The parser now extracts ALL profile cards.

## Count Discrepancies

The "associated members" count shown in the LinkedIn header may differ slightly from the extracted count. Common causes:

### Typical Variance: +1 to +10 (More Extracted than Expected)

| Cause | Explanation |
|-------|-------------|
| Header caching | LinkedIn updates "N associated members" asynchronously |
| Recent joins | New members may render before count updates |
| Page includes founder/CEO | Sometimes counted separately in header |

### Negative Variance (Fewer Extracted than Expected)

| Cause | Explanation | Solution |
|-------|-------------|----------|
| **Incomplete scroll** | Page wasn't fully scrolled before saving | Re-save after complete scroll |
| Privacy changes | Members hid profiles after count update | Normal, accept variance |
| Deleted accounts | Members left LinkedIn | Normal, accept variance |

**Acceptable variance**: <5% is normal. Investigate if variance is >10% or consistently negative.

### Example Batch Results (2025-12-11)

| Organization | Expected | Extracted | Variance | Notes |
|--------------|----------|-----------|----------|-------|
| Eye Filmmuseum | 250 | 251 | +1 | Normal |
| Het Utrechts Archief | 77 | 77 | 0 | Perfect |
| KB nationale bibliotheek | 624 | 561 | -63 | Incomplete scroll |
| SURF | 947 | 957 | +10 | Normal (large org) |
| Rijksmuseum | 797 | 800 | +3 | Normal |

## Duplicate Profile Merging

Some people maintain multiple LinkedIn accounts. The parser detects these by name and merges them:

### Detection

Profiles are grouped by normalized name (case-insensitive). If multiple profiles share the same name:
- First profile becomes the **primary** entry
- Additional profiles stored in `alternate_profiles` array

### Example Output

```json
{
  "name": "Mattie Boom",
  "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-8346bb79",
  "linkedin_slug": "mattie-boom-8346bb79",
  "alternate_profiles": [
    {
      "linkedin_profile_url": "https://www.linkedin.com/in/mattie-boom-76a122386",
      "linkedin_slug": "mattie-boom-76a122386"
    }
  ]
}
```

### Why This Matters

- Prevents inflated staff counts
- Preserves all profile URLs for verification
- Enables future de-duplication with external sources

## Anonymous Member Handling

Anonymous "LinkedIn Member" entries:
- Are NOT deduplicated (each is unique)
- Receive unique IDs: `{slug}_staff_{index}_linkedin_member_{N}`
- Have `name_type: "anonymous"`
- Have headlines but NO profile URLs
- Are included in heritage-relevance detection

## Validation

After extraction, verify:
- [ ] Total staff is within ~5% of expected associated members count
- [ ] `with_linkedin_url` + `anonymous_members` approximately equals `total_staff_extracted`
- [ ] Staff with `alternate_profiles` matches `duplicate_profiles_merged` count
- [ ] Heritage type detection working (staff_by_heritage_type populated)
- [ ] No duplicate staff IDs
- [ ] No UI element contamination (names like "Show more", "Load more", etc.)
- [ ] High data quality: >95% should have headlines

### Verification Script

```bash
# Run extraction with verbose output
python scripts/parse_linkedin_html.py \
    "data/custodian/person/affiliated/manual/Example_ People _ LinkedIn.html" \
    --custodian-name "Example" \
    --custodian-slug "example"

# Expected output includes:
#   Total staff: 251
#   Expected (associated members): 250
#   Difference: +1 (more than expected)  # Small positive variance = normal
#   Duplicate profiles merged: 3
```

## Related Rules

- **Rule 15**: Connection Data Registration (`.opencode/CONNECTION_DATA_REGISTRATION_RULE.md`)
- **Rule 17**: LinkedIn Connection Unique Identifiers (`.opencode/LINKEDIN_CONNECTION_ID_RULE.md`)
- **Rule 18**: Custodian Staff Parsing (`.opencode/CUSTODIAN_STAFF_PARSING_RULE.md`)

## Supersedes

This rule supersedes the previous approach of using MD copy-paste as primary source. MD files may still be used as fallback for connections parsing when HTML is not available, but HTML is always preferred.