168 lines
5.2 KiB
Markdown
168 lines
5.2 KiB
Markdown
# Rule 18: Custodian Staff Parsing from LinkedIn Company Pages
|
|
|
|
**When manually registering heritage custodian staff from LinkedIn company "People" pages, use the `parse_custodian_staff.py` script to convert raw text files into structured JSON.**
|
|
|
|
## Overview
|
|
|
|
LinkedIn company pages have a "People" section that lists all staff members with their names, job titles, connection degree, and mutual connections. This data is valuable for understanding the heritage sector workforce and building network analysis.
|
|
|
|
## File Locations
|
|
|
|
| Type | Location |
|
|
|------|----------|
|
|
| **Raw input files** | `data/custodian/person/manual_hc/{slug}-{timestamp}.md` |
|
|
| **Parsed output files** | `data/custodian/person/{slug}_staff_{timestamp}.json` |
|
|
| **Parser script** | `scripts/parse_custodian_staff.py` |
|
|
|
|
## Input File Format
|
|
|
|
Raw files are created by copy-pasting from LinkedIn company "People" pages:
|
|
|
|
```
|
|
Collectie Overijssel logo
|
|
Collectie Overijssel
|
|
|
|
Museums, Historical Sites, and Zoos
|
|
Zwolle, Overijssel
|
|
2K followers
|
|
51-200 employees
|
|
|
|
58 associated members
|
|
|
|
Annelien Vos-Keen
|
|
Annelien Vos-Keen
|
|
2nd degree connection · 2nd
|
|
Data Analist / KPI- en procesexpert
|
|
Thomas van Maaren, Bob Coret, and 4 other mutual connections
|
|
|
|
Martine de Boer
|
|
Martine de Boer
|
|
2nd degree connection · 2nd
|
|
Collectiespecialist bij Collectie Overijssel
|
|
...
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
python scripts/parse_custodian_staff.py <input_file> <output_file> \
|
|
--custodian-name "Custodian Name" \
|
|
--custodian-slug "custodian-slug"
|
|
```
|
|
|
|
**Example**:
|
|
```bash
|
|
python scripts/parse_custodian_staff.py \
|
|
data/custodian/person/manual_hc/collectie_overijssel-20251210T0055.md \
|
|
data/custodian/person/collectie_overijssel_staff_20251210T0055.json \
|
|
--custodian-name "Collectie Overijssel" \
|
|
--custodian-slug "collectie-overijssel"
|
|
```
|
|
|
|
**Dry-run mode** (parse without writing):
|
|
```bash
|
|
python scripts/parse_custodian_staff.py input.md output.json \
|
|
--custodian-name "Name" --custodian-slug "slug" --dry-run
|
|
```
|
|
|
|
## Output Structure
|
|
|
|
```json
|
|
{
|
|
"custodian_metadata": {
|
|
"custodian_name": "Collectie Overijssel",
|
|
"custodian_slug": "collectie-overijssel",
|
|
"name": "Collectie Overijssel",
|
|
"industry": "Museums, Historical Sites, and Zoos",
|
|
"location": { "city": "Zwolle", "region": "Overijssel" },
|
|
"follower_count": "2K",
|
|
"employee_count": "51-200",
|
|
"associated_members": 58
|
|
},
|
|
"source_metadata": {
|
|
"source_type": "linkedin_company_people_page",
|
|
"registered_timestamp": "2025-12-10T00:55:00Z",
|
|
"registration_method": "manual_linkedin_browse",
|
|
"staff_extracted": 39
|
|
},
|
|
"staff": [
|
|
{
|
|
"staff_id": "collectie-overijssel_staff_0000_annelien_vos_keen",
|
|
"name": "Annelien Vos-Keen",
|
|
"name_type": "full",
|
|
"degree": "2nd",
|
|
"headline": "Data Analist / KPI- en procesexpert",
|
|
"mutual_connections": "Thomas van Maaren, Bob Coret, and 4 other mutual connections",
|
|
"heritage_relevant": true,
|
|
"heritage_type": "D"
|
|
}
|
|
],
|
|
"staff_analysis": {
|
|
"total_staff_extracted": 39,
|
|
"heritage_relevant_count": 25,
|
|
"heritage_relevant_percentage": 64.1,
|
|
"staff_by_heritage_type": { "A": 4, "D": 1, "E": 1, "M": 18, "S": 1 },
|
|
"staff_by_degree": { "1st": 2, "2nd": 37 },
|
|
"staff_by_name_type": { "abbreviated": 1, "full": 38 },
|
|
"common_roles": { "Medewerker": 7, "Coördinator": 5, "Beheerder": 4 }
|
|
},
|
|
"provenance": {
|
|
"data_source": "LINKEDIN_MANUAL_REGISTER",
|
|
"data_tier": "TIER_3_CROWD_SOURCED"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Staff ID Format
|
|
|
|
```
|
|
{custodian_slug}_staff_{index:04d}_{name_slug}
|
|
```
|
|
|
|
**Examples**:
|
|
- `collectie-overijssel_staff_0000_annelien_vos_keen`
|
|
- `nationaal-archief_staff_0042_afelonne_doek`
|
|
|
|
## Heritage Type Detection
|
|
|
|
Staff headlines are analyzed for heritage relevance using GLAMORCUBESFIXPHDNT keywords:
|
|
|
|
| Code | Type | Keywords |
|
|
|------|------|----------|
|
|
| A | Archive | archief, archivist, archivaris, nationaal archief |
|
|
| M | Museum | museum, curator, conservator, collectie |
|
|
| L | Library | library, bibliotheek, librarian |
|
|
| D | Digital | digital, data, developer, digitalisering |
|
|
| E | Education | university, professor, docent, educatie |
|
|
| R | Research | research, onderzoek, historicus |
|
|
| ... | ... | ... |
|
|
|
|
## Name Types
|
|
|
|
| Type | Description | Example |
|
|
|------|-------------|---------|
|
|
| `full` | Complete first + last name | "Vincent Robijn" |
|
|
| `abbreviated` | Contains single-letter initial | "Fairoesh N.", "A.J. Gevers" |
|
|
| `anonymous` | Privacy-hidden profile | "LinkedIn Member" |
|
|
|
|
## Existing Files
|
|
|
|
| Custodian | Staff Count | Output File |
|
|
|-----------|-------------|-------------|
|
|
| Collectie Overijssel | 39 | `collectie_overijssel_staff_20251210T0055.json` |
|
|
| Nationaal Archief | 373 | `nationaal_archief_staff_20251209T2354.json` |
|
|
|
|
## Integration with Other Scripts
|
|
|
|
This script complements `parse_linkedin_connections.py` (Rule 15):
|
|
|
|
| Script | Purpose | Input |
|
|
|--------|---------|-------|
|
|
| `parse_linkedin_connections.py` | Parse PERSON's connections | Individual profile connections |
|
|
| `parse_custodian_staff.py` | Parse ORGANIZATION's staff | Company "People" page |
|
|
|
|
## See Also
|
|
|
|
- Rule 15: Connection Data Registration
|
|
- Rule 14: Exa MCP LinkedIn Profile Extraction
|
|
- `AGENTS.md` - Complete agent instructions
|