glam/.opencode/CUSTODIAN_STAFF_PARSING_RULE.md
2025-12-10 13:01:13 +01:00

168 lines
5.2 KiB
Markdown

# Rule 18: Custodian Staff Parsing from LinkedIn Company Pages
**When manually registering heritage custodian staff from LinkedIn company "People" pages, use the `parse_custodian_staff.py` script to convert raw text files into structured JSON.**
## Overview
LinkedIn company pages have a "People" section that lists all staff members with their names, job titles, connection degree, and mutual connections. This data is valuable for understanding the heritage sector workforce and building network analysis.
## File Locations
| Type | Location |
|------|----------|
| **Raw input files** | `data/custodian/person/manual_hc/{slug}-{timestamp}.md` |
| **Parsed output files** | `data/custodian/person/{slug}_staff_{timestamp}.json` |
| **Parser script** | `scripts/parse_custodian_staff.py` |
## Input File Format
Raw files are created by copy-pasting from LinkedIn company "People" pages:
```
Collectie Overijssel logo
Collectie Overijssel
Museums, Historical Sites, and Zoos
Zwolle, Overijssel
2K followers
51-200 employees
58 associated members
Annelien Vos-Keen
Annelien Vos-Keen
2nd degree connection · 2nd
Data Analist / KPI- en procesexpert
Thomas van Maaren, Bob Coret, and 4 other mutual connections
Martine de Boer
Martine de Boer
2nd degree connection · 2nd
Collectiespecialist bij Collectie Overijssel
...
```
## Usage
```bash
python scripts/parse_custodian_staff.py <input_file> <output_file> \
--custodian-name "Custodian Name" \
--custodian-slug "custodian-slug"
```
**Example**:
```bash
python scripts/parse_custodian_staff.py \
data/custodian/person/manual_hc/collectie_overijssel-20251210T0055.md \
data/custodian/person/collectie_overijssel_staff_20251210T0055.json \
--custodian-name "Collectie Overijssel" \
--custodian-slug "collectie-overijssel"
```
**Dry-run mode** (parse without writing):
```bash
python scripts/parse_custodian_staff.py input.md output.json \
--custodian-name "Name" --custodian-slug "slug" --dry-run
```
## Output Structure
```json
{
"custodian_metadata": {
"custodian_name": "Collectie Overijssel",
"custodian_slug": "collectie-overijssel",
"name": "Collectie Overijssel",
"industry": "Museums, Historical Sites, and Zoos",
"location": { "city": "Zwolle", "region": "Overijssel" },
"follower_count": "2K",
"employee_count": "51-200",
"associated_members": 58
},
"source_metadata": {
"source_type": "linkedin_company_people_page",
"registered_timestamp": "2025-12-10T00:55:00Z",
"registration_method": "manual_linkedin_browse",
"staff_extracted": 39
},
"staff": [
{
"staff_id": "collectie-overijssel_staff_0000_annelien_vos_keen",
"name": "Annelien Vos-Keen",
"name_type": "full",
"degree": "2nd",
"headline": "Data Analist / KPI- en procesexpert",
"mutual_connections": "Thomas van Maaren, Bob Coret, and 4 other mutual connections",
"heritage_relevant": true,
"heritage_type": "D"
}
],
"staff_analysis": {
"total_staff_extracted": 39,
"heritage_relevant_count": 25,
"heritage_relevant_percentage": 64.1,
"staff_by_heritage_type": { "A": 4, "D": 1, "E": 1, "M": 18, "S": 1 },
"staff_by_degree": { "1st": 2, "2nd": 37 },
"staff_by_name_type": { "abbreviated": 1, "full": 38 },
"common_roles": { "Medewerker": 7, "Coördinator": 5, "Beheerder": 4 }
},
"provenance": {
"data_source": "LINKEDIN_MANUAL_REGISTER",
"data_tier": "TIER_3_CROWD_SOURCED"
}
}
```
## Staff ID Format
```
{custodian_slug}_staff_{index:04d}_{name_slug}
```
**Examples**:
- `collectie-overijssel_staff_0000_annelien_vos_keen`
- `nationaal-archief_staff_0042_afelonne_doek`
## Heritage Type Detection
Staff headlines are analyzed for heritage relevance using GLAMORCUBESFIXPHDNT keywords:
| Code | Type | Keywords |
|------|------|----------|
| A | Archive | archief, archivist, archivaris, nationaal archief |
| M | Museum | museum, curator, conservator, collectie |
| L | Library | library, bibliotheek, librarian |
| D | Digital | digital, data, developer, digitalisering |
| E | Education | university, professor, docent, educatie |
| R | Research | research, onderzoek, historicus |
| ... | ... | ... |
## Name Types
| Type | Description | Example |
|------|-------------|---------|
| `full` | Complete first + last name | "Vincent Robijn" |
| `abbreviated` | Contains single-letter initial | "Fairoesh N.", "A.J. Gevers" |
| `anonymous` | Privacy-hidden profile | "LinkedIn Member" |
## Existing Files
| Custodian | Staff Count | Output File |
|-----------|-------------|-------------|
| Collectie Overijssel | 39 | `collectie_overijssel_staff_20251210T0055.json` |
| Nationaal Archief | 373 | `nationaal_archief_staff_20251209T2354.json` |
## Integration with Other Scripts
This script complements `parse_linkedin_connections.py` (Rule 15):
| Script | Purpose | Input |
|--------|---------|-------|
| `parse_linkedin_connections.py` | Parse PERSON's connections | Individual profile connections |
| `parse_custodian_staff.py` | Parse ORGANIZATION's staff | Company "People" page |
## See Also
- Rule 15: Connection Data Registration
- Rule 14: Exa MCP LinkedIn Profile Extraction
- `AGENTS.md` - Complete agent instructions