63 lines
2.2 KiB
Markdown
63 lines
2.2 KiB
Markdown
# Person Data Quality Report
|
|
|
|
**Generated:** 2026-01-11
|
|
**Directory:** `/Users/kempersc/apps/glam/data/person/`
|
|
|
|
## Summary
|
|
|
|
| Category | Count | Percentage |
|
|
|----------|-------|------------|
|
|
| Total person files | 40,122 | 100% |
|
|
| Empty profiles (name + experience missing) | 35,723 | 89.0% |
|
|
| Experience only (no name) | 4,396 | 11.0% |
|
|
| Complete profiles (name + experience) | 3 | 0.0% |
|
|
|
|
## Archived Organization Files
|
|
|
|
**593 files** were identified as misclassified organizations and moved to `_archived_orgs/`.
|
|
|
|
These were LinkedIn company pages incorrectly stored as person files. Characteristics:
|
|
- Empty `profile_data.experience` array
|
|
- No `profile_data.full_name`
|
|
- Filename matched organization name in rationale
|
|
- Rationale: "Identified as staff at [ORGANIZATION]"
|
|
|
|
## Incomplete Profile Analysis
|
|
|
|
The 35,723 empty profiles are **legitimate person records** for heritage institution staff.
|
|
They have incomplete data because:
|
|
1. LinkedIn privacy settings prevented full profile extraction
|
|
2. Staff members visible on company pages but profiles not accessible
|
|
3. These are NOT misclassified organizations
|
|
|
|
### Top Institutions with Incomplete Staff Profiles
|
|
|
|
| Count | Institution |
|
|
|-------|-------------|
|
|
| 581 | Reinwardt Academie |
|
|
| 525 | The National Archives, UK |
|
|
| 494 | Maastricht University Faculty of Arts and Social Sciences |
|
|
| 485 | The Metropolitan Museum of Art |
|
|
| 423 | وزارة الثقافة Ministry of Culture |
|
|
| 369 | University of Humanistic Studies |
|
|
| 356 | l'Institut national du patrimoine |
|
|
| 352 | École nationale des chartes |
|
|
| 343 | The Museum of Modern Art |
|
|
| 332 | Rijksmuseum |
|
|
|
|
## Recommendations
|
|
|
|
1. **Keep incomplete profiles** - They represent real staff members at heritage institutions
|
|
2. **Consider re-extraction** - Some profiles may now be accessible
|
|
3. **Add metadata flag** - Mark profiles as `extraction_status: incomplete`
|
|
4. **Prioritize by institution** - Focus re-extraction on key institutions like Rijksmuseum, MoMA
|
|
|
|
## File Structure After Cleanup
|
|
|
|
```
|
|
data/person/
|
|
├── ID_*.json # 40,122 person files (including incomplete)
|
|
├── _archived_orgs/ # 593 misclassified organization files
|
|
├── _manifest.json # Auto-generated pipeline manifest
|
|
└── _data_quality_report.md # This file
|
|
```
|