2.2 KiB
2.2 KiB
Person Data Quality Report
Generated: 2026-01-11
Directory: /Users/kempersc/apps/glam/data/person/
Summary
| Category | Count | Percentage |
|---|---|---|
| Total person files | 40,122 | 100% |
| Empty profiles (name + experience missing) | 35,723 | 89.0% |
| Experience only (no name) | 4,396 | 11.0% |
| Complete profiles (name + experience) | 3 | 0.0% |
Archived Organization Files
593 files were identified as misclassified organizations and moved to _archived_orgs/.
These were LinkedIn company pages incorrectly stored as person files. Characteristics:
- Empty
profile_data.experiencearray - No
profile_data.full_name - Filename matched organization name in rationale
- Rationale: "Identified as staff at [ORGANIZATION]"
Incomplete Profile Analysis
The 35,723 empty profiles are legitimate person records for heritage institution staff. They have incomplete data because:
- LinkedIn privacy settings prevented full profile extraction
- Staff members visible on company pages but profiles not accessible
- These are NOT misclassified organizations
Top Institutions with Incomplete Staff Profiles
| Count | Institution |
|---|---|
| 581 | Reinwardt Academie |
| 525 | The National Archives, UK |
| 494 | Maastricht University Faculty of Arts and Social Sciences |
| 485 | The Metropolitan Museum of Art |
| 423 | وزارة الثقافة Ministry of Culture |
| 369 | University of Humanistic Studies |
| 356 | l'Institut national du patrimoine |
| 352 | École nationale des chartes |
| 343 | The Museum of Modern Art |
| 332 | Rijksmuseum |
Recommendations
- Keep incomplete profiles - They represent real staff members at heritage institutions
- Consider re-extraction - Some profiles may now be accessible
- Add metadata flag - Mark profiles as
extraction_status: incomplete - Prioritize by institution - Focus re-extraction on key institutions like Rijksmuseum, MoMA
File Structure After Cleanup
data/person/
├── ID_*.json # 40,122 person files (including incomplete)
├── _archived_orgs/ # 593 misclassified organization files
├── _manifest.json # Auto-generated pipeline manifest
└── _data_quality_report.md # This file