glam/data/person/_data_quality_report.md

63 lines
2.2 KiB
Markdown

# Person Data Quality Report
**Generated:** 2026-01-11
**Directory:** `/Users/kempersc/apps/glam/data/person/`
## Summary
| Category | Count | Percentage |
|----------|-------|------------|
| Total person files | 40,122 | 100% |
| Empty profiles (name + experience missing) | 35,723 | 89.0% |
| Experience only (no name) | 4,396 | 11.0% |
| Complete profiles (name + experience) | 3 | 0.0% |
## Archived Organization Files
**593 files** were identified as misclassified organizations and moved to `_archived_orgs/`.
These were LinkedIn company pages incorrectly stored as person files. Characteristics:
- Empty `profile_data.experience` array
- No `profile_data.full_name`
- Filename matched organization name in rationale
- Rationale: "Identified as staff at [ORGANIZATION]"
## Incomplete Profile Analysis
The 35,723 empty profiles are **legitimate person records** for heritage institution staff.
They have incomplete data because:
1. LinkedIn privacy settings prevented full profile extraction
2. Staff members visible on company pages but profiles not accessible
3. These are NOT misclassified organizations
### Top Institutions with Incomplete Staff Profiles
| Count | Institution |
|-------|-------------|
| 581 | Reinwardt Academie |
| 525 | The National Archives, UK |
| 494 | Maastricht University Faculty of Arts and Social Sciences |
| 485 | The Metropolitan Museum of Art |
| 423 | وزارة الثقافة Ministry of Culture |
| 369 | University of Humanistic Studies |
| 356 | l'Institut national du patrimoine |
| 352 | École nationale des chartes |
| 343 | The Museum of Modern Art |
| 332 | Rijksmuseum |
## Recommendations
1. **Keep incomplete profiles** - They represent real staff members at heritage institutions
2. **Consider re-extraction** - Some profiles may now be accessible
3. **Add metadata flag** - Mark profiles as `extraction_status: incomplete`
4. **Prioritize by institution** - Focus re-extraction on key institutions like Rijksmuseum, MoMA
## File Structure After Cleanup
```
data/person/
├── ID_*.json # 40,122 person files (including incomplete)
├── _archived_orgs/ # 593 misclassified organization files
├── _manifest.json # Auto-generated pipeline manifest
└── _data_quality_report.md # This file
```