# Person Data Quality Report **Generated:** 2026-01-11 **Directory:** `/Users/kempersc/apps/glam/data/person/` ## Summary | Category | Count | Percentage | |----------|-------|------------| | Total person files | 40,122 | 100% | | Empty profiles (name + experience missing) | 35,723 | 89.0% | | Experience only (no name) | 4,396 | 11.0% | | Complete profiles (name + experience) | 3 | 0.0% | ## Archived Organization Files **593 files** were identified as misclassified organizations and moved to `_archived_orgs/`. These were LinkedIn company pages incorrectly stored as person files. Characteristics: - Empty `profile_data.experience` array - No `profile_data.full_name` - Filename matched organization name in rationale - Rationale: "Identified as staff at [ORGANIZATION]" ## Incomplete Profile Analysis The 35,723 empty profiles are **legitimate person records** for heritage institution staff. They have incomplete data because: 1. LinkedIn privacy settings prevented full profile extraction 2. Staff members visible on company pages but profiles not accessible 3. These are NOT misclassified organizations ### Top Institutions with Incomplete Staff Profiles | Count | Institution | |-------|-------------| | 581 | Reinwardt Academie | | 525 | The National Archives, UK | | 494 | Maastricht University Faculty of Arts and Social Sciences | | 485 | The Metropolitan Museum of Art | | 423 | وزارة الثقافة Ministry of Culture | | 369 | University of Humanistic Studies | | 356 | l'Institut national du patrimoine | | 352 | École nationale des chartes | | 343 | The Museum of Modern Art | | 332 | Rijksmuseum | ## Recommendations 1. **Keep incomplete profiles** - They represent real staff members at heritage institutions 2. **Consider re-extraction** - Some profiles may now be accessible 3. **Add metadata flag** - Mark profiles as `extraction_status: incomplete` 4. **Prioritize by institution** - Focus re-extraction on key institutions like Rijksmuseum, MoMA ## File Structure After Cleanup ``` data/person/ ├── ID_*.json # 40,122 person files (including incomplete) ├── _archived_orgs/ # 593 misclassified organization files ├── _manifest.json # Auto-generated pipeline manifest └── _data_quality_report.md # This file ```