glam/data/person/_data_quality_report.md

2.2 KiB

Person Data Quality Report

Generated: 2026-01-11 Directory: /Users/kempersc/apps/glam/data/person/

Summary

Category Count Percentage
Total person files 40,122 100%
Empty profiles (name + experience missing) 35,723 89.0%
Experience only (no name) 4,396 11.0%
Complete profiles (name + experience) 3 0.0%

Archived Organization Files

593 files were identified as misclassified organizations and moved to _archived_orgs/.

These were LinkedIn company pages incorrectly stored as person files. Characteristics:

  • Empty profile_data.experience array
  • No profile_data.full_name
  • Filename matched organization name in rationale
  • Rationale: "Identified as staff at [ORGANIZATION]"

Incomplete Profile Analysis

The 35,723 empty profiles are legitimate person records for heritage institution staff. They have incomplete data because:

  1. LinkedIn privacy settings prevented full profile extraction
  2. Staff members visible on company pages but profiles not accessible
  3. These are NOT misclassified organizations

Top Institutions with Incomplete Staff Profiles

Count Institution
581 Reinwardt Academie
525 The National Archives, UK
494 Maastricht University Faculty of Arts and Social Sciences
485 The Metropolitan Museum of Art
423 وزارة الثقافة Ministry of Culture
369 University of Humanistic Studies
356 l'Institut national du patrimoine
352 École nationale des chartes
343 The Museum of Modern Art
332 Rijksmuseum

Recommendations

  1. Keep incomplete profiles - They represent real staff members at heritage institutions
  2. Consider re-extraction - Some profiles may now be accessible
  3. Add metadata flag - Mark profiles as extraction_status: incomplete
  4. Prioritize by institution - Focus re-extraction on key institutions like Rijksmuseum, MoMA

File Structure After Cleanup

data/person/
├── ID_*.json           # 40,122 person files (including incomplete)
├── _archived_orgs/     # 593 misclassified organization files
├── _manifest.json      # Auto-generated pipeline manifest
└── _data_quality_report.md  # This file