# Data Fabrication Prohibition ## 🚨 CRITICAL RULE: NO DATA FABRICATION ALLOWED 🚨 **ALL DATA MUST BE REAL AND VERIFIABLE.** Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity. ## What Constitutes Data Fabrication ### ❌ FORBIDDEN (Never do this): - Creating fake names for people - Inventing job titles or companies - Making up education history - Generating placeholder skills or experience - Creating fictional LinkedIn URLs - Any "fallback" data when real data is unavailable ### ✅ ALLOWED (What you CAN do): - Skip profiles that cannot be extracted - Return `null` or empty fields for missing data - Mark profiles with `extraction_error: true` - Log why extraction failed - Use "Unknown" or "Not specified" for display purposes (not in stored data) ## Real-World Example of Violation **What happened**: The script created a fake profile for "Celyna Keates" when the Exa API failed. **Why this is wrong**: - The person does not exist - The LinkedIn URL was fabricated - All profile data was invented by the LLM - This pollutes the dataset with false information **Correct approach**: - When API fails, skip the profile - Log the failure - Do NOT create any fallback data ## Technical Implementation ### In Extraction Scripts ```python # ❌ WRONG - Creating fallback data if not profile_data: profile_data = { "name": "Unknown Person", "headline": "No data available", "experience": [] } # ✅ CORRECT - Skip when no real data if not profile_data or profile_data.get("extraction_error"): print(f"Skipping {url} - extraction failed") return None # Do NOT save anything ``` ### In Validation Functions ```python def validate_profile(profile_data): # Must have real name from LinkedIn if not profile_data.get("name") or len(profile_data["name"]) < 2: return False # Name must not be generic/fabricated name = profile_data["name"].lower() generic_names = ["linkedin member", "unknown", "n/a", "not specified"] if name in generic_names: return False return True ``` ## User Requirements From project leadership: - **"ALL PROFILES SHOULD BE REAL!!!"** - **"Fabricating data is strictly prohibited"** - **"Better to have missing data than fake data"** ## Consequences of Violation 1. **Data Integrity**: Fabricated data corrupts research value 2. **Trust**: Undermines confidence in entire dataset 3. **Legal**: May constitute misrepresentation 4. **Reputation**: Damages project credibility ## When in Doubt 1. **Don't save** if you cannot verify data authenticity 2. **Log the issue** for manual review 3. **Skip the record** - it's better to have no data than bad data 4. **Ask for clarification** if unsure about validation rules ## Reporting Violations If you discover fabricated data: 1. Immediately remove the fabricated records 2. Document how it happened 3. Implement safeguards to prevent recurrence 4. Report to project maintainers Remember: **Real data or no data.** There is no third option.