3 KiB
3 KiB
Data Fabrication Prohibition
🚨 CRITICAL RULE: NO DATA FABRICATION ALLOWED 🚨
ALL DATA MUST BE REAL AND VERIFIABLE. Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity.
What Constitutes Data Fabrication
❌ FORBIDDEN (Never do this):
- Creating fake names for people
- Inventing job titles or companies
- Making up education history
- Generating placeholder skills or experience
- Creating fictional LinkedIn URLs
- Any "fallback" data when real data is unavailable
✅ ALLOWED (What you CAN do):
- Skip profiles that cannot be extracted
- Return
nullor empty fields for missing data - Mark profiles with
extraction_error: true - Log why extraction failed
- Use "Unknown" or "Not specified" for display purposes (not in stored data)
Real-World Example of Violation
What happened: The script created a fake profile for "Celyna Keates" when the Exa API failed.
Why this is wrong:
- The person does not exist
- The LinkedIn URL was fabricated
- All profile data was invented by the LLM
- This pollutes the dataset with false information
Correct approach:
- When API fails, skip the profile
- Log the failure
- Do NOT create any fallback data
Technical Implementation
In Extraction Scripts
# ❌ WRONG - Creating fallback data
if not profile_data:
profile_data = {
"name": "Unknown Person",
"headline": "No data available",
"experience": []
}
# ✅ CORRECT - Skip when no real data
if not profile_data or profile_data.get("extraction_error"):
print(f"Skipping {url} - extraction failed")
return None # Do NOT save anything
In Validation Functions
def validate_profile(profile_data):
# Must have real name from LinkedIn
if not profile_data.get("name") or len(profile_data["name"]) < 2:
return False
# Name must not be generic/fabricated
name = profile_data["name"].lower()
generic_names = ["linkedin member", "unknown", "n/a", "not specified"]
if name in generic_names:
return False
return True
User Requirements
From project leadership:
- "ALL PROFILES SHOULD BE REAL!!!"
- "Fabricating data is strictly prohibited"
- "Better to have missing data than fake data"
Consequences of Violation
- Data Integrity: Fabricated data corrupts research value
- Trust: Undermines confidence in entire dataset
- Legal: May constitute misrepresentation
- Reputation: Damages project credibility
When in Doubt
- Don't save if you cannot verify data authenticity
- Log the issue for manual review
- Skip the record - it's better to have no data than bad data
- Ask for clarification if unsure about validation rules
Reporting Violations
If you discover fabricated data:
- Immediately remove the fabricated records
- Document how it happened
- Implement safeguards to prevent recurrence
- Report to project maintainers
Remember: Real data or no data. There is no third option.