104 lines
No EOL
3 KiB
Markdown
104 lines
No EOL
3 KiB
Markdown
# Data Fabrication Prohibition
|
|
|
|
## 🚨 CRITICAL RULE: NO DATA FABRICATION ALLOWED 🚨
|
|
|
|
**ALL DATA MUST BE REAL AND VERIFIABLE.** Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity.
|
|
|
|
## What Constitutes Data Fabrication
|
|
|
|
### ❌ FORBIDDEN (Never do this):
|
|
- Creating fake names for people
|
|
- Inventing job titles or companies
|
|
- Making up education history
|
|
- Generating placeholder skills or experience
|
|
- Creating fictional LinkedIn URLs
|
|
- Any "fallback" data when real data is unavailable
|
|
|
|
### ✅ ALLOWED (What you CAN do):
|
|
- Skip profiles that cannot be extracted
|
|
- Return `null` or empty fields for missing data
|
|
- Mark profiles with `extraction_error: true`
|
|
- Log why extraction failed
|
|
- Use "Unknown" or "Not specified" for display purposes (not in stored data)
|
|
|
|
## Real-World Example of Violation
|
|
|
|
**What happened**: The script created a fake profile for "Celyna Keates" when the Exa API failed.
|
|
|
|
**Why this is wrong**:
|
|
- The person does not exist
|
|
- The LinkedIn URL was fabricated
|
|
- All profile data was invented by the LLM
|
|
- This pollutes the dataset with false information
|
|
|
|
**Correct approach**:
|
|
- When API fails, skip the profile
|
|
- Log the failure
|
|
- Do NOT create any fallback data
|
|
|
|
## Technical Implementation
|
|
|
|
### In Extraction Scripts
|
|
|
|
```python
|
|
# ❌ WRONG - Creating fallback data
|
|
if not profile_data:
|
|
profile_data = {
|
|
"name": "Unknown Person",
|
|
"headline": "No data available",
|
|
"experience": []
|
|
}
|
|
|
|
# ✅ CORRECT - Skip when no real data
|
|
if not profile_data or profile_data.get("extraction_error"):
|
|
print(f"Skipping {url} - extraction failed")
|
|
return None # Do NOT save anything
|
|
```
|
|
|
|
### In Validation Functions
|
|
|
|
```python
|
|
def validate_profile(profile_data):
|
|
# Must have real name from LinkedIn
|
|
if not profile_data.get("name") or len(profile_data["name"]) < 2:
|
|
return False
|
|
|
|
# Name must not be generic/fabricated
|
|
name = profile_data["name"].lower()
|
|
generic_names = ["linkedin member", "unknown", "n/a", "not specified"]
|
|
if name in generic_names:
|
|
return False
|
|
|
|
return True
|
|
```
|
|
|
|
## User Requirements
|
|
|
|
From project leadership:
|
|
- **"ALL PROFILES SHOULD BE REAL!!!"**
|
|
- **"Fabricating data is strictly prohibited"**
|
|
- **"Better to have missing data than fake data"**
|
|
|
|
## Consequences of Violation
|
|
|
|
1. **Data Integrity**: Fabricated data corrupts research value
|
|
2. **Trust**: Undermines confidence in entire dataset
|
|
3. **Legal**: May constitute misrepresentation
|
|
4. **Reputation**: Damages project credibility
|
|
|
|
## When in Doubt
|
|
|
|
1. **Don't save** if you cannot verify data authenticity
|
|
2. **Log the issue** for manual review
|
|
3. **Skip the record** - it's better to have no data than bad data
|
|
4. **Ask for clarification** if unsure about validation rules
|
|
|
|
## Reporting Violations
|
|
|
|
If you discover fabricated data:
|
|
1. Immediately remove the fabricated records
|
|
2. Document how it happened
|
|
3. Implement safeguards to prevent recurrence
|
|
4. Report to project maintainers
|
|
|
|
Remember: **Real data or no data.** There is no third option. |