glam/.opencode/DATA_FABRICATION_PROHIBITION.md
2025-12-12 12:51:10 +01:00

104 lines
No EOL
3 KiB
Markdown

# Data Fabrication Prohibition
## 🚨 CRITICAL RULE: NO DATA FABRICATION ALLOWED 🚨
**ALL DATA MUST BE REAL AND VERIFIABLE.** Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity.
## What Constitutes Data Fabrication
### ❌ FORBIDDEN (Never do this):
- Creating fake names for people
- Inventing job titles or companies
- Making up education history
- Generating placeholder skills or experience
- Creating fictional LinkedIn URLs
- Any "fallback" data when real data is unavailable
### ✅ ALLOWED (What you CAN do):
- Skip profiles that cannot be extracted
- Return `null` or empty fields for missing data
- Mark profiles with `extraction_error: true`
- Log why extraction failed
- Use "Unknown" or "Not specified" for display purposes (not in stored data)
## Real-World Example of Violation
**What happened**: The script created a fake profile for "Celyna Keates" when the Exa API failed.
**Why this is wrong**:
- The person does not exist
- The LinkedIn URL was fabricated
- All profile data was invented by the LLM
- This pollutes the dataset with false information
**Correct approach**:
- When API fails, skip the profile
- Log the failure
- Do NOT create any fallback data
## Technical Implementation
### In Extraction Scripts
```python
# ❌ WRONG - Creating fallback data
if not profile_data:
profile_data = {
"name": "Unknown Person",
"headline": "No data available",
"experience": []
}
# ✅ CORRECT - Skip when no real data
if not profile_data or profile_data.get("extraction_error"):
print(f"Skipping {url} - extraction failed")
return None # Do NOT save anything
```
### In Validation Functions
```python
def validate_profile(profile_data):
# Must have real name from LinkedIn
if not profile_data.get("name") or len(profile_data["name"]) < 2:
return False
# Name must not be generic/fabricated
name = profile_data["name"].lower()
generic_names = ["linkedin member", "unknown", "n/a", "not specified"]
if name in generic_names:
return False
return True
```
## User Requirements
From project leadership:
- **"ALL PROFILES SHOULD BE REAL!!!"**
- **"Fabricating data is strictly prohibited"**
- **"Better to have missing data than fake data"**
## Consequences of Violation
1. **Data Integrity**: Fabricated data corrupts research value
2. **Trust**: Undermines confidence in entire dataset
3. **Legal**: May constitute misrepresentation
4. **Reputation**: Damages project credibility
## When in Doubt
1. **Don't save** if you cannot verify data authenticity
2. **Log the issue** for manual review
3. **Skip the record** - it's better to have no data than bad data
4. **Ask for clarification** if unsure about validation rules
## Reporting Violations
If you discover fabricated data:
1. Immediately remove the fabricated records
2. Document how it happened
3. Implement safeguards to prevent recurrence
4. Report to project maintainers
Remember: **Real data or no data.** There is no third option.