glam/.opencode/DATA_FABRICATION_PROHIBITION.md
2025-12-12 12:51:10 +01:00

3 KiB

Data Fabrication Prohibition

🚨 CRITICAL RULE: NO DATA FABRICATION ALLOWED 🚨

ALL DATA MUST BE REAL AND VERIFIABLE. Fabricating any data - even as placeholders - is strictly prohibited and violates project integrity.

What Constitutes Data Fabrication

FORBIDDEN (Never do this):

  • Creating fake names for people
  • Inventing job titles or companies
  • Making up education history
  • Generating placeholder skills or experience
  • Creating fictional LinkedIn URLs
  • Any "fallback" data when real data is unavailable

ALLOWED (What you CAN do):

  • Skip profiles that cannot be extracted
  • Return null or empty fields for missing data
  • Mark profiles with extraction_error: true
  • Log why extraction failed
  • Use "Unknown" or "Not specified" for display purposes (not in stored data)

Real-World Example of Violation

What happened: The script created a fake profile for "Celyna Keates" when the Exa API failed.

Why this is wrong:

  • The person does not exist
  • The LinkedIn URL was fabricated
  • All profile data was invented by the LLM
  • This pollutes the dataset with false information

Correct approach:

  • When API fails, skip the profile
  • Log the failure
  • Do NOT create any fallback data

Technical Implementation

In Extraction Scripts

# ❌ WRONG - Creating fallback data
if not profile_data:
    profile_data = {
        "name": "Unknown Person",
        "headline": "No data available",
        "experience": []
    }

# ✅ CORRECT - Skip when no real data
if not profile_data or profile_data.get("extraction_error"):
    print(f"Skipping {url} - extraction failed")
    return None  # Do NOT save anything

In Validation Functions

def validate_profile(profile_data):
    # Must have real name from LinkedIn
    if not profile_data.get("name") or len(profile_data["name"]) < 2:
        return False
    
    # Name must not be generic/fabricated
    name = profile_data["name"].lower()
    generic_names = ["linkedin member", "unknown", "n/a", "not specified"]
    if name in generic_names:
        return False
    
    return True

User Requirements

From project leadership:

  • "ALL PROFILES SHOULD BE REAL!!!"
  • "Fabricating data is strictly prohibited"
  • "Better to have missing data than fake data"

Consequences of Violation

  1. Data Integrity: Fabricated data corrupts research value
  2. Trust: Undermines confidence in entire dataset
  3. Legal: May constitute misrepresentation
  4. Reputation: Damages project credibility

When in Doubt

  1. Don't save if you cannot verify data authenticity
  2. Log the issue for manual review
  3. Skip the record - it's better to have no data than bad data
  4. Ask for clarification if unsure about validation rules

Reporting Violations

If you discover fabricated data:

  1. Immediately remove the fabricated records
  2. Document how it happened
  3. Implement safeguards to prevent recurrence
  4. Report to project maintainers

Remember: Real data or no data. There is no third option.