glam/.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md
2025-12-14 17:09:55 +01:00

9.6 KiB

Social Media Link Validation Rules

Rule: AGENTS.md Rule 23
Status: ACTIVE
Created: 2025-12-12

Summary

Social media links MUST point to the specific institution's page, NOT to generic platform homepages or the platform's own account. Generic links provide zero informational value and must be validated before storage.

These patterns indicate generic or invalid social media links that MUST be rejected:

Facebook

Pattern Why Invalid
facebook.com/ Generic homepage
facebook.com/facebook Facebook's own page
facebook.com/home.php User homepage redirect
facebook.com/profile.php Generic profile page
facebook.com/watch Generic watch page

Twitter/X

Pattern Why Invalid
twitter.com/ Generic homepage
twitter.com/twitter Twitter's own account
twitter.com/home User home timeline
x.com/ Generic X homepage
x.com/twitter Twitter's own account on X

Instagram

Pattern Why Invalid
instagram.com/ Generic homepage
instagram.com/instagram Instagram's own account
instagram.com/explore Explore page

LinkedIn

Pattern Why Invalid
linkedin.com/ Generic homepage
linkedin.com/company/linkedin LinkedIn's own page
linkedin.com/feed User feed
linkedin.com/jobs Jobs page

YouTube

Pattern Why Invalid
youtube.com/ Generic homepage
youtube.com/youtube YouTube's own channel
youtube.com/feed User feed
youtube.com/watch Generic watch page (no video ID)

TikTok

Pattern Why Invalid
tiktok.com/ Generic homepage
tiktok.com/tiktok TikTok's own account
tiktok.com/foryou For You page

Valid Patterns

Facebook

facebook.com/{page_name}/           # Page with name
facebook.com/pages/{name}/{id}      # Legacy pages format
facebook.com/{page_name}            # Page without trailing slash

Twitter/X

twitter.com/{username}              # User profile
x.com/{username}                    # User profile on X

Instagram

instagram.com/{username}/           # Profile with trailing slash
instagram.com/{username}            # Profile without trailing slash

LinkedIn

linkedin.com/company/{slug}/        # Company page
linkedin.com/school/{slug}/         # School/university page
linkedin.com/in/{username}/         # Personal profile (less common for institutions)

YouTube

youtube.com/@{handle}               # Channel with handle
youtube.com/c/{custom_url}          # Custom URL channel
youtube.com/channel/{channel_id}    # Channel ID format (UC...)
youtube.com/user/{username}         # Legacy user format

TikTok

tiktok.com/@{username}              # User profile

Python Implementation

import re
from typing import Optional

# Patterns that indicate INVALID (generic) links
INVALID_SOCIAL_MEDIA_PATTERNS = {
    'facebook': [
        r'^https?://(www\.)?facebook\.com/?$',
        r'^https?://(www\.)?facebook\.com/facebook/?$',
        r'^https?://(www\.)?facebook\.com/home\.php',
        r'^https?://(www\.)?facebook\.com/profile\.php$',
        r'^https?://(www\.)?facebook\.com/watch/?$',
    ],
    'twitter': [
        r'^https?://(www\.)?(twitter|x)\.com/?$',
        r'^https?://(www\.)?(twitter|x)\.com/twitter/?$',
        r'^https?://(www\.)?(twitter|x)\.com/home/?$',
    ],
    'instagram': [
        r'^https?://(www\.)?instagram\.com/?$',
        r'^https?://(www\.)?instagram\.com/instagram/?$',
        r'^https?://(www\.)?instagram\.com/explore/?',
    ],
    'linkedin': [
        r'^https?://(www\.)?linkedin\.com/?$',
        r'^https?://(www\.)?linkedin\.com/company/linkedin/?$',
        r'^https?://(www\.)?linkedin\.com/feed/?$',
        r'^https?://(www\.)?linkedin\.com/jobs/?',
    ],
    'youtube': [
        r'^https?://(www\.)?youtube\.com/?$',
        r'^https?://(www\.)?youtube\.com/youtube/?$',
        r'^https?://(www\.)?youtube\.com/feed/?',
        r'^https?://(www\.)?youtube\.com/watch/?$',  # watch without video ID
    ],
    'tiktok': [
        r'^https?://(www\.)?tiktok\.com/?$',
        r'^https?://(www\.)?tiktok\.com/tiktok/?$',
        r'^https?://(www\.)?tiktok\.com/foryou/?$',
    ],
}


def is_valid_social_media_link(platform: str, url: Optional[str]) -> bool:
    """
    Check if a social media URL is valid (points to specific institution page).
    
    Args:
        platform: Platform name (facebook, twitter, instagram, linkedin, youtube, tiktok)
        url: The URL to validate
        
    Returns:
        True if the URL is valid (institution-specific), False if generic/invalid
    """
    if not url:
        return False
    
    url = url.strip()
    if not url:
        return False
    
    # Normalize platform name
    platform = platform.lower().strip()
    
    # Get invalid patterns for this platform
    patterns = INVALID_SOCIAL_MEDIA_PATTERNS.get(platform, [])
    
    # Check against invalid patterns
    for pattern in patterns:
        if re.match(pattern, url, re.IGNORECASE):
            return False  # Matches a generic/invalid pattern
    
    return True


def validate_social_media_dict(social_media: dict) -> dict:
    """
    Validate a dictionary of social media links, removing invalid ones.
    
    Args:
        social_media: Dict like {'facebook': 'url', 'twitter': 'url', ...}
        
    Returns:
        Dict with only valid social media links
    """
    validated = {}
    
    for platform, url in social_media.items():
        if is_valid_social_media_link(platform, url):
            validated[platform] = url
        else:
            print(f"WARNING: Removing invalid {platform} link: {url}")
    
    return validated


def extract_platform_from_url(url: str) -> Optional[str]:
    """
    Extract the platform name from a social media URL.
    
    Args:
        url: A social media URL
        
    Returns:
        Platform name or None if not recognized
    """
    if not url:
        return None
    
    url_lower = url.lower()
    
    if 'facebook.com' in url_lower:
        return 'facebook'
    elif 'twitter.com' in url_lower or 'x.com' in url_lower:
        return 'twitter'
    elif 'instagram.com' in url_lower:
        return 'instagram'
    elif 'linkedin.com' in url_lower:
        return 'linkedin'
    elif 'youtube.com' in url_lower or 'youtu.be' in url_lower:
        return 'youtube'
    elif 'tiktok.com' in url_lower:
        return 'tiktok'
    
    return None

Usage in Enrichment Scripts

from social_media_validation import is_valid_social_media_link, validate_social_media_dict

def enrich_institution_social_media(ghcid: str, google_data: dict) -> dict:
    """Enrich institution with validated social media links."""
    
    social_media = {}
    
    # Extract social media from Google Maps data
    raw_social = {
        'facebook': google_data.get('facebook'),
        'twitter': google_data.get('twitter'),
        'instagram': google_data.get('instagram'),
        'youtube': google_data.get('youtube'),
    }
    
    # Validate each link
    for platform, url in raw_social.items():
        if url and is_valid_social_media_link(platform, url):
            social_media[platform] = url
        elif url:
            print(f"Skipping invalid {platform} for {ghcid}: {url}")
    
    return social_media

Real-World Example

Problem Detected

# API returned this for Agrarisch Museum Westerhem
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# {"facebook": "https://www.facebook.com/facebook"}

This is INVALID because:

  • facebook.com/facebook is Facebook's own corporate page
  • It provides zero information about Agrarisch Museum Westerhem
  • It was likely a default/fallback value from some enrichment source

Solution

  1. Validate before storing: is_valid_social_media_link('facebook', url) returns False
  2. Do not write invalid links to custodian YAML
  3. Clean up existing invalid data from database

Testing

def test_invalid_patterns():
    """Test that generic links are correctly identified as invalid."""
    
    # These should all return False (invalid)
    assert is_valid_social_media_link('facebook', 'https://www.facebook.com/') == False
    assert is_valid_social_media_link('facebook', 'https://www.facebook.com/facebook') == False
    assert is_valid_social_media_link('twitter', 'https://twitter.com/') == False
    assert is_valid_social_media_link('twitter', 'https://x.com/twitter') == False
    assert is_valid_social_media_link('instagram', 'https://instagram.com/instagram') == False
    assert is_valid_social_media_link('youtube', 'https://youtube.com/') == False
    
    # These should all return True (valid)
    assert is_valid_social_media_link('facebook', 'https://www.facebook.com/rijksmuseum/') == True
    assert is_valid_social_media_link('twitter', 'https://twitter.com/rijksmuseum') == True
    assert is_valid_social_media_link('instagram', 'https://instagram.com/rijksmuseum') == True
    assert is_valid_social_media_link('youtube', 'https://youtube.com/@Rijksmuseum') == True
    assert is_valid_social_media_link('linkedin', 'https://linkedin.com/company/rijksmuseum') == True
  • Rule 22: Custodian YAML Files Are the Single Source of Truth
  • Rule 5: Data enrichment is ADDITIVE ONLY
  • Rule 21: Data Fabrication is Strictly Prohibited

See Also

  • .opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md - Data source of truth rules
  • AGENTS.md - Complete agent instructions