Social Media Link Validation Rules
Rule: AGENTS.md Rule 23
Status: ACTIVE
Created: 2025-12-12
Summary
Social media links MUST point to the specific institution's page, NOT to generic platform homepages or the platform's own account. Generic links provide zero informational value and must be validated before storage.
Invalid Patterns (Generic Links)
These patterns indicate generic or invalid social media links that MUST be rejected:
Facebook
| Pattern |
Why Invalid |
facebook.com/ |
Generic homepage |
facebook.com/facebook |
Facebook's own page |
facebook.com/home.php |
User homepage redirect |
facebook.com/profile.php |
Generic profile page |
facebook.com/watch |
Generic watch page |
| Pattern |
Why Invalid |
twitter.com/ |
Generic homepage |
twitter.com/twitter |
Twitter's own account |
twitter.com/home |
User home timeline |
x.com/ |
Generic X homepage |
x.com/twitter |
Twitter's own account on X |
Instagram
| Pattern |
Why Invalid |
instagram.com/ |
Generic homepage |
instagram.com/instagram |
Instagram's own account |
instagram.com/explore |
Explore page |
LinkedIn
| Pattern |
Why Invalid |
linkedin.com/ |
Generic homepage |
linkedin.com/company/linkedin |
LinkedIn's own page |
linkedin.com/feed |
User feed |
linkedin.com/jobs |
Jobs page |
YouTube
| Pattern |
Why Invalid |
youtube.com/ |
Generic homepage |
youtube.com/youtube |
YouTube's own channel |
youtube.com/feed |
User feed |
youtube.com/watch |
Generic watch page (no video ID) |
TikTok
| Pattern |
Why Invalid |
tiktok.com/ |
Generic homepage |
tiktok.com/tiktok |
TikTok's own account |
tiktok.com/foryou |
For You page |
Valid Patterns
Facebook
facebook.com/{page_name}/ # Page with name
facebook.com/pages/{name}/{id} # Legacy pages format
facebook.com/{page_name} # Page without trailing slash
twitter.com/{username} # User profile
x.com/{username} # User profile on X
Instagram
instagram.com/{username}/ # Profile with trailing slash
instagram.com/{username} # Profile without trailing slash
LinkedIn
linkedin.com/company/{slug}/ # Company page
linkedin.com/school/{slug}/ # School/university page
linkedin.com/in/{username}/ # Personal profile (less common for institutions)
YouTube
youtube.com/@{handle} # Channel with handle
youtube.com/c/{custom_url} # Custom URL channel
youtube.com/channel/{channel_id} # Channel ID format (UC...)
youtube.com/user/{username} # Legacy user format
TikTok
tiktok.com/@{username} # User profile
Python Implementation
import re
from typing import Optional
# Patterns that indicate INVALID (generic) links
INVALID_SOCIAL_MEDIA_PATTERNS = {
'facebook': [
r'^https?://(www\.)?facebook\.com/?$',
r'^https?://(www\.)?facebook\.com/facebook/?$',
r'^https?://(www\.)?facebook\.com/home\.php',
r'^https?://(www\.)?facebook\.com/profile\.php$',
r'^https?://(www\.)?facebook\.com/watch/?$',
],
'twitter': [
r'^https?://(www\.)?(twitter|x)\.com/?$',
r'^https?://(www\.)?(twitter|x)\.com/twitter/?$',
r'^https?://(www\.)?(twitter|x)\.com/home/?$',
],
'instagram': [
r'^https?://(www\.)?instagram\.com/?$',
r'^https?://(www\.)?instagram\.com/instagram/?$',
r'^https?://(www\.)?instagram\.com/explore/?',
],
'linkedin': [
r'^https?://(www\.)?linkedin\.com/?$',
r'^https?://(www\.)?linkedin\.com/company/linkedin/?$',
r'^https?://(www\.)?linkedin\.com/feed/?$',
r'^https?://(www\.)?linkedin\.com/jobs/?',
],
'youtube': [
r'^https?://(www\.)?youtube\.com/?$',
r'^https?://(www\.)?youtube\.com/youtube/?$',
r'^https?://(www\.)?youtube\.com/feed/?',
r'^https?://(www\.)?youtube\.com/watch/?$', # watch without video ID
],
'tiktok': [
r'^https?://(www\.)?tiktok\.com/?$',
r'^https?://(www\.)?tiktok\.com/tiktok/?$',
r'^https?://(www\.)?tiktok\.com/foryou/?$',
],
}
def is_valid_social_media_link(platform: str, url: Optional[str]) -> bool:
"""
Check if a social media URL is valid (points to specific institution page).
Args:
platform: Platform name (facebook, twitter, instagram, linkedin, youtube, tiktok)
url: The URL to validate
Returns:
True if the URL is valid (institution-specific), False if generic/invalid
"""
if not url:
return False
url = url.strip()
if not url:
return False
# Normalize platform name
platform = platform.lower().strip()
# Get invalid patterns for this platform
patterns = INVALID_SOCIAL_MEDIA_PATTERNS.get(platform, [])
# Check against invalid patterns
for pattern in patterns:
if re.match(pattern, url, re.IGNORECASE):
return False # Matches a generic/invalid pattern
return True
def validate_social_media_dict(social_media: dict) -> dict:
"""
Validate a dictionary of social media links, removing invalid ones.
Args:
social_media: Dict like {'facebook': 'url', 'twitter': 'url', ...}
Returns:
Dict with only valid social media links
"""
validated = {}
for platform, url in social_media.items():
if is_valid_social_media_link(platform, url):
validated[platform] = url
else:
print(f"WARNING: Removing invalid {platform} link: {url}")
return validated
def extract_platform_from_url(url: str) -> Optional[str]:
"""
Extract the platform name from a social media URL.
Args:
url: A social media URL
Returns:
Platform name or None if not recognized
"""
if not url:
return None
url_lower = url.lower()
if 'facebook.com' in url_lower:
return 'facebook'
elif 'twitter.com' in url_lower or 'x.com' in url_lower:
return 'twitter'
elif 'instagram.com' in url_lower:
return 'instagram'
elif 'linkedin.com' in url_lower:
return 'linkedin'
elif 'youtube.com' in url_lower or 'youtu.be' in url_lower:
return 'youtube'
elif 'tiktok.com' in url_lower:
return 'tiktok'
return None
Usage in Enrichment Scripts
from social_media_validation import is_valid_social_media_link, validate_social_media_dict
def enrich_institution_social_media(ghcid: str, google_data: dict) -> dict:
"""Enrich institution with validated social media links."""
social_media = {}
# Extract social media from Google Maps data
raw_social = {
'facebook': google_data.get('facebook'),
'twitter': google_data.get('twitter'),
'instagram': google_data.get('instagram'),
'youtube': google_data.get('youtube'),
}
# Validate each link
for platform, url in raw_social.items():
if url and is_valid_social_media_link(platform, url):
social_media[platform] = url
elif url:
print(f"Skipping invalid {platform} for {ghcid}: {url}")
return social_media
Real-World Example
Problem Detected
# API returned this for Agrarisch Museum Westerhem
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# {"facebook": "https://www.facebook.com/facebook"}
This is INVALID because:
facebook.com/facebook is Facebook's own corporate page
- It provides zero information about Agrarisch Museum Westerhem
- It was likely a default/fallback value from some enrichment source
Solution
- Validate before storing:
is_valid_social_media_link('facebook', url) returns False
- Do not write invalid links to custodian YAML
- Clean up existing invalid data from database
Testing
def test_invalid_patterns():
"""Test that generic links are correctly identified as invalid."""
# These should all return False (invalid)
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/') == False
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/facebook') == False
assert is_valid_social_media_link('twitter', 'https://twitter.com/') == False
assert is_valid_social_media_link('twitter', 'https://x.com/twitter') == False
assert is_valid_social_media_link('instagram', 'https://instagram.com/instagram') == False
assert is_valid_social_media_link('youtube', 'https://youtube.com/') == False
# These should all return True (valid)
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/rijksmuseum/') == True
assert is_valid_social_media_link('twitter', 'https://twitter.com/rijksmuseum') == True
assert is_valid_social_media_link('instagram', 'https://instagram.com/rijksmuseum') == True
assert is_valid_social_media_link('youtube', 'https://youtube.com/@Rijksmuseum') == True
assert is_valid_social_media_link('linkedin', 'https://linkedin.com/company/rijksmuseum') == True
Related Rules
- Rule 22: Custodian YAML Files Are the Single Source of Truth
- Rule 5: Data enrichment is ADDITIVE ONLY
- Rule 21: Data Fabrication is Strictly Prohibited
See Also
.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md - Data source of truth rules
AGENTS.md - Complete agent instructions