glam/.opencode/SOCIAL_MEDIA_LINK_VALIDATION.md
2025-12-14 17:09:55 +01:00

319 lines
9.6 KiB
Markdown

# Social Media Link Validation Rules
**Rule**: AGENTS.md Rule 23
**Status**: ACTIVE
**Created**: 2025-12-12
## Summary
Social media links MUST point to the specific institution's page, NOT to generic platform homepages or the platform's own account. Generic links provide zero informational value and must be validated before storage.
## Invalid Patterns (Generic Links)
These patterns indicate generic or invalid social media links that MUST be rejected:
### Facebook
| Pattern | Why Invalid |
|---------|-------------|
| `facebook.com/` | Generic homepage |
| `facebook.com/facebook` | Facebook's own page |
| `facebook.com/home.php` | User homepage redirect |
| `facebook.com/profile.php` | Generic profile page |
| `facebook.com/watch` | Generic watch page |
### Twitter/X
| Pattern | Why Invalid |
|---------|-------------|
| `twitter.com/` | Generic homepage |
| `twitter.com/twitter` | Twitter's own account |
| `twitter.com/home` | User home timeline |
| `x.com/` | Generic X homepage |
| `x.com/twitter` | Twitter's own account on X |
### Instagram
| Pattern | Why Invalid |
|---------|-------------|
| `instagram.com/` | Generic homepage |
| `instagram.com/instagram` | Instagram's own account |
| `instagram.com/explore` | Explore page |
### LinkedIn
| Pattern | Why Invalid |
|---------|-------------|
| `linkedin.com/` | Generic homepage |
| `linkedin.com/company/linkedin` | LinkedIn's own page |
| `linkedin.com/feed` | User feed |
| `linkedin.com/jobs` | Jobs page |
### YouTube
| Pattern | Why Invalid |
|---------|-------------|
| `youtube.com/` | Generic homepage |
| `youtube.com/youtube` | YouTube's own channel |
| `youtube.com/feed` | User feed |
| `youtube.com/watch` | Generic watch page (no video ID) |
### TikTok
| Pattern | Why Invalid |
|---------|-------------|
| `tiktok.com/` | Generic homepage |
| `tiktok.com/tiktok` | TikTok's own account |
| `tiktok.com/foryou` | For You page |
## Valid Patterns
### Facebook
```
facebook.com/{page_name}/ # Page with name
facebook.com/pages/{name}/{id} # Legacy pages format
facebook.com/{page_name} # Page without trailing slash
```
### Twitter/X
```
twitter.com/{username} # User profile
x.com/{username} # User profile on X
```
### Instagram
```
instagram.com/{username}/ # Profile with trailing slash
instagram.com/{username} # Profile without trailing slash
```
### LinkedIn
```
linkedin.com/company/{slug}/ # Company page
linkedin.com/school/{slug}/ # School/university page
linkedin.com/in/{username}/ # Personal profile (less common for institutions)
```
### YouTube
```
youtube.com/@{handle} # Channel with handle
youtube.com/c/{custom_url} # Custom URL channel
youtube.com/channel/{channel_id} # Channel ID format (UC...)
youtube.com/user/{username} # Legacy user format
```
### TikTok
```
tiktok.com/@{username} # User profile
```
## Python Implementation
```python
import re
from typing import Optional
# Patterns that indicate INVALID (generic) links
INVALID_SOCIAL_MEDIA_PATTERNS = {
'facebook': [
r'^https?://(www\.)?facebook\.com/?$',
r'^https?://(www\.)?facebook\.com/facebook/?$',
r'^https?://(www\.)?facebook\.com/home\.php',
r'^https?://(www\.)?facebook\.com/profile\.php$',
r'^https?://(www\.)?facebook\.com/watch/?$',
],
'twitter': [
r'^https?://(www\.)?(twitter|x)\.com/?$',
r'^https?://(www\.)?(twitter|x)\.com/twitter/?$',
r'^https?://(www\.)?(twitter|x)\.com/home/?$',
],
'instagram': [
r'^https?://(www\.)?instagram\.com/?$',
r'^https?://(www\.)?instagram\.com/instagram/?$',
r'^https?://(www\.)?instagram\.com/explore/?',
],
'linkedin': [
r'^https?://(www\.)?linkedin\.com/?$',
r'^https?://(www\.)?linkedin\.com/company/linkedin/?$',
r'^https?://(www\.)?linkedin\.com/feed/?$',
r'^https?://(www\.)?linkedin\.com/jobs/?',
],
'youtube': [
r'^https?://(www\.)?youtube\.com/?$',
r'^https?://(www\.)?youtube\.com/youtube/?$',
r'^https?://(www\.)?youtube\.com/feed/?',
r'^https?://(www\.)?youtube\.com/watch/?$', # watch without video ID
],
'tiktok': [
r'^https?://(www\.)?tiktok\.com/?$',
r'^https?://(www\.)?tiktok\.com/tiktok/?$',
r'^https?://(www\.)?tiktok\.com/foryou/?$',
],
}
def is_valid_social_media_link(platform: str, url: Optional[str]) -> bool:
"""
Check if a social media URL is valid (points to specific institution page).
Args:
platform: Platform name (facebook, twitter, instagram, linkedin, youtube, tiktok)
url: The URL to validate
Returns:
True if the URL is valid (institution-specific), False if generic/invalid
"""
if not url:
return False
url = url.strip()
if not url:
return False
# Normalize platform name
platform = platform.lower().strip()
# Get invalid patterns for this platform
patterns = INVALID_SOCIAL_MEDIA_PATTERNS.get(platform, [])
# Check against invalid patterns
for pattern in patterns:
if re.match(pattern, url, re.IGNORECASE):
return False # Matches a generic/invalid pattern
return True
def validate_social_media_dict(social_media: dict) -> dict:
"""
Validate a dictionary of social media links, removing invalid ones.
Args:
social_media: Dict like {'facebook': 'url', 'twitter': 'url', ...}
Returns:
Dict with only valid social media links
"""
validated = {}
for platform, url in social_media.items():
if is_valid_social_media_link(platform, url):
validated[platform] = url
else:
print(f"WARNING: Removing invalid {platform} link: {url}")
return validated
def extract_platform_from_url(url: str) -> Optional[str]:
"""
Extract the platform name from a social media URL.
Args:
url: A social media URL
Returns:
Platform name or None if not recognized
"""
if not url:
return None
url_lower = url.lower()
if 'facebook.com' in url_lower:
return 'facebook'
elif 'twitter.com' in url_lower or 'x.com' in url_lower:
return 'twitter'
elif 'instagram.com' in url_lower:
return 'instagram'
elif 'linkedin.com' in url_lower:
return 'linkedin'
elif 'youtube.com' in url_lower or 'youtu.be' in url_lower:
return 'youtube'
elif 'tiktok.com' in url_lower:
return 'tiktok'
return None
```
## Usage in Enrichment Scripts
```python
from social_media_validation import is_valid_social_media_link, validate_social_media_dict
def enrich_institution_social_media(ghcid: str, google_data: dict) -> dict:
"""Enrich institution with validated social media links."""
social_media = {}
# Extract social media from Google Maps data
raw_social = {
'facebook': google_data.get('facebook'),
'twitter': google_data.get('twitter'),
'instagram': google_data.get('instagram'),
'youtube': google_data.get('youtube'),
}
# Validate each link
for platform, url in raw_social.items():
if url and is_valid_social_media_link(platform, url):
social_media[platform] = url
elif url:
print(f"Skipping invalid {platform} for {ghcid}: {url}")
return social_media
```
## Real-World Example
### Problem Detected
```bash
# API returned this for Agrarisch Museum Westerhem
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
# {"facebook": "https://www.facebook.com/facebook"}
```
This is **INVALID** because:
- `facebook.com/facebook` is Facebook's own corporate page
- It provides zero information about Agrarisch Museum Westerhem
- It was likely a default/fallback value from some enrichment source
### Solution
1. Validate before storing: `is_valid_social_media_link('facebook', url)` returns `False`
2. Do not write invalid links to custodian YAML
3. Clean up existing invalid data from database
## Testing
```python
def test_invalid_patterns():
"""Test that generic links are correctly identified as invalid."""
# These should all return False (invalid)
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/') == False
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/facebook') == False
assert is_valid_social_media_link('twitter', 'https://twitter.com/') == False
assert is_valid_social_media_link('twitter', 'https://x.com/twitter') == False
assert is_valid_social_media_link('instagram', 'https://instagram.com/instagram') == False
assert is_valid_social_media_link('youtube', 'https://youtube.com/') == False
# These should all return True (valid)
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/rijksmuseum/') == True
assert is_valid_social_media_link('twitter', 'https://twitter.com/rijksmuseum') == True
assert is_valid_social_media_link('instagram', 'https://instagram.com/rijksmuseum') == True
assert is_valid_social_media_link('youtube', 'https://youtube.com/@Rijksmuseum') == True
assert is_valid_social_media_link('linkedin', 'https://linkedin.com/company/rijksmuseum') == True
```
## Related Rules
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
- **Rule 5**: Data enrichment is ADDITIVE ONLY
- **Rule 21**: Data Fabrication is Strictly Prohibited
## See Also
- `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` - Data source of truth rules
- `AGENTS.md` - Complete agent instructions