319 lines
9.6 KiB
Markdown
319 lines
9.6 KiB
Markdown
# Social Media Link Validation Rules
|
|
|
|
**Rule**: AGENTS.md Rule 23
|
|
**Status**: ACTIVE
|
|
**Created**: 2025-12-12
|
|
|
|
## Summary
|
|
|
|
Social media links MUST point to the specific institution's page, NOT to generic platform homepages or the platform's own account. Generic links provide zero informational value and must be validated before storage.
|
|
|
|
## Invalid Patterns (Generic Links)
|
|
|
|
These patterns indicate generic or invalid social media links that MUST be rejected:
|
|
|
|
### Facebook
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `facebook.com/` | Generic homepage |
|
|
| `facebook.com/facebook` | Facebook's own page |
|
|
| `facebook.com/home.php` | User homepage redirect |
|
|
| `facebook.com/profile.php` | Generic profile page |
|
|
| `facebook.com/watch` | Generic watch page |
|
|
|
|
### Twitter/X
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `twitter.com/` | Generic homepage |
|
|
| `twitter.com/twitter` | Twitter's own account |
|
|
| `twitter.com/home` | User home timeline |
|
|
| `x.com/` | Generic X homepage |
|
|
| `x.com/twitter` | Twitter's own account on X |
|
|
|
|
### Instagram
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `instagram.com/` | Generic homepage |
|
|
| `instagram.com/instagram` | Instagram's own account |
|
|
| `instagram.com/explore` | Explore page |
|
|
|
|
### LinkedIn
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `linkedin.com/` | Generic homepage |
|
|
| `linkedin.com/company/linkedin` | LinkedIn's own page |
|
|
| `linkedin.com/feed` | User feed |
|
|
| `linkedin.com/jobs` | Jobs page |
|
|
|
|
### YouTube
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `youtube.com/` | Generic homepage |
|
|
| `youtube.com/youtube` | YouTube's own channel |
|
|
| `youtube.com/feed` | User feed |
|
|
| `youtube.com/watch` | Generic watch page (no video ID) |
|
|
|
|
### TikTok
|
|
|
|
| Pattern | Why Invalid |
|
|
|---------|-------------|
|
|
| `tiktok.com/` | Generic homepage |
|
|
| `tiktok.com/tiktok` | TikTok's own account |
|
|
| `tiktok.com/foryou` | For You page |
|
|
|
|
## Valid Patterns
|
|
|
|
### Facebook
|
|
```
|
|
facebook.com/{page_name}/ # Page with name
|
|
facebook.com/pages/{name}/{id} # Legacy pages format
|
|
facebook.com/{page_name} # Page without trailing slash
|
|
```
|
|
|
|
### Twitter/X
|
|
```
|
|
twitter.com/{username} # User profile
|
|
x.com/{username} # User profile on X
|
|
```
|
|
|
|
### Instagram
|
|
```
|
|
instagram.com/{username}/ # Profile with trailing slash
|
|
instagram.com/{username} # Profile without trailing slash
|
|
```
|
|
|
|
### LinkedIn
|
|
```
|
|
linkedin.com/company/{slug}/ # Company page
|
|
linkedin.com/school/{slug}/ # School/university page
|
|
linkedin.com/in/{username}/ # Personal profile (less common for institutions)
|
|
```
|
|
|
|
### YouTube
|
|
```
|
|
youtube.com/@{handle} # Channel with handle
|
|
youtube.com/c/{custom_url} # Custom URL channel
|
|
youtube.com/channel/{channel_id} # Channel ID format (UC...)
|
|
youtube.com/user/{username} # Legacy user format
|
|
```
|
|
|
|
### TikTok
|
|
```
|
|
tiktok.com/@{username} # User profile
|
|
```
|
|
|
|
## Python Implementation
|
|
|
|
```python
|
|
import re
|
|
from typing import Optional
|
|
|
|
# Patterns that indicate INVALID (generic) links
|
|
INVALID_SOCIAL_MEDIA_PATTERNS = {
|
|
'facebook': [
|
|
r'^https?://(www\.)?facebook\.com/?$',
|
|
r'^https?://(www\.)?facebook\.com/facebook/?$',
|
|
r'^https?://(www\.)?facebook\.com/home\.php',
|
|
r'^https?://(www\.)?facebook\.com/profile\.php$',
|
|
r'^https?://(www\.)?facebook\.com/watch/?$',
|
|
],
|
|
'twitter': [
|
|
r'^https?://(www\.)?(twitter|x)\.com/?$',
|
|
r'^https?://(www\.)?(twitter|x)\.com/twitter/?$',
|
|
r'^https?://(www\.)?(twitter|x)\.com/home/?$',
|
|
],
|
|
'instagram': [
|
|
r'^https?://(www\.)?instagram\.com/?$',
|
|
r'^https?://(www\.)?instagram\.com/instagram/?$',
|
|
r'^https?://(www\.)?instagram\.com/explore/?',
|
|
],
|
|
'linkedin': [
|
|
r'^https?://(www\.)?linkedin\.com/?$',
|
|
r'^https?://(www\.)?linkedin\.com/company/linkedin/?$',
|
|
r'^https?://(www\.)?linkedin\.com/feed/?$',
|
|
r'^https?://(www\.)?linkedin\.com/jobs/?',
|
|
],
|
|
'youtube': [
|
|
r'^https?://(www\.)?youtube\.com/?$',
|
|
r'^https?://(www\.)?youtube\.com/youtube/?$',
|
|
r'^https?://(www\.)?youtube\.com/feed/?',
|
|
r'^https?://(www\.)?youtube\.com/watch/?$', # watch without video ID
|
|
],
|
|
'tiktok': [
|
|
r'^https?://(www\.)?tiktok\.com/?$',
|
|
r'^https?://(www\.)?tiktok\.com/tiktok/?$',
|
|
r'^https?://(www\.)?tiktok\.com/foryou/?$',
|
|
],
|
|
}
|
|
|
|
|
|
def is_valid_social_media_link(platform: str, url: Optional[str]) -> bool:
|
|
"""
|
|
Check if a social media URL is valid (points to specific institution page).
|
|
|
|
Args:
|
|
platform: Platform name (facebook, twitter, instagram, linkedin, youtube, tiktok)
|
|
url: The URL to validate
|
|
|
|
Returns:
|
|
True if the URL is valid (institution-specific), False if generic/invalid
|
|
"""
|
|
if not url:
|
|
return False
|
|
|
|
url = url.strip()
|
|
if not url:
|
|
return False
|
|
|
|
# Normalize platform name
|
|
platform = platform.lower().strip()
|
|
|
|
# Get invalid patterns for this platform
|
|
patterns = INVALID_SOCIAL_MEDIA_PATTERNS.get(platform, [])
|
|
|
|
# Check against invalid patterns
|
|
for pattern in patterns:
|
|
if re.match(pattern, url, re.IGNORECASE):
|
|
return False # Matches a generic/invalid pattern
|
|
|
|
return True
|
|
|
|
|
|
def validate_social_media_dict(social_media: dict) -> dict:
|
|
"""
|
|
Validate a dictionary of social media links, removing invalid ones.
|
|
|
|
Args:
|
|
social_media: Dict like {'facebook': 'url', 'twitter': 'url', ...}
|
|
|
|
Returns:
|
|
Dict with only valid social media links
|
|
"""
|
|
validated = {}
|
|
|
|
for platform, url in social_media.items():
|
|
if is_valid_social_media_link(platform, url):
|
|
validated[platform] = url
|
|
else:
|
|
print(f"WARNING: Removing invalid {platform} link: {url}")
|
|
|
|
return validated
|
|
|
|
|
|
def extract_platform_from_url(url: str) -> Optional[str]:
|
|
"""
|
|
Extract the platform name from a social media URL.
|
|
|
|
Args:
|
|
url: A social media URL
|
|
|
|
Returns:
|
|
Platform name or None if not recognized
|
|
"""
|
|
if not url:
|
|
return None
|
|
|
|
url_lower = url.lower()
|
|
|
|
if 'facebook.com' in url_lower:
|
|
return 'facebook'
|
|
elif 'twitter.com' in url_lower or 'x.com' in url_lower:
|
|
return 'twitter'
|
|
elif 'instagram.com' in url_lower:
|
|
return 'instagram'
|
|
elif 'linkedin.com' in url_lower:
|
|
return 'linkedin'
|
|
elif 'youtube.com' in url_lower or 'youtu.be' in url_lower:
|
|
return 'youtube'
|
|
elif 'tiktok.com' in url_lower:
|
|
return 'tiktok'
|
|
|
|
return None
|
|
```
|
|
|
|
## Usage in Enrichment Scripts
|
|
|
|
```python
|
|
from social_media_validation import is_valid_social_media_link, validate_social_media_dict
|
|
|
|
def enrich_institution_social_media(ghcid: str, google_data: dict) -> dict:
|
|
"""Enrich institution with validated social media links."""
|
|
|
|
social_media = {}
|
|
|
|
# Extract social media from Google Maps data
|
|
raw_social = {
|
|
'facebook': google_data.get('facebook'),
|
|
'twitter': google_data.get('twitter'),
|
|
'instagram': google_data.get('instagram'),
|
|
'youtube': google_data.get('youtube'),
|
|
}
|
|
|
|
# Validate each link
|
|
for platform, url in raw_social.items():
|
|
if url and is_valid_social_media_link(platform, url):
|
|
social_media[platform] = url
|
|
elif url:
|
|
print(f"Skipping invalid {platform} for {ghcid}: {url}")
|
|
|
|
return social_media
|
|
```
|
|
|
|
## Real-World Example
|
|
|
|
### Problem Detected
|
|
|
|
```bash
|
|
# API returned this for Agrarisch Museum Westerhem
|
|
curl -s "http://localhost:8002/institution/NL-NH-MID-M-AMW" | jq '.social_media'
|
|
# {"facebook": "https://www.facebook.com/facebook"}
|
|
```
|
|
|
|
This is **INVALID** because:
|
|
- `facebook.com/facebook` is Facebook's own corporate page
|
|
- It provides zero information about Agrarisch Museum Westerhem
|
|
- It was likely a default/fallback value from some enrichment source
|
|
|
|
### Solution
|
|
|
|
1. Validate before storing: `is_valid_social_media_link('facebook', url)` returns `False`
|
|
2. Do not write invalid links to custodian YAML
|
|
3. Clean up existing invalid data from database
|
|
|
|
## Testing
|
|
|
|
```python
|
|
def test_invalid_patterns():
|
|
"""Test that generic links are correctly identified as invalid."""
|
|
|
|
# These should all return False (invalid)
|
|
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/') == False
|
|
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/facebook') == False
|
|
assert is_valid_social_media_link('twitter', 'https://twitter.com/') == False
|
|
assert is_valid_social_media_link('twitter', 'https://x.com/twitter') == False
|
|
assert is_valid_social_media_link('instagram', 'https://instagram.com/instagram') == False
|
|
assert is_valid_social_media_link('youtube', 'https://youtube.com/') == False
|
|
|
|
# These should all return True (valid)
|
|
assert is_valid_social_media_link('facebook', 'https://www.facebook.com/rijksmuseum/') == True
|
|
assert is_valid_social_media_link('twitter', 'https://twitter.com/rijksmuseum') == True
|
|
assert is_valid_social_media_link('instagram', 'https://instagram.com/rijksmuseum') == True
|
|
assert is_valid_social_media_link('youtube', 'https://youtube.com/@Rijksmuseum') == True
|
|
assert is_valid_social_media_link('linkedin', 'https://linkedin.com/company/rijksmuseum') == True
|
|
```
|
|
|
|
## Related Rules
|
|
|
|
- **Rule 22**: Custodian YAML Files Are the Single Source of Truth
|
|
- **Rule 5**: Data enrichment is ADDITIVE ONLY
|
|
- **Rule 21**: Data Fabrication is Strictly Prohibited
|
|
|
|
## See Also
|
|
|
|
- `.opencode/CUSTODIAN_DATA_SOURCE_OF_TRUTH.md` - Data source of truth rules
|
|
- `AGENTS.md` - Complete agent instructions
|