glam/docs/YOUTUBE_ENRICHMENT.md
kempersc 48a2b26f59 feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
2025-12-01 16:58:03 +01:00

238 lines
5.5 KiB
Markdown

# YouTube Enrichment for Heritage Custodians
This document explains how to enrich heritage custodian entries with YouTube channel and video data.
## Prerequisites
### 1. Get a YouTube API Key
1. **Go to Google Cloud Console**
- Visit: https://console.cloud.google.com/
2. **Create or Select a Project**
- Click on the project dropdown at the top
- Click "New Project" or select an existing one
- Name it something like "GLAM YouTube Enrichment"
3. **Enable YouTube Data API v3**
- Navigate to "APIs & Services" → "Library"
- Search for "YouTube Data API v3"
- Click on it and press **Enable**
4. **Create API Credentials**
- Go to "APIs & Services" → "Credentials"
- Click "Create Credentials" → "API Key"
- Copy the generated API key
5. **Restrict the API Key (Recommended)**
- Click on your new API key to edit it
- Under "API restrictions", select "Restrict key"
- Select only "YouTube Data API v3"
- Click Save
### 2. Set Environment Variable
```bash
export YOUTUBE_API_KEY='your-api-key-here'
```
Or add to your `.env` file:
```
YOUTUBE_API_KEY=your-api-key-here
```
### 3. Install Dependencies
```bash
pip install httpx pyyaml
# For transcript extraction (optional but recommended)
brew install yt-dlp # macOS
# or
pip install yt-dlp
```
## Usage
### Basic Usage
```bash
# Process all entries with YouTube URLs
python scripts/enrich_youtube.py
# Dry run (show what would be done)
python scripts/enrich_youtube.py --dry-run
# Process only first 10 entries
python scripts/enrich_youtube.py --limit 10
# Process a specific entry
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml
```
### Example Output
```
Processing: 0146_Q1663974.yaml
Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
Fetching channel info for UCxxxxx...
Fetching 10 recent videos...
Fetching comments for top videos...
Fetching transcripts for videos with captions...
Status: SUCCESS
Channel: Theologische Universiteit Apeldoorn
Subscribers: 1,234
Videos fetched: 10
```
## Data Collected
### Channel Information
- Channel ID and URL
- Channel title and description
- Custom URL (e.g., @channelname)
- Subscriber count
- Total video count
- Total view count
- Channel creation date
- Country
- Thumbnail and banner images
### Video Information (per video)
- Video ID and URL
- Title and description
- Published date
- Duration
- View count
- Like count
- Comment count
- Tags
- Thumbnail
- Caption availability
- Default language
### Comments (per video)
- Comment ID
- Author name and channel URL
- Comment text
- Like count
- Reply count
- Published date
### Transcripts (when available)
- Full transcript text
- Language
- Transcript type (manual or auto-generated)
## Provenance Tracking
All extracted data includes full provenance:
```yaml
youtube_enrichment:
source_url: https://www.youtube.com/user/TUApeldoorn
fetch_timestamp: '2025-12-01T15:30:00+00:00'
api_endpoint: https://www.googleapis.com/youtube/v3
api_version: v3
status: SUCCESS
channel:
channel_id: UCxxxxxxxxxxxxx
channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
title: Theologische Universiteit Apeldoorn
subscriber_count: 1234
# ... more fields
videos:
- video_id: abc123xyz
video_url: https://www.youtube.com/watch?v=abc123xyz
title: Video Title
view_count: 5678
comments:
- comment_id: xyz789
text: Great video!
like_count: 5
transcript:
transcript_text: "Full video transcript..."
language: nl
transcript_type: auto
```
## API Quota
YouTube Data API has a daily quota of **10,000 units**:
| Operation | Cost |
|-----------|------|
| Channel info | 1 unit |
| Video list | 1 unit |
| Video details | 1 unit per video |
| Comments | 1 unit per 100 comments |
| Search | 100 units |
**Estimated usage per custodian**: 15-50 units (depending on videos/comments)
For 100 custodians: ~1,500-5,000 units (well within daily quota)
## Troubleshooting
### "API key not valid"
- Check that the API key is correct
- Verify YouTube Data API v3 is enabled
- Check that the key isn't restricted to wrong APIs
### "Quota exceeded"
- Wait until the next day (quota resets at midnight Pacific Time)
- Or request a quota increase in Google Cloud Console
### "Channel not found"
- The channel may have been deleted
- The URL format may not be recognized
- Try using the channel ID directly
### "Comments disabled"
- Some videos have comments disabled
- The script handles this gracefully
### "No transcript available"
- Not all videos have captions
- Auto-generated captions may not be available for all languages
## Architecture
```
Entry YAML file
Find YouTube URL from:
- web_claims.social_youtube
- wikidata_enrichment.P2397
Resolve channel ID (handle → channel ID)
Fetch via YouTube Data API v3:
- Channel info
- Recent videos
- Video details
- Comments
Fetch via yt-dlp:
- Transcripts/captions
Add youtube_enrichment section
Update provenance
Save YAML file
```
## Related Scripts
- `scripts/enrich_wikidata.py` - Wikidata enrichment
- `scripts/enrich_google_maps.py` - Google Maps enrichment
- `scripts/fetch_website_playwright.py` - Website archiving
- `mcp_servers/social_media/server.py` - MCP server for social media
## Future Enhancements
- [ ] Track channel subscriber growth over time
- [ ] Extract video chapters/timestamps
- [ ] Analyze video categories and topics
- [ ] Cross-reference with other social media
- [ ] Detect playlists relevant to heritage