# YouTube Enrichment for Heritage Custodians This document explains how to enrich heritage custodian entries with YouTube channel and video data. ## Prerequisites ### 1. Get a YouTube API Key 1. **Go to Google Cloud Console** - Visit: https://console.cloud.google.com/ 2. **Create or Select a Project** - Click on the project dropdown at the top - Click "New Project" or select an existing one - Name it something like "GLAM YouTube Enrichment" 3. **Enable YouTube Data API v3** - Navigate to "APIs & Services" → "Library" - Search for "YouTube Data API v3" - Click on it and press **Enable** 4. **Create API Credentials** - Go to "APIs & Services" → "Credentials" - Click "Create Credentials" → "API Key" - Copy the generated API key 5. **Restrict the API Key (Recommended)** - Click on your new API key to edit it - Under "API restrictions", select "Restrict key" - Select only "YouTube Data API v3" - Click Save ### 2. Set Environment Variable ```bash export YOUTUBE_API_KEY='your-api-key-here' ``` Or add to your `.env` file: ``` YOUTUBE_API_KEY=your-api-key-here ``` ### 3. Install Dependencies ```bash pip install httpx pyyaml # For transcript extraction (optional but recommended) brew install yt-dlp # macOS # or pip install yt-dlp ``` ## Usage ### Basic Usage ```bash # Process all entries with YouTube URLs python scripts/enrich_youtube.py # Dry run (show what would be done) python scripts/enrich_youtube.py --dry-run # Process only first 10 entries python scripts/enrich_youtube.py --limit 10 # Process a specific entry python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml ``` ### Example Output ``` Processing: 0146_Q1663974.yaml Found YouTube URL: https://www.youtube.com/user/TUApeldoorn Fetching channel info for UCxxxxx... Fetching 10 recent videos... Fetching comments for top videos... Fetching transcripts for videos with captions... Status: SUCCESS Channel: Theologische Universiteit Apeldoorn Subscribers: 1,234 Videos fetched: 10 ``` ## Data Collected ### Channel Information - Channel ID and URL - Channel title and description - Custom URL (e.g., @channelname) - Subscriber count - Total video count - Total view count - Channel creation date - Country - Thumbnail and banner images ### Video Information (per video) - Video ID and URL - Title and description - Published date - Duration - View count - Like count - Comment count - Tags - Thumbnail - Caption availability - Default language ### Comments (per video) - Comment ID - Author name and channel URL - Comment text - Like count - Reply count - Published date ### Transcripts (when available) - Full transcript text - Language - Transcript type (manual or auto-generated) ## Provenance Tracking All extracted data includes full provenance: ```yaml youtube_enrichment: source_url: https://www.youtube.com/user/TUApeldoorn fetch_timestamp: '2025-12-01T15:30:00+00:00' api_endpoint: https://www.googleapis.com/youtube/v3 api_version: v3 status: SUCCESS channel: channel_id: UCxxxxxxxxxxxxx channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx title: Theologische Universiteit Apeldoorn subscriber_count: 1234 # ... more fields videos: - video_id: abc123xyz video_url: https://www.youtube.com/watch?v=abc123xyz title: Video Title view_count: 5678 comments: - comment_id: xyz789 text: Great video! like_count: 5 transcript: transcript_text: "Full video transcript..." language: nl transcript_type: auto ``` ## API Quota YouTube Data API has a daily quota of **10,000 units**: | Operation | Cost | |-----------|------| | Channel info | 1 unit | | Video list | 1 unit | | Video details | 1 unit per video | | Comments | 1 unit per 100 comments | | Search | 100 units | **Estimated usage per custodian**: 15-50 units (depending on videos/comments) For 100 custodians: ~1,500-5,000 units (well within daily quota) ## Troubleshooting ### "API key not valid" - Check that the API key is correct - Verify YouTube Data API v3 is enabled - Check that the key isn't restricted to wrong APIs ### "Quota exceeded" - Wait until the next day (quota resets at midnight Pacific Time) - Or request a quota increase in Google Cloud Console ### "Channel not found" - The channel may have been deleted - The URL format may not be recognized - Try using the channel ID directly ### "Comments disabled" - Some videos have comments disabled - The script handles this gracefully ### "No transcript available" - Not all videos have captions - Auto-generated captions may not be available for all languages ## Architecture ``` Entry YAML file ↓ Find YouTube URL from: - web_claims.social_youtube - wikidata_enrichment.P2397 ↓ Resolve channel ID (handle → channel ID) ↓ Fetch via YouTube Data API v3: - Channel info - Recent videos - Video details - Comments ↓ Fetch via yt-dlp: - Transcripts/captions ↓ Add youtube_enrichment section Update provenance ↓ Save YAML file ``` ## Related Scripts - `scripts/enrich_wikidata.py` - Wikidata enrichment - `scripts/enrich_google_maps.py` - Google Maps enrichment - `scripts/fetch_website_playwright.py` - Website archiving - `mcp_servers/social_media/server.py` - MCP server for social media ## Future Enhancements - [ ] Track channel subscriber growth over time - [ ] Extract video chapters/timestamps - [ ] Analyze video categories and topics - [ ] Cross-reference with other social media - [ ] Detect playlists relevant to heritage