glam/docs/YOUTUBE_ENRICHMENT.md

# YouTube Enrichment for Heritage Custodians

This document explains how to enrich heritage custodian entries with YouTube channel and video data.

## Prerequisites

### 1. Get a YouTube API Key

1. **Go to Google Cloud Console**
   - Visit: https://console.cloud.google.com/

2. **Create or Select a Project**
   - Click on the project dropdown at the top
   - Click "New Project" or select an existing one
   - Name it something like "GLAM YouTube Enrichment"

3. **Enable YouTube Data API v3**
   - Navigate to "APIs & Services" → "Library"
   - Search for "YouTube Data API v3"
   - Click on it and press **Enable**

4. **Create API Credentials**
   - Go to "APIs & Services" → "Credentials"
   - Click "Create Credentials" → "API Key"
   - Copy the generated API key

5. **Restrict the API Key (Recommended)**
   - Click on your new API key to edit it
   - Under "API restrictions", select "Restrict key"
   - Select only "YouTube Data API v3"
   - Click Save

### 2. Set Environment Variable

```bash
export YOUTUBE_API_KEY='your-api-key-here'
```

Or add to your `.env` file:
```
YOUTUBE_API_KEY=your-api-key-here
```

### 3. Install Dependencies

```bash
pip install httpx pyyaml

# For transcript extraction (optional but recommended)
brew install yt-dlp  # macOS
# or
pip install yt-dlp
```

## Usage

### Basic Usage

```bash
# Process all entries with YouTube URLs
python scripts/enrich_youtube.py

# Dry run (show what would be done)
python scripts/enrich_youtube.py --dry-run

# Process only first 10 entries
python scripts/enrich_youtube.py --limit 10

# Process a specific entry
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml
```

### Example Output

```
Processing: 0146_Q1663974.yaml
  Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
    Fetching channel info for UCxxxxx...
    Fetching 10 recent videos...
    Fetching comments for top videos...
    Fetching transcripts for videos with captions...
  Status: SUCCESS
    Channel: Theologische Universiteit Apeldoorn
    Subscribers: 1,234
    Videos fetched: 10
```

## Data Collected

### Channel Information
- Channel ID and URL
- Channel title and description
- Custom URL (e.g., @channelname)
- Subscriber count
- Total video count
- Total view count
- Channel creation date
- Country
- Thumbnail and banner images

### Video Information (per video)
- Video ID and URL
- Title and description
- Published date
- Duration
- View count
- Like count
- Comment count
- Tags
- Thumbnail
- Caption availability
- Default language

### Comments (per video)
- Comment ID
- Author name and channel URL
- Comment text
- Like count
- Reply count
- Published date

### Transcripts (when available)
- Full transcript text
- Language
- Transcript type (manual or auto-generated)

## Provenance Tracking

All extracted data includes full provenance:

```yaml
youtube_enrichment:
  source_url: https://www.youtube.com/user/TUApeldoorn
  fetch_timestamp: '2025-12-01T15:30:00+00:00'
  api_endpoint: https://www.googleapis.com/youtube/v3
  api_version: v3
  status: SUCCESS
  channel:
    channel_id: UCxxxxxxxxxxxxx
    channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
    title: Theologische Universiteit Apeldoorn
    subscriber_count: 1234
    # ... more fields
  videos:
    - video_id: abc123xyz
      video_url: https://www.youtube.com/watch?v=abc123xyz
      title: Video Title
      view_count: 5678
      comments:
        - comment_id: xyz789
          text: Great video!
          like_count: 5
      transcript:
        transcript_text: "Full video transcript..."
        language: nl
        transcript_type: auto
```

## API Quota

YouTube Data API has a daily quota of **10,000 units**:

| Operation | Cost |
|-----------|------|
| Channel info | 1 unit |
| Video list | 1 unit |
| Video details | 1 unit per video |
| Comments | 1 unit per 100 comments |
| Search | 100 units |

**Estimated usage per custodian**: 15-50 units (depending on videos/comments)

For 100 custodians: ~1,500-5,000 units (well within daily quota)

## Troubleshooting

### "API key not valid"
- Check that the API key is correct
- Verify YouTube Data API v3 is enabled
- Check that the key isn't restricted to wrong APIs

### "Quota exceeded"
- Wait until the next day (quota resets at midnight Pacific Time)
- Or request a quota increase in Google Cloud Console

### "Channel not found"
- The channel may have been deleted
- The URL format may not be recognized
- Try using the channel ID directly

### "Comments disabled"
- Some videos have comments disabled
- The script handles this gracefully

### "No transcript available"
- Not all videos have captions
- Auto-generated captions may not be available for all languages

## Architecture

```
Entry YAML file
    ↓
Find YouTube URL from:
  - web_claims.social_youtube
  - wikidata_enrichment.P2397
    ↓
Resolve channel ID (handle → channel ID)
    ↓
Fetch via YouTube Data API v3:
  - Channel info
  - Recent videos
  - Video details
  - Comments
    ↓
Fetch via yt-dlp:
  - Transcripts/captions
    ↓
Add youtube_enrichment section
Update provenance
    ↓
Save YAML file
```

## Related Scripts

- `scripts/enrich_wikidata.py` - Wikidata enrichment
- `scripts/enrich_google_maps.py` - Google Maps enrichment
- `scripts/fetch_website_playwright.py` - Website archiving
- `mcp_servers/social_media/server.py` - MCP server for social media

## Future Enhancements

- [ ] Track channel subscriber growth over time
- [ ] Extract video chapters/timestamps
- [ ] Analyze video categories and topics
- [ ] Cross-reference with other social media
- [ ] Detect playlists relevant to heritage