glam/docs/YOUTUBE_ENRICHMENT.md
kempersc 48a2b26f59 feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas
- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
2025-12-01 16:58:03 +01:00

5.5 KiB

YouTube Enrichment for Heritage Custodians

This document explains how to enrich heritage custodian entries with YouTube channel and video data.

Prerequisites

1. Get a YouTube API Key

  1. Go to Google Cloud Console

  2. Create or Select a Project

    • Click on the project dropdown at the top
    • Click "New Project" or select an existing one
    • Name it something like "GLAM YouTube Enrichment"
  3. Enable YouTube Data API v3

    • Navigate to "APIs & Services" → "Library"
    • Search for "YouTube Data API v3"
    • Click on it and press Enable
  4. Create API Credentials

    • Go to "APIs & Services" → "Credentials"
    • Click "Create Credentials" → "API Key"
    • Copy the generated API key
  5. Restrict the API Key (Recommended)

    • Click on your new API key to edit it
    • Under "API restrictions", select "Restrict key"
    • Select only "YouTube Data API v3"
    • Click Save

2. Set Environment Variable

export YOUTUBE_API_KEY='your-api-key-here'

Or add to your .env file:

YOUTUBE_API_KEY=your-api-key-here

3. Install Dependencies

pip install httpx pyyaml

# For transcript extraction (optional but recommended)
brew install yt-dlp  # macOS
# or
pip install yt-dlp

Usage

Basic Usage

# Process all entries with YouTube URLs
python scripts/enrich_youtube.py

# Dry run (show what would be done)
python scripts/enrich_youtube.py --dry-run

# Process only first 10 entries
python scripts/enrich_youtube.py --limit 10

# Process a specific entry
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml

Example Output

Processing: 0146_Q1663974.yaml
  Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
    Fetching channel info for UCxxxxx...
    Fetching 10 recent videos...
    Fetching comments for top videos...
    Fetching transcripts for videos with captions...
  Status: SUCCESS
    Channel: Theologische Universiteit Apeldoorn
    Subscribers: 1,234
    Videos fetched: 10

Data Collected

Channel Information

  • Channel ID and URL
  • Channel title and description
  • Custom URL (e.g., @channelname)
  • Subscriber count
  • Total video count
  • Total view count
  • Channel creation date
  • Country
  • Thumbnail and banner images

Video Information (per video)

  • Video ID and URL
  • Title and description
  • Published date
  • Duration
  • View count
  • Like count
  • Comment count
  • Tags
  • Thumbnail
  • Caption availability
  • Default language

Comments (per video)

  • Comment ID
  • Author name and channel URL
  • Comment text
  • Like count
  • Reply count
  • Published date

Transcripts (when available)

  • Full transcript text
  • Language
  • Transcript type (manual or auto-generated)

Provenance Tracking

All extracted data includes full provenance:

youtube_enrichment:
  source_url: https://www.youtube.com/user/TUApeldoorn
  fetch_timestamp: '2025-12-01T15:30:00+00:00'
  api_endpoint: https://www.googleapis.com/youtube/v3
  api_version: v3
  status: SUCCESS
  channel:
    channel_id: UCxxxxxxxxxxxxx
    channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
    title: Theologische Universiteit Apeldoorn
    subscriber_count: 1234
    # ... more fields
  videos:
    - video_id: abc123xyz
      video_url: https://www.youtube.com/watch?v=abc123xyz
      title: Video Title
      view_count: 5678
      comments:
        - comment_id: xyz789
          text: Great video!
          like_count: 5
      transcript:
        transcript_text: "Full video transcript..."
        language: nl
        transcript_type: auto

API Quota

YouTube Data API has a daily quota of 10,000 units:

Operation Cost
Channel info 1 unit
Video list 1 unit
Video details 1 unit per video
Comments 1 unit per 100 comments
Search 100 units

Estimated usage per custodian: 15-50 units (depending on videos/comments)

For 100 custodians: ~1,500-5,000 units (well within daily quota)

Troubleshooting

"API key not valid"

  • Check that the API key is correct
  • Verify YouTube Data API v3 is enabled
  • Check that the key isn't restricted to wrong APIs

"Quota exceeded"

  • Wait until the next day (quota resets at midnight Pacific Time)
  • Or request a quota increase in Google Cloud Console

"Channel not found"

  • The channel may have been deleted
  • The URL format may not be recognized
  • Try using the channel ID directly

"Comments disabled"

  • Some videos have comments disabled
  • The script handles this gracefully

"No transcript available"

  • Not all videos have captions
  • Auto-generated captions may not be available for all languages

Architecture

Entry YAML file
    ↓
Find YouTube URL from:
  - web_claims.social_youtube
  - wikidata_enrichment.P2397
    ↓
Resolve channel ID (handle → channel ID)
    ↓
Fetch via YouTube Data API v3:
  - Channel info
  - Recent videos
  - Video details
  - Comments
    ↓
Fetch via yt-dlp:
  - Transcripts/captions
    ↓
Add youtube_enrichment section
Update provenance
    ↓
Save YAML file
  • scripts/enrich_wikidata.py - Wikidata enrichment
  • scripts/enrich_google_maps.py - Google Maps enrichment
  • scripts/fetch_website_playwright.py - Website archiving
  • mcp_servers/social_media/server.py - MCP server for social media

Future Enhancements

  • Track channel subscriber growth over time
  • Extract video chapters/timestamps
  • Analyze video categories and topics
  • Cross-reference with other social media
  • Detect playlists relevant to heritage