- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data. - Loaded instance data from YAML files and enriched enum definitions with meaningful annotations. - Configured output paths for generated diagrams in both frontend and schema directories. - Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
5.5 KiB
5.5 KiB
YouTube Enrichment for Heritage Custodians
This document explains how to enrich heritage custodian entries with YouTube channel and video data.
Prerequisites
1. Get a YouTube API Key
-
Go to Google Cloud Console
-
Create or Select a Project
- Click on the project dropdown at the top
- Click "New Project" or select an existing one
- Name it something like "GLAM YouTube Enrichment"
-
Enable YouTube Data API v3
- Navigate to "APIs & Services" → "Library"
- Search for "YouTube Data API v3"
- Click on it and press Enable
-
Create API Credentials
- Go to "APIs & Services" → "Credentials"
- Click "Create Credentials" → "API Key"
- Copy the generated API key
-
Restrict the API Key (Recommended)
- Click on your new API key to edit it
- Under "API restrictions", select "Restrict key"
- Select only "YouTube Data API v3"
- Click Save
2. Set Environment Variable
export YOUTUBE_API_KEY='your-api-key-here'
Or add to your .env file:
YOUTUBE_API_KEY=your-api-key-here
3. Install Dependencies
pip install httpx pyyaml
# For transcript extraction (optional but recommended)
brew install yt-dlp # macOS
# or
pip install yt-dlp
Usage
Basic Usage
# Process all entries with YouTube URLs
python scripts/enrich_youtube.py
# Dry run (show what would be done)
python scripts/enrich_youtube.py --dry-run
# Process only first 10 entries
python scripts/enrich_youtube.py --limit 10
# Process a specific entry
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml
Example Output
Processing: 0146_Q1663974.yaml
Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
Fetching channel info for UCxxxxx...
Fetching 10 recent videos...
Fetching comments for top videos...
Fetching transcripts for videos with captions...
Status: SUCCESS
Channel: Theologische Universiteit Apeldoorn
Subscribers: 1,234
Videos fetched: 10
Data Collected
Channel Information
- Channel ID and URL
- Channel title and description
- Custom URL (e.g., @channelname)
- Subscriber count
- Total video count
- Total view count
- Channel creation date
- Country
- Thumbnail and banner images
Video Information (per video)
- Video ID and URL
- Title and description
- Published date
- Duration
- View count
- Like count
- Comment count
- Tags
- Thumbnail
- Caption availability
- Default language
Comments (per video)
- Comment ID
- Author name and channel URL
- Comment text
- Like count
- Reply count
- Published date
Transcripts (when available)
- Full transcript text
- Language
- Transcript type (manual or auto-generated)
Provenance Tracking
All extracted data includes full provenance:
youtube_enrichment:
source_url: https://www.youtube.com/user/TUApeldoorn
fetch_timestamp: '2025-12-01T15:30:00+00:00'
api_endpoint: https://www.googleapis.com/youtube/v3
api_version: v3
status: SUCCESS
channel:
channel_id: UCxxxxxxxxxxxxx
channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
title: Theologische Universiteit Apeldoorn
subscriber_count: 1234
# ... more fields
videos:
- video_id: abc123xyz
video_url: https://www.youtube.com/watch?v=abc123xyz
title: Video Title
view_count: 5678
comments:
- comment_id: xyz789
text: Great video!
like_count: 5
transcript:
transcript_text: "Full video transcript..."
language: nl
transcript_type: auto
API Quota
YouTube Data API has a daily quota of 10,000 units:
| Operation | Cost |
|---|---|
| Channel info | 1 unit |
| Video list | 1 unit |
| Video details | 1 unit per video |
| Comments | 1 unit per 100 comments |
| Search | 100 units |
Estimated usage per custodian: 15-50 units (depending on videos/comments)
For 100 custodians: ~1,500-5,000 units (well within daily quota)
Troubleshooting
"API key not valid"
- Check that the API key is correct
- Verify YouTube Data API v3 is enabled
- Check that the key isn't restricted to wrong APIs
"Quota exceeded"
- Wait until the next day (quota resets at midnight Pacific Time)
- Or request a quota increase in Google Cloud Console
"Channel not found"
- The channel may have been deleted
- The URL format may not be recognized
- Try using the channel ID directly
"Comments disabled"
- Some videos have comments disabled
- The script handles this gracefully
"No transcript available"
- Not all videos have captions
- Auto-generated captions may not be available for all languages
Architecture
Entry YAML file
↓
Find YouTube URL from:
- web_claims.social_youtube
- wikidata_enrichment.P2397
↓
Resolve channel ID (handle → channel ID)
↓
Fetch via YouTube Data API v3:
- Channel info
- Recent videos
- Video details
- Comments
↓
Fetch via yt-dlp:
- Transcripts/captions
↓
Add youtube_enrichment section
Update provenance
↓
Save YAML file
Related Scripts
scripts/enrich_wikidata.py- Wikidata enrichmentscripts/enrich_google_maps.py- Google Maps enrichmentscripts/fetch_website_playwright.py- Website archivingmcp_servers/social_media/server.py- MCP server for social media
Future Enhancements
- Track channel subscriber growth over time
- Extract video chapters/timestamps
- Analyze video categories and topics
- Cross-reference with other social media
- Detect playlists relevant to heritage