- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data. - Loaded instance data from YAML files and enriched enum definitions with meaningful annotations. - Configured output paths for generated diagrams in both frontend and schema directories. - Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.
238 lines
5.5 KiB
Markdown
238 lines
5.5 KiB
Markdown
# YouTube Enrichment for Heritage Custodians
|
|
|
|
This document explains how to enrich heritage custodian entries with YouTube channel and video data.
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Get a YouTube API Key
|
|
|
|
1. **Go to Google Cloud Console**
|
|
- Visit: https://console.cloud.google.com/
|
|
|
|
2. **Create or Select a Project**
|
|
- Click on the project dropdown at the top
|
|
- Click "New Project" or select an existing one
|
|
- Name it something like "GLAM YouTube Enrichment"
|
|
|
|
3. **Enable YouTube Data API v3**
|
|
- Navigate to "APIs & Services" → "Library"
|
|
- Search for "YouTube Data API v3"
|
|
- Click on it and press **Enable**
|
|
|
|
4. **Create API Credentials**
|
|
- Go to "APIs & Services" → "Credentials"
|
|
- Click "Create Credentials" → "API Key"
|
|
- Copy the generated API key
|
|
|
|
5. **Restrict the API Key (Recommended)**
|
|
- Click on your new API key to edit it
|
|
- Under "API restrictions", select "Restrict key"
|
|
- Select only "YouTube Data API v3"
|
|
- Click Save
|
|
|
|
### 2. Set Environment Variable
|
|
|
|
```bash
|
|
export YOUTUBE_API_KEY='your-api-key-here'
|
|
```
|
|
|
|
Or add to your `.env` file:
|
|
```
|
|
YOUTUBE_API_KEY=your-api-key-here
|
|
```
|
|
|
|
### 3. Install Dependencies
|
|
|
|
```bash
|
|
pip install httpx pyyaml
|
|
|
|
# For transcript extraction (optional but recommended)
|
|
brew install yt-dlp # macOS
|
|
# or
|
|
pip install yt-dlp
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Process all entries with YouTube URLs
|
|
python scripts/enrich_youtube.py
|
|
|
|
# Dry run (show what would be done)
|
|
python scripts/enrich_youtube.py --dry-run
|
|
|
|
# Process only first 10 entries
|
|
python scripts/enrich_youtube.py --limit 10
|
|
|
|
# Process a specific entry
|
|
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml
|
|
```
|
|
|
|
### Example Output
|
|
|
|
```
|
|
Processing: 0146_Q1663974.yaml
|
|
Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
|
|
Fetching channel info for UCxxxxx...
|
|
Fetching 10 recent videos...
|
|
Fetching comments for top videos...
|
|
Fetching transcripts for videos with captions...
|
|
Status: SUCCESS
|
|
Channel: Theologische Universiteit Apeldoorn
|
|
Subscribers: 1,234
|
|
Videos fetched: 10
|
|
```
|
|
|
|
## Data Collected
|
|
|
|
### Channel Information
|
|
- Channel ID and URL
|
|
- Channel title and description
|
|
- Custom URL (e.g., @channelname)
|
|
- Subscriber count
|
|
- Total video count
|
|
- Total view count
|
|
- Channel creation date
|
|
- Country
|
|
- Thumbnail and banner images
|
|
|
|
### Video Information (per video)
|
|
- Video ID and URL
|
|
- Title and description
|
|
- Published date
|
|
- Duration
|
|
- View count
|
|
- Like count
|
|
- Comment count
|
|
- Tags
|
|
- Thumbnail
|
|
- Caption availability
|
|
- Default language
|
|
|
|
### Comments (per video)
|
|
- Comment ID
|
|
- Author name and channel URL
|
|
- Comment text
|
|
- Like count
|
|
- Reply count
|
|
- Published date
|
|
|
|
### Transcripts (when available)
|
|
- Full transcript text
|
|
- Language
|
|
- Transcript type (manual or auto-generated)
|
|
|
|
## Provenance Tracking
|
|
|
|
All extracted data includes full provenance:
|
|
|
|
```yaml
|
|
youtube_enrichment:
|
|
source_url: https://www.youtube.com/user/TUApeldoorn
|
|
fetch_timestamp: '2025-12-01T15:30:00+00:00'
|
|
api_endpoint: https://www.googleapis.com/youtube/v3
|
|
api_version: v3
|
|
status: SUCCESS
|
|
channel:
|
|
channel_id: UCxxxxxxxxxxxxx
|
|
channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
|
|
title: Theologische Universiteit Apeldoorn
|
|
subscriber_count: 1234
|
|
# ... more fields
|
|
videos:
|
|
- video_id: abc123xyz
|
|
video_url: https://www.youtube.com/watch?v=abc123xyz
|
|
title: Video Title
|
|
view_count: 5678
|
|
comments:
|
|
- comment_id: xyz789
|
|
text: Great video!
|
|
like_count: 5
|
|
transcript:
|
|
transcript_text: "Full video transcript..."
|
|
language: nl
|
|
transcript_type: auto
|
|
```
|
|
|
|
## API Quota
|
|
|
|
YouTube Data API has a daily quota of **10,000 units**:
|
|
|
|
| Operation | Cost |
|
|
|-----------|------|
|
|
| Channel info | 1 unit |
|
|
| Video list | 1 unit |
|
|
| Video details | 1 unit per video |
|
|
| Comments | 1 unit per 100 comments |
|
|
| Search | 100 units |
|
|
|
|
**Estimated usage per custodian**: 15-50 units (depending on videos/comments)
|
|
|
|
For 100 custodians: ~1,500-5,000 units (well within daily quota)
|
|
|
|
## Troubleshooting
|
|
|
|
### "API key not valid"
|
|
- Check that the API key is correct
|
|
- Verify YouTube Data API v3 is enabled
|
|
- Check that the key isn't restricted to wrong APIs
|
|
|
|
### "Quota exceeded"
|
|
- Wait until the next day (quota resets at midnight Pacific Time)
|
|
- Or request a quota increase in Google Cloud Console
|
|
|
|
### "Channel not found"
|
|
- The channel may have been deleted
|
|
- The URL format may not be recognized
|
|
- Try using the channel ID directly
|
|
|
|
### "Comments disabled"
|
|
- Some videos have comments disabled
|
|
- The script handles this gracefully
|
|
|
|
### "No transcript available"
|
|
- Not all videos have captions
|
|
- Auto-generated captions may not be available for all languages
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Entry YAML file
|
|
↓
|
|
Find YouTube URL from:
|
|
- web_claims.social_youtube
|
|
- wikidata_enrichment.P2397
|
|
↓
|
|
Resolve channel ID (handle → channel ID)
|
|
↓
|
|
Fetch via YouTube Data API v3:
|
|
- Channel info
|
|
- Recent videos
|
|
- Video details
|
|
- Comments
|
|
↓
|
|
Fetch via yt-dlp:
|
|
- Transcripts/captions
|
|
↓
|
|
Add youtube_enrichment section
|
|
Update provenance
|
|
↓
|
|
Save YAML file
|
|
```
|
|
|
|
## Related Scripts
|
|
|
|
- `scripts/enrich_wikidata.py` - Wikidata enrichment
|
|
- `scripts/enrich_google_maps.py` - Google Maps enrichment
|
|
- `scripts/fetch_website_playwright.py` - Website archiving
|
|
- `mcp_servers/social_media/server.py` - MCP server for social media
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Track channel subscriber growth over time
|
|
- [ ] Extract video chapters/timestamps
|
|
- [ ] Analyze video categories and topics
|
|
- [ ] Cross-reference with other social media
|
|
- [ ] Detect playlists relevant to heritage
|