kempersc 48a2b26f59 feat: Add script to generate Mermaid ER diagrams with instance data from LinkML schemas

- Implemented `generate_mermaid_with_instances.py` to create ER diagrams that include all classes, relationships, enum values, and instance data.
- Loaded instance data from YAML files and enriched enum definitions with meaningful annotations.
- Configured output paths for generated diagrams in both frontend and schema directories.
- Added support for excluding technical classes and limiting the number of displayed enum and instance values for readability.

2025-12-01 16:58:03 +01:00

5.5 KiB

Raw Blame History

YouTube Enrichment for Heritage Custodians

This document explains how to enrich heritage custodian entries with YouTube channel and video data.

Prerequisites

1. Get a YouTube API Key

Go to Google Cloud Console
- Visit: https://console.cloud.google.com/
Create or Select a Project
- Click on the project dropdown at the top
- Click "New Project" or select an existing one
- Name it something like "GLAM YouTube Enrichment"
Enable YouTube Data API v3
- Navigate to "APIs & Services" → "Library"
- Search for "YouTube Data API v3"
- Click on it and press Enable
Create API Credentials
- Go to "APIs & Services" → "Credentials"
- Click "Create Credentials" → "API Key"
- Copy the generated API key
Restrict the API Key (Recommended)
- Click on your new API key to edit it
- Under "API restrictions", select "Restrict key"
- Select only "YouTube Data API v3"
- Click Save

2. Set Environment Variable

export YOUTUBE_API_KEY='your-api-key-here'

Or add to your .env file:

YOUTUBE_API_KEY=your-api-key-here

3. Install Dependencies

pip install httpx pyyaml

# For transcript extraction (optional but recommended)
brew install yt-dlp  # macOS
# or
pip install yt-dlp

Usage

Basic Usage

# Process all entries with YouTube URLs
python scripts/enrich_youtube.py

# Dry run (show what would be done)
python scripts/enrich_youtube.py --dry-run

# Process only first 10 entries
python scripts/enrich_youtube.py --limit 10

# Process a specific entry
python scripts/enrich_youtube.py --entry 0146_Q1663974.yaml

Example Output

Processing: 0146_Q1663974.yaml
  Found YouTube URL: https://www.youtube.com/user/TUApeldoorn
    Fetching channel info for UCxxxxx...
    Fetching 10 recent videos...
    Fetching comments for top videos...
    Fetching transcripts for videos with captions...
  Status: SUCCESS
    Channel: Theologische Universiteit Apeldoorn
    Subscribers: 1,234
    Videos fetched: 10

Data Collected

Video Information (per video)

Video ID and URL
Title and description
Published date
Duration
View count
Like count
Comment count
Tags
Thumbnail
Caption availability
Default language

Comments (per video)

Comment ID
Author name and channel URL
Comment text
Like count
Reply count
Published date

Transcripts (when available)

Full transcript text
Language
Transcript type (manual or auto-generated)

Provenance Tracking

All extracted data includes full provenance:

youtube_enrichment:
  source_url: https://www.youtube.com/user/TUApeldoorn
  fetch_timestamp: '2025-12-01T15:30:00+00:00'
  api_endpoint: https://www.googleapis.com/youtube/v3
  api_version: v3
  status: SUCCESS
  channel:
    channel_id: UCxxxxxxxxxxxxx
    channel_url: https://www.youtube.com/channel/UCxxxxxxxxxxxxx
    title: Theologische Universiteit Apeldoorn
    subscriber_count: 1234
    # ... more fields
  videos:
    - video_id: abc123xyz
      video_url: https://www.youtube.com/watch?v=abc123xyz
      title: Video Title
      view_count: 5678
      comments:
        - comment_id: xyz789
          text: Great video!
          like_count: 5
      transcript:
        transcript_text: "Full video transcript..."
        language: nl
        transcript_type: auto

API Quota

YouTube Data API has a daily quota of 10,000 units:

Operation	Cost
Channel info	1 unit
Video list	1 unit
Video details	1 unit per video
Comments	1 unit per 100 comments
Search	100 units

Estimated usage per custodian: 15-50 units (depending on videos/comments)

For 100 custodians: ~1,500-5,000 units (well within daily quota)

Troubleshooting

"API key not valid"

Check that the API key is correct
Verify YouTube Data API v3 is enabled
Check that the key isn't restricted to wrong APIs

"Quota exceeded"

Wait until the next day (quota resets at midnight Pacific Time)
Or request a quota increase in Google Cloud Console

"Channel not found"

The channel may have been deleted
The URL format may not be recognized
Try using the channel ID directly

"Comments disabled"

Some videos have comments disabled
The script handles this gracefully

"No transcript available"

Not all videos have captions
Auto-generated captions may not be available for all languages

Architecture

Entry YAML file
    ↓
Find YouTube URL from:
  - web_claims.social_youtube
  - wikidata_enrichment.P2397
    ↓
Resolve channel ID (handle → channel ID)
    ↓
Fetch via YouTube Data API v3:
  - Channel info
  - Recent videos
  - Video details
  - Comments
    ↓
Fetch via yt-dlp:
  - Transcripts/captions
    ↓
Add youtube_enrichment section
Update provenance
    ↓
Save YAML file

scripts/enrich_wikidata.py - Wikidata enrichment
scripts/enrich_google_maps.py - Google Maps enrichment
scripts/fetch_website_playwright.py - Website archiving
mcp_servers/social_media/server.py - MCP server for social media

Future Enhancements

Track channel subscriber growth over time
Extract video chapters/timestamps
Analyze video categories and topics
Cross-reference with other social media
Detect playlists relevant to heritage

5.5 KiB Raw Blame History