five

ScriptSmith/sponsorblock-youtube-metadata-2024

收藏
Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ScriptSmith/sponsorblock-youtube-metadata-2024
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: SponsorBlock YouTube Metadata license: cc-by-4.0 language: - en - ru - hi - pl - pt - de - fr - es - vi - multilingual annotations_creators: - machine-generated language_creators: - machine-generated source_datasets: - original task_categories: - text-classification - feature-extraction - text-generation tags: - youtube - video-metadata - sponsorblock - subtitles - engagement - viewer-behavior - captions - transcripts size_categories: - 10M<n<100M configs: - config_name: video_metadata data_files: - split: train path: video_metadata/*.parquet default: true - config_name: subtitles data_files: - split: train path: subtitles/*.parquet - config_name: channel_playlists data_files: - split: train path: channel_playlists/*.parquet - config_name: heatmaps data_files: - split: train path: heatmaps/*.parquet - config_name: live_chat data_files: - split: train path: live_chat/*.parquet dataset_info: - config_name: video_metadata features: - name: id dtype: string - name: title dtype: string - name: fulltitle dtype: string - name: description dtype: string - name: upload_date dtype: string - name: timestamp dtype: int64 - name: duration dtype: int64 - name: view_count dtype: int64 - name: like_count dtype: int64 - name: comment_count dtype: int64 - name: channel dtype: string - name: channel_id dtype: string - name: channel_url dtype: string - name: channel_follower_count dtype: int64 - name: channel_is_verified dtype: bool - name: uploader dtype: string - name: uploader_id dtype: string - name: width dtype: int64 - name: height dtype: int64 - name: fps dtype: float64 - name: vcodec dtype: string - name: acodec dtype: string - name: dynamic_range dtype: string - name: aspect_ratio dtype: float64 - name: age_limit dtype: int64 - name: availability dtype: string - name: live_status dtype: string - name: was_live dtype: bool - name: is_live dtype: bool - name: language dtype: string - name: playable_in_embed dtype: bool - name: media_type dtype: string - name: license dtype: string - name: location dtype: string - name: categories dtype: string - name: tags dtype: string - name: thumbnail dtype: string - name: chapters_json dtype: string splits: - name: train num_examples: 154536 - config_name: subtitles features: - name: video_id dtype: string - name: language dtype: string - name: full_text dtype: string - name: segments_json dtype: string splits: - name: train num_examples: 62819 - config_name: channel_playlists features: - name: id dtype: string - name: title dtype: string - name: description dtype: string - name: duration dtype: int64 - name: view_count dtype: int64 - name: timestamp dtype: int64 - name: upload_date dtype: string - name: live_status dtype: string - name: availability dtype: string - name: channel dtype: string - name: channel_id dtype: string - name: channel_url dtype: string - name: channel_is_verified dtype: bool - name: uploader dtype: string - name: uploader_id dtype: string - name: playlist dtype: string - name: playlist_id dtype: string - name: playlist_title dtype: string - name: playlist_count dtype: int64 - name: playlist_index dtype: int64 - name: playlist_channel dtype: string - name: playlist_channel_id dtype: string - name: thumbnail dtype: string splits: - name: train num_examples: 20800000 - config_name: heatmaps features: - name: video_id dtype: string - name: start_time dtype: float64 - name: end_time dtype: float64 - name: value dtype: float64 splits: - name: train num_examples: 15000000 - config_name: live_chat features: - name: video_id dtype: string - name: message_id dtype: string - name: timestamp_usec dtype: int64 - name: video_offset_msec dtype: int64 - name: author_name dtype: string - name: author_channel_id dtype: string - name: message dtype: string - name: message_type dtype: string splits: - name: train num_examples: 5000000 --- # SponsorBlock YouTube Metadata Dataset A dataset of YouTube video metadata collected from a subset of videos in the [SponsorBlock](https://sponsor.ajay.app/) database. This dataset contains metadata, subtitles, engagement heatmaps, live chat, and channel playlist information for popular YouTube videos. ## Quick Stats | Metric | Value | |--------|-------| | Total videos | 154,536 | | Videos with subtitles | 62,819 (41%) | | Videos with heatmaps | ~92% | | Videos with chapters | ~32% | | Livestream recordings | 4,300 | | Unique channels | 20,061 | | Channel playlist entries | ~20.8M | | Average video duration | 28.6 minutes | | Average view count | 1.6M views | ## Collection Period | Data Type | Collection Date | Content Date Range | |-----------|-----------------|-------------------| | Video metadata | Aug 9 - Sep 30, 2025 | May 2019 - Dec 2024 | | Channel playlists | Oct 2-5, 2025 | Varies by channel | **Note**: All metrics (view counts, likes, subscriber counts) are snapshots from the collection date and do not represent historical data. ## Data Source Videos were selected from a **subset** of the [SponsorBlock](https://sponsor.ajay.app/) database, a crowdsourced browser extension for skipping sponsor segments. The subset was created by sorting all videos by their SponsorBlock **vote count** (total upvotes on submitted segments) and selecting the top ~167K video IDs. This selection method biases the dataset toward: - **Highly-engaged SponsorBlock content** — videos with many user-submitted and upvoted segments - **Popular videos** with large, active communities - **Gaming, tech, and entertainment** content (common SponsorBlock categories) - Videos with **sponsor segments** (brand deals, mid-roll ads) - **English-language** content primarily, with significant Russian, Hindi, and European language representation **Note**: This dataset does not represent a random sample of YouTube or even a random sample of SponsorBlock. Videos were specifically selected for having high engagement within the SponsorBlock community. ### Category Distribution | Category | Percentage | |----------|------------| | Gaming | 15% | | Entertainment | 11% | | People & Blogs | 8% | | Science & Technology | 6% | | Education | 5% | | Other | 55% | ### Language Distribution | Language | Percentage | |----------|------------| | English (en, en-US) | 25% | | Russian | 16% | | Hindi | 2.5% | | Polish | 2% | | Other | 54.5% | ## Dataset Configurations ### `video_metadata` Full metadata for 154,536 individual YouTube videos. | Column | Type | Description | |--------|------|-------------| | `id` | string | YouTube video ID (11 characters) | | `title` | string | Video title | | `fulltitle` | string | Full video title | | `description` | string | Video description (may be truncated) | | `upload_date` | string | Upload date (YYYYMMDD format) | | `timestamp` | int64 | Unix timestamp of upload | | `duration` | int64 | Duration in seconds | | `view_count` | int64 | Number of views at collection time | | `like_count` | int64 | Number of likes at collection time | | `comment_count` | int64 | Number of comments at collection time | | `channel` | string | Channel display name | | `channel_id` | string | YouTube channel ID (24 characters, starts with UC) | | `channel_url` | string | Full channel URL | | `channel_follower_count` | int64 | Subscriber count at collection time | | `channel_is_verified` | bool | Whether channel has verification badge | | `uploader` | string | Uploader name (usually same as channel) | | `uploader_id` | string | Uploader handle (@username) | | `width` | int64 | Video width in pixels | | `height` | int64 | Video height in pixels | | `fps` | float64 | Frames per second | | `vcodec` | string | Video codec (e.g., vp9, av01, avc1) | | `acodec` | string | Audio codec (e.g., opus, mp4a) | | `dynamic_range` | string | SDR or HDR | | `aspect_ratio` | float64 | Width/height ratio | | `age_limit` | int64 | Age restriction (0 = none, 18 = adult) | | `availability` | string | public, unlisted, or private | | `live_status` | string | not_live, is_live, was_live, is_upcoming | | `was_live` | bool | True if video was a livestream | | `is_live` | bool | True if currently live | | `language` | string | Detected video language (ISO 639-1) | | `playable_in_embed` | bool | Whether embedding is allowed | | `categories` | string | JSON array of YouTube categories | | `tags` | string | JSON array of video tags | | `thumbnail` | string | URL of highest quality thumbnail | | `chapters_json` | string | JSON array of `{title, start_time, end_time}` | ### `subtitles` Extracted subtitles for 62,819 videos (41% coverage). | Column | Type | Description | |--------|------|-------------| | `video_id` | string | YouTube video ID | | `language` | string | Subtitle language code | | `full_text` | string | Concatenated plain text of all subtitles | | `segments_json` | string | JSON array of `{start, end, text}` segments | **Note**: These are auto-generated captions from YouTube, not human-created subtitles. ### `channel_playlists` Video entries from 20,061 channel upload playlists (~20.8M total entries). | Column | Type | Description | |--------|------|-------------| | `id` | string | YouTube video ID | | `title` | string | Video title | | `description` | string | Video description | | `duration` | int64 | Duration in seconds | | `view_count` | int64 | Number of views | | `timestamp` | int64 | Unix timestamp | | `upload_date` | string | Upload date (YYYYMMDD) | | `channel` | string | Channel name | | `channel_id` | string | Channel ID | | `playlist` | string | Playlist name (usually "Uploads from [Channel]") | | `playlist_id` | string | Playlist ID | | `playlist_count` | int64 | Total videos in channel's upload playlist | | `playlist_index` | int64 | Position in playlist (1 = newest) | | `thumbnail` | string | Thumbnail URL | **Use cases**: Channel growth analysis, upload frequency patterns, content strategy analysis. ### `heatmaps` Viewer engagement heatmaps showing which parts of videos are most watched. ~92% of videos have heatmap data, with ~100 segments per video. | Column | Type | Description | |--------|------|-------------| | `video_id` | string | YouTube video ID | | `start_time` | float64 | Segment start in seconds | | `end_time` | float64 | Segment end in seconds | | `value` | float64 | Relative engagement score (0.0 - 1.0) | **How to interpret**: Values represent relative rewatch/skip behavior. Higher values indicate segments that viewers replay or don't skip. Values near 0 indicate frequently skipped segments (often intros, sponsor reads). **Use cases**: Identifying engaging content patterns, sponsor segment detection, intro/outro detection, content quality analysis. ### `live_chat` Chat messages from 4,300 livestream recordings. | Column | Type | Description | |--------|------|-------------| | `video_id` | string | YouTube video ID | | `message_id` | string | Unique message ID | | `timestamp_usec` | int64 | Message timestamp in microseconds | | `video_offset_msec` | int64 | Video playback time when message appeared | | `author_name` | string | Display name of message author | | `author_channel_id` | string | Author's YouTube channel ID | | `message` | string | Message text content | | `message_type` | string | text, paid (Super Chat), membership, gift | **Use cases**: Chat sentiment analysis, community engagement patterns, Super Chat analysis. ## Raw Data Archives In addition to the processed parquet files, the complete raw yt-dlp output is available as compressed archives in the `raw/` directory: | Archive | Contents | Compressed | Uncompressed | |---------|----------|------------|--------------| | `video_metadata.tar.zst` | 154K `.info.json` files | 1.6 GB | 56 GB | | `subtitles.tar.zst` | 62K `.vtt` files | 2.5 GB | 12 GB | | `live_chat.tar.zst` | 4K `.live_chat.json` files | 897 MB | 14 GB | | `channel_playlists.tar.zst` | 20K channel `.json` files | 3.4 GB | 48 GB | | **Total** | | **8.2 GB** | **130 GB** | ### Extracting Raw Data ```bash # Extract with zstd tar -I zstd -xvf raw/video_metadata.tar.zst # Or with explicit zstd command zstd -d raw/video_metadata.tar.zst -c | tar -xvf - ``` ### Additional Fields in Raw Data The raw JSON files contain fields not included in the parquet files: - `formats` - All available streaming URLs and quality options - `thumbnails` - All thumbnail resolutions - `automatic_captions` - Caption availability for 150+ languages - `subtitles` - Manual subtitle availability - `requested_formats` - Selected format details - `http_headers` - Request headers used ## Usage Examples ### Basic Loading ```python from datasets import load_dataset # Load video metadata videos = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "video_metadata") print(f"Loaded {len(videos['train']):,} videos") # Load with streaming for large configs playlists = load_dataset( "ScriptSmith/sponsorblock-youtube-metadata-2024", "channel_playlists", streaming=True ) ``` ### Analyzing Engagement Patterns ```python import json from datasets import load_dataset videos = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "video_metadata") heatmaps = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "heatmaps") # Find videos with high early drop-off (bad intros) # Join on video_id, look for low values in first 30 seconds ``` ### Working with Subtitles ```python import json from datasets import load_dataset subtitles = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "subtitles") # Parse segments for row in subtitles['train']: segments = json.loads(row['segments_json']) for seg in segments: print(f"{seg['start']:.1f}s: {seg['text']}") ``` ### Channel Analysis ```python from datasets import load_dataset from collections import Counter playlists = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "channel_playlists") # Count videos per channel channel_counts = Counter(row['channel_id'] for row in playlists['train']) print(f"Most prolific: {channel_counts.most_common(10)}") ``` ## Known Limitations 1. **Temporal snapshot**: All metrics are from collection time (Aug-Oct 2025) and don't reflect current values or historical trends. 2. **Selection bias**: Videos come from SponsorBlock, biasing toward sponsored content and certain genres (gaming, tech). 3. **Missing data**: Not all videos have subtitles (41%), heatmaps (92%), or chapters (32%). 4. **Deleted content**: Some videos may have been deleted or made private since collection. 5. **Auto-captions**: Subtitles are YouTube's auto-generated captions, which may contain transcription errors. 6. **Truncated descriptions**: Very long descriptions may be truncated. 7. **Regional availability**: Some videos may be region-locked; metadata reflects availability from the collection location. ## License This dataset is released under **CC-BY-4.0**. The underlying video content belongs to the respective creators. This dataset contains only metadata, not the actual video/audio content. ## Citation ```bibtex @dataset{sponsorblock_youtube_metadata_2025, title={SponsorBlock YouTube Metadata Dataset}, author={Smith, Adam}, year={2025}, url={https://huggingface.co/datasets/ScriptSmith/sponsorblock-youtube-metadata-2024}, note={Metadata for 154K YouTube videos from the SponsorBlock database} } ``` ## Acknowledgments - [SponsorBlock](https://sponsor.ajay.app/) for the video ID database - [yt-dlp](https://github.com/yt-dlp/yt-dlp) for metadata extraction - YouTube creators whose content is represented in this dataset
提供机构:
ScriptSmith
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作