ScriptSmith/sponsorblock-youtube-metadata-2024
收藏Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ScriptSmith/sponsorblock-youtube-metadata-2024
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: SponsorBlock YouTube Metadata
license: cc-by-4.0
language:
- en
- ru
- hi
- pl
- pt
- de
- fr
- es
- vi
- multilingual
annotations_creators:
- machine-generated
language_creators:
- machine-generated
source_datasets:
- original
task_categories:
- text-classification
- feature-extraction
- text-generation
tags:
- youtube
- video-metadata
- sponsorblock
- subtitles
- engagement
- viewer-behavior
- captions
- transcripts
size_categories:
- 10M<n<100M
configs:
- config_name: video_metadata
data_files:
- split: train
path: video_metadata/*.parquet
default: true
- config_name: subtitles
data_files:
- split: train
path: subtitles/*.parquet
- config_name: channel_playlists
data_files:
- split: train
path: channel_playlists/*.parquet
- config_name: heatmaps
data_files:
- split: train
path: heatmaps/*.parquet
- config_name: live_chat
data_files:
- split: train
path: live_chat/*.parquet
dataset_info:
- config_name: video_metadata
features:
- name: id
dtype: string
- name: title
dtype: string
- name: fulltitle
dtype: string
- name: description
dtype: string
- name: upload_date
dtype: string
- name: timestamp
dtype: int64
- name: duration
dtype: int64
- name: view_count
dtype: int64
- name: like_count
dtype: int64
- name: comment_count
dtype: int64
- name: channel
dtype: string
- name: channel_id
dtype: string
- name: channel_url
dtype: string
- name: channel_follower_count
dtype: int64
- name: channel_is_verified
dtype: bool
- name: uploader
dtype: string
- name: uploader_id
dtype: string
- name: width
dtype: int64
- name: height
dtype: int64
- name: fps
dtype: float64
- name: vcodec
dtype: string
- name: acodec
dtype: string
- name: dynamic_range
dtype: string
- name: aspect_ratio
dtype: float64
- name: age_limit
dtype: int64
- name: availability
dtype: string
- name: live_status
dtype: string
- name: was_live
dtype: bool
- name: is_live
dtype: bool
- name: language
dtype: string
- name: playable_in_embed
dtype: bool
- name: media_type
dtype: string
- name: license
dtype: string
- name: location
dtype: string
- name: categories
dtype: string
- name: tags
dtype: string
- name: thumbnail
dtype: string
- name: chapters_json
dtype: string
splits:
- name: train
num_examples: 154536
- config_name: subtitles
features:
- name: video_id
dtype: string
- name: language
dtype: string
- name: full_text
dtype: string
- name: segments_json
dtype: string
splits:
- name: train
num_examples: 62819
- config_name: channel_playlists
features:
- name: id
dtype: string
- name: title
dtype: string
- name: description
dtype: string
- name: duration
dtype: int64
- name: view_count
dtype: int64
- name: timestamp
dtype: int64
- name: upload_date
dtype: string
- name: live_status
dtype: string
- name: availability
dtype: string
- name: channel
dtype: string
- name: channel_id
dtype: string
- name: channel_url
dtype: string
- name: channel_is_verified
dtype: bool
- name: uploader
dtype: string
- name: uploader_id
dtype: string
- name: playlist
dtype: string
- name: playlist_id
dtype: string
- name: playlist_title
dtype: string
- name: playlist_count
dtype: int64
- name: playlist_index
dtype: int64
- name: playlist_channel
dtype: string
- name: playlist_channel_id
dtype: string
- name: thumbnail
dtype: string
splits:
- name: train
num_examples: 20800000
- config_name: heatmaps
features:
- name: video_id
dtype: string
- name: start_time
dtype: float64
- name: end_time
dtype: float64
- name: value
dtype: float64
splits:
- name: train
num_examples: 15000000
- config_name: live_chat
features:
- name: video_id
dtype: string
- name: message_id
dtype: string
- name: timestamp_usec
dtype: int64
- name: video_offset_msec
dtype: int64
- name: author_name
dtype: string
- name: author_channel_id
dtype: string
- name: message
dtype: string
- name: message_type
dtype: string
splits:
- name: train
num_examples: 5000000
---
# SponsorBlock YouTube Metadata Dataset
A dataset of YouTube video metadata collected from a subset of videos in the [SponsorBlock](https://sponsor.ajay.app/) database. This dataset contains metadata, subtitles, engagement heatmaps, live chat, and channel playlist information for popular YouTube videos.
## Quick Stats
| Metric | Value |
|--------|-------|
| Total videos | 154,536 |
| Videos with subtitles | 62,819 (41%) |
| Videos with heatmaps | ~92% |
| Videos with chapters | ~32% |
| Livestream recordings | 4,300 |
| Unique channels | 20,061 |
| Channel playlist entries | ~20.8M |
| Average video duration | 28.6 minutes |
| Average view count | 1.6M views |
## Collection Period
| Data Type | Collection Date | Content Date Range |
|-----------|-----------------|-------------------|
| Video metadata | Aug 9 - Sep 30, 2025 | May 2019 - Dec 2024 |
| Channel playlists | Oct 2-5, 2025 | Varies by channel |
**Note**: All metrics (view counts, likes, subscriber counts) are snapshots from the collection date and do not represent historical data.
## Data Source
Videos were selected from a **subset** of the [SponsorBlock](https://sponsor.ajay.app/) database, a crowdsourced browser extension for skipping sponsor segments. The subset was created by sorting all videos by their SponsorBlock **vote count** (total upvotes on submitted segments) and selecting the top ~167K video IDs. This selection method biases the dataset toward:
- **Highly-engaged SponsorBlock content** — videos with many user-submitted and upvoted segments
- **Popular videos** with large, active communities
- **Gaming, tech, and entertainment** content (common SponsorBlock categories)
- Videos with **sponsor segments** (brand deals, mid-roll ads)
- **English-language** content primarily, with significant Russian, Hindi, and European language representation
**Note**: This dataset does not represent a random sample of YouTube or even a random sample of SponsorBlock. Videos were specifically selected for having high engagement within the SponsorBlock community.
### Category Distribution
| Category | Percentage |
|----------|------------|
| Gaming | 15% |
| Entertainment | 11% |
| People & Blogs | 8% |
| Science & Technology | 6% |
| Education | 5% |
| Other | 55% |
### Language Distribution
| Language | Percentage |
|----------|------------|
| English (en, en-US) | 25% |
| Russian | 16% |
| Hindi | 2.5% |
| Polish | 2% |
| Other | 54.5% |
## Dataset Configurations
### `video_metadata`
Full metadata for 154,536 individual YouTube videos.
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | YouTube video ID (11 characters) |
| `title` | string | Video title |
| `fulltitle` | string | Full video title |
| `description` | string | Video description (may be truncated) |
| `upload_date` | string | Upload date (YYYYMMDD format) |
| `timestamp` | int64 | Unix timestamp of upload |
| `duration` | int64 | Duration in seconds |
| `view_count` | int64 | Number of views at collection time |
| `like_count` | int64 | Number of likes at collection time |
| `comment_count` | int64 | Number of comments at collection time |
| `channel` | string | Channel display name |
| `channel_id` | string | YouTube channel ID (24 characters, starts with UC) |
| `channel_url` | string | Full channel URL |
| `channel_follower_count` | int64 | Subscriber count at collection time |
| `channel_is_verified` | bool | Whether channel has verification badge |
| `uploader` | string | Uploader name (usually same as channel) |
| `uploader_id` | string | Uploader handle (@username) |
| `width` | int64 | Video width in pixels |
| `height` | int64 | Video height in pixels |
| `fps` | float64 | Frames per second |
| `vcodec` | string | Video codec (e.g., vp9, av01, avc1) |
| `acodec` | string | Audio codec (e.g., opus, mp4a) |
| `dynamic_range` | string | SDR or HDR |
| `aspect_ratio` | float64 | Width/height ratio |
| `age_limit` | int64 | Age restriction (0 = none, 18 = adult) |
| `availability` | string | public, unlisted, or private |
| `live_status` | string | not_live, is_live, was_live, is_upcoming |
| `was_live` | bool | True if video was a livestream |
| `is_live` | bool | True if currently live |
| `language` | string | Detected video language (ISO 639-1) |
| `playable_in_embed` | bool | Whether embedding is allowed |
| `categories` | string | JSON array of YouTube categories |
| `tags` | string | JSON array of video tags |
| `thumbnail` | string | URL of highest quality thumbnail |
| `chapters_json` | string | JSON array of `{title, start_time, end_time}` |
### `subtitles`
Extracted subtitles for 62,819 videos (41% coverage).
| Column | Type | Description |
|--------|------|-------------|
| `video_id` | string | YouTube video ID |
| `language` | string | Subtitle language code |
| `full_text` | string | Concatenated plain text of all subtitles |
| `segments_json` | string | JSON array of `{start, end, text}` segments |
**Note**: These are auto-generated captions from YouTube, not human-created subtitles.
### `channel_playlists`
Video entries from 20,061 channel upload playlists (~20.8M total entries).
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | YouTube video ID |
| `title` | string | Video title |
| `description` | string | Video description |
| `duration` | int64 | Duration in seconds |
| `view_count` | int64 | Number of views |
| `timestamp` | int64 | Unix timestamp |
| `upload_date` | string | Upload date (YYYYMMDD) |
| `channel` | string | Channel name |
| `channel_id` | string | Channel ID |
| `playlist` | string | Playlist name (usually "Uploads from [Channel]") |
| `playlist_id` | string | Playlist ID |
| `playlist_count` | int64 | Total videos in channel's upload playlist |
| `playlist_index` | int64 | Position in playlist (1 = newest) |
| `thumbnail` | string | Thumbnail URL |
**Use cases**: Channel growth analysis, upload frequency patterns, content strategy analysis.
### `heatmaps`
Viewer engagement heatmaps showing which parts of videos are most watched. ~92% of videos have heatmap data, with ~100 segments per video.
| Column | Type | Description |
|--------|------|-------------|
| `video_id` | string | YouTube video ID |
| `start_time` | float64 | Segment start in seconds |
| `end_time` | float64 | Segment end in seconds |
| `value` | float64 | Relative engagement score (0.0 - 1.0) |
**How to interpret**: Values represent relative rewatch/skip behavior. Higher values indicate segments that viewers replay or don't skip. Values near 0 indicate frequently skipped segments (often intros, sponsor reads).
**Use cases**: Identifying engaging content patterns, sponsor segment detection, intro/outro detection, content quality analysis.
### `live_chat`
Chat messages from 4,300 livestream recordings.
| Column | Type | Description |
|--------|------|-------------|
| `video_id` | string | YouTube video ID |
| `message_id` | string | Unique message ID |
| `timestamp_usec` | int64 | Message timestamp in microseconds |
| `video_offset_msec` | int64 | Video playback time when message appeared |
| `author_name` | string | Display name of message author |
| `author_channel_id` | string | Author's YouTube channel ID |
| `message` | string | Message text content |
| `message_type` | string | text, paid (Super Chat), membership, gift |
**Use cases**: Chat sentiment analysis, community engagement patterns, Super Chat analysis.
## Raw Data Archives
In addition to the processed parquet files, the complete raw yt-dlp output is available as compressed archives in the `raw/` directory:
| Archive | Contents | Compressed | Uncompressed |
|---------|----------|------------|--------------|
| `video_metadata.tar.zst` | 154K `.info.json` files | 1.6 GB | 56 GB |
| `subtitles.tar.zst` | 62K `.vtt` files | 2.5 GB | 12 GB |
| `live_chat.tar.zst` | 4K `.live_chat.json` files | 897 MB | 14 GB |
| `channel_playlists.tar.zst` | 20K channel `.json` files | 3.4 GB | 48 GB |
| **Total** | | **8.2 GB** | **130 GB** |
### Extracting Raw Data
```bash
# Extract with zstd
tar -I zstd -xvf raw/video_metadata.tar.zst
# Or with explicit zstd command
zstd -d raw/video_metadata.tar.zst -c | tar -xvf -
```
### Additional Fields in Raw Data
The raw JSON files contain fields not included in the parquet files:
- `formats` - All available streaming URLs and quality options
- `thumbnails` - All thumbnail resolutions
- `automatic_captions` - Caption availability for 150+ languages
- `subtitles` - Manual subtitle availability
- `requested_formats` - Selected format details
- `http_headers` - Request headers used
## Usage Examples
### Basic Loading
```python
from datasets import load_dataset
# Load video metadata
videos = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "video_metadata")
print(f"Loaded {len(videos['train']):,} videos")
# Load with streaming for large configs
playlists = load_dataset(
"ScriptSmith/sponsorblock-youtube-metadata-2024",
"channel_playlists",
streaming=True
)
```
### Analyzing Engagement Patterns
```python
import json
from datasets import load_dataset
videos = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "video_metadata")
heatmaps = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "heatmaps")
# Find videos with high early drop-off (bad intros)
# Join on video_id, look for low values in first 30 seconds
```
### Working with Subtitles
```python
import json
from datasets import load_dataset
subtitles = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "subtitles")
# Parse segments
for row in subtitles['train']:
segments = json.loads(row['segments_json'])
for seg in segments:
print(f"{seg['start']:.1f}s: {seg['text']}")
```
### Channel Analysis
```python
from datasets import load_dataset
from collections import Counter
playlists = load_dataset("ScriptSmith/sponsorblock-youtube-metadata-2024", "channel_playlists")
# Count videos per channel
channel_counts = Counter(row['channel_id'] for row in playlists['train'])
print(f"Most prolific: {channel_counts.most_common(10)}")
```
## Known Limitations
1. **Temporal snapshot**: All metrics are from collection time (Aug-Oct 2025) and don't reflect current values or historical trends.
2. **Selection bias**: Videos come from SponsorBlock, biasing toward sponsored content and certain genres (gaming, tech).
3. **Missing data**: Not all videos have subtitles (41%), heatmaps (92%), or chapters (32%).
4. **Deleted content**: Some videos may have been deleted or made private since collection.
5. **Auto-captions**: Subtitles are YouTube's auto-generated captions, which may contain transcription errors.
6. **Truncated descriptions**: Very long descriptions may be truncated.
7. **Regional availability**: Some videos may be region-locked; metadata reflects availability from the collection location.
## License
This dataset is released under **CC-BY-4.0**. The underlying video content belongs to the respective creators. This dataset contains only metadata, not the actual video/audio content.
## Citation
```bibtex
@dataset{sponsorblock_youtube_metadata_2025,
title={SponsorBlock YouTube Metadata Dataset},
author={Smith, Adam},
year={2025},
url={https://huggingface.co/datasets/ScriptSmith/sponsorblock-youtube-metadata-2024},
note={Metadata for 154K YouTube videos from the SponsorBlock database}
}
```
## Acknowledgments
- [SponsorBlock](https://sponsor.ajay.app/) for the video ID database
- [yt-dlp](https://github.com/yt-dlp/yt-dlp) for metadata extraction
- YouTube creators whose content is represented in this dataset
提供机构:
ScriptSmith



