vnkat/youtube-comment-sentiment
收藏Hugging Face2026-02-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vnkat/youtube-comment-sentiment
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
- hi
- ja
- es
task_categories:
- text-classification
tags:
- youtube
- sentiment
- comments
- multi-linguistic
size_categories:
- 1M<n<10M
---
# YouTube Comments Sentiment Analysis Dataset (1M+ Labeled Comments)
## Overview
This dataset comprises over one million YouTube comments, each annotated with sentiment labels—**Positive**, **Neutral**, or **Negative**. The comments span a diverse range of topics including programming, news, sports, politics and more, and are enriched with comprehensive metadata to facilitate various NLP and sentiment analysis tasks.
## How to use:
```Python
import pandas as pd
df = pd.read_csv("hf://datasets/AmaanP314/youtube-comment-sentiment/youtube-comments-sentiment.csv")
```
## Dataset Contents
Each record in the dataset includes the following fields:
- **CommentID:** A unique identifier assigned to each YouTube comment. This allows for individual tracking and analysis of comments.
- **VideoID:** The unique identifier of the YouTube video to which the comment belongs. This links each comment to its corresponding video.
- **VideoTitle:** The title of the YouTube video where the comment was posted. This provides context about the video's content.
- **AuthorName:** The display name of the user who posted the comment. This indicates the commenter's identity.
- **AuthorChannelID:** The unique identifier of the YouTube channel of the comment's author. This allows for tracking comments across different videos from the same author.
- **CommentText:** The actual text content of the YouTube comment. This is the raw data used for sentiment analysis.
- **Sentiment:** The sentiment classification of the comment, typically categorized as positive, negative, or neutral. This represents the emotional tone of the comment.
- **Likes:** The number of likes received by the comment. This indicates the comment's popularity or agreement from other users.
- **Replies:** The number of replies to the comment. This indicates the level of engagement and discussion generated by the comment.
- **PublishedAt:** The date and time when the comment was published. This allows for time-based analysis of comment trends.
- **CountryCode:** The two-letter country code of the user that posted the comment. This can be used to analyze regional sentiment.
- **CategoryID:** The category ID of the video that the comment was posted on. This allows for analysis of sentiment across video categories.
## Key Features:
* **Sentiment Analysis:** Each comment has been categorized into positive, negative, or neutral sentiment, allowing for direct analysis of emotional tone.
* **Video and Author Metadata:** The dataset includes information about the videos (title, category, ID) and authors (channel ID, name), enabling contextual analysis.
* **Engagement Metrics:** Columns such as "Likes" and "Replies" provide insights into comment popularity and discussion levels.
* **Temporal and Geographical Data:** "PublishedAt" and "CountryCode" columns allow for time-based and regional sentiment analysis.
## Data Collection & Labeling Process
- **Extraction:**
Comments were gathered using the YouTube Data API, ensuring a rich and diverse collection from multiple channels and regions.
- **Sentiment Labeling:**
A combination of advanced AI (using models such as Gemini) and manual validation was used to accurately label each comment.
- **Cleaning & Preprocessing:**
Comprehensive cleaning steps were applied—removing extraneous noise like timestamps, code snippets, and special characters—to ensure high-quality, ready-to-use text.
- **Augmentation for Balance:**
To address class imbalances (especially for underrepresented negative and neutral sentiments), a comment augmentation process was implemented. This process generated synthetic variations of selected comments, increasing linguistic diversity while preserving the original sentiment, thus ensuring a more balanced dataset.
## Benefits for Users
- **Scale & Diversity:**
With over 1M comments from various domains, this dataset offers a rich resource for training and evaluating sentiment analysis models.
- **Quality & Consistency:**
Rigorous cleaning, preprocessing, and augmentation ensure that the data is both reliable and representative of real-world YouTube interactions.
- **Versatility:**
Ideal for researchers, data scientists, and developers looking to build or fine-tune large language models for sentiment analysis, content moderation, and other NLP applications.
## Uses:
* Sentiment analysis of YouTube comments.
* Analysis of viewer engagement and discussion patterns.
* Exploration of sentiment trends across different video categories.
* Regional sentiment analysis.
* Building machine learning models for sentiment prediction.
* Analyzing the impact of video content on viewer sentiment.
This dataset is open-sourced to encourage collaboration and innovation. Detailed documentation and the code used for extraction, labeling, and augmentation are available in the accompanying GitHub repository.
提供机构:
vnkat



