five

vnkat/youtube-comment-sentiment

收藏
Hugging Face2026-02-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vnkat/youtube-comment-sentiment
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - en - hi - ja - es task_categories: - text-classification tags: - youtube - sentiment - comments - multi-linguistic size_categories: - 1M<n<10M --- # YouTube Comments Sentiment Analysis Dataset (1M+ Labeled Comments) ## Overview This dataset comprises over one million YouTube comments, each annotated with sentiment labels—**Positive**, **Neutral**, or **Negative**. The comments span a diverse range of topics including programming, news, sports, politics and more, and are enriched with comprehensive metadata to facilitate various NLP and sentiment analysis tasks. ## How to use: ```Python import pandas as pd df = pd.read_csv("hf://datasets/AmaanP314/youtube-comment-sentiment/youtube-comments-sentiment.csv") ``` ## Dataset Contents Each record in the dataset includes the following fields: - **CommentID:** A unique identifier assigned to each YouTube comment. This allows for individual tracking and analysis of comments. - **VideoID:** The unique identifier of the YouTube video to which the comment belongs. This links each comment to its corresponding video. - **VideoTitle:** The title of the YouTube video where the comment was posted. This provides context about the video's content. - **AuthorName:** The display name of the user who posted the comment. This indicates the commenter's identity. - **AuthorChannelID:** The unique identifier of the YouTube channel of the comment's author. This allows for tracking comments across different videos from the same author. - **CommentText:** The actual text content of the YouTube comment. This is the raw data used for sentiment analysis. - **Sentiment:** The sentiment classification of the comment, typically categorized as positive, negative, or neutral. This represents the emotional tone of the comment. - **Likes:** The number of likes received by the comment. This indicates the comment's popularity or agreement from other users. - **Replies:** The number of replies to the comment. This indicates the level of engagement and discussion generated by the comment. - **PublishedAt:** The date and time when the comment was published. This allows for time-based analysis of comment trends. - **CountryCode:** The two-letter country code of the user that posted the comment. This can be used to analyze regional sentiment. - **CategoryID:** The category ID of the video that the comment was posted on. This allows for analysis of sentiment across video categories. ## Key Features: * **Sentiment Analysis:** Each comment has been categorized into positive, negative, or neutral sentiment, allowing for direct analysis of emotional tone. * **Video and Author Metadata:** The dataset includes information about the videos (title, category, ID) and authors (channel ID, name), enabling contextual analysis. * **Engagement Metrics:** Columns such as "Likes" and "Replies" provide insights into comment popularity and discussion levels. * **Temporal and Geographical Data:** "PublishedAt" and "CountryCode" columns allow for time-based and regional sentiment analysis. ## Data Collection & Labeling Process - **Extraction:** Comments were gathered using the YouTube Data API, ensuring a rich and diverse collection from multiple channels and regions. - **Sentiment Labeling:** A combination of advanced AI (using models such as Gemini) and manual validation was used to accurately label each comment. - **Cleaning & Preprocessing:** Comprehensive cleaning steps were applied—removing extraneous noise like timestamps, code snippets, and special characters—to ensure high-quality, ready-to-use text. - **Augmentation for Balance:** To address class imbalances (especially for underrepresented negative and neutral sentiments), a comment augmentation process was implemented. This process generated synthetic variations of selected comments, increasing linguistic diversity while preserving the original sentiment, thus ensuring a more balanced dataset. ## Benefits for Users - **Scale & Diversity:** With over 1M comments from various domains, this dataset offers a rich resource for training and evaluating sentiment analysis models. - **Quality & Consistency:** Rigorous cleaning, preprocessing, and augmentation ensure that the data is both reliable and representative of real-world YouTube interactions. - **Versatility:** Ideal for researchers, data scientists, and developers looking to build or fine-tune large language models for sentiment analysis, content moderation, and other NLP applications. ## Uses: * Sentiment analysis of YouTube comments. * Analysis of viewer engagement and discussion patterns. * Exploration of sentiment trends across different video categories. * Regional sentiment analysis. * Building machine learning models for sentiment prediction. * Analyzing the impact of video content on viewer sentiment. This dataset is open-sourced to encourage collaboration and innovation. Detailed documentation and the code used for extraction, labeling, and augmentation are available in the accompanying GitHub repository.
提供机构:
vnkat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作