thefrankhsu/hate_speech_twitter

Name: thefrankhsu/hate_speech_twitter
Creator: thefrankhsu
Published: 2023-12-15 03:47:33
License: 暂无描述

Hugging Face2023-12-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/thefrankhsu/hate_speech_twitter

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-classification language: - en tags: - health - tweet - hate speech - mental health - hate speech detection - hate speech classification - social media - mobile health size_categories: - 1K<n<10K --- ## Dataset Card for Dataset Name  The dataset is designed to analyze and address hate speech within online platforms. It consists of two sets: the training and testing sets. The two datasets have been labeled and categorized instances of hate speech into nine distinct categories. ## Dataset Description  The dataset comprises three key features: tweets, labels (with hate speech denoted as 1 and non-hate speech as 0), and categories (behavior, class, disability, ethnicity, gender, physical appearance, race, religion, sexual orientation). * Training set: contains a total of 5679 tweets (Hate Speech: 1516 / Non Hate Speech: 4163), and the number of hate speech in each category is not equally distributed. * Testing set: contains a total of 1000 tweets (Hate Speech: 500 / Non Hate Speech: 500), and the number of hate speech in each category is generally even. ## Uses This dataset can be utilized for various purposes, including but not limited to: * Developing and training machine learning models for hate speech detection. * Analyzing the prevalence and patterns of hate speech across different categories. * Understanding the challenges associated with categorizing hate speech on social media platforms. Check it out for the example [project](https://github.com/Wei-Hsi/AI4health)! ## Source Data  The dataset utilized in this study is sourced from Kaggle and named the [Hate Speech and Offensive Language dataset](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/). Hate speech instances are identified by selecting tweets within the "class" column. ## Annotations  Category labels were generated through an OpenAI API call employing the GPT-3.5 model. It's important to note the instability in category predictions when utilizing GPT-3.5 for label generation, as it tends to predict different categories each time. However, we have confirmed that these tweets were labeled correctly. If there are any misclassified labels, please feel free to reach out. Thank you in advance for your assistance. ## Dataset Card Contact Please feel free to contact me via wh476@cornell.edu!

提供机构：

thefrankhsu

原始信息汇总

数据集概述

该数据集旨在分析和解决在线平台中的仇恨言论问题。它包含训练集和测试集，这两个数据集已被标记并将仇恨言论实例分类为九个不同的类别。

数据集描述

数据集包含三个关键特征：推文、标签（仇恨言论标记为1，非仇恨言论标记为0）和类别（行为、阶级、残疾、种族、性别、外貌、种族、宗教、性取向）。

训练集：包含总共5679条推文（仇恨言论：1516条 / 非仇恨言论：4163条），每个类别的仇恨言论数量分布不均。
测试集：包含总共1000条推文（仇恨言论：500条 / 非仇恨言论：500条），每个类别的仇恨言论数量大致均匀。

用途

该数据集可用于多种目的，包括但不限于：

开发和训练用于仇恨言论检测的机器学习模型。
分析不同类别中仇恨言论的流行模式。
理解在社交媒体平台上分类仇恨言论的挑战。

源数据

该研究使用的数据集来自Kaggle，名为Hate Speech and Offensive Language dataset。仇恨言论实例是通过选择“class”列中的推文来识别的。

标注

类别标签是通过使用GPT-3.5模型的OpenAI API调用生成的。需要注意的是，使用GPT-3.5进行标签生成时，类别预测存在不稳定性，每次预测的类别可能不同。然而，我们已经确认这些推文被正确标记。如果有任何误分类的标签，请随时联系。

5,000+

优质数据集

54 个

任务类型

进入经典数据集