thefrankhsu/hate_speech_twitter
收藏Hugging Face2023-12-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/thefrankhsu/hate_speech_twitter
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- en
tags:
- health
- tweet
- hate speech
- mental health
- hate speech detection
- hate speech classification
- social media
- mobile health
size_categories:
- 1K<n<10K
---
## Dataset Card for Dataset Name
<!-- Provide a quick summary of the dataset. -->
The dataset is designed to analyze and address hate speech within online platforms. It consists of two sets: the training and testing sets. The two datasets have been labeled and categorized instances of hate speech into nine distinct categories.
## Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
The dataset comprises three key features: tweets, labels (with hate speech denoted as 1 and non-hate speech as 0), and categories (behavior, class, disability, ethnicity, gender,
physical appearance, race, religion, sexual orientation).
* Training set: contains a total of 5679 tweets (Hate Speech: 1516 / Non Hate Speech: 4163), and the number of hate speech in each category is not equally distributed.
* Testing set: contains a total of 1000 tweets (Hate Speech: 500 / Non Hate Speech: 500), and the number of hate speech in each category is generally even.
## Uses
This dataset can be utilized for various purposes, including but not limited to:
* Developing and training machine learning models for hate speech detection.
* Analyzing the prevalence and patterns of hate speech across different categories.
* Understanding the challenges associated with categorizing hate speech on social media platforms.
Check it out for the example [project](https://github.com/Wei-Hsi/AI4health)!
## Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
The dataset utilized in this study is sourced from Kaggle and named the [Hate Speech and Offensive Language dataset](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/).
Hate speech instances are identified by selecting tweets within the "class" column.
## Annotations
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
Category labels were generated through an OpenAI API call employing the GPT-3.5 model.
It's important to note the instability in category predictions when utilizing GPT-3.5 for label generation, as it tends to predict different categories each time. However, we have confirmed that these tweets were labeled correctly. If there are any misclassified labels, please feel free to reach out. Thank you in advance for your assistance.
## Dataset Card Contact
Please feel free to contact me via wh476@cornell.edu!
提供机构:
thefrankhsu
原始信息汇总
数据集概述
该数据集旨在分析和解决在线平台中的仇恨言论问题。它包含训练集和测试集,这两个数据集已被标记并将仇恨言论实例分类为九个不同的类别。
数据集描述
数据集包含三个关键特征:推文、标签(仇恨言论标记为1,非仇恨言论标记为0)和类别(行为、阶级、残疾、种族、性别、外貌、种族、宗教、性取向)。
- 训练集:包含总共5679条推文(仇恨言论:1516条 / 非仇恨言论:4163条),每个类别的仇恨言论数量分布不均。
- 测试集:包含总共1000条推文(仇恨言论:500条 / 非仇恨言论:500条),每个类别的仇恨言论数量大致均匀。
用途
该数据集可用于多种目的,包括但不限于:
- 开发和训练用于仇恨言论检测的机器学习模型。
- 分析不同类别中仇恨言论的流行模式。
- 理解在社交媒体平台上分类仇恨言论的挑战。
源数据
该研究使用的数据集来自Kaggle,名为Hate Speech and Offensive Language dataset。仇恨言论实例是通过选择“class”列中的推文来识别的。
标注
类别标签是通过使用GPT-3.5模型的OpenAI API调用生成的。需要注意的是,使用GPT-3.5进行标签生成时,类别预测存在不稳定性,每次预测的类别可能不同。然而,我们已经确认这些推文被正确标记。如果有任何误分类的标签,请随时联系。



