legacy-datasets/hate_offensive
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/legacy-datasets/hate_offensive
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- machine-generated
language:
- en
license:
- mit
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
paperswithcode_id: hate-speech-and-offensive-language
pretty_name: HateOffensive
tags:
- hate-speech-detection
dataset_info:
features:
- name: total_annotation_count
dtype: int32
- name: hate_speech_annotations
dtype: int32
- name: offensive_language_annotations
dtype: int32
- name: neither_annotations
dtype: int32
- name: label
dtype:
class_label:
names:
'0': hate-speech
'1': offensive-language
'2': neither
- name: tweet
dtype: string
splits:
- name: train
num_bytes: 2811298
num_examples: 24783
download_size: 2546446
dataset_size: 2811298
---
# Dataset Card for HateOffensive
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage** : https://arxiv.org/abs/1905.12516
- **Repository** : https://github.com/t-davidson/hate-speech-and-offensive-language
- **Paper** : https://arxiv.org/abs/1905.12516
- **Leaderboard** :
- **Point of Contact** : trd54 at cornell dot edu
### Dataset Summary
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English (`en`)
## Dataset Structure
### Data Instances
```
{
"count": 3,
"hate_speech_annotation": 0,
"offensive_language_annotation": 0,
"neither_annotation": 3,
"label": 2, # "neither"
"tweet": "!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. & as a man you should always take the trash out...")
}
```
### Data Fields
count: (Integer) number of users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable,
hate_speech_annotation: (Integer) number of users who judged the tweet to be hate speech,
offensive_language_annotation: (Integer) number of users who judged the tweet to be offensive,
neither_annotation: (Integer) number of users who judged the tweet to be neither offensive nor non-offensive,
label: (Class Label) integer class label for majority of CF users (0: 'hate-speech', 1: 'offensive-language' or 2: 'neither'),
tweet: (string)
### Data Splits
This dataset is not splitted, only the train split is available.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
Usernames are not anonymized in the dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
MIT License
### Citation Information
@inproceedings{hateoffensive,
title = {Automated Hate Speech Detection and the Problem of Offensive Language},
author = {Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar},
booktitle = {Proceedings of the 11th International AAAI Conference on Web and Social Media},
series = {ICWSM '17},
year = {2017},
location = {Montreal, Canada},
pages = {512-515}
}
### Contributions
Thanks to [@MisbahKhan789](https://github.com/MisbahKhan789) for adding this dataset.
The HateOffensive dataset is a monolingual (English) dataset designed for detecting hate speech and offensive language. It contains approximately 24,783 tweets, each annotated by multiple users for whether they are hate speech, offensive language, or neither. The dataset labels are categorized into three classes: hate-speech, offensive-language, and neither. The dataset is licensed under MIT and is suitable for text classification tasks, particularly multi-class classification.
提供机构:
legacy-datasets
原始信息汇总
数据集概述
基本信息
- 数据集名称: HateOffensive
- 语言: 英语 (en)
- 许可证: MIT
- 数据集大小: 10K<n<100K
- 多语言性: 单语种
- 源数据: 原始数据
- 任务类别: 文本分类
- 任务ID: 多类别分类
- 标签: hate-speech-detection
数据集结构
特征
- total_annotation_count: 整数类型,表示每个推文的标注总数。
- hate_speech_annotations: 整数类型,表示判定为仇恨言论的标注数。
- offensive_language_annotations: 整数类型,表示判定为攻击性语言的标注数。
- neither_annotations: 整数类型,表示判定为既非仇恨言论也非攻击性语言的标注数。
- label: 类别标签,包含三个类别:0: hate-speech, 1: offensive-language, 2: neither。
- tweet: 字符串类型,表示推文内容。
数据分割
- 训练集: 包含24783个样本,数据大小为2811298字节。
数据实例
json { "count": 3, "hate_speech_annotation": 0, "offensive_language_annotation": 0, "neither_annotation": 3, "label": 2, # "neither" "tweet": "!!! RT @mayasolovely: As a woman you shouldnt complain about cleaning up your house. & as a man you should always take the trash out..." }
数据集创建
标注过程
- 标注者: 众包
- 语言创建者: 机器生成
个人和敏感信息
- 数据集中用户名未匿名化。
许可证信息
- MIT许可证
引用信息
plaintext @inproceedings{hateoffensive, title = {Automated Hate Speech Detection and the Problem of Offensive Language}, author = {Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar}, booktitle = {Proceedings of the 11th International AAAI Conference on Web and Social Media}, series = {ICWSM 17}, year = {2017}, location = {Montreal, Canada}, pages = {512-515} }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



