five

NoisyHate/Noisy_Hate_Data

收藏
Hugging Face2023-05-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NoisyHate/Noisy_Hate_Data
下载链接
链接失效反馈
官方服务:
资源简介:
# toxic-detection-testset ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Citation Information](#citation-information) ## Dataset Description ### Dataset Summary This dataset a test set for toxic detection that contains both clean data and it's perturbed version with human-written perturbations online. In addition, our dataset can be used to benchmark misspelling correctors as well. ### Languages English ## Dataset Structure ### Data Instances ``` { "clean_version": "this is pretty much exactly how i feel damn", "perturbed_version": "this is pretty much exactly how i feel daaammnn", "toxicity": 0.7, "obscene": 0.7, "sexual_explicit": 0, "identity_attack": 0, ... "insult": 0.2, "quality_mean": 4 } ``` ### Data Fields This dataset is derived from the [Jigsaw data](https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data). Hence, it keeps all the useful metrics and attributes. **Main** * clean_version * perturbed_version **Metrics** * toxicity * severe_toxicity * obscene * threat * insult * identity_attack * sexual_explicit **Identity attributes** * male * female * transgender * other_gender * heterosexual * homosexual_gay_or_lesbian * bisexual * other_sexual_orientation * christian * jewish * muslim * hindu * buddhist * atheist * other_religion * black * white * asian * latino * other_race_or_ethnicity * physical_disability * intellectual_or_learning_disability * psychiatric_or_mental_illness * other_disability ### Data Splits test: 1339 ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Jigsaw is a famous toxic speech classification dataset containing approximately 2 million public comments from the Civil Comments platform. In addition to the toxic score labels for toxicity classification, the Jigsaw dataset also provides several toxicity sub-type dimensions which indicate particular comments' target groups, such as male, female, black, and Asian. Due to these prolific identity annotations and significant data volume, we adopt this dataset as our raw data source. Since the dataset has been used as the standard benchmark dataset for content moderation tasks, this adoption will also help reduce the entry barrier in adopting NoisyHate from the community. Since the comments from the Jigsaw dataset contain a lot of special characters, emojis, and informal language, data cleaning was necessary to ensure data quality. Following a typical text processing pipeline, we removed duplicated texts, special characters, special punctuation, hyperlinks, and numbers. Since we only focused on English texts, sentences containing non-standard English words were filtered out. 13,1982 comments remained after this cleaning step. #### Who are the source language producers? The source data is provided by the Conversation AI team, a research initiative founded by Jigsaw and Google. ### Annotations #### Annotation process In the annotation process, we display a guideline to explain the definition of human-generated perturbation and provide examples of both high-quality and low-quality perturbations. This training phase has been suggested to warrant high-quality responses from the human worker, especially for labeling tasks. Each MTurk worker is then presented with a pair of a perturbed sentences and its clean version and is asked to determine the quality of the perturbed one (Guideline and UI can be found in our [paper](#citation-information)). We recruited five different workers from the North America region through five assignments to assess each pair. A five-second countdown timer was also set for each task to ensure workers spent enough time on it. To ensure the quality of their responses, we designed an attention question that asks them to click on the perturbed word in the given sentences before they provide their quality ratings. Workers who cannot correctly identify the perturbation's location in the given sentence will be blocked for future batches. We aimed to pay the workers at an average rate of \$10 per hour, which is well above the federal minimum wage (\$7.25 per hour). The payment of each task was estimated by the average length of the sentences, which totals around 25 words per pair, and the average reading speed of native speakers is around 228 words per minute. #### Who are the annotators? US Amazon MTurk workers with HIT Approval Rate greater than 98%, and Number of HITs approved greater than 1000. ### Personal and Sensitive Information N/A ## Additional Information ### Dataset Curators [More Information Needed] ### Citation Information paper is coming soon
提供机构:
NoisyHate
原始信息汇总

toxic-detection-testset

数据集描述

数据集概述

该数据集是一个用于毒性检测的测试集,包含清洁数据及其在线人工扰动版本。此外,该数据集还可用于基准拼写校正器。

语言

英语

数据集结构

数据实例

json { "clean_version": "this is pretty much exactly how i feel damn", "perturbed_version": "this is pretty much exactly how i feel daaammnn", "toxicity": 0.7, "obscene": 0.7, "sexual_explicit": 0, "identity_attack": 0, "insult": 0.2, "quality_mean": 4 }

数据字段

该数据集源自Jigsaw数据,因此保留了所有有用的指标和属性。

主要字段

  • clean_version
  • perturbed_version

指标

  • toxicity
  • severe_toxicity
  • obscene
  • threat
  • insult
  • identity_attack
  • sexual_explicit

身份属性

  • male
  • female
  • transgender
  • other_gender
  • heterosexual
  • homosexual_gay_or_lesbian
  • bisexual
  • other_sexual_orientation
  • christian
  • jewish
  • muslim
  • hindu
  • buddhist
  • atheist
  • other_religion
  • black
  • white
  • asian
  • latino
  • other_race_or_ethnicity
  • physical_disability
  • intellectual_or_learning_disability
  • psychiatric_or_mental_illness
  • other_disability

数据分割

测试集:1339

数据集创建

数据集创建理由

[更多信息待补充]

源数据

初始数据收集和规范化

Jigsaw是一个著名的毒性言论分类数据集,包含大约200万条来自Civil Comments平台的公共评论。除了毒性评分标签用于毒性分类外,Jigsaw数据集还提供了几个毒性子类型维度,指示特定评论的目标群体,如男性、女性、黑人和亚洲人。由于这些丰富的身份注释和大量的数据量,我们采用该数据集作为我们的原始数据源。由于该数据集已被用作内容审核任务的标准基准数据集,这种采用也将有助于降低社区采用NoisyHate的入门门槛。

由于Jigsaw数据集中的评论包含大量特殊字符、表情符号和非正式语言,因此需要进行数据清洗以确保数据质量。按照典型的文本处理流程,我们删除了重复的文本、特殊字符、特殊标点符号、超链接和数字。由于我们只关注英语文本,因此过滤掉了包含非标准英语单词的句子。经过这一清洗步骤后,剩下131,982条评论。

源语言生产者是谁?

源数据由Conversation AI团队提供,该团队是由Jigsaw和Google发起的研究倡议。

注释

注释过程

在注释过程中,我们展示了一个指南,解释了人工生成的扰动的定义,并提供了高质量和低质量扰动的示例。这一培训阶段已被建议确保人类工作者的高质量响应,特别是对于标注任务。每位MTurk工作者随后会看到一对扰动句子和其清洁版本,并被要求确定扰动句子的质量(指南和UI可以在我们的论文中找到)。

我们通过五个不同的任务从北美地区招募了五名不同的工作者来评估每一对。每个任务还设置了一个五秒的倒计时计时器,以确保工作者花费足够的时间。为了确保他们的响应质量,我们设计了一个注意力问题,要求他们在提供质量评级之前点击给定句子中的扰动词。无法正确识别给定句子中扰动位置的工作者将被阻止参与未来的批次。我们的目标是以每小时10美元的平均费率支付工作者,这远高于联邦最低工资(每小时7.25美元)。每个任务的支付是根据句子的平均长度(每对约25个单词)和母语者的平均阅读速度(每分钟约228个单词)估算的。

注释者是谁?

美国Amazon MTurk工作者,HIT批准率大于98%,且批准的HIT数量大于1000。

个人和敏感信息

N/A

附加信息

数据集策展人

[更多信息待补充]

引用信息

论文即将发布

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作