s-nlp/en_paradetox_toxicity

Name: s-nlp/en_paradetox_toxicity
Creator: s-nlp
Published: 2023-09-08 08:37:06
License: 暂无描述

Hugging Face2023-09-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/s-nlp/en_paradetox_toxicity

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: openrail++ task_categories: - text-classification language: - en --- # ParaDetox: Detoxification with Parallel Data (English). Toxicity Task Results This repository contains information about **Toxicity Task** markup from [English Paradetox dataset](https://huggingface.co/datasets/s-nlp/paradetox) collection pipeline. The original paper ["ParaDetox: Detoxification with Parallel Data"](https://aclanthology.org/2022.acl-long.469/) was presented at ACL 2022 main conference. ## ParaDetox Collection Pipeline The ParaDetox Dataset collection was done via [Yandex.Toloka](https://toloka.yandex.com/) crowdsource platform. The collection was done in three steps: * *Task 1:* **Generation of Paraphrases**: The first crowdsourcing task asks users to eliminate toxicity in a given sentence while keeping the content. * *Task 2:* **Content Preservation Check**: We show users the generated paraphrases along with their original variants and ask them to indicate if they have close meanings. * *Task 3:* **Toxicity Check**: Finally, we check if the workers succeeded in removing toxicity. Specifically this repo contains the results of **Task 3: Toxicity Check**. Here, the samples with markup confidence >= 90 are present. The input here is text and the label shows if the text is toxic or not. Totally, datasets contains 26,507 samples. Among them, the minor part is toxic examples (4,009 pairs). ## Citation ``` @inproceedings{logacheva-etal-2022-paradetox, title = "{P}ara{D}etox: Detoxification with Parallel Data", author = "Logacheva, Varvara and Dementieva, Daryna and Ustyantsev, Sergey and Moskovskiy, Daniil and Dale, David and Krotova, Irina and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.469", pages = "6804--6818", abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.", } ``` ## Contacts For any questions, please contact: Daryna Dementieva (dardem96@gmail.com)

提供机构：

s-nlp

原始信息汇总

ParaDetox数据集概述

数据集基本信息

许可证: openrail++
任务类别: 文本分类
语言: 英语

数据集描述

名称: ParaDetox: Detoxification with Parallel Data (English)
任务: 毒性检测
数据收集方法: 通过Yandex.Toloka众包平台进行，分为三个步骤：
- 任务1: 生成同义句，要求用户在不改变原意的情况下消除句子中的毒性。
- 任务2: 内容保留检查，展示生成的同义句及其原始版本，询问用户两者是否意义相近。
- 任务3: 毒性检查，检查工作者是否成功移除毒性。
数据集内容: 包含任务3: 毒性检查的结果，样本标记置信度大于等于90%。输入为文本，标签指示文本是否具有毒性。
数据集规模: 总计26,507个样本，其中毒性样本4,009对。

引用信息

@inproceedings{logacheva-etal-2022-paradetox, title = "{P}ara{D}etox: Detoxification with Parallel Data", author = "Logacheva, Varvara and Dementieva, Daryna and Ustyantsev, Sergey and Moskovskiy, Daniil and Dale, David and Krotova, Irina and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.469", pages = "6804--6818", abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集