s-nlp/paradetox

Name: s-nlp/paradetox
Creator: s-nlp
Published: 2025-04-02 15:20:04
License: 暂无描述

Hugging Face2025-04-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/s-nlp/paradetox

下载链接

链接失效反馈

官方服务：

资源简介：

ParaDetox数据集是第一个用于英语文本去毒化任务的平行语料库，包含通过Yandex.Toloka众包平台收集的11,939个有毒句子的19,766个去毒化版本。数据集的收集过程分为三个步骤：生成去毒化版本、内容保留检查和毒性检查，以确保数据的高质量。此外，数据集还包括一些被标注为“无法重写”的样本。该数据集可用于训练去毒化模型，并通过风格转移准确率（STA）、内容保留度（SIM）和流畅度（FL）三个参数进行自动评估。

ParaDetox is the first parallel corpus for the English text detoxification task. It contains 19,766 detoxified versions of 11,939 toxic sentences collected via the Yandex.Toloka crowdsourcing platform. The dataset collection process comprises three steps: generating detoxified versions, conducting content preservation checks, and performing toxicity checks to ensure high data quality. Additionally, the dataset includes some samples labeled as "unrewritable". This dataset can be used to train detoxification models, and can be automatically evaluated using three metrics: style transfer accuracy (STA), content preservation (SIM), and fluency (FL).

提供机构：

s-nlp

原始信息汇总

数据集概述

数据集名称

ParaDetox: Detoxification with Parallel Data (English)

数据集描述

包含用于英语文本解毒任务的第一个平行语料库，以及相关的模型和评估方法。
原始论文为《ParaDetox: Detoxification with Parallel Data》，发表于ACL 2022主会议。

数据收集流程

生成同义句：用户需消除给定句子的毒性同时保持内容不变。
内容保持检查：展示生成的同义句及其原始版本，询问是否意义相近。
毒性检查：确认工人是否成功移除毒性。

数据集详情

包含11,939个毒性句子的同义句，平均每个句子有1.66个同义句，总计19,766个同义句。
公开了在第一阶段任务中被标记为“无法重写”的样本。

评估参数

风格转移准确性 (STA)：非毒性输出的百分比，通过预训练的毒性分类器识别。
内容保持 (SIM)：原始文本与输出之间的余弦相似度。
流畅性 (FL)：流畅句子的百分比，通过基于RoBERTa的语言可接受性分类器识别。

模型

最新技术 (SOTA)：基于ParaDetox数据集训练的BART (base)模型，已发布于HuggingFace。

引用信息

@inproceedings{logacheva-etal-2022-paradetox, title = "{P}ara{D}etox: Detoxification with Parallel Data", author = "Logacheva, Varvara and Dementieva, Daryna and Ustyantsev, Sergey and Moskovskiy, Daniil and Dale, David and Krotova, Irina and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.469", pages = "6804--6818", abstract = "..." }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集