five

s-nlp/ru_non_detoxified

收藏
Hugging Face2023-09-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/s-nlp/ru_non_detoxified
下载链接
链接失效反馈
官方服务:
资源简介:
ParaDetox数据集专注于俄语文本的去毒化任务,特别是重述任务的负面结果。数据集通过Yandex.Toloka平台收集,包含三个步骤:生成重述、内容保存检查和毒性检查。本仓库特别关注第一步的结果,即生成重述,数据集大小约为11,446样本。描述了样本无法去毒化的原因,如非毒性文本、含有毒性内容或内容不清晰。

ParaDetox数据集专注于俄语文本的去毒化任务,特别是重述任务的负面结果。数据集通过Yandex.Toloka平台收集,包含三个步骤:生成重述、内容保存检查和毒性检查。本仓库特别关注第一步的结果,即生成重述,数据集大小约为11,446样本。描述了样本无法去毒化的原因,如非毒性文本、含有毒性内容或内容不清晰。
提供机构:
s-nlp
原始信息汇总

ParaDetox: Detoxification with Parallel Data (Russian)

数据集概述

  • 任务类别: 文本分类
  • 语言: 俄语
  • 许可证: openrail++

数据集内容

  • 数据集名称: ParaDetox
  • 数据收集平台: Yandex.Toloka
  • 数据收集步骤:
    • 任务1: 生成同义句,要求用户在不改变原意的情况下消除句子中的毒性。
    • 任务2: 内容保持检查,展示生成的同义句及其原始版本,询问用户两者是否意义相近。
    • 任务3: 毒性检查,检查工人是否成功移除了毒性。
  • 数据集大小: 约11,446样本
  • 特殊样本: 包含被标注者标记为无法解毒的样本,原因可能包括非毒性文本、毒性内容、不清晰文本。

引用信息

@inproceedings{logacheva-etal-2022-study, title = "A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification", author = "Logacheva, Varvara and Dementieva, Daryna and Krotova, Irina and Fenogenova, Alena and Nikishina, Irina and Shavrina, Tatiana and Panchenko, Alexander", booktitle = "Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.humeval-1.8", doi = "10.18653/v1/2022.humeval-1.8", pages = "90--101", abstract = "It is often difficult to reliably evaluate models which generate text. Among them, text style transfer is a particularly difficult to evaluate, because its success depends on a number of parameters.We conduct an evaluation of a large number of models on a detoxification task. We explore the relations between the manual and automatic metrics and find that there is only weak correlation between them, which is dependent on the type of model which generated text. Automatic metrics tend to be less reliable for better-performing models. However, our findings suggest that, ChrF and BertScore metrics can be used as a proxy for human evaluation of text detoxification to some extent.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作