textdetox/multilingual_paradetox

Name: textdetox/multilingual_paradetox
Creator: University of Virginia, Carnegie Mellon University, Allen Institute for Artificial Intelligence, Microsoft, Indian Institute of Technology Gandhinagar
Published: 2025-05-22 22:30:14
License: 暂无描述

arXiv2025-05-22 更新2025-05-24 收录

下载链接：

https://huggingface.co/datasets/textdetox/multilingual_paradetox

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个多语言平行脱毒数据集，提供了九种在类型上不同的语言的平行有毒和中和文本。这些文本被精心挑选，以确保有毒内容与其语义上等价的中和（非有毒）样本配对。这种并行设置使得直接评估跨语言的脱毒效果成为可能。数据集由Hugging Face托管，方便研究人员进行实验和分析。

This dataset is a multilingual parallel detoxification dataset that provides parallel toxic and neutral text pairs across nine typologically distinct languages. These texts are carefully selected to ensure that toxic content is paired with its semantically equivalent neutral (non-toxic) samples. This parallel setup enables direct evaluation of cross-lingual detoxification performance. The dataset is hosted on Hugging Face to facilitate experiments and analysis for researchers.

提供机构：

University of Virginia, Carnegie Mellon University, Allen Institute for Artificial Intelligence, Microsoft, Indian Institute of Technology Gandhinagar

创建时间：

2025-05-22

原始信息汇总

数据集概述

基本信息

名称: Multilingual Text Detoxification with Parallel Data
语言: 英语 (en)、乌克兰语 (uk)、俄语 (ru)、德语 (de)、中文 (zh)、阿姆哈拉语 (am)、阿拉伯语 (ar)、印地语 (hi)、西班牙语 (es)、意大利语 (it)、法语 (fr)、希伯来语 (he)、日语 (ja)、鞑靼语 (tt)
许可证: openrail++
规模: 10K<n<100K
任务类别: 文本生成 (text-generation)

数据集结构

特征:
- toxic_sentence: 字符串类型，表示有毒文本
- neutral_sentence: 字符串类型，表示去毒后的文本
数据分割:
- 英语 (en): 400 个样本
- 俄语 (ru): 400 个样本
- 乌克兰语 (uk): 400 个样本
- 德语 (de): 400 个样本
- 西班牙语 (es): 400 个样本
- 阿姆哈拉语 (am): 400 个样本
- 中文 (zh): 400 个样本
- 阿拉伯语 (ar): 400 个样本
- 印地语 (hi): 400 个样本
下载大小: 489288 字节
数据集大小: 764013 字节

用途

用于文本去毒任务的多语言平行数据集，为 TextDetox 共享任务准备。

数据来源

英语: Jigsaw、Unitary AI Toxicity Dataset
俄语: Russian Language Toxic Comments、Toxic Russian Comments
乌克兰语: Ukrainian Twitter texts
西班牙语: Detecting and Monitoring Hate Speech in Twitter、Detoxis、RoBERTuito
德语: GemEval 2018, 2021
阿姆哈拉语: Amharic Hate Speech
阿拉伯语: OSACT4
印地语: Hostility Detection Dataset in Hindi、HASOC track at FIRE 2019
意大利语: AMI、HODI、Jigsaw Multilingual Toxic Comment
法语: FrenchToxicityPrompts、Jigsaw Multilingual Toxic Comment
希伯来语: Hebrew Offensive Language Dataset
Hinglish: Hinglish Hate Detection
日语: 2chan 帖子
鞑靼语: 自有数据

引用

bibtex @inproceedings{dementieva-etal-2025-multilingual, title = "Multilingual and Explainable Text Detoxification with Parallel Corpora", author = "Dementieva, Daryna and Babakov, Nikolay and Ronen, Amit and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Florian and Wang, Xintong and Yimam, Seid Muhie and Moskovskiy, Daniil Alekhseevich and Stakovskii, Elisei and Kaufman, Eran and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.535/", pages = "7998--8025" }

@inproceedings{dementieva2024overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2024}, author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{v{s}}{v{c}}{a}kov{a} and Alba Garc{i}a Seco de Herrera}, year={2024}, organization={CEUR-WS.org} }

@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24, author = {Janek Bevendorff and Xavier Bonet Casals and Berta Chulvi and Daryna Dementieva and Ashaf Elnagar and Dayne Freitag and Maik Fr{"{o}}be and Damir Korencic and Maximilian Mayerl and Animesh Mukherjee and Alexander Panchenko and Martin Potthast and Francisco Rangel and Paolo Rosso and Alisa Smirnova and Efstathios Stamatatos and Benno Stein and Mariona Taul{{e}} and Dmitry Ustalov and Matti Wiegmann and Eva Zangerle}, editor = {Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis}, title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative {AI} Authorship Verification - Extended Abstract}, booktitle = {Advances in Information Retrieval - 46th European Conference on Information Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part {VI}}, series = {Lecture Notes in Computer Science}, volume = {14613}, pages = {3--10}, publisher = {Springer}, year = {2024}, url = {https://doi.org/10.1007/978-3-031-56072-9_1}, doi = {10.1007/978-3-031-56072-9_1} }

联系方式

联系人: Daryna Dementieva

5,000+

优质数据集

54 个

任务类型

进入经典数据集