five

textdetox/multilingual_paradetox

收藏
arXiv2025-05-22 更新2025-05-24 收录
下载链接:
https://huggingface.co/datasets/textdetox/multilingual_paradetox
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个多语言平行脱毒数据集,提供了九种在类型上不同的语言的平行有毒和中和文本。这些文本被精心挑选,以确保有毒内容与其语义上等价的中和(非有毒)样本配对。这种并行设置使得直接评估跨语言的脱毒效果成为可能。数据集由Hugging Face托管,方便研究人员进行实验和分析。

This dataset is a multilingual parallel detoxification dataset that provides parallel toxic and neutral text pairs across nine typologically distinct languages. These texts are carefully selected to ensure that toxic content is paired with its semantically equivalent neutral (non-toxic) samples. This parallel setup enables direct evaluation of cross-lingual detoxification performance. The dataset is hosted on Hugging Face to facilitate experiments and analysis for researchers.
提供机构:
University of Virginia, Carnegie Mellon University, Allen Institute for Artificial Intelligence, Microsoft, Indian Institute of Technology Gandhinagar
创建时间:
2025-05-22
原始信息汇总

数据集概述

基本信息

  • 名称: Multilingual Text Detoxification with Parallel Data
  • 语言: 英语 (en)、乌克兰语 (uk)、俄语 (ru)、德语 (de)、中文 (zh)、阿姆哈拉语 (am)、阿拉伯语 (ar)、印地语 (hi)、西班牙语 (es)、意大利语 (it)、法语 (fr)、希伯来语 (he)、日语 (ja)、鞑靼语 (tt)
  • 许可证: openrail++
  • 规模: 10K<n<100K
  • 任务类别: 文本生成 (text-generation)

数据集结构

  • 特征:
    • toxic_sentence: 字符串类型,表示有毒文本
    • neutral_sentence: 字符串类型,表示去毒后的文本
  • 数据分割:
    • 英语 (en): 400 个样本
    • 俄语 (ru): 400 个样本
    • 乌克兰语 (uk): 400 个样本
    • 德语 (de): 400 个样本
    • 西班牙语 (es): 400 个样本
    • 阿姆哈拉语 (am): 400 个样本
    • 中文 (zh): 400 个样本
    • 阿拉伯语 (ar): 400 个样本
    • 印地语 (hi): 400 个样本
  • 下载大小: 489288 字节
  • 数据集大小: 764013 字节

用途

  • 用于文本去毒任务的多语言平行数据集,为 TextDetox 共享任务准备。

数据来源

  • 英语: Jigsaw、Unitary AI Toxicity Dataset
  • 俄语: Russian Language Toxic Comments、Toxic Russian Comments
  • 乌克兰语: Ukrainian Twitter texts
  • 西班牙语: Detecting and Monitoring Hate Speech in Twitter、Detoxis、RoBERTuito
  • 德语: GemEval 2018, 2021
  • 阿姆哈拉语: Amharic Hate Speech
  • 阿拉伯语: OSACT4
  • 印地语: Hostility Detection Dataset in Hindi、HASOC track at FIRE 2019
  • 意大利语: AMI、HODI、Jigsaw Multilingual Toxic Comment
  • 法语: FrenchToxicityPrompts、Jigsaw Multilingual Toxic Comment
  • 希伯来语: Hebrew Offensive Language Dataset
  • Hinglish: Hinglish Hate Detection
  • 日语: 2chan 帖子
  • 鞑靼语: 自有数据

引用

bibtex @inproceedings{dementieva-etal-2025-multilingual, title = "Multilingual and Explainable Text Detoxification with Parallel Corpora", author = "Dementieva, Daryna and Babakov, Nikolay and Ronen, Amit and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Florian and Wang, Xintong and Yimam, Seid Muhie and Moskovskiy, Daniil Alekhseevich and Stakovskii, Elisei and Kaufman, Eran and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.535/", pages = "7998--8025" }

@inproceedings{dementieva2024overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2024}, author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{v{s}}{v{c}}{a}kov{a} and Alba Garc{i}a Seco de Herrera}, year={2024}, organization={CEUR-WS.org} }

@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24, author = {Janek Bevendorff and Xavier Bonet Casals and Berta Chulvi and Daryna Dementieva and Ashaf Elnagar and Dayne Freitag and Maik Fr{"{o}}be and Damir Korencic and Maximilian Mayerl and Animesh Mukherjee and Alexander Panchenko and Martin Potthast and Francisco Rangel and Paolo Rosso and Alisa Smirnova and Efstathios Stamatatos and Benno Stein and Mariona Taul{{e}} and Dmitry Ustalov and Matti Wiegmann and Eva Zangerle}, editor = {Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis}, title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative {AI} Authorship Verification - Extended Abstract}, booktitle = {Advances in Information Retrieval - 46th European Conference on Information Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part {VI}}, series = {Lecture Notes in Computer Science}, volume = {14613}, pages = {3--10}, publisher = {Springer}, year = {2024}, url = {https://doi.org/10.1007/978-3-031-56072-9_1}, doi = {10.1007/978-3-031-56072-9_1} }

联系方式

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作