textdetox/multilingual_paradetox
收藏数据集概述
基本信息
- 名称: Multilingual Text Detoxification with Parallel Data
- 语言: 英语 (en)、乌克兰语 (uk)、俄语 (ru)、德语 (de)、中文 (zh)、阿姆哈拉语 (am)、阿拉伯语 (ar)、印地语 (hi)、西班牙语 (es)、意大利语 (it)、法语 (fr)、希伯来语 (he)、日语 (ja)、鞑靼语 (tt)
- 许可证: openrail++
- 规模: 10K<n<100K
- 任务类别: 文本生成 (text-generation)
数据集结构
- 特征:
toxic_sentence: 字符串类型,表示有毒文本neutral_sentence: 字符串类型,表示去毒后的文本
- 数据分割:
- 英语 (en): 400 个样本
- 俄语 (ru): 400 个样本
- 乌克兰语 (uk): 400 个样本
- 德语 (de): 400 个样本
- 西班牙语 (es): 400 个样本
- 阿姆哈拉语 (am): 400 个样本
- 中文 (zh): 400 个样本
- 阿拉伯语 (ar): 400 个样本
- 印地语 (hi): 400 个样本
- 下载大小: 489288 字节
- 数据集大小: 764013 字节
用途
- 用于文本去毒任务的多语言平行数据集,为 TextDetox 共享任务准备。
数据来源
- 英语: Jigsaw、Unitary AI Toxicity Dataset
- 俄语: Russian Language Toxic Comments、Toxic Russian Comments
- 乌克兰语: Ukrainian Twitter texts
- 西班牙语: Detecting and Monitoring Hate Speech in Twitter、Detoxis、RoBERTuito
- 德语: GemEval 2018, 2021
- 阿姆哈拉语: Amharic Hate Speech
- 阿拉伯语: OSACT4
- 印地语: Hostility Detection Dataset in Hindi、HASOC track at FIRE 2019
- 意大利语: AMI、HODI、Jigsaw Multilingual Toxic Comment
- 法语: FrenchToxicityPrompts、Jigsaw Multilingual Toxic Comment
- 希伯来语: Hebrew Offensive Language Dataset
- Hinglish: Hinglish Hate Detection
- 日语: 2chan 帖子
- 鞑靼语: 自有数据
引用
bibtex @inproceedings{dementieva-etal-2025-multilingual, title = "Multilingual and Explainable Text Detoxification with Parallel Corpora", author = "Dementieva, Daryna and Babakov, Nikolay and Ronen, Amit and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Florian and Wang, Xintong and Yimam, Seid Muhie and Moskovskiy, Daniil Alekhseevich and Stakovskii, Elisei and Kaufman, Eran and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.535/", pages = "7998--8025" }
@inproceedings{dementieva2024overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2024}, author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{v{s}}{v{c}}{a}kov{a} and Alba Garc{i}a Seco de Herrera}, year={2024}, organization={CEUR-WS.org} }
@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24, author = {Janek Bevendorff and Xavier Bonet Casals and Berta Chulvi and Daryna Dementieva and Ashaf Elnagar and Dayne Freitag and Maik Fr{"{o}}be and Damir Korencic and Maximilian Mayerl and Animesh Mukherjee and Alexander Panchenko and Martin Potthast and Francisco Rangel and Paolo Rosso and Alisa Smirnova and Efstathios Stamatatos and Benno Stein and Mariona Taul{{e}} and Dmitry Ustalov and Matti Wiegmann and Eva Zangerle}, editor = {Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis}, title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative {AI} Authorship Verification - Extended Abstract}, booktitle = {Advances in Information Retrieval - 46th European Conference on Information Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part {VI}}, series = {Lecture Notes in Computer Science}, volume = {14613}, pages = {3--10}, publisher = {Springer}, year = {2024}, url = {https://doi.org/10.1007/978-3-031-56072-9_1}, doi = {10.1007/978-3-031-56072-9_1} }
联系方式
- 联系人: Daryna Dementieva



