textdetox/multilingual_paradetox_test
收藏数据集概述
数据集名称
- MultiParaDetox (Test)
数据集描述
- 这是一个多语言平行数据集,用于文本解毒,专为CLEF TextDetox 2024共享任务准备。
- 数据集包含9种语言,每种语言收集了1000对有毒<->解毒文本实例,分为开发集(400对)和测试集(600对)。
数据集配置
- 默认配置
- 数据文件路径
uk:data/uk-*hi:data/hi-*zh:data/zh-*ar:data/ar-*de:data/de-*en:data/en-*ru:data/ru-*am:data/am-*es:data/es-*
- 数据文件路径
数据集特征
- 文本:数据类型为字符串。
数据集分割
- uk
- 字节数:64010
- 示例数:600
- hi
- 字节数:84742
- 示例数:600
- zh
- 字节数:51159
- 示例数:600
- ar
- 字节数:67319
- 示例数:600
- de
- 字节数:68242
- 示例数:600
- en
- 字节数:37872
- 示例数:600
- ru
- 字节数:73326
- 示例数:600
- am
- 字节数:110756
- 示例数:600
- es
- 字节数:40172
- 示例数:600
数据集大小
- 下载大小:377419字节
- 数据集大小:597598字节
数据来源
- 英语:Jigsaw,Unitary AI Toxicity Dataset
- 俄语:Russian Language Toxic Comments,Toxic Russian Comments
- 乌克兰语:Ukrainian Twitter texts
- 西班牙语:Detecting and Monitoring Hate Speech in Twitter,Detoxis,RoBERTuito: a pre-trained language model for social media text in Spanish
- 德语:GemEval 2018, 2021
- 阿姆哈拉语:Amharic Hate Speech
- 阿拉伯语:OSACT4
- 印地语:Hostility Detection Dataset in Hindi,Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages
引用信息
-
若需引用此数据集,请参考以下文献:
@inproceedings{dementieva2024overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2024}, author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{v{s}}{v{c}}{a}kov{a} and Alba Garc{i}a Seco de Herrera}, year={2024}, organization={CEUR-WS.org} }
@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24, author = {Janek Bevendorff and Xavier Bonet Casals and Berta Chulvi and Daryna Dementieva and Ashaf Elnagar and Dayne Freitag and Maik Fr{"{o}}be and Damir Korencic and Maximilian Mayerl and Animesh Mukherjee and Alexander Panchenko and Martin Potthast and Francisco Rangel and Paolo Rosso and Alisa Smirnova and Efstathios Stamatatos and Benno Stein and Mariona Taul{{e}} and Dmitry Ustalov and Matti Wiegmann and Eva Zangerle}, editor = {Nazli Goharian and Nicola Tonellotto and Yulan He and Aldo Lipani and Graham McDonald and Craig Macdonald and Iadh Ounis}, title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative {AI} Authorship Verification - Extended Abstract}, booktitle = {Advances in Information Retrieval - 46th European Conference on Information Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part {VI}}, series = {Lecture Notes in Computer Science}, volume = {14613}, pages = {3--10}, publisher = {Springer}, year = {2024}, url = {https://doi.org/10.1007/978-3-031-56072-9_1}, doi = {10.1007/978-3-031-56072-9_1}, timestamp = {Fri, 29 Mar 2024 23:01:36 +0100}, biburl = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }




