NiGuLa/Russian_Sensitive_Topics
收藏数据集概述
数据集语言
- 俄语
数据集标签
- 有毒评论分类
许可证
- 创意共享署名-非商业性使用-相同方式共享 4.0 国际许可协议
任务类别
- 文本分类
大小类别
- 10,000 < 数据集大小 < 100,000
模型概念
- 数据集关注于敏感话题,如恐同、政治、种族主义等,共涉及18个话题。
引用信息
@inproceedings{babakov-etal-2021-detecting, title = "Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company{}s Reputation", author = "Babakov, Nikolay and Logacheva, Varvara and Kozlova, Olga and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", month = apr, year = "2021", address = "Kiyv, Ukraine", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.bsnlp-1.4", pages = "26--36", abstract = "Not all topics are equally {``}flammable{} in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects: (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian: a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.", }



