NiGuLa/Russian_Inappropriate_Messages

Name: NiGuLa/Russian_Inappropriate_Messages
Creator: NiGuLa
Published: 2023-05-12 13:37:15
License: 暂无描述

Hugging Face2023-05-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NiGuLa/Russian_Inappropriate_Messages

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ru tags: - toxic comments classification license: cc task_categories: - text-classification size_categories: - 100K<n<1M --- ## General concept The **'inappropriateness'** substance we tried to collect in the dataset and detect with the model **is NOT a substitution of toxicity**, it is rather a derivative of toxicity. So the model based on our dataset could serve as **an additional layer of inappropriateness filtering after toxicity and obscenity filtration**. You can detect the exact sensitive topic by using [this model](https://huggingface.co/Skoltech/russian-sensitive-topics). Generally, an inappropriate utterance is an utterance that has not obscene words or any kind of toxic intent but can still harm the reputation of the speaker. Find some samples for more intuition in the table below. Learn more about the concept of inappropriateness [in this article ](https://www.aclweb.org/anthology/2021.bsnlp-1.4/) presented at the workshop for Balto-Slavic NLP at the EACL-2021 conference. Please note that this article describes the first version of the dataset, while the model is trained on the extended version of the dataset open-sourced on our [GitHub](https://github.com/skoltech-nlp/inappropriate-sensitive-topics/blob/main/Version2/appropriateness/Appropriateness.csv) or on [kaggle](https://www.kaggle.com/nigula/russianinappropriatemessages). The properties of the dataset are the same as the one described in the article, the only difference is the size. | Text | Inappropriate | Comment | Unsafe topic | |------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------|---------------| | Ладно бы видного деятеля завалили а тут какого то ноунейм нигру преступника | yes | Оправдание убийства | offline crime | | Это нарушение УКРФ! | no | Осуждение нарушения закона | offline crime | | Религия - это насилие высшего уровня. Верующего принуждают к вере еще в младенчестве. | yes | Оскорбление чувств верующих | religion | | Христиане, как вы относитесь к иноверцам, в частности к Гамалиилу и к уверовавшим в Иисуса фарисеям? | no | Обычные вопросы к верующим без намека на оскорбление их чувств | religion | | Хорошо что проституция запрещена | no | Осуждение простуитуции | prostitution | | Возьму бабки и сниму шлюх на выходных | yes | Популяризация проституции | prostitution | ## Licensing Information [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa]. [![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa] [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/ [cc-by-nc-sa-image]: https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png ## Citation If you find this repository helpful, feel free to cite our publication: ``` @inproceedings{babakov-etal-2021-detecting, title = "Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company{'}s Reputation", author = "Babakov, Nikolay and Logacheva, Varvara and Kozlova, Olga and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", month = apr, year = "2021", address = "Kiyv, Ukraine", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.bsnlp-1.4", pages = "26--36", abstract = "Not all topics are equally {``}flammable{''} in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects: (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian: a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.", } ``` ## Contacts If you have any questions please contact [Nikolay](mailto:bbkhse@gmail.com)

提供机构：

NiGuLa

原始信息汇总

数据集概述

数据集名称

未明确提供数据集的具体名称。

数据集内容

该数据集专注于收集和检测“不适当”内容，这些内容并非直接的毒性或淫秽，而是毒性的衍生。
数据集用于训练模型，该模型作为毒性和淫秽过滤后的额外不适当内容过滤层。
数据集包含一系列敏感话题的不适当言论样本，如宗教、犯罪、性交易等。

数据集特点

数据集中的不适当言论不包含淫秽词汇或明显的毒性意图，但仍可能损害发言者的声誉。
数据集的扩展版本已开源，可在GitHub或Kaggle上获取。

数据集使用

用户可以通过使用特定的模型来检测具体的敏感话题。
数据集适用于文本分类任务。

数据集规模

数据集大小介于10万到100万之间。

语言

数据集主要使用俄语。

许可证

数据集遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License。

引用信息

如需引用，请参考以下出版物：

@inproceedings{babakov-etal-2021-detecting, title = "Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company{}s Reputation", author = "Babakov, Nikolay and Logacheva, Varvara and Kozlova, Olga and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", month = apr, year = "2021", address = "Kiyv, Ukraine", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.bsnlp-1.4", pages = "26--36", abstract = "Not all topics are equally {``}flammable{} in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects: (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian: a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.", }

联系方式

如有疑问，请联系Nikolay。

5,000+

优质数据集

54 个

任务类型

进入经典数据集