ruanchaves/hatebr

Name: ruanchaves/hatebr
Creator: ruanchaves
Published: 2023-04-13 13:39:40
License: 暂无描述

Hugging Face2023-04-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ruanchaves/hatebr

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - pt language_creators: - found license: [] multilinguality: - monolingual pretty_name: HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese size_categories: - 1K<n<10K source_datasets: - original tags: - instagram task_categories: - text-classification task_ids: - hate-speech-detection --- # Dataset Card for HateBR - Offensive Language and Hate Speech Dataset in Brazilian Portuguese ## Dataset Description - **Homepage:** http://143.107.183.175:14581/ - **Repository:** https://github.com/franciellevargas/HateBR - **Paper:** https://aclanthology.org/2022.lrec-1.777/ - **Leaderboard:** - **Point of Contact:** https://franciellevargas.github.io/ ### Dataset Summary HateBR is the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of politicians and manually annotated by specialists. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level (highly, moderately, and slightly offensive messages), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. Furthermore, baseline experiments were implemented reaching 85% of F1-score outperforming the current literature models for the Portuguese language. Accordingly, we hope that the proposed expertly annotated corpus may foster research on hate speech and offensive language detection in the Natural Language Processing area. **Relevant Links:** * [**Demo: Brasil Sem Ódio**](http://143.107.183.175:14581/) * [**MOL - Multilingual Offensive Lexicon Annotated with Contextual Information**](https://github.com/franciellevargas/MOL) ### Supported Tasks and Leaderboards Hate Speech Detection ### Languages Portuguese ## Dataset Structure ### Data Instances ``` {'instagram_comments': 'Hipocrita!!', 'offensive_language': True, 'offensiveness_levels': 2, 'antisemitism': False, 'apology_for_the_dictatorship': False, 'fatphobia': False, 'homophobia': False, 'partyism': False, 'racism': False, 'religious_intolerance': False, 'sexism': False, 'xenophobia': False, 'offensive_&_non-hate_speech': True, 'non-offensive': False, 'specialist_1_hate_speech': False, 'specialist_2_hate_speech': False, 'specialist_3_hate_speech': False } ``` ### Data Fields * **instagram_comments**: Instagram comments. * **offensive_language**: A classification of comments as either offensive (True) or non-offensive (False). * **offensiveness_levels**: A classification of comments based on their level of offensiveness, including highly offensive (3), moderately offensive (2), slightly offensive (1) and non-offensive (0). * **antisemitism**: A classification of whether or not the comment contains antisemitic language. * **apology_for_the_dictatorship**: A classification of whether or not the comment praises the military dictatorship period in Brazil. * **fatphobia**: A classification of whether or not the comment contains language that promotes fatphobia. * **homophobia**: A classification of whether or not the comment contains language that promotes homophobia. * **partyism**: A classification of whether or not the comment contains language that promotes partyism. * **racism**: A classification of whether or not the comment contains racist language. * **religious_intolerance**: A classification of whether or not the comment contains language that promotes religious intolerance. * **sexism**: A classification of whether or not the comment contains sexist language. * **xenophobia**: A classification of whether or not the comment contains language that promotes xenophobia. * **offensive_&_no-hate_speech**: A classification of whether or not the comment is offensive but does not contain hate speech. * **specialist_1_hate_speech**: A classification of whether or not the comment was annotated by the first specialist as hate speech. * **specialist_2_hate_speech**: A classification of whether or not the comment was annotated by the second specialist as hate speech. * **specialist_3_hate_speech**: A classification of whether or not the comment was annotated by the third specialist as hate speech. ### Data Splits The original authors of the dataset did not propose a standard data split. To address this, we use the [multi-label data stratification technique](http://scikit.ml/stratification.html) implemented at the scikit-multilearn library to propose a train-validation-test split. This method considers all classes for hate speech in the data and attempts to balance the representation of each class in the split. | name |train|validation|test| |---------|----:|----:|----:| |hatebr|4480|1120|1400| ## Considerations for Using the Data ### Discussion of Biases Please refer to [the HateBR paper](https://aclanthology.org/2022.lrec-1.777/) for a discussion of biases. ### Licensing Information The HateBR dataset, including all its components, is provided strictly for academic and research purposes. The use of the dataset for any commercial or non-academic purpose is expressly prohibited without the prior written consent of [SINCH](https://www.sinch.com/). ### Citation Information ``` @inproceedings{vargas2022hatebr, title={HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection}, author={Vargas, Francielle and Carvalho, Isabelle and de G{\'o}es, Fabiana Rodrigues and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio}, booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference}, pages={7174--7183}, year={2022} } ``` ### Contributions Thanks to [@ruanchaves](https://github.com/ruanchaves) for adding this dataset.

提供机构：

ruanchaves

原始信息汇总

数据集卡 for HateBR - 巴西葡萄牙语中的攻击性语言和仇恨言论数据集

数据集描述

数据集摘要

HateBR 是第一个大规模专家标注的巴西 Instagram 评论数据集，用于网络和社交媒体上的仇恨言论和攻击性语言检测。HateBR 语料库从巴西政治人物的 Instagram 评论中收集，并由专家手动标注。它包含 7,000 个文档，根据三个不同的层次进行标注：二元分类（攻击性评论与非攻击性评论）、攻击性级别（高度、中度和轻微攻击性消息）以及九个仇恨言论类别（仇外、种族主义、恐同、性别歧视、宗教不容忍、党派主义、对独裁的辩护、反犹太主义和肥胖恐惧症）。每个评论由三位不同的标注者进行标注，并取得了高度的标注者间一致性。此外，实施了基准实验，达到了 85% 的 F1 分数，超过了当前葡萄牙语文献模型。因此，我们希望这个专家标注的语料库能够促进自然语言处理领域中仇恨言论和攻击性语言检测的研究。

支持的任务和排行榜

Hate Speech Detection

语言

葡萄牙语

数据集结构

数据实例

json { "instagram_comments": "Hipocrita!!", "offensive_language": True, "offensiveness_levels": 2, "antisemitism": False, "apology_for_the_dictatorship": False, "fatphobia": False, "homophobia": False, "partyism": False, "racism": False, "religious_intolerance": False, "sexism": False, "xenophobia": False, "offensive_&_non-hate_speech": True, "non-offensive": False, "specialist_1_hate_speech": False, "specialist_2_hate_speech": False, "specialist_3_hate_speech": False }

数据字段

instagram_comments: Instagram 评论。
offensive_language: 评论是否具有攻击性的分类（True 或 False）。
offensiveness_levels: 评论的攻击性级别分类，包括高度攻击性（3）、中度攻击性（2）、轻微攻击性（1）和非攻击性（0）。
antisemitism: 评论是否包含反犹太语言的分类。
apology_for_the_dictatorship: 评论是否赞扬巴西军事独裁时期的分类。
fatphobia: 评论是否包含肥胖恐惧症语言的分类。
homophobia: 评论是否包含恐同语言的分类。
partyism: 评论是否包含党派主义语言的分类。
racism: 评论是否包含种族主义语言的分类。
religious_intolerance: 评论是否包含宗教不容忍语言的分类。
sexism: 评论是否包含性别歧视语言的分类。
xenophobia: 评论是否包含仇外语言的分类。
offensive_&_no-hate_speech: 评论是否具有攻击性但不包含仇恨言论的分类。
specialist_1_hate_speech: 评论是否被第一位专家标注为仇恨言论的分类。
specialist_2_hate_speech: 评论是否被第二位专家标注为仇恨言论的分类。
specialist_3_hate_speech: 评论是否被第三位专家标注为仇恨言论的分类。

数据分割

数据集的原作者没有提出标准的数据分割。为了解决这个问题，我们使用 scikit-multilearn 库中实现的多标签数据分层技术提出了训练-验证-测试分割。这种方法考虑了数据中所有仇恨言论类别，并试图在分割中平衡每个类别的表示。

名称	训练	验证	测试
hatebr	4480	1120	1400

使用数据的注意事项

偏见讨论

请参阅 HateBR 论文以讨论偏见。

许可信息

HateBR 数据集及其所有组件严格用于学术和研究目的。未经 SINCH 事先书面同意，禁止将数据集用于任何商业或非学术目的。

引用信息

bibtex @inproceedings{vargas2022hatebr, title={HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection}, author={Vargas, Francielle and Carvalho, Isabelle and de G{o}es, Fabiana Rodrigues and Pardo, Thiago and Benevenuto, Fabr{\i}cio}, booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference}, pages={7174--7183}, year={2022} }

贡献

感谢 @ruanchaves 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集