HateBR

arXiv2022-12-27 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2103.14972v6

下载链接

链接失效反馈

官方服务：

资源简介：

HateBR是由圣保罗大学和米纳斯吉拉斯联邦大学的研究团队创建的大型专家标注的巴西葡萄牙语Instagram评论数据集，旨在用于攻击性语言和仇恨言论检测。该数据集包含7000条从巴西政治人物的Instagram账户收集的评论，由专家手动标注，具有高的一致性。数据集根据三个不同的层次进行标注：二元分类（攻击性与非攻击性评论）、攻击性级别分类（高度、中度、轻微攻击性）和九个仇恨言论组（如种族主义、性别歧视等）。创建过程中，团队采用了严格的标注标准和培训步骤，确保标注的一致性和质量。该数据集的应用领域主要集中在自动检测和分类社交媒体上的攻击性语言和仇恨言论，以增强网络安全和识别对特定群体怀有恶意意图的个人。

HateBR is a large-scale expert-annotated Brazilian Portuguese Instagram comment dataset developed by research teams from the University of São Paulo and the Federal University of Minas Gerais, targeting offensive language and hate speech detection. This dataset includes 7,000 comments collected from the Instagram accounts of Brazilian political figures, which were manually annotated by experts with high inter-annotator agreement. The dataset is annotated under three distinct hierarchical frameworks: binary classification (offensive vs. non-offensive comments), offensive intensity classification (high, moderate, and mild offensiveness), and nine hate speech categories (e.g., racism, sexism, etc.). During the dataset construction process, the team adopted strict annotation standards and training procedures to ensure annotation consistency and quality. The primary application scenarios of this dataset focus on automatic detection and classification of offensive language and hate speech on social media, to bolster cybersecurity and identify individuals harboring malicious intent towards specific groups.

提供机构：

圣保罗大学数学与计算机科学研究所，巴西米纳斯吉拉斯联邦大学计算机科学系

创建时间：

2021-03-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集