ISHate

github2024-05-09 更新2024-05-31 收录

下载链接：

https://github.com/benjaminocampo/ISHate

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集用于深入分析隐含和微妙的仇恨言论信息，包含训练、开发和测试集，用于训练机器学习模型以检测隐含和微妙的仇恨言论。

This dataset is designed for in-depth analysis of implicit and subtle hate speech messages. It includes training, development, and test sets, which are utilized to train machine learning models for detecting implicit and subtle hate speech.

创建时间：

2023-01-28

原始信息汇总

数据集概述

数据集名称

ISHate 数据集

数据集内容

数据集用于分析隐式和微妙的仇恨言论信息。
数据集包含训练、开发和测试集，存储为压缩的parquet文件。
数据集中的消息被标记为显式HS、隐式HS、非微妙或微妙。
目标群体已标准化，便于分析和检查其分布。
数据集通过增加少数类（隐式HS和微妙HS）进行了扩充。

数据集获取方式

直接下载：数据集文件位于 ./data/ 目录下，可以使用 pandas 直接读取。
通过Huggingface下载：使用 datasets 库从Huggingface下载。

数据集使用建议

推荐使用所有原始数据加上扩充数据（各种扩充方法的并集）来训练模型。
隐式属性已标记给所有隐式HS消息，未来计划扩展到扩充句子。

数据集相关链接

Huggingface 数据集卡片：BenjaminOcampo/ISHate

数据集引用信息

tex @inproceedings{ocampo-etal-2023-depth, title = "An In-depth Analysis of Implicit and Subtle Hate Speech Messages", author = "Ocampo, Nicol{a}s Benjam{\i}n and Sviridova, Ekaterina and Cabrio, Elena and Villata, Serena", editor = "Vlachos, Andreas and Augenstein, Isabelle", booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.eacl-main.147", doi = "10.18653/v1/2023.eacl-main.147", pages = "1997--2013", abstract = "The research carried out so far in detecting abusive content in social media has primarily focused on overt forms of hate speech. While explicit hate speech (HS) is more easily identifiable by recognizing hateful words, messages containing linguistically subtle and implicit forms of HS (as circumlocution, metaphors and sarcasm) constitute a real challenge for automatic systems. While the sneaky and tricky nature of subtle messages might be perceived as less hurtful with respect to the same content expressed clearly, such abuse is at least as harmful as overt abuse. In this paper, we first provide an in-depth and systematic analysis of 7 standard benchmarks for HS detection, relying on a fine-grained and linguistically-grounded definition of implicit and subtle messages. Then, we experiment with state-of-the-art neural network architectures on two supervised tasks, namely implicit HS and subtle HS message classification. We show that while such models perform satisfactory on explicit messages, they fail to detect implicit and subtle content, highlighting the fact that HS detection is not a solved problem and deserves further investigation.", }

搜集汇总

数据集介绍

构建方式

ISHate数据集通过深入分析隐性和微妙的仇恨言论信息构建而成，旨在捕捉那些不易被直接识别的仇恨言论。数据集包含了原始消息及其增强版本，通过多种数据增强方法（如AAV、BT、EDA等）来增加隐性和微妙仇恨言论的样本量，以平衡数据分布。此外，数据集还对目标群体进行了标准化处理，以减少不同群体之间的重叠，确保分析的准确性。

特点

ISHate数据集的显著特点在于其对隐性和微妙仇恨言论的精细分类，不仅区分了显性与隐性仇恨言论，还进一步细分为非微妙和微妙类别。数据集包含了丰富的增强数据，通过多种数据增强技术提升了模型的泛化能力。此外，数据集对目标群体的标注进行了标准化，便于分析和研究。

使用方法

用户可以通过Pandas直接读取数据集的Parquet文件，或使用Huggingface的Datasets库进行下载和加载。数据集分为训练集、验证集和测试集，适用于隐性和微妙仇恨言论的检测任务。模型方面，用户可以利用Huggingface的Transformers库加载预训练的BERT、DeBERTa、HateBERT等模型，或使用Python的pickle模块加载SVM模型进行预测。

背景与挑战

背景概述

ISHate数据集由Nicolás Benjamín Ocampo等人于2023年创建，旨在深入分析隐性和微妙的仇恨言论信息。该数据集的构建基于对现有仇恨言论检测研究的补充，特别是针对那些通过隐晦或微妙方式表达的仇恨言论。通过在EACL 2023会议上发表的论文《An In-depth Analysis of Implicit and Subtle Hate Speech Messages》，研究团队提出了对隐性和微妙仇恨言论的精细分类，并展示了这些类型仇恨言论对自动检测系统的挑战。该数据集的发布不仅为仇恨言论检测领域提供了新的研究方向，还为相关领域的研究人员提供了宝贵的资源。

当前挑战

ISHate数据集面临的挑战主要集中在隐性和微妙仇恨言论的识别与分类上。首先，隐性仇恨言论的定义和识别标准较为模糊，需要通过复杂的语言学分析来确定。其次，微妙仇恨言论的检测更为困难，因其表达方式隐晦且不易察觉，现有的自动检测模型难以有效捕捉。此外，数据集在构建过程中还面临数据增强和标签扩展的挑战，如如何在不破坏原始数据结构的情况下增加隐性仇恨言论的样本量，以及如何为增强后的数据提供准确的隐性和微妙标签。这些挑战不仅影响了数据集的质量，也对后续的模型训练和评估提出了更高的要求。

常用场景

经典使用场景

ISHate数据集主要用于检测和分类隐性和微妙的仇恨言论信息。该数据集通过区分显性仇恨言论（Explicit HS）、隐性仇恨言论（Implicit HS）以及非微妙和微妙仇恨言论（Non-Subtle和Subtle），为研究人员提供了一个全面的工具来分析和理解不同形式的仇恨言论。通过结合原始数据和增强数据，研究人员可以训练模型以识别这些复杂且难以察觉的仇恨言论形式，从而提高社交媒体内容审核的准确性和效率。

衍生相关工作

ISHate数据集的发布激发了大量相关研究工作，特别是在隐性和微妙仇恨言论检测领域。研究人员利用该数据集开发了多种机器学习模型，包括BERT、DeBERTa和HateBERT等，以提高仇恨言论检测的准确性。此外，该数据集还促进了数据增强技术的应用，如AAV、BT和EDA等，进一步提升了模型的性能。这些衍生工作不仅丰富了仇恨言论检测的研究内容，还为相关领域的技术发展提供了重要参考。

数据集最近研究