ISHate
收藏数据集概述
数据集名称
- ISHate 数据集
数据集内容
- 数据集用于分析隐式和微妙的仇恨言论信息。
- 数据集包含训练、开发和测试集,存储为压缩的parquet文件。
- 数据集中的消息被标记为显式HS、隐式HS、非微妙或微妙。
- 目标群体已标准化,便于分析和检查其分布。
- 数据集通过增加少数类(隐式HS和微妙HS)进行了扩充。
数据集获取方式
- 直接下载:数据集文件位于
./data/目录下,可以使用pandas直接读取。 - 通过Huggingface下载:使用
datasets库从Huggingface下载。
数据集使用建议
- 推荐使用所有原始数据加上扩充数据(各种扩充方法的并集)来训练模型。
- 隐式属性已标记给所有隐式HS消息,未来计划扩展到扩充句子。
数据集相关链接
- Huggingface 数据集卡片:BenjaminOcampo/ISHate
数据集引用信息
tex @inproceedings{ocampo-etal-2023-depth, title = "An In-depth Analysis of Implicit and Subtle Hate Speech Messages", author = "Ocampo, Nicol{a}s Benjam{\i}n and Sviridova, Ekaterina and Cabrio, Elena and Villata, Serena", editor = "Vlachos, Andreas and Augenstein, Isabelle", booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.eacl-main.147", doi = "10.18653/v1/2023.eacl-main.147", pages = "1997--2013", abstract = "The research carried out so far in detecting abusive content in social media has primarily focused on overt forms of hate speech. While explicit hate speech (HS) is more easily identifiable by recognizing hateful words, messages containing linguistically subtle and implicit forms of HS (as circumlocution, metaphors and sarcasm) constitute a real challenge for automatic systems. While the sneaky and tricky nature of subtle messages might be perceived as less hurtful with respect to the same content expressed clearly, such abuse is at least as harmful as overt abuse. In this paper, we first provide an in-depth and systematic analysis of 7 standard benchmarks for HS detection, relying on a fine-grained and linguistically-grounded definition of implicit and subtle messages. Then, we experiment with state-of-the-art neural network architectures on two supervised tasks, namely implicit HS and subtle HS message classification. We show that while such models perform satisfactory on explicit messages, they fail to detect implicit and subtle content, highlighting the fact that HS detection is not a solved problem and deserves further investigation.", }




