Mapping Multiclass-Targeted Hate Speech in Online Discourse: An Open Dataset
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Mapping_Multiclasss-Targeted_Hate_Speech_in_Online_Discourse_An_Open_Dataset/31292419
下载链接
链接失效反馈官方服务:
资源简介:
The Multiclass English Hate Speech Dataset is a resource for digital humanities research, enabling scholars to study how online hate speech reflects and shapes identity, power, and social boundaries in online communication. The data is an extended and fine-grained version of the original binary-labelled hate-speech dataset released as part of the TweetEval [1] benchmark (Hate sub-task). While the original dataset contained English posts annotated only as Hate or Non-Hate, this work substantially enhances it by applying a detailed manual re-annotation process to create multiple specific hate-speech categories. This provides greater granularity, enables more accurate modelling of real-world online hate, and enables sociolinguistic analysis.
All posts were manually reviewed and reclassified by trained annotators following a structured annotation guideline. The dataset introduces a comprehensive multiclass taxonomy capturing different forms of explicit and implicit hate, including:
Gender-Based Hate Speech (Misogyny)Gender-Based Hate Speech (Misandry)Racial and Ethnic Hate Speech (Anti-Black Hate Speech)Racial and Ethnic Hate Speech (Anti-Hispanic Hate Speech)Racial and Ethnic Hate Speech (Anti-Asian Hate Speech)Racial and Ethnic Hate Speech (Anti-Semitic Hate Speech)Immigration & Xenophobic Hate Speech (Anti-Immigrant)Immigration & Xenophobic Hate Speech (Anti-Refugee)Immigration & Xenophobic Hate Speech (Xenophobia)Religious Hate Speech (Islamophobia)Religious Hate Speech (Anti-Christian Hate Speech)Profanity and General AbuseThreats and ViolenceHate speech toward CountriesThrough this re-annotation effort, the dataset transforms a simple binary classification problem into a 14-class fine-grained hate-speech categorization task. This enables robust research on model sensitivity, bias analysis, explainability, and safety evaluation.
The dataset is suitable for a range of applications, including:
Content moderation research and safety evaluationSociolinguistic and discourse analysis of targeted abuseDigital humanities studies on identity, power, and social boundariesAll user identifiers and personally identifiable information (PII) have been removed or masked to ensure privacy and ethical compliance. Each record includes the anonymized text and the newly assigned multiclass label. The samples presented in the data are hate records extracted from TweetEval.
This resource aims to support researchers, practitioners, and policymakers in building safer and more responsible AI systems capable of detecting nuanced forms of online hate and targeted harassment.
[1] https://github.com/cardiffnlp/tweeteval/tree/main/datasets/hate
创建时间:
2026-02-09



