RUSHOLD (Roman Urdu Hate Speech and Offensive Language Dataset)

Name: RUSHOLD (Roman Urdu Hate Speech and Offensive Language Dataset)
Creator: OpenDataLab
Published: 2026-05-24 06:30:15
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/RUSHOLD

下载链接

链接失效反馈

官方服务：

资源简介：

HSOL 是用于仇恨言论检测的数据集。作者从一个仇恨言论词典开始，其中包含单词和被互联网用户识别为仇恨言论的短语，由 Hatebase.org 编译。使用他们搜索的 Twitter API 对于包含词典中术语的推文，生成来自 33,458 位 Twitter 用户的推文样本。他们提取每个用户的时间线，产生一组 8540 万条推文。他们从这个语料库中随机抽取了 25k 条推文样本，其中包含词典中的术语，并由 CrowdFlower (CF) 工作人员手动编码。工人们被要求将每条推文标记为以下三类之一：仇恨言论、冒犯性但非仇恨言论或既非冒犯性又非仇恨言论。

HSOL is a dataset for hate speech detection. The authors started with a hate speech lexicon compiled by Hatebase.org, which contains words and phrases identified as hate speech by internet users. Using the Twitter API, they searched for tweets containing terms from this lexicon, generating a sample of tweets from 33,458 Twitter users. They extracted the timelines of each of these users, producing a corpus of 85.4 million tweets. They then randomly sampled 25,000 tweets containing lexicon terms from this corpus, which were manually annotated by CrowdFlower (CF) annotators. Annotators were instructed to label each tweet into one of three categories: hate speech, offensive but not hate speech, and neither offensive nor hate speech.

提供机构：

OpenDataLab

创建时间：

2022-06-07

搜集汇总

数据集介绍