RUSHOLD (Roman Urdu Hate Speech and Offensive Language Dataset)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/RUSHOLD
下载链接
链接失效反馈官方服务:
资源简介:
HSOL 是用于仇恨言论检测的数据集。作者从一个仇恨言论词典开始,其中包含单词和
被互联网用户识别为仇恨言论的短语,由 Hatebase.org 编译。使用他们搜索的 Twitter API
对于包含词典中术语的推文,生成来自 33,458 位 Twitter 用户的推文样本。他们提取
每个用户的时间线,产生一组 8540 万条推文。他们从这个语料库中随机抽取了 25k 条推文样本,其中包含词典中的术语,并由 CrowdFlower (CF) 工作人员手动编码。工人们被要求将每条推文标记为以下三类之一:仇恨言论、冒犯性但非仇恨言论或既非冒犯性又非仇恨言论。
HSOL is a dataset for hate speech detection. The authors started with a hate speech lexicon compiled by Hatebase.org, which contains words and phrases identified as hate speech by internet users. Using the Twitter API, they searched for tweets containing terms from this lexicon, generating a sample of tweets from 33,458 Twitter users. They extracted the timelines of each of these users, producing a corpus of 85.4 million tweets. They then randomly sampled 25,000 tweets containing lexicon terms from this corpus, which were manually annotated by CrowdFlower (CF) annotators. Annotators were instructed to label each tweet into one of three categories: hate speech, offensive but not hate speech, and neither offensive nor hate speech.
提供机构:
OpenDataLab
创建时间:
2022-06-07
搜集汇总
数据集介绍

背景与挑战
背景概述
RUSHOLD是一个用于检测仇恨言论的罗马乌尔都语数据集,基于Hatebase.org词典从Twitter收集推文,并由人工标注为仇恨言论、冒犯性言论或非冒犯性言论三类。该数据集由拉合尔管理科学大学于2020年发布。
以上内容由遇见数据集搜集并总结生成



