five

Wikipedia Talk Labels Toxicity

收藏
arXiv2025-09-30 收录
下载链接:
https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了超过16万个来自英文维基百科的评论,这些评论被标注了毒性评分以及标注者的人口统计信息。每条评论大约有10位标注者使用明确的毒性类别进行标记。此外,这个语料库在最近的研究中被广泛用于开发深度学习方法以检测有毒语言,并研究标注中的偏见。它提供了标注者的性别信息,这对于分析标注中性别表现的研究至关重要。该数据集的规模超过16万条评论,其任务是对毒性和性别进行分类。

This dataset contains over 160,000 comments sourced from English Wikipedia, annotated with toxicity scores and the demographic information of annotators. Each comment was labeled by approximately 10 annotators using explicit toxicity categories. Furthermore, this corpus has been widely utilized in recent research for developing deep learning methods to detect toxic language and investigating biases in annotation workflows. It provides the gender information of annotators, which is critical for studies analyzing gender performance in annotations. The core task of this dataset is toxicity and gender classification.
提供机构:
Wikipedia Detox project
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作