five

动态生成的仇恨言论数据集

收藏
arXiv2021-06-03 更新2024-06-21 收录
下载链接:
https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
动态生成的仇恨言论数据集是由艾伦图灵研究所的研究人员创建的,旨在通过动态数据生成过程提高在线仇恨检测模型的性能和鲁棒性。该数据集包含约40,000条记录,由经过培训的标注者在四轮动态数据创建过程中生成和标注。数据集中的每条仇恨言论记录都有精细的标签,指明仇恨的类型和目标。仇恨言论占数据集的54%,远高于其他可比数据集。通过这种动态生成的方法,模型在测试集上的表现显著提高,且更难以被标注者欺骗。此外,这些模型在HATECHECK功能测试套件上的表现也有所提升,显示出更好的泛化能力。

The dynamically generated hate speech dataset was created by researchers at the Alan Turing Institute, aiming to improve the performance and robustness of online hate detection models through a dynamic data generation process. This dataset contains approximately 40,000 records, which were generated and annotated by trained annotators across four rounds of dynamic data creation procedures. Each hate speech record in the dataset carries fine-grained labels specifying the type and target of the hate speech. Hate speech accounts for 54% of the dataset, a proportion significantly higher than that of other comparable datasets. With this dynamic generation approach, models exhibit significantly improved performance on test sets and are more difficult to be deceived by annotators. Additionally, these models also show enhanced performance on the HATECHECK functional test suite, demonstrating better generalization capabilities.
提供机构:
艾伦图灵研究所
创建时间:
2021-01-01
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个用于仇恨言论检测的合成生成数据集,由Vidgen等人于2021年创建,包含标注为仇恨或非仇恨的文本内容,并进一步细分了仇恨类型和目标群体。数据集分为训练、开发和测试集,适用于训练和评估仇恨言论分类模型,支持对多样化仇恨言论形式的研究。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务