five

Multilingual Twitter Corpus

收藏
arXiv2020-03-03 更新2024-06-21 收录
下载链接:
https://github.com/xiaoleihuang/Multilingual_Fairness_LREC
下载链接
链接失效反馈
官方服务:
资源简介:
本研究发布的多语言Twitter语料库,专注于仇恨言论检测,涵盖英语、意大利语、波兰语、葡萄牙语和西班牙语五种语言。该数据集通过用户个人资料推断出年龄、国家、性别和种族/民族四个作者人口统计因素,旨在评估文档分类模型的公平性。数据集的创建过程涉及从已发布的语料库中整合并标注仇恨言论,以及通过Face++等工具推断用户属性。该数据集适用于研究语言变异与人口统计组之间的关系,以及开发无偏见的文档分类器。

This multilingual Twitter corpus released in this study focuses on hate speech detection, covering five languages: English, Italian, Polish, Portuguese, and Spanish. Four demographic attributes of the authors, namely age, country, gender, and race/ethnicity, are inferred from user profiles, and this corpus is designed to evaluate the fairness of document classification models. The construction of this corpus involves integrating and annotating hate speech from published corpora, as well as inferring user attributes via tools such as Face++. This corpus is applicable for researching the relationship between linguistic variation and demographic groups, as well as developing unbiased document classifiers.
提供机构:
科罗拉多大学博尔德分校
创建时间:
2020-02-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作