Multilingual Twitter Corpus

Name: Multilingual Twitter Corpus
Creator: 科罗拉多大学博尔德分校
Published: 2020-03-03 21:34:59
License: 暂无描述

arXiv2020-03-03 更新2024-06-21 收录

下载链接：

https://github.com/xiaoleihuang/Multilingual_Fairness_LREC

下载链接

链接失效反馈

官方服务：

资源简介：

本研究发布的多语言Twitter语料库，专注于仇恨言论检测，涵盖英语、意大利语、波兰语、葡萄牙语和西班牙语五种语言。该数据集通过用户个人资料推断出年龄、国家、性别和种族/民族四个作者人口统计因素，旨在评估文档分类模型的公平性。数据集的创建过程涉及从已发布的语料库中整合并标注仇恨言论，以及通过Face++等工具推断用户属性。该数据集适用于研究语言变异与人口统计组之间的关系，以及开发无偏见的文档分类器。

This multilingual Twitter corpus released in this study focuses on hate speech detection, covering five languages: English, Italian, Polish, Portuguese, and Spanish. Four demographic attributes of the authors, namely age, country, gender, and race/ethnicity, are inferred from user profiles, and this corpus is designed to evaluate the fairness of document classification models. The construction of this corpus involves integrating and annotating hate speech from published corpora, as well as inferring user attributes via tools such as Face++. This corpus is applicable for researching the relationship between linguistic variation and demographic groups, as well as developing unbiased document classifiers.

提供机构：

科罗拉多大学博尔德分校

创建时间：

2020-02-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集