The 'Call me sexist but' Dataset (CMSB)

Name: The 'Call me sexist but' Dataset (CMSB)
Creator: OpenDataLab
Published: 2026-05-31 11:30:31
License: 暂无描述

OpenDataLab2026-05-31 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/The_Call_me_sexist_but_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

来自心理量表的推文和项目，用于通过反事实示例进行性别歧视检测。该数据集包含三种类型的“短文本”内容：社交媒体帖子（推文）心理调查项目，以及前两类的合成对抗性修改。推文数据可以根据其来源进一步分为 3 个单独的数据集：1.1 敌对性别歧视数据集、1.2 仁慈性别歧视数据集和 1.3 callme 性别歧视数据集。 1.1 和 1.2 是从 Waseem, Z., & Hovy, D. (2016) 和 Jha, A., & Mamidi, R. (2017) 获得的预先存在的数据集，我们重新注释了（参见我们的论文和数据声明更多信息）。具体包括这些数据集的理由是，它们在真实的对话（社交媒体）环境中具有各种性别歧视表达。特别是，它们的表现形式从通过负面刻板印象（1.1）公开反对少数性别到利用积极刻板印象巧妙地将其视为能力不足和脆弱（1.2）。 callme 性别歧视数据集 (1.3) 是我们根据推文中“call me sexist but”短语的存在收集的。这种查询选择背后的基本原理是，一些 Twitter 用户认为潜在的性别歧视评论和信号，因此使用这个短语的存在，这可以说是对性别歧视意见的免责声明。调查项目 (2) 属于态度调查，旨在衡量参与者的性别歧视态度和性别偏见。我们在论文中详细介绍了我们的选择程序。最后，对抗性示例由来自 Amazon Mechanical Turk 的众包通过对推文和缩放项目进行最小更改来生成，以便将性别歧视示例更改为非性别歧视示例。我们希望这些示例将帮助我们控制非性别歧视数据（例如主题、文明）中的典型混淆，并产生具有更少偏见的数据集，从而使我们能够训练更强大的机器学习模型。出于道德原因，我们只要求将性别歧视的例子变成非性别歧视的例子，反之亦然。对数据集进行注释以捕获文本因内容（说话者所相信的内容）或措辞（说话者选择的词）而具有性别歧视的情况。我们在论文中解释了这个密码本的基本原理。

This dataset consists of tweets and items sourced from psychological scales, designed for gender discrimination detection using counterfactual examples. This dataset encompasses three types of short text content: social media posts (tweets), psychological survey items, and synthetic adversarial modifications of the first two categories. The tweet data can be further divided into three separate datasets based on their source: 1.1 Hostile Sexism Dataset, 1.2 Benevolent Sexism Dataset, and 1.3 CallMe Sexism Dataset. Datasets 1.1 and 1.2 are pre-existing datasets obtained from Waseem, Z., & Hovy, D. (2016) and Jha, A., & Mamidi, R. (2017), which we re-annotated (please refer to our paper and data statement for more details). The rationale for including these datasets is that they feature diverse expressions of gender discrimination in real-world conversational (social media) contexts. Specifically, their manifestations range from overtly opposing marginalized genders via negative stereotypes (1.1) to subtly framing marginalized genders as incompetent and vulnerable by leveraging positive stereotypes (1.2). The CallMe Sexism Dataset (1.3) was collected based on the presence of the phrase "call me sexist but" in tweets. The rationale behind this query selection is that some Twitter users recognize potential sexist comments and signals, hence employing the presence of this phrase, which can arguably serve as a disclaimer for sexist opinions. The survey items (Category 2) belong to attitude surveys designed to measure participants' sexist attitudes and gender biases. The details of our selection procedure are elaborated in our paper. Finally, adversarial examples were generated via crowdsourcing from Amazon Mechanical Turk by making minimal changes to tweets and survey items, with the goal of converting sexist examples into non-sexist ones. We hope that these examples will help us control for typical confounders in non-sexist data (e.g., topics, civility) and produce datasets with less bias, enabling us to train more robust machine learning models. For ethical reasons, we only required converting sexist examples into non-sexist ones, and vice versa. The dataset was annotated to capture cases where text is sexist due to either its content (what the speaker believes) or its wording (the words the speaker chooses). The rationale for this coding scheme is explained in our paper.

提供机构：

OpenDataLab

创建时间：

2022-09-01

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个用于性别歧视检测的文本分类数据集，包含来自社交媒体和心理调查的三种类型数据，以及对抗性修改的合成示例。数据集经过专业注释，旨在帮助训练更少偏见的机器学习模型。

以上内容由遇见数据集搜集并总结生成