jigsaw_unintended_bias

Name: jigsaw_unintended_bias
Creator: maas
Published: 2025-11-07 16:30:58
License: 暂无描述

魔搭社区2025-11-07 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/jigsaw_unintended_bias

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Jigsaw Unintended Bias in Toxicity Classification ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification - **Repository:** - **Paper:** - **Leaderboard:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard - **Point of Contact:** ### Dataset Summary The Jigsaw Unintended Bias in Toxicity Classification dataset comes from the eponymous Kaggle competition. Please see the original [data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) description for more information. ### Supported Tasks and Leaderboards The main target for this dataset is toxicity prediction. Several toxicity subtypes are also available, so the dataset can be used for multi-attribute prediction. See the original [leaderboard](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard) for reference. ### Languages English ## Dataset Structure ### Data Instances A data point consists of an id, a comment, the main target, the other toxicity subtypes as well as identity attributes. For instance, here's the first train example. ``` { "article_id": 2006, "asian": NaN, "atheist": NaN, "bisexual": NaN, "black": NaN, "buddhist": NaN, "christian": NaN, "comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!", "created_date": "2015-09-29 10:50:41.987077+00", "disagree": 0, "female": NaN, "funny": 0, "heterosexual": NaN, "hindu": NaN, "homosexual_gay_or_lesbian": NaN, "identity_annotator_count": 0, "identity_attack": 0.0, "insult": 0.0, "intellectual_or_learning_disability": NaN, "jewish": NaN, "latino": NaN, "likes": 0, "male": NaN, "muslim": NaN, "obscene": 0.0, "other_disability": NaN, "other_gender": NaN, "other_race_or_ethnicity": NaN, "other_religion": NaN, "other_sexual_orientation": NaN, "parent_id": NaN, "physical_disability": NaN, "psychiatric_or_mental_illness": NaN, "publication_id": 2, "rating": 0, "sad": 0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "target": 0.0, "threat": 0.0, "toxicity_annotator_count": 4, "transgender": NaN, "white": NaN, "wow": 0 } ``` ### Data Fields - `id`: id of the comment - `target`: value between 0(non-toxic) and 1(toxic) classifying the comment - `comment_text`: the text of the comment - `severe_toxicity`: value between 0(non-severe_toxic) and 1(severe_toxic) classifying the comment - `obscene`: value between 0(non-obscene) and 1(obscene) classifying the comment - `identity_attack`: value between 0(non-identity_hate) or 1(identity_hate) classifying the comment - `insult`: value between 0(non-insult) or 1(insult) classifying the comment - `threat`: value between 0(non-threat) and 1(threat) classifying the comment - For a subset of rows, columns containing whether the comment mentions the entities (they may contain NaNs): - `male` - `female` - `transgender` - `other_gender` - `heterosexual` - `homosexual_gay_or_lesbian` - `bisexual` - `other_sexual_orientation` - `christian` - `jewish` - `muslim` - `hindu` - `buddhist` - `atheist` - `other_religion` - `black` - `white` - `asian` - `latino` - `other_race_or_ethnicity` - `physical_disability` - `intellectual_or_learning_disability` - `psychiatric_or_mental_illness` - `other_disability` - Other metadata related to the source of the comment, such as creation date, publication id, number of likes, number of annotators, etc: - `created_date` - `publication_id` - `parent_id` - `article_id` - `rating` - `funny` - `wow` - `sad` - `likes` - `disagree` - `sexual_explicit` - `identity_annotator_count` - `toxicity_annotator_count` ### Data Splits There are four splits: - train: The train dataset as released during the competition. Contains labels and identity information for a subset of rows. - test: The train dataset as released during the competition. Does not contain labels nor identity information. - test_private_expanded: The private leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold. - test_public_expanded: The public leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold. ## Dataset Creation ### Curation Rationale The dataset was created to help in efforts to identify and curb instances of toxicity online. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information This dataset is released under CC0, as is the underlying comment text. ### Citation Information No citation is available for this dataset, though you may link to the [kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) competition ### Contributions Thanks to [@iwontbecreative](https://github.com/iwontbecreative) for adding this dataset.

# Jigsaw毒性分类意外偏差数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **官方主页：** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification - **代码仓库：** - **相关论文：** - **排行榜：** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard - **联系人：** ### 数据集概要本Jigsaw毒性分类意外偏差数据集源自同名Kaggle竞赛。如需了解更多信息，请参阅原始[数据集](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)说明。 ### 支持任务与排行榜本数据集的核心任务为毒性预测。此外还提供了多个毒性子类别标签，因此可用于多属性预测任务。相关参考信息可参阅原始[排行榜](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard)。 ### 语言英语 ## 数据集结构 ### 数据实例每条数据样本包含评论ID、评论内容、主分类标签、其他毒性子类别标签以及身份属性信息。以下为首个训练集样本示例： { "article_id": 2006, "asian": NaN, "atheist": NaN, "bisexual": NaN, "black": NaN, "buddhist": NaN, "christian": NaN, "comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!", "created_date": "2015-09-29 10:50:41.987077+00", "disagree": 0, "female": NaN, "funny": 0, "heterosexual": NaN, "hindu": NaN, "homosexual_gay_or_lesbian": NaN, "identity_annotator_count": 0, "identity_attack": 0.0, "insult": 0.0, "intellectual_or_learning_disability": NaN, "jewish": NaN, "latino": NaN, "likes": 0, "male": NaN, "muslim": NaN, "obscene": 0.0, "other_disability": NaN, "other_gender": NaN, "other_race_or_ethnicity": NaN, "other_religion": NaN, "other_sexual_orientation": NaN, "parent_id": NaN, "physical_disability": NaN, "psychiatric_or_mental_illness": NaN, "publication_id": 2, "rating": 0, "sad": 0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "target": 0.0, "threat": 0.0, "toxicity_annotator_count": 4, "transgender": NaN, "white": NaN, "wow": 0 } ### 数据字段 - `id`：评论的唯一标识符 - `target`：取值范围为0（非毒性）至1（毒性），用于对评论进行毒性分类 - `comment_text`：评论文本内容 - `severe_toxicity`（严重毒性）：取值范围为0（非严重毒性）至1（严重毒性），用于对评论进行严重毒性分类 - `obscene`（淫秽内容）：取值范围为0（非淫秽）至1（淫秽），用于对评论进行淫秽性分类 - `identity_attack`（身份攻击）：取值范围为0（非身份仇恨）至1（身份仇恨），用于对评论进行身份攻击分类 - `insult`（侮辱）：取值范围为0（非侮辱）至1（侮辱），用于对评论进行侮辱性分类 - `threat`（威胁）：取值范围为0（非威胁）至1（威胁），用于对评论进行威胁性分类 - 针对部分样本，包含评论是否提及以下实体的字段（可能存在缺失值NaN）： - `male`（男性） - `female`（女性） - `transgender`（跨性别） - `other_gender`（其他性别） - `heterosexual`（异性恋） - `homosexual_gay_or_lesbian`（男同性恋/女同性恋） - `bisexual`（双性恋） - `other_sexual_orientation`（其他性取向） - `christian`（基督教徒） - `jewish`（犹太教徒） - `muslim`（穆斯林） - `hindu`（印度教徒） - `buddhist`（佛教徒） - `atheist`（无神论者） - `other_religion`（其他宗教） - `black`（黑人） - `white`（白人） - `asian`（亚裔） - `latino`（拉丁裔） - `other_race_or_ethnicity`（其他种族或族裔） - `physical_disability`（身体残疾） - `intellectual_or_learning_disability`（智力或学习障碍） - `psychiatric_or_mental_illness`（精神疾病） - `other_disability`（其他残疾） - 评论来源相关的其他元数据，例如创建日期、发布ID、点赞数、标注者数量等： - `created_date`（创建日期） - `publication_id`（发布ID） - `parent_id`（父评论ID） - `article_id`（文章ID） - `rating`（评分） - `funny`（有趣标记数） - `wow`（惊讶标记数） - `sad`（悲伤标记数） - `likes`（点赞数） - `disagree`（反对数） - `sexual_explicit`（露骨性内容） - `identity_annotator_count`（身份标注者数量） - `toxicity_annotator_count`（毒性标注者数量） ### 数据划分本数据集共包含四个划分集： - 训练集（train）：竞赛期间发布的训练数据集，仅部分样本包含标签与身份相关信息。 - 测试集（test）：竞赛期间发布的训练数据集，不包含标签与身份相关信息。 - 私有扩展测试集（test_private_expanded）：私有排行榜对应的测试集，包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果，可通过≥0.5的阈值轻松还原。 - 公共扩展测试集（test_public_expanded）：公共排行榜对应的测试集，包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果，可通过≥0.5的阈值轻松还原。 ## 数据集构建 ### 构建初衷本数据集的构建初衷是助力识别并遏制网络上的毒性言论。 ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本的创作者是谁？ [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息本数据集与底层评论文本均采用CC0协议发布。 ### 引用信息本数据集暂无公开引用文献，您可链接至该[Kaggle竞赛页面](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)。 ### 贡献致谢感谢[@iwontbecreative](https://github.com/iwontbecreative)贡献本数据集。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集