five

jigsaw_unintended_bias

收藏
魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/jigsaw_unintended_bias
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Jigsaw Unintended Bias in Toxicity Classification ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification - **Repository:** - **Paper:** - **Leaderboard:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard - **Point of Contact:** ### Dataset Summary The Jigsaw Unintended Bias in Toxicity Classification dataset comes from the eponymous Kaggle competition. Please see the original [data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) description for more information. ### Supported Tasks and Leaderboards The main target for this dataset is toxicity prediction. Several toxicity subtypes are also available, so the dataset can be used for multi-attribute prediction. See the original [leaderboard](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard) for reference. ### Languages English ## Dataset Structure ### Data Instances A data point consists of an id, a comment, the main target, the other toxicity subtypes as well as identity attributes. For instance, here's the first train example. ``` { "article_id": 2006, "asian": NaN, "atheist": NaN, "bisexual": NaN, "black": NaN, "buddhist": NaN, "christian": NaN, "comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!", "created_date": "2015-09-29 10:50:41.987077+00", "disagree": 0, "female": NaN, "funny": 0, "heterosexual": NaN, "hindu": NaN, "homosexual_gay_or_lesbian": NaN, "identity_annotator_count": 0, "identity_attack": 0.0, "insult": 0.0, "intellectual_or_learning_disability": NaN, "jewish": NaN, "latino": NaN, "likes": 0, "male": NaN, "muslim": NaN, "obscene": 0.0, "other_disability": NaN, "other_gender": NaN, "other_race_or_ethnicity": NaN, "other_religion": NaN, "other_sexual_orientation": NaN, "parent_id": NaN, "physical_disability": NaN, "psychiatric_or_mental_illness": NaN, "publication_id": 2, "rating": 0, "sad": 0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "target": 0.0, "threat": 0.0, "toxicity_annotator_count": 4, "transgender": NaN, "white": NaN, "wow": 0 } ``` ### Data Fields - `id`: id of the comment - `target`: value between 0(non-toxic) and 1(toxic) classifying the comment - `comment_text`: the text of the comment - `severe_toxicity`: value between 0(non-severe_toxic) and 1(severe_toxic) classifying the comment - `obscene`: value between 0(non-obscene) and 1(obscene) classifying the comment - `identity_attack`: value between 0(non-identity_hate) or 1(identity_hate) classifying the comment - `insult`: value between 0(non-insult) or 1(insult) classifying the comment - `threat`: value between 0(non-threat) and 1(threat) classifying the comment - For a subset of rows, columns containing whether the comment mentions the entities (they may contain NaNs): - `male` - `female` - `transgender` - `other_gender` - `heterosexual` - `homosexual_gay_or_lesbian` - `bisexual` - `other_sexual_orientation` - `christian` - `jewish` - `muslim` - `hindu` - `buddhist` - `atheist` - `other_religion` - `black` - `white` - `asian` - `latino` - `other_race_or_ethnicity` - `physical_disability` - `intellectual_or_learning_disability` - `psychiatric_or_mental_illness` - `other_disability` - Other metadata related to the source of the comment, such as creation date, publication id, number of likes, number of annotators, etc: - `created_date` - `publication_id` - `parent_id` - `article_id` - `rating` - `funny` - `wow` - `sad` - `likes` - `disagree` - `sexual_explicit` - `identity_annotator_count` - `toxicity_annotator_count` ### Data Splits There are four splits: - train: The train dataset as released during the competition. Contains labels and identity information for a subset of rows. - test: The train dataset as released during the competition. Does not contain labels nor identity information. - test_private_expanded: The private leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold. - test_public_expanded: The public leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold. ## Dataset Creation ### Curation Rationale The dataset was created to help in efforts to identify and curb instances of toxicity online. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information This dataset is released under CC0, as is the underlying comment text. ### Citation Information No citation is available for this dataset, though you may link to the [kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) competition ### Contributions Thanks to [@iwontbecreative](https://github.com/iwontbecreative) for adding this dataset.

# Jigsaw毒性分类意外偏差数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **官方主页:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification - **代码仓库:** - **相关论文:** - **排行榜:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard - **联系人:** ### 数据集概要 本Jigsaw毒性分类意外偏差数据集源自同名Kaggle竞赛。如需了解更多信息,请参阅原始[数据集](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)说明。 ### 支持任务与排行榜 本数据集的核心任务为毒性预测。此外还提供了多个毒性子类别标签,因此可用于多属性预测任务。相关参考信息可参阅原始[排行榜](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard)。 ### 语言 英语 ## 数据集结构 ### 数据实例 每条数据样本包含评论ID、评论内容、主分类标签、其他毒性子类别标签以及身份属性信息。 以下为首个训练集样本示例: { "article_id": 2006, "asian": NaN, "atheist": NaN, "bisexual": NaN, "black": NaN, "buddhist": NaN, "christian": NaN, "comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!", "created_date": "2015-09-29 10:50:41.987077+00", "disagree": 0, "female": NaN, "funny": 0, "heterosexual": NaN, "hindu": NaN, "homosexual_gay_or_lesbian": NaN, "identity_annotator_count": 0, "identity_attack": 0.0, "insult": 0.0, "intellectual_or_learning_disability": NaN, "jewish": NaN, "latino": NaN, "likes": 0, "male": NaN, "muslim": NaN, "obscene": 0.0, "other_disability": NaN, "other_gender": NaN, "other_race_or_ethnicity": NaN, "other_religion": NaN, "other_sexual_orientation": NaN, "parent_id": NaN, "physical_disability": NaN, "psychiatric_or_mental_illness": NaN, "publication_id": 2, "rating": 0, "sad": 0, "severe_toxicity": 0.0, "sexual_explicit": 0.0, "target": 0.0, "threat": 0.0, "toxicity_annotator_count": 4, "transgender": NaN, "white": NaN, "wow": 0 } ### 数据字段 - `id`:评论的唯一标识符 - `target`:取值范围为0(非毒性)至1(毒性),用于对评论进行毒性分类 - `comment_text`:评论文本内容 - `severe_toxicity`(严重毒性):取值范围为0(非严重毒性)至1(严重毒性),用于对评论进行严重毒性分类 - `obscene`(淫秽内容):取值范围为0(非淫秽)至1(淫秽),用于对评论进行淫秽性分类 - `identity_attack`(身份攻击):取值范围为0(非身份仇恨)至1(身份仇恨),用于对评论进行身份攻击分类 - `insult`(侮辱):取值范围为0(非侮辱)至1(侮辱),用于对评论进行侮辱性分类 - `threat`(威胁):取值范围为0(非威胁)至1(威胁),用于对评论进行威胁性分类 - 针对部分样本,包含评论是否提及以下实体的字段(可能存在缺失值NaN): - `male`(男性) - `female`(女性) - `transgender`(跨性别) - `other_gender`(其他性别) - `heterosexual`(异性恋) - `homosexual_gay_or_lesbian`(男同性恋/女同性恋) - `bisexual`(双性恋) - `other_sexual_orientation`(其他性取向) - `christian`(基督教徒) - `jewish`(犹太教徒) - `muslim`(穆斯林) - `hindu`(印度教徒) - `buddhist`(佛教徒) - `atheist`(无神论者) - `other_religion`(其他宗教) - `black`(黑人) - `white`(白人) - `asian`(亚裔) - `latino`(拉丁裔) - `other_race_or_ethnicity`(其他种族或族裔) - `physical_disability`(身体残疾) - `intellectual_or_learning_disability`(智力或学习障碍) - `psychiatric_or_mental_illness`(精神疾病) - `other_disability`(其他残疾) - 评论来源相关的其他元数据,例如创建日期、发布ID、点赞数、标注者数量等: - `created_date`(创建日期) - `publication_id`(发布ID) - `parent_id`(父评论ID) - `article_id`(文章ID) - `rating`(评分) - `funny`(有趣标记数) - `wow`(惊讶标记数) - `sad`(悲伤标记数) - `likes`(点赞数) - `disagree`(反对数) - `sexual_explicit`(露骨性内容) - `identity_annotator_count`(身份标注者数量) - `toxicity_annotator_count`(毒性标注者数量) ### 数据划分 本数据集共包含四个划分集: - 训练集(train):竞赛期间发布的训练数据集,仅部分样本包含标签与身份相关信息。 - 测试集(test):竞赛期间发布的训练数据集,不包含标签与身份相关信息。 - 私有扩展测试集(test_private_expanded):私有排行榜对应的测试集,包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果,可通过≥0.5的阈值轻松还原。 - 公共扩展测试集(test_public_expanded):公共排行榜对应的测试集,包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果,可通过≥0.5的阈值轻松还原。 ## 数据集构建 ### 构建初衷 本数据集的构建初衷是助力识别并遏制网络上的毒性言论。 ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本的创作者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 本数据集与底层评论文本均采用CC0协议发布。 ### 引用信息 本数据集暂无公开引用文献,您可链接至该[Kaggle竞赛页面](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)。 ### 贡献致谢 感谢[@iwontbecreative](https://github.com/iwontbecreative)贡献本数据集。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作