jigsaw_unintended_bias
收藏魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/jigsaw_unintended_bias
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Jigsaw Unintended Bias in Toxicity Classification
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
- **Repository:**
- **Paper:**
- **Leaderboard:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard
- **Point of Contact:**
### Dataset Summary
The Jigsaw Unintended Bias in Toxicity Classification dataset comes from the eponymous Kaggle competition.
Please see the original [data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)
description for more information.
### Supported Tasks and Leaderboards
The main target for this dataset is toxicity prediction. Several toxicity subtypes are also available, so the dataset
can be used for multi-attribute prediction.
See the original [leaderboard](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard)
for reference.
### Languages
English
## Dataset Structure
### Data Instances
A data point consists of an id, a comment, the main target, the other toxicity subtypes as well as identity attributes.
For instance, here's the first train example.
```
{
"article_id": 2006,
"asian": NaN,
"atheist": NaN,
"bisexual": NaN,
"black": NaN,
"buddhist": NaN,
"christian": NaN,
"comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",
"created_date": "2015-09-29 10:50:41.987077+00",
"disagree": 0,
"female": NaN,
"funny": 0,
"heterosexual": NaN,
"hindu": NaN,
"homosexual_gay_or_lesbian": NaN,
"identity_annotator_count": 0,
"identity_attack": 0.0,
"insult": 0.0,
"intellectual_or_learning_disability": NaN,
"jewish": NaN,
"latino": NaN,
"likes": 0,
"male": NaN,
"muslim": NaN,
"obscene": 0.0,
"other_disability": NaN,
"other_gender": NaN,
"other_race_or_ethnicity": NaN,
"other_religion": NaN,
"other_sexual_orientation": NaN,
"parent_id": NaN,
"physical_disability": NaN,
"psychiatric_or_mental_illness": NaN,
"publication_id": 2,
"rating": 0,
"sad": 0,
"severe_toxicity": 0.0,
"sexual_explicit": 0.0,
"target": 0.0,
"threat": 0.0,
"toxicity_annotator_count": 4,
"transgender": NaN,
"white": NaN,
"wow": 0
}
```
### Data Fields
- `id`: id of the comment
- `target`: value between 0(non-toxic) and 1(toxic) classifying the comment
- `comment_text`: the text of the comment
- `severe_toxicity`: value between 0(non-severe_toxic) and 1(severe_toxic) classifying the comment
- `obscene`: value between 0(non-obscene) and 1(obscene) classifying the comment
- `identity_attack`: value between 0(non-identity_hate) or 1(identity_hate) classifying the comment
- `insult`: value between 0(non-insult) or 1(insult) classifying the comment
- `threat`: value between 0(non-threat) and 1(threat) classifying the comment
- For a subset of rows, columns containing whether the comment mentions the entities (they may contain NaNs):
- `male`
- `female`
- `transgender`
- `other_gender`
- `heterosexual`
- `homosexual_gay_or_lesbian`
- `bisexual`
- `other_sexual_orientation`
- `christian`
- `jewish`
- `muslim`
- `hindu`
- `buddhist`
- `atheist`
- `other_religion`
- `black`
- `white`
- `asian`
- `latino`
- `other_race_or_ethnicity`
- `physical_disability`
- `intellectual_or_learning_disability`
- `psychiatric_or_mental_illness`
- `other_disability`
- Other metadata related to the source of the comment, such as creation date, publication id, number of likes,
number of annotators, etc:
- `created_date`
- `publication_id`
- `parent_id`
- `article_id`
- `rating`
- `funny`
- `wow`
- `sad`
- `likes`
- `disagree`
- `sexual_explicit`
- `identity_annotator_count`
- `toxicity_annotator_count`
### Data Splits
There are four splits:
- train: The train dataset as released during the competition. Contains labels and identity information for a
subset of rows.
- test: The train dataset as released during the competition. Does not contain labels nor identity information.
- test_private_expanded: The private leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold.
- test_public_expanded: The public leaderboard test set, including toxicity labels and subgroups. The competition target was a binarized version of the toxicity column, which can be easily reconstructed using a >=0.5 threshold.
## Dataset Creation
### Curation Rationale
The dataset was created to help in efforts to identify and curb instances of toxicity online.
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
This dataset is released under CC0, as is the underlying comment text.
### Citation Information
No citation is available for this dataset, though you may link to the [kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) competition
### Contributions
Thanks to [@iwontbecreative](https://github.com/iwontbecreative) for adding this dataset.
# Jigsaw毒性分类意外偏差数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **官方主页:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
- **代码仓库:**
- **相关论文:**
- **排行榜:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard
- **联系人:**
### 数据集概要
本Jigsaw毒性分类意外偏差数据集源自同名Kaggle竞赛。如需了解更多信息,请参阅原始[数据集](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)说明。
### 支持任务与排行榜
本数据集的核心任务为毒性预测。此外还提供了多个毒性子类别标签,因此可用于多属性预测任务。相关参考信息可参阅原始[排行榜](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/leaderboard)。
### 语言
英语
## 数据集结构
### 数据实例
每条数据样本包含评论ID、评论内容、主分类标签、其他毒性子类别标签以及身份属性信息。
以下为首个训练集样本示例:
{
"article_id": 2006,
"asian": NaN,
"atheist": NaN,
"bisexual": NaN,
"black": NaN,
"buddhist": NaN,
"christian": NaN,
"comment_text": "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!",
"created_date": "2015-09-29 10:50:41.987077+00",
"disagree": 0,
"female": NaN,
"funny": 0,
"heterosexual": NaN,
"hindu": NaN,
"homosexual_gay_or_lesbian": NaN,
"identity_annotator_count": 0,
"identity_attack": 0.0,
"insult": 0.0,
"intellectual_or_learning_disability": NaN,
"jewish": NaN,
"latino": NaN,
"likes": 0,
"male": NaN,
"muslim": NaN,
"obscene": 0.0,
"other_disability": NaN,
"other_gender": NaN,
"other_race_or_ethnicity": NaN,
"other_religion": NaN,
"other_sexual_orientation": NaN,
"parent_id": NaN,
"physical_disability": NaN,
"psychiatric_or_mental_illness": NaN,
"publication_id": 2,
"rating": 0,
"sad": 0,
"severe_toxicity": 0.0,
"sexual_explicit": 0.0,
"target": 0.0,
"threat": 0.0,
"toxicity_annotator_count": 4,
"transgender": NaN,
"white": NaN,
"wow": 0
}
### 数据字段
- `id`:评论的唯一标识符
- `target`:取值范围为0(非毒性)至1(毒性),用于对评论进行毒性分类
- `comment_text`:评论文本内容
- `severe_toxicity`(严重毒性):取值范围为0(非严重毒性)至1(严重毒性),用于对评论进行严重毒性分类
- `obscene`(淫秽内容):取值范围为0(非淫秽)至1(淫秽),用于对评论进行淫秽性分类
- `identity_attack`(身份攻击):取值范围为0(非身份仇恨)至1(身份仇恨),用于对评论进行身份攻击分类
- `insult`(侮辱):取值范围为0(非侮辱)至1(侮辱),用于对评论进行侮辱性分类
- `threat`(威胁):取值范围为0(非威胁)至1(威胁),用于对评论进行威胁性分类
- 针对部分样本,包含评论是否提及以下实体的字段(可能存在缺失值NaN):
- `male`(男性)
- `female`(女性)
- `transgender`(跨性别)
- `other_gender`(其他性别)
- `heterosexual`(异性恋)
- `homosexual_gay_or_lesbian`(男同性恋/女同性恋)
- `bisexual`(双性恋)
- `other_sexual_orientation`(其他性取向)
- `christian`(基督教徒)
- `jewish`(犹太教徒)
- `muslim`(穆斯林)
- `hindu`(印度教徒)
- `buddhist`(佛教徒)
- `atheist`(无神论者)
- `other_religion`(其他宗教)
- `black`(黑人)
- `white`(白人)
- `asian`(亚裔)
- `latino`(拉丁裔)
- `other_race_or_ethnicity`(其他种族或族裔)
- `physical_disability`(身体残疾)
- `intellectual_or_learning_disability`(智力或学习障碍)
- `psychiatric_or_mental_illness`(精神疾病)
- `other_disability`(其他残疾)
- 评论来源相关的其他元数据,例如创建日期、发布ID、点赞数、标注者数量等:
- `created_date`(创建日期)
- `publication_id`(发布ID)
- `parent_id`(父评论ID)
- `article_id`(文章ID)
- `rating`(评分)
- `funny`(有趣标记数)
- `wow`(惊讶标记数)
- `sad`(悲伤标记数)
- `likes`(点赞数)
- `disagree`(反对数)
- `sexual_explicit`(露骨性内容)
- `identity_annotator_count`(身份标注者数量)
- `toxicity_annotator_count`(毒性标注者数量)
### 数据划分
本数据集共包含四个划分集:
- 训练集(train):竞赛期间发布的训练数据集,仅部分样本包含标签与身份相关信息。
- 测试集(test):竞赛期间发布的训练数据集,不包含标签与身份相关信息。
- 私有扩展测试集(test_private_expanded):私有排行榜对应的测试集,包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果,可通过≥0.5的阈值轻松还原。
- 公共扩展测试集(test_public_expanded):公共排行榜对应的测试集,包含毒性标签与群体细分信息。竞赛的目标标签为毒性列经二值化处理后的结果,可通过≥0.5的阈值轻松还原。
## 数据集构建
### 构建初衷
本数据集的构建初衷是助力识别并遏制网络上的毒性言论。
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源文本的创作者是谁?
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
本数据集与底层评论文本均采用CC0协议发布。
### 引用信息
本数据集暂无公开引用文献,您可链接至该[Kaggle竞赛页面](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)。
### 贡献致谢
感谢[@iwontbecreative](https://github.com/iwontbecreative)贡献本数据集。
提供机构:
maas
创建时间:
2025-04-21



