jigsaw_toxicity_pred
收藏魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/jigsaw_toxicity_pred
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Jigsaw Comment Toxicity Classification Kaggle Competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. This dataset consists of a large number of Wikipedia comments which have been labeled by human raters for toxic behavior.
### Supported Tasks and Leaderboards
The dataset support multi-label classification
### Languages
The comments are in English
## Dataset Structure
### Data Instances
A data point consists of a comment followed by multiple labels that can be associated with it.
{'id': '02141412314',
'comment_text': 'Sample comment text',
'toxic': 0,
'severe_toxic': 0,
'obscene': 0,
'threat': 0,
'insult': 0,
'identity_hate': 1,
}
### Data Fields
- `id`: id of the comment
- `comment_text`: the text of the comment
- `toxic`: value of 0(non-toxic) or 1(toxic) classifying the comment
- `severe_toxic`: value of 0(non-severe_toxic) or 1(severe_toxic) classifying the comment
- `obscene`: value of 0(non-obscene) or 1(obscene) classifying the comment
- `threat`: value of 0(non-threat) or 1(threat) classifying the comment
- `insult`: value of 0(non-insult) or 1(insult) classifying the comment
- `identity_hate`: value of 0(non-identity_hate) or 1(identity_hate) classifying the comment
### Data Splits
The data is split into a training and testing set.
## Dataset Creation
### Curation Rationale
The dataset was created to help in efforts to identify and curb instances of toxicity online.
### Source Data
#### Initial Data Collection and Normalization
The dataset is a collection of Wikipedia comments.
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
If words that are associated with swearing, insults or profanity are present in a comment, it is likely that it will be classified as toxic, regardless of the tone or the intent of the author e.g. humorous/self-deprecating. This could present some biases towards already vulnerable minority groups.
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
The "Toxic Comment Classification" dataset is released under [CC0], with the underlying comment text being governed by Wikipedia\'s [CC-SA-3.0].
### Citation Information
No citation information.
### Contributions
Thanks to [@Tigrex161](https://github.com/Tigrex161) for adding this dataset.
# [数据集名称]数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用考量](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **官方主页:** [Jigsaw评论有毒性分类Kaggle竞赛](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)
- **代码仓库:**
- **相关论文:**
- **排行榜:**
- **联系方式:**
### 数据集摘要
与自身关注的议题展开讨论往往并非易事。网络上存在的辱骂与骚扰威胁,使得许多用户不愿再表达自身观点,也放弃了探寻不同见解的可能。各平台难以有效促成健康对话,最终导致众多社区限制乃至完全关闭用户评论功能。本数据集包含大量维基百科评论,已由人工标注人员针对其是否存在有毒行为完成标注。
### 支持任务与排行榜
本数据集支持多标签分类任务。
### 语言
评论语言为英语。
## 数据集结构
### 数据实例
单个数据实例由一条评论及其关联的多类标签构成,示例如下:
{
'id': '02141412314',
'comment_text': '示例评论文本',
'toxic': 0,
'severe_toxic': 0,
'obscene': 0,
'threat': 0,
'insult': 0,
'identity_hate': 1,
}
### 数据字段
- `id`:评论的唯一标识符
- `comment_text`:评论的文本内容
- `toxic`:评论的毒性分类标签,取值为0(非有毒)或1(有毒)
- `severe_toxic`:评论的严重毒性分类标签,取值为0(非严重有毒)或1(严重有毒)
- `obscene`:评论的淫秽内容分类标签,取值为0(非淫秽)或1(淫秽)
- `threat`:评论的威胁性分类标签,取值为0(非威胁)或1(威胁)
- `insult`:评论的侮辱性分类标签,取值为0(非侮辱)或1(侮辱)
- `identity_hate`:评论的身份仇恨分类标签,取值为0(非身份仇恨)或1(身份仇恨)
### 数据划分
数据集被划分为训练集与测试集两个子集。
## 数据集构建
### 构建初衷
本数据集的构建初衷为助力识别并遏制网络中的有毒言论现象。
### 源数据
#### 初始数据收集与标准化处理
本数据集的源数据为维基百科评论合集。
#### 源评论发布者身份?
[More Information Needed]
### 标注信息
#### 标注流程
[More Information Needed]
#### 标注人员身份?
[More Information Needed]
### 个人与敏感信息
[More Information Needed]
## 数据集使用考量
### 数据集的社会影响
[More Information Needed]
### 偏见讨论
若评论中包含与咒骂、侮辱或粗鄙用语相关的词汇,则无论评论的语气或作者意图(例如幽默或自嘲)如何,该评论大概率会被判定为有毒言论。这一机制可能会对本已处于弱势的少数群体造成偏见。
### 其他已知局限性
[More Information Needed]
## 附加信息
### 数据集整理者
[More Information Needed]
### 授权信息
本"有毒评论分类"数据集采用[CC0]协议发布,其中所包含的评论文本受维基百科[CC-SA-3.0]协议约束。
### 引用信息
暂无引用信息。
### 贡献致谢
感谢[@Tigrex161](https://github.com/Tigrex161)贡献本数据集。
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍

背景与挑战
背景概述
jigsaw_toxicity_pred是一个用于毒性评论分类的数据集,包含大量维基百科评论,每条评论标注了多种毒性类别(如toxic、obscene等),支持多标签分类任务,评论语言为英文。该数据集旨在帮助识别和减少在线平台上的毒性内容。
以上内容由遇见数据集搜集并总结生成



