five

Siyke/jigsaw_toxicity_pred

收藏
Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Siyke/jigsaw_toxicity_pred
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - other language: - en license: - cc0-1.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: - multi-label-classification pretty_name: JigsawToxicityPred dataset_info: features: - name: comment_text dtype: string - name: toxic dtype: class_label: names: '0': 'false' '1': 'true' - name: severe_toxic dtype: class_label: names: '0': 'false' '1': 'true' - name: obscene dtype: class_label: names: '0': 'false' '1': 'true' - name: threat dtype: class_label: names: '0': 'false' '1': 'true' - name: insult dtype: class_label: names: '0': 'false' '1': 'true' - name: identity_hate dtype: class_label: names: '0': 'false' '1': 'true' splits: - name: train num_bytes: 71282358 num_examples: 159571 - name: test num_bytes: 28241991 num_examples: 63978 download_size: 0 dataset_size: 99524349 train-eval-index: - config: default task: text-classification task_id: binary_classification splits: train_split: train eval_split: test col_mapping: comment_text: text toxic: target metrics: - type: accuracy name: Accuracy - type: f1 name: F1 macro args: average: macro - type: f1 name: F1 micro args: average: micro - type: f1 name: F1 weighted args: average: weighted - type: precision name: Precision macro args: average: macro - type: precision name: Precision micro args: average: micro - type: precision name: Precision weighted args: average: weighted - type: recall name: Recall macro args: average: macro - type: recall name: Recall micro args: average: micro - type: recall name: Recall weighted args: average: weighted --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Jigsaw Comment Toxicity Classification Kaggle Competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. This dataset consists of a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. ### Supported Tasks and Leaderboards The dataset support multi-label classification ### Languages The comments are in English ## Dataset Structure ### Data Instances A data point consists of a comment followed by multiple labels that can be associated with it. {'id': '02141412314', 'comment_text': 'Sample comment text', 'toxic': 0, 'severe_toxic': 0, 'obscene': 0, 'threat': 0, 'insult': 0, 'identity_hate': 1, } ### Data Fields - `id`: id of the comment - `comment_text`: the text of the comment - `toxic`: value of 0(non-toxic) or 1(toxic) classifying the comment - `severe_toxic`: value of 0(non-severe_toxic) or 1(severe_toxic) classifying the comment - `obscene`: value of 0(non-obscene) or 1(obscene) classifying the comment - `threat`: value of 0(non-threat) or 1(threat) classifying the comment - `insult`: value of 0(non-insult) or 1(insult) classifying the comment - `identity_hate`: value of 0(non-identity_hate) or 1(identity_hate) classifying the comment ### Data Splits The data is split into a training and testing set. ## Dataset Creation ### Curation Rationale The dataset was created to help in efforts to identify and curb instances of toxicity online. ### Source Data #### Initial Data Collection and Normalization The dataset is a collection of Wikipedia comments. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases If words that are associated with swearing, insults or profanity are present in a comment, it is likely that it will be classified as toxic, regardless of the tone or the intent of the author e.g. humorous/self-deprecating. This could present some biases towards already vulnerable minority groups. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The "Toxic Comment Classification" dataset is released under [CC0], with the underlying comment text being governed by Wikipedia\'s [CC-SA-3.0]. ### Citation Information No citation information. ### Contributions Thanks to [@Tigrex161](https://github.com/Tigrex161) for adding this dataset.
提供机构:
Siyke
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作