OxAISH-AL-LLM/wiki_toxic

Name: OxAISH-AL-LLM/wiki_toxic
Creator: OxAISH-AL-LLM
Published: 2022-09-19 15:53:19
License: 暂无描述

Hugging Face2022-09-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/OxAISH-AL-LLM/wiki_toxic

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - found license: - cc0-1.0 multilinguality: - monolingual pretty_name: Toxic Wikipedia Comments size_categories: - 100K<n<1M source_datasets: - extended|other tags: - wikipedia - toxicity - toxic comments task_categories: - text-classification task_ids: - hate-speech-detection --- # Dataset Card for Wiki Toxic ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The Wiki Toxic dataset is a modified, cleaned version of the dataset used in the [Kaggle Toxic Comment Classification challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview) from 2017/18. The dataset contains comments collected from Wikipedia forums and classifies them into two categories, `toxic` and `non-toxic`. The Kaggle dataset was cleaned using the included `clean.py` file. ### Supported Tasks and Leaderboards - Text Classification: the dataset can be used for training a model to recognise toxicity in sentences and classify them accordingly. ### Languages The sole language used in the dataset is English. ## Dataset Structure ### Data Instances For each data point, there is an id, the comment_text itself, and a label (0 for non-toxic, 1 for toxic). ``` {'id': 'a123a58f610cffbc', 'comment_text': '"This article SUCKS. It may be poorly written, poorly formatted, or full of pointless crap that no one cares about, and probably all of the above. If it can be rewritten into something less horrible, please, for the love of God, do so, before the vacuum caused by its utter lack of quality drags the rest of Wikipedia down into a bottomless pit of mediocrity."', 'label': 1} ``` ### Data Fields - `id`: A unique identifier string for each comment - `comment_text`: A string containing the text of the comment - `label`: An integer, either 0 if the comment is non-toxic, or 1 if the comment is toxic ### Data Splits The Wiki Toxic dataset has three splits: *train*, *validation*, and *test*. The statistics for each split are below: | Dataset Split | Number of data points in split | | ----------- | ----------- | | Train | 127,656 | | Validation | 31,915 | | Test | 63,978 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.

提供机构：

OxAISH-AL-LLM

原始信息汇总

数据集概述

数据集名称

名称: Wiki Toxic
别名: Toxic Wikipedia Comments

数据集属性

语言: 英语
许可证: CC0-1.0
多语言性: 单语种
大小: 100K<n<1M
来源: 扩展自其他数据集
标签: wikipedia, toxicity, toxic comments
任务类别: 文本分类
任务ID: hate-speech-detection

数据集描述

摘要: Wiki Toxic 数据集是2017/18年Kaggle毒性评论分类挑战赛所用数据集的修改和清洗版本。该数据集包含从维基百科论坛收集的评论，并将其分类为毒性和非毒性两类。
支持的任务: 文本分类，用于训练模型识别句子中的毒性并进行相应分类。
语言: 数据集仅使用英语。

数据集结构

数据实例: 每个数据点包含一个id、comment_text和标签（0表示非毒性，1表示毒性）。
数据字段:
- id: 每个评论的唯一标识符
- comment_text: 包含评论文本的字符串
- label: 整数，0表示非毒性，1表示毒性
数据分割: 数据集分为训练集、验证集和测试集，具体统计如下：
- 训练集: 127,656个数据点
- 验证集: 31,915个数据点
- 测试集: 63,978个数据点

数据集创建

注释创建者: 众包
数据收集和标准化: 信息缺失
注释过程: 信息缺失
个人和敏感信息: 信息缺失

使用数据的考虑

社会影响: 信息缺失
偏见讨论: 信息缺失
其他已知限制: 信息缺失

附加信息

数据集管理员: 信息缺失
许可信息: 信息缺失
引用信息: 信息缺失
贡献者: 感谢@github-username添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集