kjj0/4chanpol-openaimod
收藏Hugging Face2024-01-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kjj0/4chanpol-openaimod
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: sexual
dtype: float64
- name: hate
dtype: float64
- name: violence
dtype: float64
- name: self-harm
dtype: float64
- name: sexual/minors
dtype: float64
- name: hate/threatening
dtype: float64
- name: violence/graphic
dtype: float64
splits:
- name: train
num_bytes: 23614214277
num_examples: 114647404
download_size: 14061193653
dataset_size: 23614214277
---
# Dataset Card for "kjj0/4chanpol-openaimod"
**Warning: offensive content.**
This dataset contains 114M unique posts made between June 2016 and November 2019.
This is a variant of the dataset provided by [Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board](https://arxiv.org/abs/2001.07487).
We have deduplicated posts and stripped metadata to create an easily accessible collection of unique texts.
We have also provided OpenAI moderation scores. A variant without these scores can be found at [kjj0/4chanpol](https://huggingface.co/datasets/kjj0/4chanpol).
Our purpose for creating this dataset, along with the OpenAI predictions (which are fairly accurate)
is to cheaply obtain a massive labeled text dataset (albeit with some unpleasant content) in order
to do research on data selection, active learning, label noise, and training curricula.
```
@inproceedings{papasavva2020raiders,
title={Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board},
author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
volume={14},
pages={885--894},
year={2020}
}
```
提供机构:
kjj0
原始信息汇总
数据集概述
数据集信息
- 特征:
text: 文本内容,数据类型为字符串。sexual: 性相关评分,数据类型为浮点数。hate: 仇恨相关评分,数据类型为浮点数。violence: 暴力相关评分,数据类型为浮点数。self-harm: 自残相关评分,数据类型为浮点数。sexual/minors: 未成年性相关评分,数据类型为浮点数。hate/threatening: 仇恨威胁相关评分,数据类型为浮点数。violence/graphic: 暴力图像相关评分,数据类型为浮点数。
数据分割
- 训练集:
- 名称:
train - 字节数: 23614214277
- 样本数: 114647404
- 名称:
数据集大小
- 下载大小: 14061193653
- 数据集大小: 23614214277
数据集描述
- 包含114M条独特帖子,时间范围为2016年6月至2019年11月。
- 该数据集是基于Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board的变体。
- 已对帖子进行去重处理并去除元数据,以创建易于访问的独特文本集合。
- 提供了OpenAI的审核评分。
- 目的:用于研究数据选择、主动学习、标签噪声和训练课程,尽管内容可能不愉快。



