ylelauta/pol-4chan-augmented
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ylelauta/pol-4chan-augmented
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: thread_no
dtype: int64
- name: archived_on
dtype: int64
- name: semantic_url
dtype: string
- name: "no"
dtype: int64
- name: resto
dtype: int64
- name: time
dtype: int64
- name: now
dtype: string
- name: name
dtype: string
- name: trip
dtype: string
- name: sub
dtype: string
- name: com
dtype: string
- name: country
dtype: string
- name: country_name
dtype: string
- name: filename
dtype: string
- name: ext
dtype: string
- name: fsize
dtype: int64
- name: md5
dtype: string
- name: w
dtype: int32
- name: h
dtype: int32
- name: tn_w
dtype: int32
- name: tn_h
dtype: int32
- name: tim
dtype: int64
- name: replies
dtype: int32
- name: images
dtype: int32
- name: bumplimit
dtype: int32
- name: imagelimit
dtype: int32
- name: archived
dtype: int32
- name: closed
dtype: int32
- name: toxicity
dtype: float32
- name: severe_toxicity
dtype: float32
- name: inflammatory
dtype: float32
- name: profanity
dtype: float32
- name: insult
dtype: float32
- name: obscene
dtype: float32
- name: spam
dtype: float32
- name: entities
dtype: string
splits:
- name: train
num_bytes: 19440000000
num_examples: 134529233
download_size: 19440000000
dataset_size: 134529233
configs:
- config_name: default
data_files:
- split: train
path: data/train-*-of-00270.parquet
license: cc-by-4.0
task_categories:
- text-classification
language:
- en
tags:
- 4chan
- toxicity
- perspective-api
- named-entities
- political
pretty_name: "/pol/ 4chan Augmented (Jun 2016 - Nov 2019)"
size_categories:
- 100M<n<1B
---
# /pol/ 4chan Augmented Dataset
**134.5M posts** from 3.4M threads on 4chan's /pol/ board (June 2016 - November 2019), augmented with Perspective API toxicity scores and named entity recognition.
## Dataset Description
This dataset contains posts from 4chan's Politically Incorrect (/pol/) board, collected between June 2016 and November 2019. Each post has been augmented with:
- **Perspective API toxicity scores** (7 dimensions): toxicity, severe_toxicity, inflammatory, profanity, insult, obscene, spam
- **Named Entity Recognition**: extracted entities stored as JSON arrays
### Source
Original data from [Papasavva et al. (2020)](https://zenodo.org/records/3606810) — "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board".
### Format
270 parquet shards with zstd compression (~70MB each). The dataset is directly loadable with HuggingFace `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("ylelauta/pol-4chan-augmented")
```
### Schema
| Field | Type | Description |
|-------|------|-------------|
| `thread_no` | int64 | Thread number (from OP) |
| `no` | int64 | Post number |
| `resto` | int64 | 0 = OP, >0 = reply-to thread number |
| `time` | int64 | Unix timestamp |
| `com` | string | Comment HTML |
| `country` / `country_name` | string | Poster's country flag |
| `sub` | string | Subject (OP only) |
| `name` / `trip` | string | Poster identity |
| `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | mixed | Image metadata |
| `replies` / `images` | int32 | Thread stats (OP only) |
| `toxicity` | float32 | Perspective API toxicity score (0-1) |
| `severe_toxicity` | float32 | Severe toxicity score |
| `inflammatory` | float32 | Inflammatory score |
| `profanity` | float32 | Profanity score |
| `insult` | float32 | Insult score |
| `obscene` | float32 | Obscene score |
| `spam` | float32 | Spam score |
| `entities` | string | JSON array of named entities |
### Statistics
- **134,529,233** posts
- **3,397,911** threads
- **270** parquet shards
- Date range: June 2016 - November 2019
### Citation
```bibtex
@inproceedings{papasavva2020raiders,
title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board},
author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
year={2020}
}
```
### License
CC-BY-4.0 (following the original dataset license)
数据集元信息:
特征:
- 字段名:thread_no,数据类型:int64
- 字段名:archived_on,数据类型:int64
- 字段名:semantic_url,数据类型:字符串
- 字段名:no,数据类型:int64
- 字段名:resto,数据类型:int64
- 字段名:time,数据类型:int64
- 字段名:now,数据类型:字符串
- 字段名:name,数据类型:字符串
- 字段名:trip,数据类型:字符串
- 字段名:sub,数据类型:字符串
- 字段名:com,数据类型:字符串
- 字段名:country,数据类型:字符串
- 字段名:country_name,数据类型:字符串
- 字段名:filename,数据类型:字符串
- 字段名:ext,数据类型:字符串
- 字段名:fsize,数据类型:int64
- 字段名:md5,数据类型:字符串
- 字段名:w,数据类型:int32
- 字段名:h,数据类型:int32
- 字段名:tn_w,数据类型:int32
- 字段名:tn_h,数据类型:int32
- 字段名:tim,数据类型:int64
- 字段名:replies,数据类型:int32
- 字段名:images,数据类型:int32
- 字段名:bumplimit,数据类型:int32
- 字段名:imagelimit,数据类型:int32
- 字段名:archived,数据类型:int32
- 字段名:closed,数据类型:int32
- 字段名:toxicity,数据类型:float32
- 字段名:severe_toxicity,数据类型:float32
- 字段名:inflammatory,数据类型:float32
- 字段名:profanity,数据类型:float32
- 字段名:insult,数据类型:float32
- 字段名:obscene,数据类型:float32
- 字段名:spam,数据类型:float32
- 字段名:entities,数据类型:字符串
数据拆分:
- 拆分名称:train(训练集),占用字节数:19440000000,样本数量:134529233
下载大小:19440000000
数据集占用空间:134529233
配置项:
- 配置名称:default(默认配置)
数据文件:
- 拆分:train(训练集)
路径:data/train-*-of-00270.parquet
许可证:CC-BY-4.0
任务类别:文本分类
语言:英语
标签:4chan、毒性评分、透视API(Perspective API)、命名实体识别、政治相关
美观名称:/pol/ 4chan 增强数据集(2016年6月 - 2019年11月)
大小类别:1亿 < 样本数 < 10亿
# /pol/ 4chan 增强数据集
本数据集包含4chan/pol/(政治不正确)板块2016年6月至2019年11月期间的1.345亿条帖子,涵盖340万个讨论线程,并附加了透视API(Perspective API)毒性评分与命名实体识别标注。
## 数据集说明
本数据集收录了4chan政治不正确(/pol/)板块2016年6月至2019年11月的帖子数据,每条帖子均附加以下标注:
- **透视API(Perspective API)毒性评分**(共7个维度):毒性(toxicity)、重度毒性(severe_toxicity)、煽动性(inflammatory)、亵渎性(profanity)、侮辱性(insult)、淫秽性(obscene)、垃圾信息(spam)
- **命名实体识别**:提取的命名实体以JSON数组形式存储
### 数据来源
原始数据来自Papasavva等人(2020)的论文《Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board》,可访问[Zenodo仓库](https://zenodo.org/records/3606810)获取。
### 数据格式
采用zstd压缩的270个Parquet分片(单分片大小约70MB)。可直接通过HuggingFace `datasets`库加载:
python
from datasets import load_dataset
ds = load_dataset("ylelauta/pol-4chan-augmented")
### 数据Schema
| 字段名 | 数据类型 | 字段说明 |
|-------|---------|----------|
| `thread_no` | int64 | 讨论线程编号(对应原帖发布者Original Poster,简称OP) |
| `no` | int64 | 单条帖子编号 |
| `resto` | int64 | 回复目标编号:0代表原帖,大于0则为被回复的帖子编号 |
| `time` | int64 | Unix时间戳 |
| `com` | string | 评论内容(HTML格式) |
| `country` / `country_name` | string | 发布者国家代码/国家名称 |
| `sub` | string | 帖子主题(仅原帖发布者可用) |
| `name` / `trip` | string | 发布者身份标识(trip码) |
| `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | 混合类型 | 图片元数据 |
| `replies` / `images` | int32 | 讨论线程统计数据(仅原帖包含此字段) |
| `toxicity` | float32 | 透视API(Perspective API)毒性评分(取值范围0-1) |
| `severe_toxicity` | float32 | 重度毒性评分 |
| `inflammatory` | float32 | 煽动性评分 |
| `profanity` | float32 | 亵渎性评分 |
| `insult` | float32 | 侮辱性评分 |
| `obscene` | float32 | 淫秽性评分 |
| `spam` | float32 | 垃圾信息评分 |
| `entities` | string | 命名实体JSON数组字符串 |
### 统计信息
- 总帖子数:134,529,233
- 总讨论线程数:3,397,911
- Parquet分片数量:270
- 数据时间范围:2016年6月至2019年11月
### 引用格式
bibtex
@inproceedings{papasavva2020raiders,
title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board},
author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
year={2020}
}
### 许可证
遵循原数据集许可证,采用CC-BY-4.0协议
提供机构:
ylelauta



