ylelauta/pol-4chan-augmented

Name: ylelauta/pol-4chan-augmented
Creator: ylelauta
Published: 2026-02-25 18:49:55
License: 暂无描述

Hugging Face2026-02-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ylelauta/pol-4chan-augmented

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: thread_no dtype: int64 - name: archived_on dtype: int64 - name: semantic_url dtype: string - name: "no" dtype: int64 - name: resto dtype: int64 - name: time dtype: int64 - name: now dtype: string - name: name dtype: string - name: trip dtype: string - name: sub dtype: string - name: com dtype: string - name: country dtype: string - name: country_name dtype: string - name: filename dtype: string - name: ext dtype: string - name: fsize dtype: int64 - name: md5 dtype: string - name: w dtype: int32 - name: h dtype: int32 - name: tn_w dtype: int32 - name: tn_h dtype: int32 - name: tim dtype: int64 - name: replies dtype: int32 - name: images dtype: int32 - name: bumplimit dtype: int32 - name: imagelimit dtype: int32 - name: archived dtype: int32 - name: closed dtype: int32 - name: toxicity dtype: float32 - name: severe_toxicity dtype: float32 - name: inflammatory dtype: float32 - name: profanity dtype: float32 - name: insult dtype: float32 - name: obscene dtype: float32 - name: spam dtype: float32 - name: entities dtype: string splits: - name: train num_bytes: 19440000000 num_examples: 134529233 download_size: 19440000000 dataset_size: 134529233 configs: - config_name: default data_files: - split: train path: data/train-*-of-00270.parquet license: cc-by-4.0 task_categories: - text-classification language: - en tags: - 4chan - toxicity - perspective-api - named-entities - political pretty_name: "/pol/ 4chan Augmented (Jun 2016 - Nov 2019)" size_categories: - 100M<n<1B --- # /pol/ 4chan Augmented Dataset **134.5M posts** from 3.4M threads on 4chan's /pol/ board (June 2016 - November 2019), augmented with Perspective API toxicity scores and named entity recognition. ## Dataset Description This dataset contains posts from 4chan's Politically Incorrect (/pol/) board, collected between June 2016 and November 2019. Each post has been augmented with: - **Perspective API toxicity scores** (7 dimensions): toxicity, severe_toxicity, inflammatory, profanity, insult, obscene, spam - **Named Entity Recognition**: extracted entities stored as JSON arrays ### Source Original data from [Papasavva et al. (2020)](https://zenodo.org/records/3606810) — "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board". ### Format 270 parquet shards with zstd compression (~70MB each). The dataset is directly loadable with HuggingFace `datasets`: ```python from datasets import load_dataset ds = load_dataset("ylelauta/pol-4chan-augmented") ``` ### Schema | Field | Type | Description | |-------|------|-------------| | `thread_no` | int64 | Thread number (from OP) | | `no` | int64 | Post number | | `resto` | int64 | 0 = OP, >0 = reply-to thread number | | `time` | int64 | Unix timestamp | | `com` | string | Comment HTML | | `country` / `country_name` | string | Poster's country flag | | `sub` | string | Subject (OP only) | | `name` / `trip` | string | Poster identity | | `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | mixed | Image metadata | | `replies` / `images` | int32 | Thread stats (OP only) | | `toxicity` | float32 | Perspective API toxicity score (0-1) | | `severe_toxicity` | float32 | Severe toxicity score | | `inflammatory` | float32 | Inflammatory score | | `profanity` | float32 | Profanity score | | `insult` | float32 | Insult score | | `obscene` | float32 | Obscene score | | `spam` | float32 | Spam score | | `entities` | string | JSON array of named entities | ### Statistics - **134,529,233** posts - **3,397,911** threads - **270** parquet shards - Date range: June 2016 - November 2019 ### Citation ```bibtex @inproceedings{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, year={2020} } ``` ### License CC-BY-4.0 (following the original dataset license)

数据集元信息：特征： - 字段名：thread_no，数据类型：int64 - 字段名：archived_on，数据类型：int64 - 字段名：semantic_url，数据类型：字符串 - 字段名：no，数据类型：int64 - 字段名：resto，数据类型：int64 - 字段名：time，数据类型：int64 - 字段名：now，数据类型：字符串 - 字段名：name，数据类型：字符串 - 字段名：trip，数据类型：字符串 - 字段名：sub，数据类型：字符串 - 字段名：com，数据类型：字符串 - 字段名：country，数据类型：字符串 - 字段名：country_name，数据类型：字符串 - 字段名：filename，数据类型：字符串 - 字段名：ext，数据类型：字符串 - 字段名：fsize，数据类型：int64 - 字段名：md5，数据类型：字符串 - 字段名：w，数据类型：int32 - 字段名：h，数据类型：int32 - 字段名：tn_w，数据类型：int32 - 字段名：tn_h，数据类型：int32 - 字段名：tim，数据类型：int64 - 字段名：replies，数据类型：int32 - 字段名：images，数据类型：int32 - 字段名：bumplimit，数据类型：int32 - 字段名：imagelimit，数据类型：int32 - 字段名：archived，数据类型：int32 - 字段名：closed，数据类型：int32 - 字段名：toxicity，数据类型：float32 - 字段名：severe_toxicity，数据类型：float32 - 字段名：inflammatory，数据类型：float32 - 字段名：profanity，数据类型：float32 - 字段名：insult，数据类型：float32 - 字段名：obscene，数据类型：float32 - 字段名：spam，数据类型：float32 - 字段名：entities，数据类型：字符串数据拆分： - 拆分名称：train（训练集），占用字节数：19440000000，样本数量：134529233 下载大小：19440000000 数据集占用空间：134529233 配置项： - 配置名称：default（默认配置）数据文件： - 拆分：train（训练集）路径：data/train-*-of-00270.parquet 许可证：CC-BY-4.0 任务类别：文本分类语言：英语标签：4chan、毒性评分、透视API（Perspective API）、命名实体识别、政治相关美观名称：/pol/ 4chan 增强数据集（2016年6月 - 2019年11月）大小类别：1亿 < 样本数 < 10亿 # /pol/ 4chan 增强数据集本数据集包含4chan/pol/（政治不正确）板块2016年6月至2019年11月期间的1.345亿条帖子，涵盖340万个讨论线程，并附加了透视API（Perspective API）毒性评分与命名实体识别标注。 ## 数据集说明本数据集收录了4chan政治不正确（/pol/）板块2016年6月至2019年11月的帖子数据，每条帖子均附加以下标注： - **透视API（Perspective API）毒性评分**（共7个维度）：毒性（toxicity）、重度毒性（severe_toxicity）、煽动性（inflammatory）、亵渎性（profanity）、侮辱性（insult）、淫秽性（obscene）、垃圾信息（spam） - **命名实体识别**：提取的命名实体以JSON数组形式存储 ### 数据来源原始数据来自Papasavva等人（2020）的论文《Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board》，可访问[Zenodo仓库](https://zenodo.org/records/3606810)获取。 ### 数据格式采用zstd压缩的270个Parquet分片（单分片大小约70MB）。可直接通过HuggingFace `datasets`库加载： python from datasets import load_dataset ds = load_dataset("ylelauta/pol-4chan-augmented") ### 数据Schema | 字段名 | 数据类型 | 字段说明 | |-------|---------|----------| | `thread_no` | int64 | 讨论线程编号（对应原帖发布者Original Poster，简称OP） | | `no` | int64 | 单条帖子编号 | | `resto` | int64 | 回复目标编号：0代表原帖，大于0则为被回复的帖子编号 | | `time` | int64 | Unix时间戳 | | `com` | string | 评论内容（HTML格式） | | `country` / `country_name` | string | 发布者国家代码/国家名称 | | `sub` | string | 帖子主题（仅原帖发布者可用） | | `name` / `trip` | string | 发布者身份标识（trip码） | | `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | 混合类型 | 图片元数据 | | `replies` / `images` | int32 | 讨论线程统计数据（仅原帖包含此字段） | | `toxicity` | float32 | 透视API（Perspective API）毒性评分（取值范围0-1） | | `severe_toxicity` | float32 | 重度毒性评分 | | `inflammatory` | float32 | 煽动性评分 | | `profanity` | float32 | 亵渎性评分 | | `insult` | float32 | 侮辱性评分 | | `obscene` | float32 | 淫秽性评分 | | `spam` | float32 | 垃圾信息评分 | | `entities` | string | 命名实体JSON数组字符串 | ### 统计信息 - 总帖子数：134,529,233 - 总讨论线程数：3,397,911 - Parquet分片数量：270 - 数据时间范围：2016年6月至2019年11月 ### 引用格式 bibtex @inproceedings{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, year={2020} } ### 许可证遵循原数据集许可证，采用CC-BY-4.0协议

提供机构：

ylelauta

5,000+

优质数据集

54 个

任务类型

进入经典数据集