five

ylelauta/pol-4chan-augmented

收藏
Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ylelauta/pol-4chan-augmented
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: thread_no dtype: int64 - name: archived_on dtype: int64 - name: semantic_url dtype: string - name: "no" dtype: int64 - name: resto dtype: int64 - name: time dtype: int64 - name: now dtype: string - name: name dtype: string - name: trip dtype: string - name: sub dtype: string - name: com dtype: string - name: country dtype: string - name: country_name dtype: string - name: filename dtype: string - name: ext dtype: string - name: fsize dtype: int64 - name: md5 dtype: string - name: w dtype: int32 - name: h dtype: int32 - name: tn_w dtype: int32 - name: tn_h dtype: int32 - name: tim dtype: int64 - name: replies dtype: int32 - name: images dtype: int32 - name: bumplimit dtype: int32 - name: imagelimit dtype: int32 - name: archived dtype: int32 - name: closed dtype: int32 - name: toxicity dtype: float32 - name: severe_toxicity dtype: float32 - name: inflammatory dtype: float32 - name: profanity dtype: float32 - name: insult dtype: float32 - name: obscene dtype: float32 - name: spam dtype: float32 - name: entities dtype: string splits: - name: train num_bytes: 19440000000 num_examples: 134529233 download_size: 19440000000 dataset_size: 134529233 configs: - config_name: default data_files: - split: train path: data/train-*-of-00270.parquet license: cc-by-4.0 task_categories: - text-classification language: - en tags: - 4chan - toxicity - perspective-api - named-entities - political pretty_name: "/pol/ 4chan Augmented (Jun 2016 - Nov 2019)" size_categories: - 100M<n<1B --- # /pol/ 4chan Augmented Dataset **134.5M posts** from 3.4M threads on 4chan's /pol/ board (June 2016 - November 2019), augmented with Perspective API toxicity scores and named entity recognition. ## Dataset Description This dataset contains posts from 4chan's Politically Incorrect (/pol/) board, collected between June 2016 and November 2019. Each post has been augmented with: - **Perspective API toxicity scores** (7 dimensions): toxicity, severe_toxicity, inflammatory, profanity, insult, obscene, spam - **Named Entity Recognition**: extracted entities stored as JSON arrays ### Source Original data from [Papasavva et al. (2020)](https://zenodo.org/records/3606810) — "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board". ### Format 270 parquet shards with zstd compression (~70MB each). The dataset is directly loadable with HuggingFace `datasets`: ```python from datasets import load_dataset ds = load_dataset("ylelauta/pol-4chan-augmented") ``` ### Schema | Field | Type | Description | |-------|------|-------------| | `thread_no` | int64 | Thread number (from OP) | | `no` | int64 | Post number | | `resto` | int64 | 0 = OP, >0 = reply-to thread number | | `time` | int64 | Unix timestamp | | `com` | string | Comment HTML | | `country` / `country_name` | string | Poster's country flag | | `sub` | string | Subject (OP only) | | `name` / `trip` | string | Poster identity | | `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | mixed | Image metadata | | `replies` / `images` | int32 | Thread stats (OP only) | | `toxicity` | float32 | Perspective API toxicity score (0-1) | | `severe_toxicity` | float32 | Severe toxicity score | | `inflammatory` | float32 | Inflammatory score | | `profanity` | float32 | Profanity score | | `insult` | float32 | Insult score | | `obscene` | float32 | Obscene score | | `spam` | float32 | Spam score | | `entities` | string | JSON array of named entities | ### Statistics - **134,529,233** posts - **3,397,911** threads - **270** parquet shards - Date range: June 2016 - November 2019 ### Citation ```bibtex @inproceedings{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, year={2020} } ``` ### License CC-BY-4.0 (following the original dataset license)

数据集元信息: 特征: - 字段名:thread_no,数据类型:int64 - 字段名:archived_on,数据类型:int64 - 字段名:semantic_url,数据类型:字符串 - 字段名:no,数据类型:int64 - 字段名:resto,数据类型:int64 - 字段名:time,数据类型:int64 - 字段名:now,数据类型:字符串 - 字段名:name,数据类型:字符串 - 字段名:trip,数据类型:字符串 - 字段名:sub,数据类型:字符串 - 字段名:com,数据类型:字符串 - 字段名:country,数据类型:字符串 - 字段名:country_name,数据类型:字符串 - 字段名:filename,数据类型:字符串 - 字段名:ext,数据类型:字符串 - 字段名:fsize,数据类型:int64 - 字段名:md5,数据类型:字符串 - 字段名:w,数据类型:int32 - 字段名:h,数据类型:int32 - 字段名:tn_w,数据类型:int32 - 字段名:tn_h,数据类型:int32 - 字段名:tim,数据类型:int64 - 字段名:replies,数据类型:int32 - 字段名:images,数据类型:int32 - 字段名:bumplimit,数据类型:int32 - 字段名:imagelimit,数据类型:int32 - 字段名:archived,数据类型:int32 - 字段名:closed,数据类型:int32 - 字段名:toxicity,数据类型:float32 - 字段名:severe_toxicity,数据类型:float32 - 字段名:inflammatory,数据类型:float32 - 字段名:profanity,数据类型:float32 - 字段名:insult,数据类型:float32 - 字段名:obscene,数据类型:float32 - 字段名:spam,数据类型:float32 - 字段名:entities,数据类型:字符串 数据拆分: - 拆分名称:train(训练集),占用字节数:19440000000,样本数量:134529233 下载大小:19440000000 数据集占用空间:134529233 配置项: - 配置名称:default(默认配置) 数据文件: - 拆分:train(训练集) 路径:data/train-*-of-00270.parquet 许可证:CC-BY-4.0 任务类别:文本分类 语言:英语 标签:4chan、毒性评分、透视API(Perspective API)、命名实体识别、政治相关 美观名称:/pol/ 4chan 增强数据集(2016年6月 - 2019年11月) 大小类别:1亿 < 样本数 < 10亿 # /pol/ 4chan 增强数据集 本数据集包含4chan/pol/(政治不正确)板块2016年6月至2019年11月期间的1.345亿条帖子,涵盖340万个讨论线程,并附加了透视API(Perspective API)毒性评分与命名实体识别标注。 ## 数据集说明 本数据集收录了4chan政治不正确(/pol/)板块2016年6月至2019年11月的帖子数据,每条帖子均附加以下标注: - **透视API(Perspective API)毒性评分**(共7个维度):毒性(toxicity)、重度毒性(severe_toxicity)、煽动性(inflammatory)、亵渎性(profanity)、侮辱性(insult)、淫秽性(obscene)、垃圾信息(spam) - **命名实体识别**:提取的命名实体以JSON数组形式存储 ### 数据来源 原始数据来自Papasavva等人(2020)的论文《Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board》,可访问[Zenodo仓库](https://zenodo.org/records/3606810)获取。 ### 数据格式 采用zstd压缩的270个Parquet分片(单分片大小约70MB)。可直接通过HuggingFace `datasets`库加载: python from datasets import load_dataset ds = load_dataset("ylelauta/pol-4chan-augmented") ### 数据Schema | 字段名 | 数据类型 | 字段说明 | |-------|---------|----------| | `thread_no` | int64 | 讨论线程编号(对应原帖发布者Original Poster,简称OP) | | `no` | int64 | 单条帖子编号 | | `resto` | int64 | 回复目标编号:0代表原帖,大于0则为被回复的帖子编号 | | `time` | int64 | Unix时间戳 | | `com` | string | 评论内容(HTML格式) | | `country` / `country_name` | string | 发布者国家代码/国家名称 | | `sub` | string | 帖子主题(仅原帖发布者可用) | | `name` / `trip` | string | 发布者身份标识(trip码) | | `filename` / `ext` / `fsize` / `md5` / `w` / `h` / `tim` | 混合类型 | 图片元数据 | | `replies` / `images` | int32 | 讨论线程统计数据(仅原帖包含此字段) | | `toxicity` | float32 | 透视API(Perspective API)毒性评分(取值范围0-1) | | `severe_toxicity` | float32 | 重度毒性评分 | | `inflammatory` | float32 | 煽动性评分 | | `profanity` | float32 | 亵渎性评分 | | `insult` | float32 | 侮辱性评分 | | `obscene` | float32 | 淫秽性评分 | | `spam` | float32 | 垃圾信息评分 | | `entities` | string | 命名实体JSON数组字符串 | ### 统计信息 - 总帖子数:134,529,233 - 总讨论线程数:3,397,911 - Parquet分片数量:270 - 数据时间范围:2016年6月至2019年11月 ### 引用格式 bibtex @inproceedings{papasavva2020raiders, title={Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board}, author={Papasavva, Antonis and Zannettou, Savvas and De Cristofaro, Emiliano and Stringhini, Gianluca and Blackburn, Jeremy}, booktitle={Proceedings of the International AAAI Conference on Web and Social Media}, year={2020} } ### 许可证 遵循原数据集许可证,采用CC-BY-4.0协议
提供机构:
ylelauta
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作