u84u/4chan-pol
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/u84u/4chan-pol
下载链接
链接失效反馈官方服务:
资源简介:
这是一个名为Raiders Of The Lost Kek - Cleaned的数据集,源自研究论文https://arxiv.org/abs/2001.07487,内容涉及4chan/pol板块的帖子数据,并经过清洗处理以去除毒性内容。数据集规模在100M到1B之间,语言为英语。数据处理过程包括:从HuggingFace托管的原始数据集提取、解压、调整压缩级别以生成最终约10GB的parquet文件,并清理HTML(如去除转义字符、剥离标签、转换换行符、移除引用链接),同时将帖子与回复进行配对(有明确引用的帖子配对到实际回复,无引用的配对到线程原始发帖)。
This is a dataset named Raiders Of The Lost Kek - Cleaned, derived from the research paper https://arxiv.org/abs/2001.07487, containing sanitized data from the 4chan/pol board with a focus on toxic content. The dataset size ranges from 100M to 1B entries and is in English. The processing involved: initial extraction from the HuggingFace-hosted raw dataset, decompression, adjustment of compression levels to produce a final ~10GB parquet file, and cleaning of HTML (such as unescaping, stripping tags, converting <br> to newlines, and removing quotelinks), with posts paired to their replies (explicit references paired to actual replies, and no-reference posts paired to the thread OP).
提供机构:
u84u



