recursal/OKReddit-ReleaseCandidate3
收藏Hugging Face2025-06-23 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/recursal/OKReddit-ReleaseCandidate3
下载链接
链接失效反馈官方服务:
资源简介:
OKReddit是一个经过过滤的Reddit提交和评论数据集,涵盖了从2005年到2023年的内容,总计约6.5 TiB(估计有6亿行数据)。该数据集主要用于研究或存档目的,包含了经过筛选的子版块列表。数据集的主要语言是英语,但也包含少量其他语言的内容。数据集的结构包括每个子版块中的提交线程,每个线程包含提交和评论的详细信息。数据集的创建过程涉及对子版块质量的过滤、有价值提交的筛选以及评论的进一步精炼。数据集可以用于多种自然语言处理任务,如文本分类、语言建模、情感分析和主题建模。
OKReddit is a filtered collection of **6.5 TiB** (An estimated 600M rows of reddit submissions) of reddit submissions and comments from 2005 to 2023. This dataset has been prepared for research or archival purposes. This dataset includes (obviously) a filtered list of subreddits. The primary language of the dataset is English, as the majority of redditors are English educated. However, posts in other languages may also be present in smaller quantities. The dataset structure includes submission threads within subreddits, each containing detailed information about submissions and comments. The dataset creation process involves filtering subreddit quality, selecting valuable submissions, and refining comment selection. The dataset may be used for a variety of natural language processing (NLP) tasks including text classification, language modeling, sentiment analysis, and topic modeling.
提供机构:
recursal



