arctic
收藏Hugging Face2026-03-15 更新2026-03-20 收录
下载链接:
https://huggingface.co/datasets/open-index/arctic
下载链接
链接失效反馈官方服务:
资源简介:
Arctic Shift Reddit Archive 是一个包含2005年至2010年间Reddit所有公开子论坛评论和提交内容的数据集,以Parquet格式组织并按月分片。该数据集源自Arctic Shift项目对PushShift Reddit存档的重新处理,涵盖了71.5M条评论和14.8M条提交,总计86.3M项数据,压缩后大小为7.0 GB。数据按类型(评论与提交)分开存储,便于按需加载或流式处理。适用于语言模型训练、情感分析、社区研究和信息检索等任务。数据集提供了详细的字段说明,包括评论的ID、作者、子论坛、正文、得分、创建时间等,以及提交的标题、作者、子论坛、URL等。此外,README还包含了使用DuckDB、datasets库和huggingface_hub进行数据查询和下载的示例代码。
Arctic Shift Reddit Archive is a dataset containing all public subreddit comments and submissions on Reddit from 2005 to 2010, organized in Parquet format and split by month. This dataset is derived from the reprocessing of the PushShift Reddit archive by the Arctic Shift project. It covers 71.5 million comments and 14.8 million submissions, totaling 86.3 million entries, with a compressed size of 7.0 GB. The data is stored separately by type (comments and submissions) to facilitate on-demand loading or streaming processing. It is suitable for tasks such as language model training, sentiment analysis, community research, and information retrieval. The dataset provides detailed field descriptions, including comment-related fields such as ID, author, subreddit, body, score, creation timestamp, as well as submission-related fields including title, author, subreddit, URL, etc. Additionally, the README includes sample code for data querying and downloading using DuckDB, the Hugging Face datasets library, and huggingface_hub.
创建时间:
2026-03-15



