HuggingFaceGECLM/REDDIT_submissions

Name: HuggingFaceGECLM/REDDIT_submissions
Creator: HuggingFaceGECLM
Published: 2023-03-17 07:44:37
License: 暂无描述

Hugging Face2023-03-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/HuggingFaceGECLM/REDDIT_submissions

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含50个高质量Reddit子版块的提交内容，这些内容是从Reddit PushShift数据转储中提取的（时间跨度为2006年至2023年1月）。数据集的结构包括多个子版块的分割，每个分割对应一个特定的子版块。数据集的特征包括多个字段，如作者、标题、内容、评论数量等，所有字段都被转换为字符串格式。数据集的主要用途包括文本生成、语言建模和对话建模。数据集的创建理由是为了保留Reddit提交内容的历史变化，并且数据集中的信息字段被简化为关键字段。数据集的来源是Reddit PushShift数据转储，数据集的初始收集和规范化过程可以参考相关论文。数据集的使用注意事项包括需要匿名化处理，并且尽管所选子版块被认为是高质量的，但仍可能反映互联网上的偏见和毒性。

提供机构：

HuggingFaceGECLM

原始信息汇总

数据集卡片 for "REDDIT_submissions"

数据集描述

数据集概要

从REDDIT PushShift数据 dumps（从2006年到2023年1月）提取的50个高质量subreddits的提交内容。

支持的任务

这些提交内容可用于文本生成和语言建模，以及对话建模。

数据集结构

数据特征

allow_live_comments: string
archived: string
author: string
author_fullname: string
banned_by: string
category: string
content_categories: string
contest_mode: string
created_utc: string
discussion_type: string
distinguished: string
domain: string
edited: string
gilded: string
hidden: string
hide_score: string
id: string
is_created_from_ads_ui: string
is_crosspostable: string
is_meta: string
is_original_content: string
is_reddit_media_domain: string
is_robot_indexable: string
is_self: string
is_video: string
locked: string
media: string
media_embed: string
media_only: string
name: string
no_follow: string
num_comments: string
num_crossposts: string
over_18: string
parent_whitelist_status: string
permalink: string
pinned: string
post_hint: string
pwls: string
quarantine: string
removed_by: string
removed_by_category: string
retrieved_on: string
score: string
secure_media: string
secure_media_embed: string
selftext: string
send_replies: string
spoiler: string
stickied: string
subreddit_id: string
subreddit_name_prefixed: string
subreddit_subscribers: string
subreddit_type: string
suggested_sort: string
title: string
top_awarded_type: string
total_awards_received: string
treatment_tags: string
upvote_ratio: string
url: string
url_overridden_by_dest: string
view_count: string
whitelist_status: string
wls: string

数据分割

每个分割对应于以下列表中的特定subreddit：

tifu: 711926746 字节, 526283 样本
explainlikeimfive: 1407570925 字节, 1811324 样本
WritingPrompts: 883683696 字节, 1001358 样本
changemyview: 366049867 字节, 257332 样本
LifeProTips: 596724168 字节, 715494 样本
todayilearned: 1882122179 字节, 2153849 样本
science: 675817380 字节, 872768 样本
askscience: 1180347707 字节, 1562708 样本
ifyoulikeblank: 248876237 字节, 221368 样本
Foodforthought: 56817554 字节, 70647 样本
IWantToLearn: 97666128 字节, 103347 样本
bestof: 230879506 字节, 341029 样本
IAmA: 375534116 字节, 436003 样本
socialskills: 327412682 字节, 260354 样本
relationship_advice: 5050087947 字节, 3284961 样本
philosophy: 230221165 字节, 212792 样本
YouShouldKnow: 87706881 字节, 94635 样本
history: 295389153 字节, 284318 样本
books: 635450859 字节, 692807 样本
Showerthoughts: 4859309870 字节, 6358205 样本
personalfinance: 1813984142 字节, 1347837 样本
buildapc: 4754190700 字节, 3030207 样本
EatCheapAndHealthy: 95544413 字节, 79694 样本
boardgames: 379980593 字节, 287493 样本
malefashionadvice: 523741819 字节, 548587 样本
femalefashionadvice: 131338068 字节, 131110 样本
scifi: 148283250 字节, 134568 样本
Fantasy: 265612464 字节, 175866 样本
Games: 1112497898 字节, 830997 样本
bodyweightfitness: 154845910 字节, 144829 样本
SkincareAddiction: 908265410 字节, 890421 样本
podcasts: 114495922 字节, 113707 样本
suggestmeabook: 307022597 字节, 300601 样本
AskHistorians: 586939915 字节, 592242 样本
gaming: 7306865977 字节, 6418305 样本
DIY: 612049815 字节, 505769 样本
mildlyinteresting: 1497282377 字节, 1971187 样本
sports: 866461524 字节, 783890 样本
space: 413125181 字节, 415629 样本
gadgets: 242359652 字节, 284487 样本
Documentaries: 658519015 字节, 300935 样本
GetMotivated: 458864553 字节, 395894 样本
UpliftingNews: 294091853 字节, 285339 样本
technology: 1562501874 字节, 2112572 样本
Fitness: 939461866 字节, 1035109 样本
travel: 988622317 字节, 1012452 样本
lifehacks: 124628404 字节, 116871 样本
Damnthatsinteresting: 536680874 字节, 397143 样本
gardening: 652169745 字节, 723267 样本
programming: 455470198 字节, 571221 样本

数据集创建

数据集来源

Reddit PushShift数据 dumps 是定期爬取Reddit以提取和保存其所有数据的数据收集工作的一部分。

个人和敏感信息

数据包含与内容关联的Redditor用户名。

使用数据的注意事项

在任何处理之前，此数据集应进行匿名化处理。尽管所选的subreddits被认为是高质量的，但它们仍可能反映出互联网上存在的偏见和毒性表达。

5,000+

优质数据集

54 个

任务类型

进入经典数据集