OKReddit-ReleaseCandidate3
收藏OKReddit - Release Candidate 2023
数据集概述
OKReddit是一个经过筛选的Reddit提交和评论集合,数据量约为6.5 TiB(估计包含6亿行Reddit提交),时间跨度从2005年到2023年。该数据集主要用于研究和存档目的。
数据集来源
- 源数据: Academic Torrents(由stuck_in_the_matrix, Watchful1, RaiderBDev & pushshift folks提供)
支持的任务和排行榜
该数据集可用于多种自然语言处理(NLP)任务,包括:
- 文本分类:根据情感、主题或子版块对评论和帖子进行分类。
- 语言建模:训练语言模型以理解和生成对话文本。
- 情感分析:分析不同子版块和主题下的评论和帖子的情感。
- 主题建模:识别和建模帖子中讨论的主题。
语言
数据集的主要语言是英语,但也有其他语言的帖子,数量较少。
数据集结构
数据实例
每个数据实例代表一个子版块内的提交线程。
thread_id:提交线程ID,包含Reddit用于标记线程的t3_前缀。subreddit:子版块名称,不区分大小写。namedconversation:一个OpenAI兼容的对话:from:发布内容的作者用户名。content:发布的Reddit Markdown内容。
submission/comments:原始提交和评论。
数据示例
json { "thread_id": "t3_of7h2", "subreddit": "Gaben", "namedconversation": [ { "from": "[deleted]", "content": "[13 Jan 2012, 07:01:07] TIL Half-Life 2s source code was hacked because the hacker guessed Gabes password, which was "gaben"
Link: half-life.wikia.com" }, { "from": "clydethefrog", "content": "[15 Jan 2012, 18:01:06] Thats my password too" }, { "from": "Dunge", "content": "[29 Feb 2012, 02:02:34] "Gembe was led into believing that Valve wanted to employ him as an in-house security auditor. He was to be offered a flight to the USA and was to be arrested on arrival by the FBI."
Wow thats sad" }, { "from": "captainregularr", "content": "[13 Jan 2012, 14:01:14] Did you know gaben makes me gaben my gaben?" }, { "from": "Turellio", "content": "[13 Jan 2012, 17:01:53] thats what gaben gaben" }, { "from": "captainregularr", "content": "[13 Jan 2012, 17:01:05] I gaben to gabens demands." }, { "from": "RagingRetard", "content": "[13 Jan 2012, 17:01:49] Oh, quit your incessant gaben." } ], "submission": { "sub": { "name": "Gaben", "id": "2scx1", "subs": null, "type": null }, "author": null, "title": "TIL Half-Life 2s source code was hacked because the hacker guessed Gabes password, which was "gaben"", "score": 23, "created": 1326440407.0, "id": "of7h2", "flags": "", "link_flair": null, "url": "http://half-life.wikia.com/wiki/Half-Life_2_Beta#Source_code_leak", "text": "", "removed": [], "cross": [] }, "comments": [ { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "clydethefrog", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": "Thats my password too", "score": 1, "created": "1326652326", "id": "c3hge04", "parent_id": "t3_of7h2", "thread_id": "t3_of7h2", "flags": "A", "children": [] }, { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "Dunge", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": ""Gembe was led into believing that Valve wanted to employ him as an in-house security auditor. He was to be offered a flight to the USA and was to be arrested on arrival by the FBI."
Wow thats sad", "score": 3, "created": "1330483894", "id": "c3w2ulz", "parent_id": "t3_of7h2", "thread_id": "t3_of7h2", "flags": "A", "children": [] }, { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "captainregularr", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": "Did you know gaben makes me gaben my gaben?", "score": 5, "created": "1326463514", "id": "c3gsfkx", "parent_id": "t3_of7h2", "thread_id": "t3_of7h2", "flags": "A", "children": [ { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "Turellio", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": "thats what gaben gaben", "score": 3, "created": "1326476873", "id": "c3guihp", "parent_id": "t1_c3gsfkx", "thread_id": "t3_of7h2", "flags": "A", "children": [ { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "captainregularr", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": "I gaben to gabens demands.", "score": 5, "created": "1326477005", "id": "c3guje0", "parent_id": "t1_c3guihp", "thread_id": "t3_of7h2", "flags": "AE", "children": [ { "sub": { "name": "Gaben", "id": "2scx1", "subs": -1, "type": "" }, "author": { "name": "RagingRetard", "uid": "", "create": -1, "flair": null, "patreon": false, "premium": false }, "text": "Oh, quit your incessant gaben.", "score": 2, "created": "1326477409", "id": "c3gulzh", "parent_id": "t1_c3guje0", "thread_id": "t3_of7h2", "flags": "A", "children": [] } ] } ] } ] } ] }
额外数据集说明
标志:Reddit有一些布尔开关可以压缩成字符串。我们已经这样做以减少需要存储的布尔开关数量。
对于提交,标志字符到布尔名称的映射如下:
python flag_map = { "!": "spoiler", "#": "stickied", ">": "pinned", "A": "archived", "C": "is_crosspostable", "c": "is_original_content", "E": "edited", "e": "is_meta", "G": "can_gild", "H": "hidden", "i": "is_robot_indexable", "L": "allow_live_comments", "l": "locked", "m": "is_reddit_media_domain", "M": "over_18", "O": "contest_mode", "q": "quarantine", "s": "is_self", "v": "is_video", }
对于评论:
python flag_map = { "#": "stickied", "A": "archived", "E": "edited", "G": "can_gild", "H": "hidden", "l": "locked", "=": "score_hidden", "P": "author_premium", "R": "send_replies", "O": "can_mod_post", "N": "no_follow", }
在命名对话中,仅使用提交的over_18标志。
数据集创建
筛选子版块质量
为了构建一个更具包容性的数据集,同时保持标准,我们实施了一个修剪过程,针对根据三个关键指标缺乏有价值内容的子版块:
- 参与度:总评论数与总提交数的比率,反映子版块的活动水平。
- 丰富度:媒体提交数占总提交数的比例的平方,表示多媒体内容的密度。
- 多样性:评论和提交中的唯一作者数之和除以唯一提交作者数,表示社区参与的广度。
此外,我们还对提交和作者数量设定了某些基线阈值:
python if ( stats_data["submission"]["authors"] < 70 # 总唯一作者数 or stats_data["comment"]["authors"] < 20 # 总唯一评论者数 or stats_data["submission"]["submissions"] < 450 # 总提交数 or stats_data["comment"]["comments"] < 585 # 总评论数 ):
跳过该子版块
通过应用这些标准,我们已缩小到大约62,000个高质量子版块。
有价值的提交
为了消除提交数量不足的子版块,我们首先识别“有用的线程”,这些线程具有以下特征之一:
- 至少五个回复,
- 或者,如果原始帖子是文本,超过2,500个字符。
我们建立了一个介于5到20之间的随机阈值,任何低于此随机生成要求的子版块都将被排除。
细化的评论选择
在线程级别的过滤之后,评论将根据以下标准进行额外审查:
- 评分低于-4的评论将被丢弃。
- 在拥有超过50条评论的线程中,嵌套深度超过六层的评论将被删除。
- 如果评论线程的累计评分低于零,则该线程的其余部分将被修剪。
- 与第2或第3点中修剪的父评论相关的子评论也将被删除。
数据集创建
该数据集是从Reddit的开始到2023年底的提交和评论的筛选集合。
使用数据集的注意事项
数据集的社会影响
通过发布此数据集,我们旨在使这一开发资源对社区广泛可用。
偏见讨论
我们决定不审查NSFW或有毒内容。这允许进行更好的有毒分析并提供多样化的数据集。
附加信息
关于RWKV
RWKV是一个开源的非盈利组织,隶属于Linux基金会。专注于开发RWKV AI架构,以实现我们的愿景。
关于Recursal AI
Recursal AI是支持RWKV模型开发和用户的商业实体,同时通过其公共云或私有云/本地部署提供商业服务。
许可信息
由于此数据集源自Reddit的公开爬取,原始内容可能受版权和其他许可条款的约束。此外,此数据集仅用于研究和存档目的。
引用信息
如果您在研究或项目中使用此数据集,请按如下方式引用:
TeX @dataset{OKReddit, title = {OKReddit}, year = {2024}, publisher = {KaraKaraWitch}, url = {https://huggingface.co/datasets/KaraKaraWitch/OKReddit} }
此外,请引用以下源bibtex:
TeX @article{, title= {Reddit comments/submissions 2005-06 to 2023-12}, journal= {}, author= {stuck_in_the_matrix, Watchful1, RaiderBDev}, year= {}, url= {}, abstract= {Reddit comments and submissions from 2005-06 to 2023-09 collected by pushshift and u/RaiderBDev.
These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here https://github.com/Watchful1/PushshiftDumps
The more recent dumps are collected by u/RaiderBDev and questions can be submitted here https://github.com/ArthurHeitmann/arctic_shift}, keywords= {reddit}, terms= {}, license= {}, superseded= {} }




