reddit-title-body
收藏魔搭社区2025-11-12 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/reddit-title-body
下载链接
链接失效反馈官方服务:
资源简介:
# Reddit (Title, Body)-Pairs
This dataset contains jsonl-Files about (title, body) pairs from Reddit. Each line is a JSON object of the following format:
```
{'title': 'The title of a thread', 'body': 'The longer body of the thread', 'subreddit': 'subreddit_name'}
```
The 2021 file contains submissions up until including 2021-06. Entries in the respective files are shuffled on a monthly basis.
The data has been filtered for:
- Remove threads with an upvote_ratio < 0.5
- Only include threads with a title more than 25 characters and bodies with `len(title)+25 < len(body) < 4096`
- Only keep threads with at least 3 comments or at least 3 upvotes.
## Overview
| File | Lines |
| --- | :---: |
| reddit_title_text_2010.jsonl.gz | 431,782
| reddit_title_text_2011.jsonl.gz | 1,673,264
| reddit_title_text_2012.jsonl.gz | 3,727,526
| reddit_title_text_2013.jsonl.gz | 5,713,956
| reddit_title_text_2014.jsonl.gz | 8,538,976
| reddit_title_text_2015.jsonl.gz | 11,064,453
| reddit_title_text_2016.jsonl.gz | 12,224,789
| reddit_title_text_2017.jsonl.gz | 13,558,139
| reddit_title_text_2018.jsonl.gz | 15,552,110
| reddit_title_text_2019.jsonl.gz | 19,224,970
| reddit_title_text_2020.jsonl.gz | 23,030,988
| reddit_title_text_2021.jsonl.gz | 12,704,958
Note: The data comes from [Pushshift](https://files.pushshift.io/reddit/). Please have a look at the respective license of Reddit and Pushshift before using the data.
Be aware that this dataset is not filtered for biases, hate-speech, spam, racial slurm etc. It depicts the content as it is posted on Reddit.
# Reddit(标题-正文)对数据集
本数据集包含来自Reddit的(标题-正文)对相关的JSON Lines(jsonl)格式文件。每行均为符合如下格式的JSON对象:
{"title": "帖子标题", "body": "帖子的详细正文内容", "subreddit": "子版块(subreddit)名称"}
2021版数据集文件包含截至2021年6月(含当日)的所有投稿内容。各文件内的条目均按月度进行随机打乱处理。
本数据集已通过如下筛选规则进行预处理:
- 移除点赞率(upvote_ratio)低于0.5的帖子
- 仅保留标题长度超过25个字符,且正文长度满足`len(title)+25 < len(body) < 4096`的帖子
- 仅保留评论数不少于3条或点赞数不少于3的帖子
## 数据集概览
| 文件 | 行数 |
| --- | :---: |
| reddit_title_text_2010.jsonl.gz | 431,782
| reddit_title_text_2011.jsonl.gz | 1,673,264
| reddit_title_text_2012.jsonl.gz | 3,727,526
| reddit_title_text_2013.jsonl.gz | 5,713,956
| reddit_title_text_2014.jsonl.gz | 8,538,976
| reddit_title_text_2015.jsonl.gz | 11,064,453
| reddit_title_text_2016.jsonl.gz | 12,224,789
| reddit_title_text_2017.jsonl.gz | 13,558,139
| reddit_title_text_2018.jsonl.gz | 15,552,110
| reddit_title_text_2019.jsonl.gz | 19,224,970
| reddit_title_text_2020.jsonl.gz | 23,030,988
| reddit_title_text_2021.jsonl.gz | 12,704,958
注意:本数据集源自[Pushshift](https://files.pushshift.io/reddit/)。使用本数据集前,请查阅Reddit与Pushshift的相关许可协议。
请注意,本数据集未针对偏见、仇恨言论、垃圾信息、种族歧视性用语等内容进行过滤,完整保留了Reddit平台上的原生发布内容。
提供机构:
maas
创建时间:
2025-01-06



