five

reddit-title-body

收藏
魔搭社区2025-11-12 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/reddit-title-body
下载链接
链接失效反馈
官方服务:
资源简介:
# Reddit (Title, Body)-Pairs This dataset contains jsonl-Files about (title, body) pairs from Reddit. Each line is a JSON object of the following format: ``` {'title': 'The title of a thread', 'body': 'The longer body of the thread', 'subreddit': 'subreddit_name'} ``` The 2021 file contains submissions up until including 2021-06. Entries in the respective files are shuffled on a monthly basis. The data has been filtered for: - Remove threads with an upvote_ratio < 0.5 - Only include threads with a title more than 25 characters and bodies with `len(title)+25 < len(body) < 4096` - Only keep threads with at least 3 comments or at least 3 upvotes. ## Overview | File | Lines | | --- | :---: | | reddit_title_text_2010.jsonl.gz | 431,782 | reddit_title_text_2011.jsonl.gz | 1,673,264 | reddit_title_text_2012.jsonl.gz | 3,727,526 | reddit_title_text_2013.jsonl.gz | 5,713,956 | reddit_title_text_2014.jsonl.gz | 8,538,976 | reddit_title_text_2015.jsonl.gz | 11,064,453 | reddit_title_text_2016.jsonl.gz | 12,224,789 | reddit_title_text_2017.jsonl.gz | 13,558,139 | reddit_title_text_2018.jsonl.gz | 15,552,110 | reddit_title_text_2019.jsonl.gz | 19,224,970 | reddit_title_text_2020.jsonl.gz | 23,030,988 | reddit_title_text_2021.jsonl.gz | 12,704,958 Note: The data comes from [Pushshift](https://files.pushshift.io/reddit/). Please have a look at the respective license of Reddit and Pushshift before using the data. Be aware that this dataset is not filtered for biases, hate-speech, spam, racial slurm etc. It depicts the content as it is posted on Reddit.

# Reddit(标题-正文)对数据集 本数据集包含来自Reddit的(标题-正文)对相关的JSON Lines(jsonl)格式文件。每行均为符合如下格式的JSON对象: {"title": "帖子标题", "body": "帖子的详细正文内容", "subreddit": "子版块(subreddit)名称"} 2021版数据集文件包含截至2021年6月(含当日)的所有投稿内容。各文件内的条目均按月度进行随机打乱处理。 本数据集已通过如下筛选规则进行预处理: - 移除点赞率(upvote_ratio)低于0.5的帖子 - 仅保留标题长度超过25个字符,且正文长度满足`len(title)+25 < len(body) < 4096`的帖子 - 仅保留评论数不少于3条或点赞数不少于3的帖子 ## 数据集概览 | 文件 | 行数 | | --- | :---: | | reddit_title_text_2010.jsonl.gz | 431,782 | reddit_title_text_2011.jsonl.gz | 1,673,264 | reddit_title_text_2012.jsonl.gz | 3,727,526 | reddit_title_text_2013.jsonl.gz | 5,713,956 | reddit_title_text_2014.jsonl.gz | 8,538,976 | reddit_title_text_2015.jsonl.gz | 11,064,453 | reddit_title_text_2016.jsonl.gz | 12,224,789 | reddit_title_text_2017.jsonl.gz | 13,558,139 | reddit_title_text_2018.jsonl.gz | 15,552,110 | reddit_title_text_2019.jsonl.gz | 19,224,970 | reddit_title_text_2020.jsonl.gz | 23,030,988 | reddit_title_text_2021.jsonl.gz | 12,704,958 注意:本数据集源自[Pushshift](https://files.pushshift.io/reddit/)。使用本数据集前,请查阅Reddit与Pushshift的相关许可协议。 请注意,本数据集未针对偏见、仇恨言论、垃圾信息、种族歧视性用语等内容进行过滤,完整保留了Reddit平台上的原生发布内容。
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作