five

reddit

收藏
魔搭社区2025-11-27 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/reddit
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Reddit This dataset contains titles and bodies of Reddit posts collected from the [Reddit-Title-Body dataset](https://huggingface.co/datasets/sentence-transformers/reddit-title-body). The data has been filtered for: * Remove threads with an upvote_ratio < 0.5 * Only include threads with a title more than 25 characters and bodies with len(title)+25 < len(body) < 4096 * Only keep threads with at least 3 comments or at least 3 upvotes. ## Dataset Subsets ### `pair` subset * Columns: "title", "body" * Column types: `str`, `str` * Examples: ```python { 'title': 'Has anybody else watched Kings?', 'body': "I know it's not SciFi per se, but I thought this kind of \"big concept\" show might appeal to the same group. I hadn't heard of it, but Hulu recommended it to me, and I ended up watching the entire thing over a couple of days. I thought it was absolutely fantastic, and I'm really bummed that it won't be coming back. I've been recommending it to everyone I know, but I haven't found anyone else who's watched it! Did anybody here? If so, what did people think? EDIT: P.S. It's all available on Hulu!", } ``` * Collection strategy: Concatenating all files from [Reddit-Title-Body dataset](https://huggingface.co/datasets/sentence-transformers/reddit-title-body). * Deduplified: No

# Reddit 数据集卡片 本数据集收录了从[Reddit-Title-Body数据集](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)中获取的Reddit帖子标题与正文内容。 数据集已按照以下规则完成筛选: * 移除点赞率(upvote_ratio)低于0.5的讨论帖 * 仅保留标题长度超过25个字符,且正文长度满足「标题长度+25 < 正文长度 < 4096」的讨论帖 * 仅保留至少包含3条评论或3个点赞的讨论帖 ## 数据集子集 ### `pair` 子集 * 列字段:"标题"、"正文" * 列数据类型:均为字符串(str) * 示例: python { 'title': '还有其他人看过《Kings》吗?', 'body': '我知道它算不上严格意义上的科幻作品,但我觉得这种"big concept"的剧集可能会吸引同类受众。我此前从未听闻这部剧集,却被Hulu推荐,最终用数天时间看完了全季。我认为该剧堪称佳作,却因不会推出续作而深感遗憾。我一直向身边所有人推荐这部剧,但至今未遇到其他看过的观众!在座各位是否有人看过?若有,观感如何?编辑注:另外,全剧均可在Hulu平台观看!', } * 采集策略:合并来自[Reddit-Title-Body数据集](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)的所有文件 * 去重状态:未去重
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作