魔搭社区2025-11-27 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/reddit
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Reddit
This dataset contains titles and bodies of Reddit posts collected from the [Reddit-Title-Body dataset](https://huggingface.co/datasets/sentence-transformers/reddit-title-body).
The data has been filtered for:
* Remove threads with an upvote_ratio < 0.5
* Only include threads with a title more than 25 characters and bodies with len(title)+25 < len(body) < 4096
* Only keep threads with at least 3 comments or at least 3 upvotes.
## Dataset Subsets
### `pair` subset
* Columns: "title", "body"
* Column types: `str`, `str`
* Examples:
```python
{
'title': 'Has anybody else watched Kings?',
'body': "I know it's not SciFi per se, but I thought this kind of \"big concept\" show might appeal to the same group. I hadn't heard of it, but Hulu recommended it to me, and I ended up watching the entire thing over a couple of days. I thought it was absolutely fantastic, and I'm really bummed that it won't be coming back. I've been recommending it to everyone I know, but I haven't found anyone else who's watched it! Did anybody here? If so, what did people think? EDIT: P.S. It's all available on Hulu!",
}
```
* Collection strategy: Concatenating all files from [Reddit-Title-Body dataset](https://huggingface.co/datasets/sentence-transformers/reddit-title-body).
* Deduplified: No
# Reddit 数据集卡片
本数据集收录了从[Reddit-Title-Body数据集](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)中获取的Reddit帖子标题与正文内容。
数据集已按照以下规则完成筛选:
* 移除点赞率(upvote_ratio)低于0.5的讨论帖
* 仅保留标题长度超过25个字符,且正文长度满足「标题长度+25 < 正文长度 < 4096」的讨论帖
* 仅保留至少包含3条评论或3个点赞的讨论帖
## 数据集子集
### `pair` 子集
* 列字段:"标题"、"正文"
* 列数据类型:均为字符串(str)
* 示例:
python
{
'title': '还有其他人看过《Kings》吗?',
'body': '我知道它算不上严格意义上的科幻作品,但我觉得这种"big concept"的剧集可能会吸引同类受众。我此前从未听闻这部剧集,却被Hulu推荐,最终用数天时间看完了全季。我认为该剧堪称佳作,却因不会推出续作而深感遗憾。我一直向身边所有人推荐这部剧,但至今未遇到其他看过的观众!在座各位是否有人看过?若有,观感如何?编辑注:另外,全剧均可在Hulu平台观看!',
}
* 采集策略:合并来自[Reddit-Title-Body数据集](https://huggingface.co/datasets/sentence-transformers/reddit-title-body)的所有文件
* 去重状态:未去重
提供机构:
maas
创建时间:
2025-01-06



