Reddit TL;DR summarization dataset
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/openai/summarize_from_feedback
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了Reddit帖子和两个摘要版本的对比,这些摘要版本的偏好是通过众包方式从多个工作者那里收集的。此外,数据集还包含了基于注释对比数量选出的顶级工作者的标注信息,这有助于更有效地学习个别用户的偏好。该数据集的规模为:训练集包含23,292个对比;验证集包含16,294个对比。其任务是文本摘要。
This dataset comprises comparisons between Reddit posts and two summary versions, with the preferences for these summaries collected from multiple workers via crowdsourcing. Furthermore, the dataset also includes annotation data from top workers selected based on the number of annotated comparisons they contributed, which enables more efficient learning of individual user preferences. The scale of this dataset is as follows: the training set contains 23,292 comparison instances, while the validation set contains 16,294 comparison instances. The corresponding task is text summarization.
提供机构:
OpenAI
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含从人类反馈中学习摘要的数据,分为'comparisons'和'axis'两部分,用于训练和评估奖励模型。数据集来源于TL;DR数据集、CNN和Daily Mail文章,旨在训练模型生成符合人类偏好的摘要。
以上内容由遇见数据集搜集并总结生成



