tldr

Name: tldr
Creator: maas
Published: 2025-12-11 18:42:40
License: 暂无描述

魔搭社区2025-12-11 更新2025-02-22 收录

下载链接：

https://modelscope.cn/datasets/okwinds/tldr

下载链接

链接失效反馈

官方服务：

资源简介：

# 本数据集解读，请看公众号文章 👇🏻 ### <img src="https://www.modelscope.cn/datasets/okwinds/Human-Like-DPO-Dataset/resolve/master/wechat.png" width="30" height="30" align="absmiddle"> 觉察流 - [Open-R1：深度揭秘 DeepSeek-R1 开源复现进展](https://mp.weixin.qq.com/s/TxRaI8amE_N__1VU4XHvMg) > 声明：本数据集完全转载自 Huggingface 上的 [trl-lib/tldr](https://huggingface.co/datasets/trl-lib/tldr) 更多模型信息，请关注下文👇🏻，为原数据集仓库的中文版说明。 #### _仓库作者在此 👇🏻 扫一扫_ <img src="https://www.modelscope.cn/models/okwinds/GPT-2/resolve/master/qrcode_for_jcl_258.jpg" /> #### 下载方法数据集文件元信息以及数据文件，请浏览“数据集文件”页面获取。您可以通过如下GIT Clone命令，或者ModelScope SDK来下载数据集 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"} # 数据集介绍 # TL;DR Dataset ## Summary The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for summarization tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training summarization models. ## Data Structure - **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard) - **Type**: [Prompt-completion](https://huggingface.co/docs/trl/main/dataset_formats#prompt-completion) Columns: - `"pompt"`: The unabridged Reddit post. - `"completion"`: The concise "TL;DR" summary appended by the author. This structure enables models to learn the relationship between detailed content and its abbreviated form, enhancing their summarization capabilities. ## Generation script The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/tldr.py).

# 本数据集解读，请查看下述公众号文章 👇🏻 ### <img src="https://www.modelscope.cn/datasets/okwinds/Human-Like-DPO-Dataset/resolve/master/wechat.png" width="30" height="30" align="absmiddle"> 觉察流 - [Open-R1：深度揭秘 DeepSeek-R1 开源复现进展](https://mp.weixin.qq.com/s/TxRaI8amE_N__1VU4XHvMg) > 声明：本数据集完全转载自 Huggingface 平台上的 [trl-lib/tldr](https://huggingface.co/datasets/trl-lib/tldr) 更多模型相关资讯，请关注下文👇🏻，本文为原数据集仓库的中文版说明。 #### _仓库作者二维码在此 👇🏻 扫码关注_ <img src="https://www.modelscope.cn/models/okwinds/GPT-2/resolve/master/qrcode_for_jcl_258.jpg" /> #### 下载方式数据集文件元数据与实体数据，请前往「数据集文件」页面获取。您可通过以下GIT Clone命令，或ModelScope软件开发工具包（SDK）下载该数据集 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"} # 数据集介绍 # TL;DR 数据集 ## 数据集概述 TL;DR 数据集是经过预处理的Reddit帖子集合，专门为使用TRL库（TRL Library）训练摘要模型而整理构建。该数据集依托Reddit平台的通用惯例——用户常在长帖末尾附加“TL;DR（Too Long; Didn't Read，即篇幅过长，未细读）”式摘要，从而为摘要模型训练提供了高质量的成对文本数据源。 ## 数据结构 - **格式**：[标准格式](https://huggingface.co/docs/trl/main/dataset_formats#standard) - **类型**：[提示-补全格式](https://huggingface.co/docs/trl/main/dataset_formats#prompt-completion) 数据集字段说明： - `"pompt"`：未删减的完整Reddit帖子内容 - `"completion"`：帖子作者附加的简洁“TL;DR”式摘要该结构可帮助模型学习详细文本与精简摘要之间的映射关系，进而提升模型的摘要生成能力。 ## 数据集生成脚本本数据集的生成脚本可通过[此链接](https://github.com/huggingface/trl/blob/main/examples/datasets/tldr.py)获取。

提供机构：

maas

创建时间：

2025-02-15

搜集汇总

数据集介绍