LongReward-10k

Name: LongReward-10k
Creator: maas
Published: 2026-01-06 16:40:32
License: 暂无描述

魔搭社区2026-01-06 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/ZhipuAI/LongReward-10k

下载链接

链接失效反馈

官方服务：

资源简介：

# LongReward-10k <p align="center"> 💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward Paper]</a> </p> **LongReward-10k** dataset contains 10,000 long-context QA instances (both English and Chinese, up to 64,000 words). The `sft` split contains SFT data generated by [GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4), following the self-instruct method in [LongAlign](https://github.com/THUDM/LongAlign). Using this split, we supervised fine-tune two models: [LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) and [LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT), which are based on [GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b) and [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), respectively. The `dpo_glm4_9b` and `dpo_llama3.1_8b` split are long-context preference datasets, where the winning and losing responses are sampled from the above the corresponding SFT model and ranked by our proposed [LongReward](https://github.com/THUDM/LongReward) method. Using these preference datatsets, we train two DPO models (based on the SFT checkpoints): [LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) and [LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO). More Details can be found in our paper. ## All Released Models Here is the full list of models we released: | Model | HF Repo | Training Dataset | |---|---|---| | LongReward-glm4-9b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` split | | LongReward-glm4-9b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` split | | LongReward-llama3.1-8b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` split | | LongReward-llama3.1-8b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` split | ## Citation If you find our work useful, please consider citing LongReward: ``` @article{zhang2024longreward, title = {LongReward: Improving Long-context Large Language Models with AI Feedback} author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li}, journal={arXiv preprint arXiv:2410.21252}, year={2024} } ```

# LongReward-10k <p align="center"> 💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[GitHub代码仓库]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward研究论文]</a> </p> **LongReward-10k** 数据集包含10000条长上下文问答（QA）样本，涵盖英语与中文两种语言，单样本最大长度可达64000词。 `sft` 拆分集包含由[GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4)生成的监督微调（Supervised Fine-Tuning, SFT）数据，其生成遵循了[LongAlign](https://github.com/THUDM/LongAlign)提出的自指令（self-instruct）方法。依托该拆分集，我们分别基于[GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b)与[Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)，监督微调得到两个模型：[LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT)与[LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT)。 `dpo_glm4_9b` 与 `dpo_llama3.1_8b` 拆分集均为长上下文偏好数据集，其中获胜与落败的模型回复均从上述对应监督微调模型中采样得到，并通过我们提出的[LongReward](https://github.com/THUDM/LongReward)方法完成排序。依托这些偏好数据集，我们基于上述监督微调的模型检查点，训练得到两个深度偏好优化（Direct Preference Optimization, DPO）模型：[LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO)与[LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO)。更多细节可参阅我们的研究论文。 ## 已发布模型全列表 | 模型名称 | Hugging Face仓库地址 | 训练所用数据集拆分 | |---|---|---| | LongReward-glm4-9b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` 拆分集 | | LongReward-glm4-9b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` 拆分集 | | LongReward-llama3.1-8b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` 拆分集 | | LongReward-llama3.1-8b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` 拆分集 | ## 引用格式若您的工作用到了本数据集或模型，请引用LongReward相关研究： @article{zhang2024longreward, title = {LongReward: Improving Long-context Large Language Models with AI Feedback} author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li}, journal={arXiv preprint arXiv:2410.21252}, year={2024} }

提供机构：

maas

创建时间：

2025-07-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集