LongReward-10k
收藏魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ZhipuAI/LongReward-10k
下载链接
链接失效反馈官方服务:
资源简介:
# LongReward-10k
<p align="center">
💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward Paper]</a>
</p>
**LongReward-10k** dataset contains 10,000 long-context QA instances (both English and Chinese, up to 64,000 words).
The `sft` split contains SFT data generated by [GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4), following the self-instruct method in [LongAlign](https://github.com/THUDM/LongAlign). Using this split, we supervised fine-tune two models: [LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) and [LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT), which are based on [GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b) and [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), respectively.
The `dpo_glm4_9b` and `dpo_llama3.1_8b` split are long-context preference datasets, where the winning and losing responses are sampled from the above the corresponding SFT model and ranked by our proposed [LongReward](https://github.com/THUDM/LongReward) method. Using these preference datatsets, we train two DPO models (based on the SFT checkpoints): [LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) and [LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO). More Details can be found in our paper.
## All Released Models
Here is the full list of models we released:
| Model | HF Repo | Training Dataset |
|---|---|---|
| LongReward-glm4-9b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` split |
| LongReward-glm4-9b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` split |
| LongReward-llama3.1-8b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` split |
| LongReward-llama3.1-8b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` split |
## Citation
If you find our work useful, please consider citing LongReward:
```
@article{zhang2024longreward,
title = {LongReward: Improving Long-context Large Language Models
with AI Feedback}
author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li},
journal={arXiv preprint arXiv:2410.21252},
year={2024}
}
```
# LongReward-10k
<p align="center">
💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[GitHub代码仓库]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward研究论文]</a>
</p>
**LongReward-10k** 数据集包含10000条长上下文问答(QA)样本,涵盖英语与中文两种语言,单样本最大长度可达64000词。
`sft` 拆分集包含由[GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4)生成的监督微调(Supervised Fine-Tuning, SFT)数据,其生成遵循了[LongAlign](https://github.com/THUDM/LongAlign)提出的自指令(self-instruct)方法。依托该拆分集,我们分别基于[GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b)与[Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B),监督微调得到两个模型:[LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT)与[LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT)。
`dpo_glm4_9b` 与 `dpo_llama3.1_8b` 拆分集均为长上下文偏好数据集,其中获胜与落败的模型回复均从上述对应监督微调模型中采样得到,并通过我们提出的[LongReward](https://github.com/THUDM/LongReward)方法完成排序。依托这些偏好数据集,我们基于上述监督微调的模型检查点,训练得到两个深度偏好优化(Direct Preference Optimization, DPO)模型:[LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO)与[LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO)。更多细节可参阅我们的研究论文。
## 已发布模型全列表
| 模型名称 | Hugging Face仓库地址 | 训练所用数据集拆分 |
|---|---|---|
| LongReward-glm4-9b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` 拆分集 |
| LongReward-glm4-9b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` 拆分集 |
| LongReward-llama3.1-8b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` 拆分集 |
| LongReward-llama3.1-8b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` 拆分集 |
## 引用格式
若您的工作用到了本数据集或模型,请引用LongReward相关研究:
@article{zhang2024longreward,
title = {LongReward: Improving Long-context Large Language Models
with AI Feedback}
author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li},
journal={arXiv preprint arXiv:2410.21252},
year={2024}
}
提供机构:
maas
创建时间:
2025-07-30



