five

LongReward-10k

收藏
魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/ZhipuAI/LongReward-10k
下载链接
链接失效反馈
官方服务:
资源简介:
# LongReward-10k <p align="center"> 💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward Paper]</a> </p> **LongReward-10k** dataset contains 10,000 long-context QA instances (both English and Chinese, up to 64,000 words). The `sft` split contains SFT data generated by [GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4), following the self-instruct method in [LongAlign](https://github.com/THUDM/LongAlign). Using this split, we supervised fine-tune two models: [LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) and [LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT), which are based on [GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b) and [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), respectively. The `dpo_glm4_9b` and `dpo_llama3.1_8b` split are long-context preference datasets, where the winning and losing responses are sampled from the above the corresponding SFT model and ranked by our proposed [LongReward](https://github.com/THUDM/LongReward) method. Using these preference datatsets, we train two DPO models (based on the SFT checkpoints): [LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) and [LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO). More Details can be found in our paper. ## All Released Models Here is the full list of models we released: | Model | HF Repo | Training Dataset | |---|---|---| | LongReward-glm4-9b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` split | | LongReward-glm4-9b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` split | | LongReward-llama3.1-8b-SFT | [🤗 HF Repo](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` split | | LongReward-llama3.1-8b-DPO | [🤗 HF Repo](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` split | ## Citation If you find our work useful, please consider citing LongReward: ``` @article{zhang2024longreward, title = {LongReward: Improving Long-context Large Language Models with AI Feedback} author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li}, journal={arXiv preprint arXiv:2410.21252}, year={2024} } ```

# LongReward-10k <p align="center"> 💻 <a href="https://github.com/THUDM/LongReward" target="_blank">[GitHub代码仓库]</a> • 📃 <a href="https://arxiv.org/abs/2410.21252" target="_blank">[LongReward研究论文]</a> </p> **LongReward-10k** 数据集包含10000条长上下文问答(QA)样本,涵盖英语与中文两种语言,单样本最大长度可达64000词。 `sft` 拆分集包含由[GLM-4-0520](https://bigmodel.cn/dev/api/normal-model/glm-4)生成的监督微调(Supervised Fine-Tuning, SFT)数据,其生成遵循了[LongAlign](https://github.com/THUDM/LongAlign)提出的自指令(self-instruct)方法。依托该拆分集,我们分别基于[GLM-4-9B](https://huggingface.co/THUDM/glm-4-9b)与[Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B),监督微调得到两个模型:[LongReward-glm4-9b-SFT](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT)与[LongReward-llama3.1-8b-SFT](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT)。 `dpo_glm4_9b` 与 `dpo_llama3.1_8b` 拆分集均为长上下文偏好数据集,其中获胜与落败的模型回复均从上述对应监督微调模型中采样得到,并通过我们提出的[LongReward](https://github.com/THUDM/LongReward)方法完成排序。依托这些偏好数据集,我们基于上述监督微调的模型检查点,训练得到两个深度偏好优化(Direct Preference Optimization, DPO)模型:[LongReward-glm4-9b-DPO](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO)与[LongReward-llama3.1-8b-DPO](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO)。更多细节可参阅我们的研究论文。 ## 已发布模型全列表 | 模型名称 | Hugging Face仓库地址 | 训练所用数据集拆分 | |---|---|---| | LongReward-glm4-9b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-glm4-9b-SFT) | `sft` 拆分集 | | LongReward-glm4-9b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-glm4-9b-DPO) | `dpo_glm4_9b` 拆分集 | | LongReward-llama3.1-8b-SFT | [🤗 Hugging Face仓库地址](https://huggingface.co/NeoZ123/LongReward-llama3.1-8b-SFT) | `sft` 拆分集 | | LongReward-llama3.1-8b-DPO | [🤗 Hugging Face仓库地址](https://huggingface.co/THUDM/LongReward-llama3.1-8b-DPO) | `dpo_llama3.1_8b` 拆分集 | ## 引用格式 若您的工作用到了本数据集或模型,请引用LongReward相关研究: @article{zhang2024longreward, title = {LongReward: Improving Long-context Large Language Models with AI Feedback} author={Jiajie Zhang and Zhongni Hou and Xin Lv and Shulin Cao and Zhenyu Hou and Yilin Niu and Lei Hou and Yuxiao Dong and Ling Feng and Juanzi Li}, journal={arXiv preprint arXiv:2410.21252}, year={2024} }
提供机构:
maas
创建时间:
2025-07-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作