five

ChatQA2-Long-SFT-data

收藏
魔搭社区2026-01-02 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/ChatQA2-Long-SFT-data
下载链接
链接失效反馈
官方服务:
资源简介:
## Data Description Here, we release the full long SFT training dataset of [ChatQA2](https://arxiv.org/abs/2407.14482). It consists of two parts: **long_sft** and **NarrativeQA_131072**. The long_sft dataset is built and derived from existing datasets: [LongAlpaca12k](https://github.com/dvlab-research/LongLoRA), GPT-4 samples from [Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca), and [Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections). The NarrativeQA_131072 dataset is synthetically generated from NarrativeQA by adding related paragraphs to the given ground truth summary. For the first two steps training of ChatQA-2, we follow [ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data). For the continue pretraining dataset, we simply follow [Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering) to generate 10B tokens. **For more information about ChatQA-2, check the [website](https://chatqa2-project.github.io/)!** ## Other Resources [Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B) &ensp; [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B) &ensp; [Evaluation Data](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data) &ensp; [Website](https://chatqa2-project.github.io/) &ensp; [Paper](https://arxiv.org/abs/2407.14482) ## Training Details The training follows a three-stage instruction tuning process. For the first two stages, we follow ChatQA-1.5, i.e., the stage-1 uses the SFT data, and the stage-2 uses a blend of SFT data alongside other datasets. The dataset blending ratio for stage-2 is as follows: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 The stage-3 add the full long SFT dataset to the blend. And the new dataset blending ratio for stage-3 is as follows: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 - long_sft: 2.5 - NarrativeQA_131072: 5.0 ## License The dataset is released for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI. ## Correspondence to Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com) ## Citation <pre> @article{xu2024chatqa, title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities}, author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2407.14482}, year={2024} } </pre>

数据说明 我们在此发布[ChatQA2](https://arxiv.org/abs/2407.14482)的完整长上下文监督微调(Supervised Fine-Tuning,SFT)训练数据集。该数据集包含两个部分:**long_sft**与**NarrativeQA_131072**。其中long_sft数据集基于现有数据集构建衍生,包含[LongAlpaca12k](https://github.com/dvlab-research/LongLoRA)、来自[Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)的GPT-4样本,以及[Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)三个来源。NarrativeQA_131072数据集由NarrativeQA合成生成,具体方式为向给定的基准摘要添加相关段落。针对ChatQA-2的前两步训练流程,我们沿用[ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)的设置。 对于持续预训练数据集,我们直接参照[Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering)的方法生成了100亿个Token。如需了解ChatQA-2的更多相关信息,请访问[官方网站](https://chatqa2-project.github.io/)! 其他资源 [Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B) &ensp; [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B) &ensp; [评估数据集](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data) &ensp; [官方网站](https://chatqa2-project.github.io/) &ensp; [研究论文](https://arxiv.org/abs/2407.14482) 训练细节 本次训练采用三阶段指令微调流程。在前两阶段中,我们沿用ChatQA-1.5的设置:第一阶段使用监督微调(SFT)数据,第二阶段使用SFT数据与其他数据集的混合训练集。第二阶段的数据集混合比例如下: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 第三阶段将完整的长上下文SFT数据集加入混合训练集,第三阶段更新后的数据集混合比例如下: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 - long_sft: 2.5 - NarrativeQA_131072: 5.0 使用许可 本数据集仅可用于非商业用途,需遵循OpenAI生成数据的[使用条款](https://openai.com/policies/terms-of-use)。 通讯作者 彭旭(pengx@nvidia.com)、平伟(wping@nvidia.com) 引用格式 <pre> @article{xu2024chatqa, title={ChatQA 2: 填补长上下文与检索增强生成(Retrieval-Augmented Generation,RAG)能力上与专有大语言模型(Large Language Model,LLM)的差距}, author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2407.14482}, year={2024} } </pre>
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作