ChatQA2-Long-SFT-data

Name: ChatQA2-Long-SFT-data
Creator: maas
Published: 2026-01-02 16:21:02
License: 暂无描述

魔搭社区2026-01-02 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/nv-community/ChatQA2-Long-SFT-data

下载链接

链接失效反馈

官方服务：

资源简介：

## Data Description Here, we release the full long SFT training dataset of [ChatQA2](https://arxiv.org/abs/2407.14482). It consists of two parts: **long_sft** and **NarrativeQA_131072**. The long_sft dataset is built and derived from existing datasets: [LongAlpaca12k](https://github.com/dvlab-research/LongLoRA), GPT-4 samples from [Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca), and [Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections). The NarrativeQA_131072 dataset is synthetically generated from NarrativeQA by adding related paragraphs to the given ground truth summary. For the first two steps training of ChatQA-2, we follow [ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data). For the continue pretraining dataset, we simply follow [Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering) to generate 10B tokens. **For more information about ChatQA-2, check the [website](https://chatqa2-project.github.io/)!** ## Other Resources [Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B) &ensp; [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B) &ensp; [Evaluation Data](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data) &ensp; [Website](https://chatqa2-project.github.io/) &ensp; [Paper](https://arxiv.org/abs/2407.14482) ## Training Details The training follows a three-stage instruction tuning process. For the first two stages, we follow ChatQA-1.5, i.e., the stage-1 uses the SFT data, and the stage-2 uses a blend of SFT data alongside other datasets. The dataset blending ratio for stage-2 is as follows: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 The stage-3 add the full long SFT dataset to the blend. And the new dataset blending ratio for stage-3 is as follows: - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 - long_sft: 2.5 - NarrativeQA_131072: 5.0 ## License The dataset is released for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI. ## Correspondence to Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com) ## Citation <pre> @article{xu2024chatqa, title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities}, author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2407.14482}, year={2024} } </pre>

数据说明我们在此发布[ChatQA2](https://arxiv.org/abs/2407.14482)的完整长上下文监督微调（Supervised Fine-Tuning，SFT）训练数据集。该数据集包含两个部分：**long_sft**与**NarrativeQA_131072**。其中long_sft数据集基于现有数据集构建衍生，包含[LongAlpaca12k](https://github.com/dvlab-research/LongLoRA)、来自[Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)的GPT-4样本，以及[Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)三个来源。NarrativeQA_131072数据集由NarrativeQA合成生成，具体方式为向给定的基准摘要添加相关段落。针对ChatQA-2的前两步训练流程，我们沿用[ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)的设置。对于持续预训练数据集，我们直接参照[Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering)的方法生成了100亿个Token。如需了解ChatQA-2的更多相关信息，请访问[官方网站](https://chatqa2-project.github.io/)！其他资源 [Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B) &ensp; [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B) &ensp; [评估数据集](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data) &ensp; [官方网站](https://chatqa2-project.github.io/) &ensp; [研究论文](https://arxiv.org/abs/2407.14482) 训练细节本次训练采用三阶段指令微调流程。在前两阶段中，我们沿用ChatQA-1.5的设置：第一阶段使用监督微调（SFT）数据，第二阶段使用SFT数据与其他数据集的混合训练集。第二阶段的数据集混合比例如下： - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 第三阶段将完整的长上下文SFT数据集加入混合训练集，第三阶段更新后的数据集混合比例如下： - drop: 0.069 - narrativeqa: 0.095 - quoref: 0.026 - ropes: 0.026 - squad1.1: 0.095 - squad2.0: 0.095 - newsqa: 0.095 - tatqa-arithmetic: 0.15 - tatqa-others: 0.08 - synthetic_convqa: 0.3 - sft: 0.2 - long_sft: 2.5 - NarrativeQA_131072: 5.0 使用许可本数据集仅可用于非商业用途，需遵循OpenAI生成数据的[使用条款](https://openai.com/policies/terms-of-use)。通讯作者彭旭（pengx@nvidia.com）、平伟（wping@nvidia.com）引用格式 <pre> @article{xu2024chatqa, title={ChatQA 2: 填补长上下文与检索增强生成（Retrieval-Augmented Generation，RAG）能力上与专有大语言模型（Large Language Model，LLM）的差距}, author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2407.14482}, year={2024} } </pre>

提供机构：

maas

创建时间：

2025-01-20

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集