ChatQA2-Long-SFT-data
收藏魔搭社区2026-01-02 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/ChatQA2-Long-SFT-data
下载链接
链接失效反馈官方服务:
资源简介:
## Data Description
Here, we release the full long SFT training dataset of [ChatQA2](https://arxiv.org/abs/2407.14482). It consists of two parts: **long_sft** and **NarrativeQA_131072**. The long_sft dataset is built and derived from existing datasets: [LongAlpaca12k](https://github.com/dvlab-research/LongLoRA), GPT-4 samples from [Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca), and [Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections). The NarrativeQA_131072 dataset is synthetically generated from NarrativeQA by adding related paragraphs to the given ground truth summary. For the first two steps training of ChatQA-2, we follow [ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data).
For the continue pretraining dataset, we simply follow [Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering) to generate 10B tokens. **For more information about ChatQA-2, check the [website](https://chatqa2-project.github.io/)!**
## Other Resources
[Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B)   [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B)   [Evaluation Data](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data)   [Website](https://chatqa2-project.github.io/)   [Paper](https://arxiv.org/abs/2407.14482)
## Training Details
The training follows a three-stage instruction tuning process. For the first two stages, we follow ChatQA-1.5, i.e., the stage-1 uses the SFT data, and the stage-2 uses a blend of SFT data alongside other datasets. The dataset blending ratio for stage-2 is as follows:
- drop: 0.069
- narrativeqa: 0.095
- quoref: 0.026
- ropes: 0.026
- squad1.1: 0.095
- squad2.0: 0.095
- newsqa: 0.095
- tatqa-arithmetic: 0.15
- tatqa-others: 0.08
- synthetic_convqa: 0.3
- sft: 0.2
The stage-3 add the full long SFT dataset to the blend. And the new dataset blending ratio for stage-3 is as follows:
- drop: 0.069
- narrativeqa: 0.095
- quoref: 0.026
- ropes: 0.026
- squad1.1: 0.095
- squad2.0: 0.095
- newsqa: 0.095
- tatqa-arithmetic: 0.15
- tatqa-others: 0.08
- synthetic_convqa: 0.3
- sft: 0.2
- long_sft: 2.5
- NarrativeQA_131072: 5.0
## License
The dataset is released for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI.
## Correspondence to
Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com)
## Citation
<pre>
@article{xu2024chatqa,
title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities},
author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
journal={arXiv preprint arXiv:2407.14482},
year={2024}
}
</pre>
数据说明
我们在此发布[ChatQA2](https://arxiv.org/abs/2407.14482)的完整长上下文监督微调(Supervised Fine-Tuning,SFT)训练数据集。该数据集包含两个部分:**long_sft**与**NarrativeQA_131072**。其中long_sft数据集基于现有数据集构建衍生,包含[LongAlpaca12k](https://github.com/dvlab-research/LongLoRA)、来自[Open Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)的GPT-4样本,以及[Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)三个来源。NarrativeQA_131072数据集由NarrativeQA合成生成,具体方式为向给定的基准摘要添加相关段落。针对ChatQA-2的前两步训练流程,我们沿用[ChatQA1.5](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)的设置。
对于持续预训练数据集,我们直接参照[Long-Context Data Engineering](https://github.com/FranxYao/Long-Context-Data-Engineering)的方法生成了100亿个Token。如需了解ChatQA-2的更多相关信息,请访问[官方网站](https://chatqa2-project.github.io/)!
其他资源
[Llama3-ChatQA-2-8B](https://huggingface.co/nvidia/Llama3-ChatQA-2-8B)   [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B)   [评估数据集](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B/tree/main/data)   [官方网站](https://chatqa2-project.github.io/)   [研究论文](https://arxiv.org/abs/2407.14482)
训练细节
本次训练采用三阶段指令微调流程。在前两阶段中,我们沿用ChatQA-1.5的设置:第一阶段使用监督微调(SFT)数据,第二阶段使用SFT数据与其他数据集的混合训练集。第二阶段的数据集混合比例如下:
- drop: 0.069
- narrativeqa: 0.095
- quoref: 0.026
- ropes: 0.026
- squad1.1: 0.095
- squad2.0: 0.095
- newsqa: 0.095
- tatqa-arithmetic: 0.15
- tatqa-others: 0.08
- synthetic_convqa: 0.3
- sft: 0.2
第三阶段将完整的长上下文SFT数据集加入混合训练集,第三阶段更新后的数据集混合比例如下:
- drop: 0.069
- narrativeqa: 0.095
- quoref: 0.026
- ropes: 0.026
- squad1.1: 0.095
- squad2.0: 0.095
- newsqa: 0.095
- tatqa-arithmetic: 0.15
- tatqa-others: 0.08
- synthetic_convqa: 0.3
- sft: 0.2
- long_sft: 2.5
- NarrativeQA_131072: 5.0
使用许可
本数据集仅可用于非商业用途,需遵循OpenAI生成数据的[使用条款](https://openai.com/policies/terms-of-use)。
通讯作者
彭旭(pengx@nvidia.com)、平伟(wping@nvidia.com)
引用格式
<pre>
@article{xu2024chatqa,
title={ChatQA 2: 填补长上下文与检索增强生成(Retrieval-Augmented Generation,RAG)能力上与专有大语言模型(Large Language Model,LLM)的差距},
author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
journal={arXiv preprint arXiv:2407.14482},
year={2024}
}
</pre>
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



