Synth-Long-SFT32K

Name: Synth-Long-SFT32K
Creator: maas
Published: 2025-12-05 16:55:16
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/cerebras/Synth-Long-SFT32K

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Information This repository contains augmented versions of several datasets: - [Synthetic-ConvQA](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data/tree/main/synthetic_convqa) - [NarrativeQA](https://huggingface.co/datasets/deepmind/narrativeqa) - [RAG-TGE](https://huggingface.co/datasets/tdolega/rag-tge_finetuning-dataset) For more information, refer to our [blogpost](https://cerebras.ai/blog/extending-llm-context-with-99-less-training-tokens). We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768. 1. **Synthetic-ConvQA with RAFT-style augmentation.** Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each example in the dataset, we convert `(passage, question, answer)` into `(true_passage, distractor_passage_0, …, distractor_passage_k, question, answer)`. The distractors are the passages with the highest similarity to the true passages, as measured by their embeddings. We shuffle the true passage into a random position in the context, so the model has to work hard to distinguish between the similar passages and select the right information. 2. **Synthetic-ConvQA with RAFT-style augmentation + syntactic questions.** We took our augmented Synthetic-ConvQA dataset and created five synthetic question/answer pairs for each example: (1) Does the word `X` occur in the passage? (2) How often does the word `X` occur in the passage? (3) Does the phrase `X` occur in the passage? (4) How often does the phrase `X` occur in the passage? and (5) Where does the word `X` occur in the passage? Phrases in this context are 4-grams, and to create our questions we randomly select words and phrases that comprise less than 10% of the total words or phrases. For the positional information in the fifth question, we bin the answers to which third of the passage it appears. 3. **Augmented NarrativeQA.** In the variation for NarrativeQA, we create two clustering assignments, one based on the questions, and one based on the passages. For each example in the dataset, we add other examples from the same passage cluster, and also examples from the question cluster. When we add the examples, we add both the passages and question/answer pairs. The initial RAFT methodology only uses one question/answer pair with all the passages, so the additional question/answer pairs in this alteration allow more training signal to come from one example. 4. **RAG-TGE with RAFT-style augmentation.** Same augmentation strategy as we used for the Synthetic-ConvQA dataset but applied to the RAG-TGE dataset. 5. **RAG-TGE with RAFT-style augmentation (Chinese translation).** We also translated the RAFT-augmented RAG-TGE dataset to Chinese. To accomplish this we simply prompted Llama3.1-70B-Instruct to translate the data to Chinese. # License The "RAG-TGE with RAFT-style augmentation (Chinese translation)" dataset is for non-commercial use only, subject to the [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE), including without limitation Section 1(b) with regards to the use of outputs. The rest of the datasets are built on and derived from existing datasets. Please refer to the original licenses accompanying each dataset. # Acknowledgement ``` @article{zhang2024raft, title={Raft: Adapting language model to domain specific rag}, author={Zhang, Tianjun and Patil, Shishir G and Jain, Naman and Shen, Sheng and Zaharia, Matei and Stoica, Ion and Gonzalez, Joseph E}, journal={arXiv preprint arXiv:2403.10131}, year={2024} } @article{liu2025chatqa, title={Chatqa: Surpassing gpt-4 on conversational qa and rag}, author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={15416--15459}, year={2025} } ```

# 数据集信息本代码仓库包含多个数据集的增强版本： - [Synthetic-ConvQA（Synthetic-ConvQA）](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data/tree/main/synthetic_convqa) - [NarrativeQA（NarrativeQA）](https://huggingface.co/datasets/deepmind/narrativeqa) - [RAG-TGE（RAG-TGE）](https://huggingface.co/datasets/tdolega/rag-tge_finetuning-dataset) 如需了解更多信息，请参阅我们的[博客文章](https://cerebras.ai/blog/extending-llm-context-with-99-less-training-tokens)。我们将这些数据集用于长指令跟随训练，示例的最大序列长度为32768。 1. **基于RAFT风格增强的合成式对话问答数据集（Synthetic-ConvQA）** 我们的合成式长上下文数据基于[Zhang等人，2024]提出的检索增强微调（Retrieval Augmented Fine-Tuning，RAFT）方法。针对数据集中的每个示例，我们将`（段落、问题、答案）`格式转换为`（真实段落、干扰段落0、……、干扰段落k、问题、答案）`格式。干扰段落为与真实段落相似度最高的段落，相似度通过文本嵌入（embeddings）计算得到。我们将真实段落随机打乱至上下文的任意位置，迫使模型需要仔细区分相似段落并筛选出正确信息。 2. **基于RAFT风格增强且添加句法类问题的合成式对话问答数据集（Synthetic-ConvQA）** 我们基于已增强的合成式对话问答数据集，为每个示例生成五组合成式问答对：(1) 段落中是否包含单词`X`？(2) 单词`X`在段落中出现的频次是多少？(3) 段落中是否包含短语`X`？(4) 短语`X`在段落中出现的频次是多少？(5) 单词`X`在段落中的位置在哪里？本语境下的短语指4元语法单元（4-grams），我们通过随机选取占总单词或短语总量不足10%的单词与短语来生成问题。针对第五个问题的位置信息，我们将答案按段落的三分之一区间进行分箱处理。 3. **增强版叙事问答数据集（NarrativeQA）** 针对叙事问答数据集的增强方案，我们创建了两种聚类分组：一种基于问题，另一种基于段落。针对数据集中的每个示例，我们添加来自同一段落聚类的其他示例，同时也添加来自同一问题聚类的示例。添加示例时，我们同时附带对应的段落与问答对。原始的RAFT方法仅针对所有段落使用一组问答对，因此本改进方案中额外添加的问答对能够从单个示例中获取更多的训练信号。 4. **基于RAFT风格增强的RAG-TGE数据集** 采用与合成式对话问答数据集一致的增强策略，仅将其应用于RAG-TGE数据集。 5. **基于RAFT风格增强的RAG-TGE数据集（中文翻译版）** 我们同时将经过RAFT增强的RAG-TGE数据集翻译为中文，具体实现为通过提示Llama3.1-70B-Instruct模型完成数据集的中文翻译。 # 许可协议 “基于RAFT风格增强的RAG-TGE数据集（中文翻译版）”仅可用于非商业用途，需遵守[Llama 3.1社区许可协议](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE)，其中包括与输出使用相关的第1(b)条款（无限制范围）。其余数据集均基于现有数据集构建并衍生而来，请参阅各数据集附带的原始许可协议。 # 致谢 @article{zhang2024raft, title={Raft: Adapting language model to domain specific rag}, author={Zhang, Tianjun and Patil, Shishir G and Jain, Naman and Shen, Sheng and Zaharia, Matei and Stoica, Ion and Gonzalez, Joseph E}, journal={arXiv preprint arXiv:2403.10131}, year={2024} } @article{liu2025chatqa, title={Chatqa: Surpassing gpt-4 on conversational qa and rag}, author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={15416--15459}, year={2025} }

提供机构：

maas

创建时间：

2025-10-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集