QwQ-LongCoT-130K
收藏魔搭社区2026-01-06 更新2024-12-14 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/QwQ-LongCoT-130K
下载链接
链接失效反馈官方服务:
资源简介:
<span style="color:red">Also have a look on the second version here =></span> [QwQ-LongCoT-2](https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K-2)
<div style="text-align: left;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/ThfNc45SlzfGHOvxSOefF.png" width="200px" height="150px" title="kmmlu" alt="kmmlu" style="display: block; margin-left: 0;" />
<p><em>Figure 1: Just a cute picture generate with [Flux](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Logo-Design)</em></p>
</div>
Today, I’m excited to release **QwQ-LongCoT-130K**, a SFT dataset designed for training O1-like large language models (LLMs). This dataset includes about 130k instances, each with responses generated using **[QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)**. The dataset is available under the **Apache 2.0 license**, so feel free to use it as you like.
### Dataset Construction
The challenging part of creating **QwQ-LongCoT-130K** was curating seed instructions that truly worth longer chain-of-thought reasoning. Simply put, I didn’t want to generate lengthy responses—spanning thousands of tokens—for simple prompts like, *“What color is the sky?”* At the same time, I wanted them to be free of licensing issues. Accordingly, I collect seed-instructions using the following two methods.
Initially, I sourced data from the **[NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)** dataset, which contains a collection of 860K math questions and their corresponding answers. This dataset is licensed under Apache 2.0. To add diversity and include categories beyond math, I used the **Magpie** approach to extract questions from the QwQ-32B-Preview model. A common approach with Magpie involves inputting a blank space, sometimes with a user token, and expecting the model to generate a user query. However, with QwQ-32B-Preview, we observed that this method often leads the model to refuse to respond, frequently replying with something like: *“I’d be able to assist better if you provided more details.”* Also using this approach we have little or no control over the instruction generated. So, in our experiments we use the following template:
```python
import random
adjective = random.choice(["Physics", "Chemistry", "Biology", ...])
subject = random.choice([ "difficult", "tough", "long", "challenging", "tricky", ...])
This is one {adjective} {subject} question. I'll first repeat the question word-by-word before I start to solve.
```
After collecting the seed instructions, I used QwQ-32B-Preview to generate one response for each instruction. Once the generation was complete, I applied simple rule-based filtering to remove responses containing phrases like *"Sorry"* or *"As an AI model."* I also filtered out instances with excessive repetition of sentences and attempted to exclude those containing Chinese characters—though some may still remain. In any case, there is still room for further refinement.
### Dataset Analysis
The dataset consists of 90k samples from NuminaMath and about 43k generated via Magpie. In my first effort with Magpie, I accidentally forgot to log the subjects used to generate each instruction, but in the figure below you can see the distributions of the ones I didn't forget (oops). I'm planning to add more Magpie data if I find some more computing resources.
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/rTOd3gfqaN3rYbMQ0wmcm.png" width="600px" height="450px" title="kmmlu" alt="kmmlu" style="display: block; margin: auto;" />
<p><em>Figure 2: Dataset distribution</em></p>
</div>
Below is a comparison of the length distribution of instances in the QwQ-LongCoT dataset, using the top_300k_longer_conversations subset from Magpie-Ultra as a baseline. For the readability of the plot, I excluded some outliers exceeding 20k characters from the QwQ-LongCoT dataset, although the longest sample had over 170k characters. From the plot, it is evident that QwQ-LongCoT generally contains longer instances.
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/h0pIZf4Uo04I0SFTiMG4X.png" width="600px" height="450px" title="kmmlu" alt="kmmlu" style="display: block; margin: auto;" />
<p><em>Figure 3: Length comparison</em></p>
</div>
### Lessons learned from training with **QwQ-LongCoT-130K**
Well, I initially tried training with the dataset in a simple SFT setting, only to find that it does not work well. My random guess is that the thinking traces in QwQ-LongCoT include intentionally generating wrong statements and then fixing them. This "intentionally generating wrong stuff" seems to be bad for the model—I don’t have any evidence. I probably need different approaches to mask away the wrong traces during SFT or use RL. The model is still embarrassing to share, and I'm still trying some more training runs, so I hope to get a decent, shareable model soon.
### ETC
Big thanks for the Qwen Team and Project-Numina.
If you're interested in exploring the dataset further or collaborating with me, please feel free to reach out at: spthsrbwls123@yonsei.ac.kr.
<span style="color:red">敬请查阅本数据集的第二版本:</span> [QwQ-LongCoT-2](https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K-2)
<div style="text-align: left;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/ThfNc45SlzfGHOvxSOefF.png" width="200px" height="150px" title="kmmlu" alt="kmmlu" style="display: block; margin-left: 0;" />
<p><em>图1:仅为使用[Flux](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Logo-Design)生成的可爱图示</em></p>
</div>
今日,我很高兴发布**QwQ-LongCoT-130K**——一款专为训练类O1大语言模型(Large Language Model, LLM)设计的监督微调(Supervised Fine-Tuning, SFT)数据集。本数据集包含约13万个样本,每个样本的回复均由**[QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)**生成。该数据集采用**Apache 2.0开源许可协议**发布,您可自由使用。
### 数据集构建
构建**QwQ-LongCoT-130K**的难点在于筛选真正需要长链式思维推理的种子指令。简言之,我不会为诸如"天空是什么颜色?"这类简单提示生成长达数千Token的冗长回复;同时,我也希望所有指令均无版权纠纷。因此,我通过以下两种方式收集种子指令。
首先,我从**[NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)**数据集获取数据,该数据集包含86万个数学问题及其对应解答,采用Apache 2.0许可协议发布。为增加数据多样性并覆盖数学以外的领域,我采用**Magpie**方法从QwQ-32B-Preview模型中提取问题。
Magpie方法的常规操作是输入空白内容(有时会附带用户Token),期望模型生成用户查询。但在QwQ-32B-Preview模型上测试时,我们发现该方法常导致模型拒绝回复,常见回复如:"若你能提供更多细节,我就能更好地协助你。"此外,采用该方法时,我们几乎无法控制生成的指令内容。因此,在本次实验中,我们使用了如下模板:
python
import random
adjective = random.choice(["Physics", "Chemistry", "Biology", ...])
subject = random.choice([ "difficult", "tough", "long", "challenging", "tricky", ...])
This is one {adjective} {subject} question. I'll first repeat the question word-by-word before I start to solve.
收集完种子指令后,我使用QwQ-32B-Preview为每条指令生成一条回复。生成完成后,我采用简单的基于规则的过滤步骤:移除包含"Sorry"或"As an AI model"等表述的回复,过滤掉存在大量句子重复的样本,并尝试剔除包含中文字符的样本(尽管仍可能有部分残留)。无论如何,该数据集仍有进一步优化的空间。
### 数据集分析
本数据集包含9万个来自NuminaMath的样本,以及约4.3万个通过Magpie方法生成的样本。在首次使用Magpie方法时,我意外忘记记录生成每条指令所用的主题,但在下图中可以看到我未遗漏的样本主题分布(尴尬)。若后续能获得更多计算资源,我计划补充更多Magpie生成的数据。
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/rTOd3gfqaN3rYbMQ0wmcm.png" width="600px" height="450px" title="kmmlu" alt="kmmlu" style="display: block; margin: auto;" />
<p><em>图2:数据集分布</em></p>
</div>
以下为QwQ-LongCoT数据集的样本长度分布对比,我们以Magpie-Ultra的top_300k_longer_conversations子集作为基准。为提升图表可读性,我将QwQ-LongCoT数据集中超过2万个字符的极端异常值排除在外,尽管该数据集中最长的样本字符数超过17万。从图表中可以明显看出,QwQ-LongCoT数据集的样本普遍更长。
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/h0pIZf4Uo04I0SFTiMG4X.png" width="600px" height="450px" title="kmmlu" alt="kmmlu" style="display: block; margin: auto;" />
<p><em>图3:长度对比</em></p>
</div>
### 基于QwQ-LongCoT-130K进行训练的经验总结
起初,我尝试使用该数据集以简单的监督微调(SFT)方式进行训练,但效果不佳。我的初步推测是,QwQ-LongCoT数据集中的思维轨迹包含故意生成错误表述后再修正的内容,这种"故意生成错误内容"的模式似乎对模型训练不利——尽管我尚无相关证据。我可能需要采用不同的方法,在监督微调阶段屏蔽错误轨迹,或是使用强化学习(Reinforcement Learning, RL)进行训练。目前该模型仍不便公开,我仍在尝试更多训练方案,希望能尽快得到可用且可公开的模型。
### 其他说明
衷心感谢Qwen团队与Project-Numina项目组。
若您希望进一步探索本数据集或开展合作,欢迎通过以下邮箱联系:spthsrbwls123@yonsei.ac.kr.
提供机构:
maas
创建时间:
2024-12-09
搜集汇总
数据集介绍

背景与挑战
背景概述
QwQ-LongCoT-130K是一个专为训练大型语言模型设计的SFT数据集,包含约130k实例,结合了数学和其他学科的长链思维推理问题,响应由QwQ-32B-Preview生成并经过质量过滤。
以上内容由遇见数据集搜集并总结生成



