Bespoke-Stratos-17k
收藏魔搭社区2026-05-23 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/bespokelabs/Bespoke-Stratos-17k
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<a href="https://bespokelabs.ai"><img src="Bespoke-Labs-Logo-on-Mint.png" width="550"></a>
</p>
## Bespoke-Stratos-17k
[We](https://bespokelabs.ai) replicated and improved the [Berkeley Sky-T1](https://novasky-ai.github.io/posts/sky-t1/) data pipeline using SFT distillation data
from [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) to create Bespoke-Stratos-17k -- a reasoning dataset of questions, reasoning traces, and answers.
This data was used to train:
1. [Bespoke-Stratos-32B](https://huggingface.co/bespokelabs/Bespoke-Stratos-32B), a 32B reasoning model which is a fine-tune of [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)
2. [Bespoke-Stratos-7B](https://huggingface.co/bespokelabs/Bespoke-Stratos-7B), a 7B reasoning model which is a fine-tune of [Qwen-2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).
<a href="https://github.com/bespokelabsai/curator/">
<img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px>
</a>
## Metrics for Bespoke-Stratos-32B
| Metric | Bespoke-Stratos-32B | Sky-T1-32B | o1-preview | DeepSeek-R1 | DeepSeek-R1-Distill-Qwen-32B (Ours)|DeepSeek-R1-Distill-Qwen-32B (Reported)|
|---|---|---|---|---|---|---|
| AIME2024 | 63.3 | 43.3 | 40.0 | 79.8 | 66.7 | 72.6 |
| MATH500 | 93.0 | 82.4 | 81.4 | 97.3 | 89.8 | 94.3 |
| GPQA-Diamond | 58.1 | 56.8 | 75.2 | 71.5 | 61.1 | 62.1 |
| LCB v2 Easy | 96.7 | 86.3 | 92.9 | - | 91.2 | - |
| LCB v2 Medium | 75.2 | 56.8 | 54.9 | - | 75.7 | - |
| LCB v2 Hard | 26.2 | 17.9 | 16.3 | - | 38.2 | - |
| LCB v2 All | 71.1 | 57.9 | 59.1 | - | 72.2 | - |
## Metrics for Bespoke-Stratos-7B
||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B (Ours)|DeepSeek-R1-Distill-Qwen-7B (Reported)|
|---|---|---|---|---|
|AIME2024|20.0|10.0|43.3|55.5|
|MATH500|82.0|74.2|89.4|92.8|
|GPQA-Diamond|37.8|33.3|44.9|49.1|
|LiveCodeBench v2 Easy|71.4|65.9|81.3|-|
|LiveCodeBench v2 Medium|25.5|18.9|42.2|-|
|LiveCodeBench v2 Hard|1.6|3.3|2.4|-|
|LiveCodeBench v2 All|36.1|31.9|46.6|-|
## Details
The code for curating the data is [here](https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation).
Please also refer to [Sky-T1’s codebase](https://github.com/NovaSky-AI/SkyThought) for the training and evaluation code.
Similarly to [Sky-T1_data_17k](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k), this dataset contains 5k coding data from APPs and TACO, and 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset, and 1k science and puzzle data from STILL-2. Note that the exact problems included may differ due to the rejection sampling process.
We used Bespoke Curator to create the synthetic reasoning dataset. We ported the Sky-T1 data pipeline into Curator, which helped us generate the reasoning dataset within 1.5 hours with DeepSeek-R1 at a cost of $800 without hiccups.
Rejection sampling involves filtering out reasoning traces with incorrect solutions. This is challenging for code verification, which we speed up using a Ray cluster. We are currently integrating code execution verifier directly in Curator, so stay tuned.
We followed the same recipe as the Sky-T1, but with the following differences:
- We used DeepSeek-R1 as the teacher reasoning model instead of QwQ.
- The Sky-T1 recipe used gpt-4o-mini to reformat QwQ’s traces, whereas we did not reformat DeepSeek-R1’s. We found that DeepSeek-R1’s reasoning traces were sufficiently well-formatted and coherent for parsing and finetuning even without an intermediate reformatting step.
- We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%.
## Citation
```bibtex
@misc{bespoke_stratos,
author = {Bespoke Labs},
title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},
howpublished = {https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},
note = {Accessed: 2025-01-22},
year = {2025}
}
```
## Acknowledgement
We are standing on the shoulders of giants. [Bespoke Labs](https://bespokelabs.ai) would like to thank [Berkeley Sky Computing Lab](https://sky.cs.berkeley.edu/) for their work on [Sky-T1](https://novasky-ai.github.io/posts/sky-t1/) and for releasing the [code](https://github.com/NovaSky-AI/SkyThought) and [data](https://github.com/NovaSky-AI/SkyThought), [Deepseek](https://www.google.com/search?q=deepseek&oq=deepseek&gs_lcrp=EgZjaHJvbWUyDwgAEEUYORiDARixAxiABDIGCAEQRRg8Mg8IAhBFGDsYgwEYsQMYgAQyDQgDEAAYgwEYsQMYgAQyDQgEEAAYgwEYsQMYgAQyBggFEEUYPDIGCAYQRRg8MgYIBxBFGDzSAQg1MTE3ajBqN6gCALACAA&sourceid=chrome&ie=UTF-8) for releasing the [Deepseek-R1](https://github.com/deepseek-ai/DeepSeek-R1) [model](https://huggingface.co/deepseek-ai/DeepSeek-R1), and the [Datacomp](https://datacomp.ai/) community for insightful discussions.
To be in the loop, please sign up to be notified at https://bespokelabs.ai/newsletter
<p align="center">
<a href="https://bespokelabs.ai"><img src="Bespoke-Labs-Logo-on-Mint.png" width="550"></a>
</p>
## Bespoke-Stratos-17k
我们基于DeepSeek-R1的监督微调(Supervised Fine-Tuning, SFT)蒸馏数据,复现并优化了Berkeley Sky-T1的数据流水线,从而构建了Bespoke-Stratos-17k——一个包含问题、推理轨迹与答案的推理数据集。
该数据集被用于训练以下模型:
1. [Bespoke-Stratos-32B](https://huggingface.co/bespokelabs/Bespoke-Stratos-32B),一款基于Qwen-2.5-32B-Instruct微调的320亿参数推理模型
2. [Bespoke-Stratos-7B](https://huggingface.co/bespokelabs/Bespoke-Stratos-7B),一款基于Qwen-2.5-7B-Instruct微调的70亿参数推理模型。
<a href="https://github.com/bespokelabsai/curator/">
<img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px>
</a>
## Bespoke-Stratos-32B 评测指标
| 评测指标 | Bespoke-Stratos-32B | Sky-T1-32B | o1-preview | DeepSeek-R1 | DeepSeek-R1-Distill-Qwen-32B(本文方法)|DeepSeek-R1-Distill-Qwen-32B(报告值)|
|---|---|---|---|---|---|---|
| AIME2024 | 63.3 | 43.3 | 40.0 | 79.8 | 66.7 | 72.6 |
| MATH500 | 93.0 | 82.4 | 81.4 | 97.3 | 89.8 | 94.3 |
| GPQA-Diamond | 58.1 | 56.8 | 75.2 | 71.5 | 61.1 | 62.1 |
| LCB v2 简单赛道 | 96.7 | 86.3 | 92.9 | - | 91.2 | - |
| LCB v2 中等赛道 | 75.2 | 56.8 | 54.9 | - | 75.7 | - |
| LCB v2 困难赛道 | 26.2 | 17.9 | 16.3 | - | 38.2 | - |
| LCB v2 全赛道 | 71.1 | 57.9 | 59.1 | - | 72.2 | - |
## Bespoke-Stratos-7B 评测指标
||Bespoke-Stratos-7B|Qwen-2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B(本文方法)|DeepSeek-R1-Distill-Qwen-7B(报告值)|
|---|---|---|---|---|
|AIME2024|20.0|10.0|43.3|55.5|
|MATH500|82.0|74.2|89.4|92.8|
|GPQA-Diamond|37.8|33.3|44.9|49.1|
|LiveCodeBench v2 简单赛道|71.4|65.9|81.3|-|
|LiveCodeBench v2 中等赛道|25.5|18.9|42.2|-|
|LiveCodeBench v2 困难赛道|1.6|3.3|2.4|-|
|LiveCodeBench v2 全赛道|36.1|31.9|46.6|-|
## 数据集详情
本数据集的构建代码可参见:https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation。训练与评测代码可参考Sky-T1的官方代码库:https://github.com/NovaSky-AI/SkyThought。
与Sky-T1_data_17k类似,本数据集包含来自APP与TACO的5000条编码数据、来自NuminaMATH数据集的AIME、MATH及奥赛子集的10000条数学数据,以及来自STILL-2的1000条科学与谜题数据。需注意,由于拒绝采样流程的存在,数据集实际包含的具体题目可能有所差异。
我们使用Bespoke Curator构建该合成推理数据集。将Sky-T1的数据流水线迁移至Curator后,借助DeepSeek-R1,我们可在1.5小时内完成推理数据集的生成,总成本仅800美元,过程未出现异常。
拒绝采样指过滤掉包含错误解答的推理轨迹。代码验证是该流程中的难点,我们通过Ray集群加速了这一环节。目前我们正将代码执行验证器直接集成至Curator中,敬请期待。
我们沿用了Sky-T1的构建流程,但做出了以下改进:
- 我们使用DeepSeek-R1作为教师推理模型,而非QwQ。
- Sky-T1流程使用gpt-4o-mini重构QwQ的推理轨迹,而我们未对DeepSeek-R1的轨迹进行重构。我们发现,即便不经过中间重构步骤,DeepSeek-R1的推理轨迹也具备足够良好的格式与连贯性,可直接用于解析与微调。
- 我们使用gpt-4o-mini替代Sky-T1的解析逻辑来过滤错误的数学解答。借助gpt-4o-mini,我们减少了假阴性结果的数量,将正确解答的保留比例从25%提升至73%。
## 引用
bibtex
@misc{bespoke_stratos,
author = {Bespoke Labs},
title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},
howpublished = {https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},
note = {Accessed: 2025-01-22},
year = {2025}
}
## 致谢
我们站在巨人的肩膀上。Bespoke Labs 感谢伯克利天空计算实验室(Berkeley Sky Computing Lab)在Sky-T1上的工作,并公开了其代码与数据集;感谢Deepseek公开DeepSeek-R1模型;感谢Datacomp社区提供的富有启发性的讨论。
如需获取最新动态,请前往 https://bespokelabs.ai/newsletter 订阅通知。
提供机构:
maas
创建时间:
2025-01-25



