smoltalk
收藏魔搭社区2026-05-04 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/smoltalk
下载链接
链接失效反馈官方服务:
资源简介:
# SmolTalk

## Dataset description
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build [SmolLM2-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) family of models and contains 1M samples. More details in our paper https://arxiv.org/abs/2502.02737
During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets that improve instruction following while covering diverse tasks including text editing, rewriting, summarization, and reasoning.
Through a series of data ablations at 1.7B scale, we enhanced our SFT mix by incorporating public datasets to strengthen specific capabilities such as mathematics, coding, system prompt following and long-context understanding.
All the new datasets were generated with [distilabel](https://github.com/argilla-io/distilabel) and you can find the generation code here https://github.com/huggingface/smollm/tree/main/text/data/smoltalk.
You can load a dataset using
```python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
# to load the train split of a specific subset such as smol-magpie-ultra, you can do
ds = load_dataset("HuggingFaceTB/smoltalk", "smol-magpie-ultra", split="train")
```
## Dataset composition
The mix consists of:
**New datasets**
- *Smol-Magpie-Ultra*: the core component of our mix, consisting of 400K samples generated using the Magpie pipeline with /Llama-3.1-405B-Instruct. We also heavily curate and filter this dataset compared to the original Magpie-Pro pipeline. SmolLM models trained on this dataset alone outperform those trained on popular public datasets like OpenHermes and Magpie Pro across key benchmarks including IFEval and MT-Bench.
- Smol-contraints: a 36K-sample dataset that trains models to follow specific constraints, such as generating responses with a fixed number of sentences or words, or incorporating specified words in the output. The dataset has been decontaminated against IFEval to prevent overlap.
- Smol-rewrite: an 50k-sample collection focused on text rewriting tasks, such as adjusting tone to be more friendly or professional. Note that Smol-Magpie-Ultra also includes some rewriting, editing, and summarization examples.
- Smol-summarize: an 100k-sample dataset specialized in email and news summarization.
**Existing public datasets**
To enhance capabilities in mathematics, coding, system prompts, and long-context understanding, we fine-tuned SmolLM2-1.7B on various public SFT datasets and included subsets of the best performing ones using tuned ratios. These include:
- OpenHermes2.5: we added 100k samples from [OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5), since we found that it helps preserve and boost benchmarks such as MMLU and WinoGrande, and BBH.
- MetaMathQA: we add this [dataset](https://huggingface.co/datasets/meta-math/MetaMathQA?) to improve the model on mathematics and reasoning, we include 50k random samples.
- NuminaMath-CoT: we find that this [dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) helps on mathematics, especially hard problems found in benchmarks such as MATH.
- Self-Oss-Starcoder2-Instruct: we use this [dataset](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) to improve coding capabilities.
- SystemChats2.0: to make the model support a variety of system prompt formats we add 30k samples from the [SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0) dataset. Note that Smol-rewrite and and Smol-summarize datasets also include system prompts.
- LongAlign: we find that finetuning the model on only short samples makes it loose long context abilities beyond 2048 tokens, so we add english samples (with less than 16k tokens) from the [LongAlign-10k](https://huggingface.co/datasets/THUDM/LongAlign-10k) dataset and train with a 8192 sequence.
- Everyday-conversations: this [dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k) includes multi-turn everyday conversations such as greeting and was used in SmolLM v1 post-training.
- APIGen-Function-Calling: we use 80k samples from [apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling) which is a mix of [Synth-APIGen-v0.1](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1) and [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) datasets.
- Explore-Instruct-Rewriting: 30k samples from this rewriting [dataset](https://huggingface.co/datasets/Wanfq/Explore_Instruct_Rewriting_32k).
You can find the code for generating the new datasets with [distilabel](https://github.com/argilla-io/distilabel) here: https://github.com/huggingface/smollm. The ablation details will be included in an upcoming blog post.
## License
All the new datasets (Smol-Magpie-Ultra, Smol-contraints, Smol-rewrite, Smol-summarize) are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). For the existing public datasets, please refer to the original dataset for the license [Dataset composition](#dataset-composition)
## Evaluation
We compare SmolTalk to the recent [Orca AgentInstruct 1M](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1) dataset by finetuning SmolLM2 on both datasets using the same training setup (we train for 2 epochs, using a learning rate of 3e-04, a sequence length of 8192 and a global batch size of 16).

We also observe significant improvements at 7B scale when fine-tuning [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.3) on SmolTalk, notably on IFEval, BBH, GS8Mk and MATH.

## Smol-SmolTalk
For SmolLM2-135M-Instruct and SmolLM2-360M-Instruct, we use a subset of the dataset that is more suitable for these smaller models. For instance, we only include samples from Smol-Magpie-Ultra with more concise conversations and exclude advanced math datasets. You can find the dataset here: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
The training code is available here https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
## Citation
```bash
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
```
# SmolTalk

## 数据集说明
本数据集为面向大语言模型(LLM)监督微调(SFT)的合成数据集,被用于构建SmolLM2-Instruct系列模型,共包含100万条样本。更多细节可参阅我们的学术论文:https://arxiv.org/abs/2502.02737
在SmolLM2的研发过程中,我们发现基于公开监督微调数据集训练的模型,性能劣于使用专有指令数据集训练的同类模型。为填补这一性能缺口,我们构建了全新的合成数据集,该数据集可提升模型的指令遵循能力,并覆盖文本编辑、改写、摘要生成与推理等多样任务。
我们在17亿参数规模下开展了一系列数据消融实验,并通过引入公开数据集优化了监督微调数据集混合比例,以强化模型在数学、编码、系统提示遵循与长上下文理解等方面的特定能力。
所有全新数据集均通过[distilabel](https://github.com/argilla-io/distilabel)生成,数据集生成代码可在以下链接获取:https://github.com/huggingface/smollm/tree/main/text/data/smoltalk。
你可以通过以下代码加载本数据集:
python
from datasets import load_dataset
# 加载全量训练子集
ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
# 若需加载特定子集(如smol-magpie-ultra)的训练拆分,可使用如下代码:
ds = load_dataset("HuggingFaceTB/smoltalk", "smol-magpie-ultra", split="train")
## 数据集构成
本数据集混合包含以下两类数据:
**全新自研数据集**
- *Smol-Magpie-Ultra*:为本数据集混合的核心组件,包含40万条通过Magpie流水线结合Llama-3.1-405B-Instruct生成的样本。相较于原始Magpie-Pro流水线,我们对该数据集进行了大规模的精选与过滤。仅基于该数据集训练的SmolLM模型,在IFEval、MT-Bench等关键基准测试中的表现,优于基于OpenHermes、Magpie Pro等主流公开数据集训练的同类模型。
- Smol-constraints:包含3.6万条样本的数据集,用于训练模型遵循特定约束的能力,例如生成固定句数/词数的回复,或在输出中嵌入指定词汇。本数据集已针对IFEval进行了去重处理,以避免数据重叠。
- Smol-rewrite:包含5万条样本的文本改写任务数据集,用于调整文本语气至更友好或专业的风格。需注意,Smol-Magpie-Ultra中也包含部分改写、编辑与摘要生成示例。
- Smol-summarize:包含10万条样本的专属数据集,专注于电子邮件与新闻的摘要生成任务。
**现有公开数据集**
为强化模型在数学、编码、系统提示遵循与长上下文理解等方面的能力,我们基于多款公开监督微调数据集对SmolLM2-1.7B进行了微调,并通过调优后的比例引入了其中表现最优的子集,具体包括:
- OpenHermes2.5:从[OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)引入10万条样本,经测试该数据集可有效保留并提升模型在MMLU、WinoGrande与BBH等基准测试中的性能。
- MetaMathQA:引入该[数据集](https://huggingface.co/datasets/meta-math/MetaMathQA?)以提升模型的数学与推理能力,本次引入5万条随机采样样本。
- NuminaMath-CoT:该[数据集](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)可有效提升模型的数学解题能力,尤其针对MATH等基准测试中的高难度题目。
- Self-Oss-Starcoder2-Instruct:使用该[数据集](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k)以强化模型的编码能力。
- SystemChats2.0:从[SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0)数据集引入3万条样本,以支持模型适配多种系统提示格式。需注意,Smol-rewrite与Smol-summarize数据集同样包含系统提示相关样本。
- LongAlign:我们发现仅基于短样本进行微调会导致模型丧失2048令牌以上的长上下文理解能力,因此从[LongAlign-10k](https://huggingface.co/datasets/THUDM/LongAlign-10k)数据集引入长度低于16k令牌的英文样本,并以8192的序列长度进行训练。
- Everyday-conversations:该[数据集](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k)包含多轮日常对话(如问候交互),曾被用于SmolLM v1的后训练阶段。
- APIGen-Function-Calling:从[apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling)引入8万条样本,该数据集由[Synth-APIGen-v0.1](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1)与[xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)数据集混合而成。
- Explore-Instruct-Rewriting:从该改写[数据集](https://huggingface.co/datasets/Wanfq/Explore_Instruct_Rewriting_32k)引入3万条样本。
全新数据集的生成代码可通过[distilabel](https://github.com/argilla-io/distilabel)在以下链接获取:https://github.com/huggingface/smollm。本次数据消融实验的详细结果将在后续的博客文章中发布。
## 许可证
所有全新自研数据集(Smol-Magpie-Ultra、Smol-constraints、Smol-rewrite、Smol-summarize)均采用[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)许可证。现有公开数据集的许可证请参阅对应原始数据集的说明,详见[数据集构成](#dataset-composition)部分。
## 模型评估
我们将SmolTalk与近期发布的[Orca AgentInstruct 1M](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1)数据集进行了对比评估:在完全相同的训练配置下(训练2轮,学习率3e-4,序列长度8192,全局批次大小16),分别基于两个数据集对SmolLM2进行微调。

此外,我们发现基于SmolTalk微调[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)模型(70亿参数规模)时,模型在IFEval、BBH、GS8Mk与MATH等基准测试中取得了显著性能提升。

## Smol-SmolTalk 子集
针对SmolLM2-135M-Instruct与SmolLM2-360M-Instruct两款小参数模型,我们使用了适配其规模的数据集子集:仅保留Smol-Magpie-Ultra中对话更简洁的样本,并剔除了高难度数学相关数据集。该子集可通过以下链接获取:https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
对应的训练代码可在以下链接获取:https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
## 引用格式
bash
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
提供机构:
maas
创建时间:
2024-11-22



