smol-smoltalk
收藏魔搭社区2026-01-08 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceTB/smol-smoltalk
下载链接
链接失效反馈官方服务:
资源简介:
# Smol-SmalTalk
This is a subset of [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/) dataset adapted for smol models with less than 1B parameters. We used it to build [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and
[SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/). We do SFT on this dataset and then DPO on UltraFeedback.
Compared to SmolTalk:
- The conversations from Smol-Magpie-Ultra are shorter in this dataset
- We include less task specific data compared to SmolTalk (e.g no function calling and less rewriting and summarization examples) since these smaller models have limited capacity
- We don't include any advanced math datasets
```python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smol-smoltalk", split="train")
```
## Citation
```bash
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
```
# Smol-SmalTalk
本数据集是[SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/)数据集的一个子集,专为参数规模小于10亿的小型模型适配。我们使用该数据集构建了[SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)与[SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/)。我们在此数据集上完成监督微调(Supervised Fine-Tuning, SFT),随后基于UltraFeedback数据集进行直接偏好优化(Direct Preference Optimization, DPO)。
相较于SmolTalk,本数据集存在以下差异:
- 源自Smol-Magpie-Ultra的对话篇幅更短
- 相较于SmolTalk,我们剔除了更多任务专用数据(例如无函数调用相关内容,且大幅减少改写与摘要示例),原因在于此类小型模型的参数容量有限
- 未包含任何高等数学相关数据集
python
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smol-smoltalk", split="train")
## 引用
bibtex
@misc{allal2025smollm2smolgoesbig,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
year={2025},
eprint={2502.02737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02737},
}
提供机构:
maas
创建时间:
2025-09-09



