five

HuggingFaceTB/smoltalk

收藏
Hugging Face2025-02-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceTB/smoltalk
下载链接
链接失效反馈
官方服务:
资源简介:
SmolTalk是一个用于监督微调(SFT)大语言模型(LLMs)的合成数据集,包含100万个样本。它被用于构建SmolLM2-Instruct系列模型,并通过一系列数据消融实验增强了模型的指令遵循能力。数据集由多个子集组成,包括新生成的合成数据集和现有的公开数据集,涵盖了文本编辑、重写、摘要和推理等多种任务。新数据集使用distilabel生成,现有数据集则用于增强模型在数学、编码、系统提示和长上下文理解等方面的能力。

SmolTalk is a synthetic dataset designed for supervised finetuning (SFT) of large language models (LLMs), containing 1 million samples. This dataset was used to build the SmolLM2-Instruct family of models and covers various tasks including text editing, rewriting, summarization, reasoning, mathematics, coding, system prompt following, and long-context understanding. The new datasets were generated using the distilabel tool, and the README provides information on how to load the dataset and its composition. The dataset is licensed under Apache 2.0 for the new datasets, while the licenses for the existing public datasets are specified in their respective sources.
提供机构:
HuggingFaceTB
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作