HuggingFaceTB/smoltalk

Name: HuggingFaceTB/smoltalk
Creator: HuggingFaceTB
Published: 2025-02-10 16:36:16
License: 暂无描述

Hugging Face2025-02-10 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/HuggingFaceTB/smoltalk

下载链接

链接失效反馈

官方服务：

资源简介：

SmolTalk是一个用于监督微调（SFT）大语言模型（LLMs）的合成数据集，包含100万个样本。它被用于构建SmolLM2-Instruct系列模型，并通过一系列数据消融实验增强了模型的指令遵循能力。数据集由多个子集组成，包括新生成的合成数据集和现有的公开数据集，涵盖了文本编辑、重写、摘要和推理等多种任务。新数据集使用distilabel生成，现有数据集则用于增强模型在数学、编码、系统提示和长上下文理解等方面的能力。

SmolTalk is a synthetic dataset designed for supervised finetuning (SFT) of large language models (LLMs), containing 1 million samples. This dataset was used to build the SmolLM2-Instruct family of models and covers various tasks including text editing, rewriting, summarization, reasoning, mathematics, coding, system prompt following, and long-context understanding. The new datasets were generated using the distilabel tool, and the README provides information on how to load the dataset and its composition. The dataset is licensed under Apache 2.0 for the new datasets, while the licenses for the existing public datasets are specified in their respective sources.

提供机构：

HuggingFaceTB

5,000+

优质数据集

54 个

任务类型

进入经典数据集