microsoft/orca-agentinstruct-1M-v1

Name: microsoft/orca-agentinstruct-1M-v1
Creator: microsoft
Published: 2024-11-01 00:14:29
License: 暂无描述

Hugging Face2024-11-01 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/microsoft/orca-agentinstruct-1M-v1

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个完全合成的指令对集合，使用AgentInstruct框架生成。它包含约100万条指令对，涵盖文本编辑、创意写作、编码、阅读理解等多种能力。数据集中的指令对是从网络上公开可用的原始文本内容中合成的。该数据集适用于任何基础大语言模型（LLM）的指令调优。AgentInstruct框架还生成了一个包含约2500万条指令对的超集，用于对Mistral-7b模型进行后训练，结果在多个基准测试中表现出显著的性能提升。该数据集不适用于教育系统、组织或卫生系统，仅用于研究目的。需要注意的是，该数据集是合成的，可能包含不反映现实世界现象的不准确之处。

This dataset is a fully synthetic set of instruction pairs generated using the AgentInstruct framework. It contains approximately 1 million instruction pairs, covering various capabilities such as text editing, creative writing, coding, and reading comprehension. The data is synthetically generated from publicly available raw text content on the Web. The dataset is intended for instruction tuning of any base large language model (LLM). The AgentInstruct framework also generates a superset of this dataset with approximately 25 million instruction pairs, which was used to post-train the Mistral-7b model, resulting in significant improvements in performance across multiple benchmarks. The dataset is not intended for use in educational systems, organizations, or health systems, and is shared for research purposes only. It is important to note that the dataset is synthetic and may contain inaccuracies that do not reflect real-world phenomena.

提供机构：

microsoft

5,000+

优质数据集

54 个

任务类型

进入经典数据集