LaMini-instruction
收藏魔搭社区2025-11-07 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/MBZUAI/LaMini-instruction
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "LaMini-Instruction"
<p align="center" width="100%">
<a><img src="https://raw.githubusercontent.com/mbzuai-nlp/lamini-lm/main/images/lamini.png" alt="Title" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a>
</p>
<p align="center"> <a href="https://twitter.com/WuMinghao_nlp" target="_blank">Minghao Wu</a>, <a href="https://twitter.com/MacAbdul9" target="_blank">Abdul Waheed</a>, <a href="https://chiyuzhang94.github.io/" target="_blank">Chiyu Zhang</a>, <a href="https://mageed.arts.ubc.ca/" target="_blank">Muhammad Abdul-Mageed</a>, <a href="https://afaji.github.io/" target="_blank">Alham Fikri Aji</a>,</p>
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Instances](#data-instances)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Additional Information](#additional-information)
## Dataset Description
We distill the knowledge from large language models by performing sentence/offline distillation (Kim and Rush, 2016). We generate a total of **2.58M** pairs of instructions and responses using [`gpt-3.5-turbo`](https://openai.com/api/) based on several existing resources of prompts, including [self-instruct](https://github.com/yizhongw/self-instruct) (Wang et al., 2022), [P3](https://huggingface.co/datasets/bigscience/P3) (Sanh et al., 2022), [FLAN](https://github.com/google-research/FLAN) (Longpre et al., 2023) and [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) (Taori et al., 2023). More information about the process for generating our instruction dataset, please refer to [our paper](https://arxiv.org/abs/2304.14402).
We use LaMini-Instruction to fine-tune various small-sized language models, resulting in [LaMini-LM model series](https://github.com/mbzuai-nlp/laMini-LM#models).
Each sample contain the following two fields:
* `instruction`: `string`, describes the task that the model is expected to perform.
* `response`: `string`, the response to the instruction as generated by `gpt-3.5-turbo`.
* `instruction_source`: `string`, the source of a instruction. The possible values are `{'alpaca', 'generated_flan', 'generated_p3', 'original_flan', 'original_p3', 'self_instruct_with_topic', 'self_instruct_without_topic'}`.
Two examples in our LaMini-instruction data:
```json
{
"instruction": "What is the capital of Australia?\n\nIs Canberra the only city chosen to serve as a capital due to a compromise between Sydney and Melbourne?",
"response": "The capital of Australia is Canberra. Yes, Canberra was chosen as the capital of Australia in 1908 due to a compromise between Sydney and Melbourne, as neither city wanted the other to become the capital.",
"instruction_source": "generated_flan"
}
```
```json
{
"instruction": "Create a poem that has the word \"ocean\" in every other line.",
"response": "The ocean's depth is a mystery,\nIts waves crashing with energy.\nSalt air fills the coastal scene,\nThe ocean's blue hues pure and clean.\n\nSeagulls cry out a melancholy song,\nThe ocean's rhythm steady and strong.\nSunsets paint the water gold,\nThe ocean's secrets forever untold.",
"instruction_source": "self_instruct_without_topic"
}
```
## Considerations for Using the Data
As the output is distilled from ChatGPT, this data contains errors and biases produced by ChatGPT. Models trained with this dataset will inherit the errors and biases. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.
### Licensing Information
The dataset is available under the [Creative Commons NonCommercial (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode).
### Citation Information
Please cite us if you use our data or models.
```bibtex
@article{lamini-lm,
author = {Minghao Wu and
Abdul Waheed and
Chiyu Zhang and
Muhammad Abdul-Mageed and
Alham Fikri Aji
},
title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
journal = {CoRR},
volume = {abs/2304.14402},
year = {2023},
url = {https://arxiv.org/abs/2304.14402},
eprinttype = {arXiv},
eprint = {2304.14402}
}
```
# 数据集卡片:"LaMini-Instruction"
<p align="center" width="100%">
<a><img src="https://raw.githubusercontent.com/mbzuai-nlp/lamini-lm/main/images/lamini.png" alt="Title" style="width: 100%; min-width: 300px; display: block; margin: auto;"></a>
</p>
<p align="center"> <a href="https://twitter.com/WuMinghao_nlp" target="_blank">吴明昊</a>, <a href="https://twitter.com/MacAbdul9" target="_blank">Abdul Waheed</a>, <a href="https://chiyuzhang94.github.io/" target="_blank">张驰宇</a>, <a href="https://mageed.arts.ubc.ca/" target="_blank">Muhammad Abdul-Mageed</a>, <a href="https://afaji.github.io/" target="_blank">Alham Fikri Aji</a>,</p>
## 目录
- [数据集描述](#dataset-description)
- [数据集结构](#dataset-structure)
- [数据字段](#data-fields)
- [数据样例](#data-instances)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [附加信息](#additional-information)
## 数据集描述
我们通过实施句子级/离线蒸馏(sentence/offline distillation)从大语言模型(Large Language Model,LLM)中提炼知识。我们基于多个现有的提示词资源,使用[`gpt-3.5-turbo`](https://openai.com/api/) 共生成了**258万**条指令与响应配对数据。这些资源包括[自我指令(self-instruct)](https://github.com/yizhongw/self-instruct)(Wang等人,2022)、[P3](https://huggingface.co/datasets/bigscience/P3)(Sanh等人,2022)、[FLAN](https://github.com/google-research/FLAN)(Longpre等人,2023)以及[Alpaca](https://github.com/tatsu-lab/stanford_alpaca)(Taori等人,2023)。如需了解我们生成指令数据集的具体流程,请参阅[我们的论文](https://arxiv.org/abs/2304.14402)。
我们使用LaMini-Instruction数据集对多款小型语言模型进行微调,由此得到了[LaMini-LM模型系列](https://github.com/mbzuai-nlp/laMini-LM#models)。
每条样本包含以下三个字段:
* `instruction`:字符串类型,表示模型需执行的任务描述。
* `response`:字符串类型,即由`gpt-3.5-turbo`生成的对应指令响应内容。
* `instruction_source`:字符串类型,表示指令的来源。可选取值为`{'alpaca', 'generated_flan', 'generated_p3', 'original_flan', 'original_p3', 'self_instruct_with_topic', 'self_instruct_without_topic'}`。
我们在LaMini-Instruction数据集中提供两个样例:
json
{
"instruction": "澳大利亚的首都是哪里?
堪培拉是否是因悉尼和墨尔本之间的妥协而被选为首都的唯一城市?",
"response": "澳大利亚的首都是堪培拉。是的,1908年堪培拉被选为澳大利亚首都,这是悉尼和墨尔本之间妥协的结果,因为两座城市都不希望对方成为首都。",
"instruction_source": "generated_flan"
}
json
{
"instruction": "创作一首每隔一行包含单词“海洋”的诗歌。",
"response": "海洋的深邃是个谜,
海浪奔腾充满活力。
咸咸的空气弥漫海岸,
海洋的蓝调纯净清新。
海鸥悲鸣着忧郁的歌,
海洋的节奏沉稳有力。
落日将水面染成金色,
海洋的秘密永远无人知晓。",
"instruction_source": "self_instruct_without_topic"
}
## 数据集使用注意事项
由于该数据集的输出源自ChatGPT,因此其中包含ChatGPT生成的错误与偏差。使用该数据集训练的模型将继承这些错误与偏差。我们建议用户谨慎使用本数据集,并探索新的方法来过滤或改进这些缺陷。
### 许可信息
本数据集采用[知识共享非商业许可(Creative Commons NonCommercial,CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/legalcode)发布。
### 引用信息
若您使用了本数据集或模型,请引用我们的工作:
bibtex
@article{lamini-lm,
author = {Minghao Wu and
Abdul Waheed and
Chiyu Zhang and
Muhammad Abdul-Mageed and
Alham Fikri Aji
},
title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
journal = {CoRR},
volume = {abs/2304.14402},
year = {2023},
url = {https://arxiv.org/abs/2304.14402},
eprinttype = {arXiv},
eprint = {2304.14402}
}
提供机构:
maas
创建时间:
2025-03-17



