ChiyuSONG/dynamics-of-instruction-tuning
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ChiyuSONG/dynamics-of-instruction-tuning
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
viewer: false
task_categories:
- text-generation
language:
- zh
---
<p align="center">
💻 <a href="https://github.com/ChiyuSONG/dynamics-of-instruction-tuning" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2310.19651" target="_blank">[Paper]</a> • 👀 <a href="https://huggingface.co/datasets/ChiyuSONG/dynamics-of-instruction-tuning/blob/main/preview.json" target="_blank">[Preview]</a>
</p>
#### Update
12/01/23: Corrected ambiguous choices in the validation and test sets of the role-play chat data.
## Overview
We introduce *DoIT*, a collection of over 40k human-curated instruction-output pairs in Chinese. This dataset is organized into ten representative ability categories: (1) STEM subject - Biology, (2) Humanity subject - History, (3) Code Generation, (4) Creative Writing, (5) Language proficiency - Chinese, (6) Dialogue Understanding, (7) Role-play Chat, (8) Logical Reasoning, (9) Chain of Thought, and (10) Ethics.
| Ability | Data Source | Data Size |
|---|---|---|
|STEM - Biology|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,242|
|Humanity - History|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|2,093|
|Code Generation|[Leetcode](https://leetcode.cn/)|5,168|
|Creative Writing|User Queries from In-House Data|1,200|
|Chinese|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,650|
|Dialogue Understanding|[C3-D](https://dataset.org/c3/)|5,085|
|Role-play Chat|[BELLE](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)|1,200|
|Logical Reasoning|[LogiQA2.0](https://github.com/csitfun/LogiQA2.0)|12,951|
|COT for Grad-Math|[PRM800K](https://github.com/openai/prm800k)|11,701|
|Ethics|[COIG - Human Value](https://github.com/BAAI-Zlab/COIG#human-value-alignment-instructions-34471)|1,200|
Each data instance is meticulously reviewed by human annotators after collection to maintain quality control. For in-depth information on the annotation process and the variations in the development of each ability during instruction tuning, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning).
## Data Format
```javascript
// As demonstrated in the preview
{
// "messages" contains the instruction-output pairs.
"messages":[{"role":"user", "content":"xxxxx"}, {"role":"assistant", "content":"xxxxx"}]
// Data id, ids are independent for each ability category.
"idx": 100
// Name of its ability category.
"type": "role-play"
// "0" means it is a exact-match question, "1" means it is a open-ended question
"question_format": 1
// optional, only for evaluating open-ended questions in valid and test sets.
"choices":[gold_answer, fine-grained corruption, coarse-grained corruption]
}
```
For more details on data usage in model training and evaluation, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning).
## Citation
```
@article{song2023dynamics,
title={Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace},
author={Song, Chiyu and Zhou, Zhanchao and Yan, Jianhao and Fei, Yuejiao and Lan, Zhenzhong and Zhang, Yue},
journal={arXiv preprint arXiv:2310.19651},
year={2023}
}
```
提供机构:
ChiyuSONG
原始信息汇总
数据集概述
我们介绍 DoIT,一个包含超过 40k 条人工精选的中文指令-输出对的数据集。该数据集分为十个代表性能力类别:
- STEM 学科 - 生物学
- 人文学科 - 历史
- 代码生成
- 创意写作
- 语言能力 - 中文
- 对话理解
- 角色扮演聊天
- 逻辑推理
- 思维链
- 伦理
| 能力 | 数据来源 | 数据量 |
|---|---|---|
| STEM - 生物学 | COIG - 考试 | 1,242 |
| 人文 - 历史 | COIG - 考试 | 2,093 |
| 代码生成 | Leetcode | 5,168 |
| 创意写作 | 内部数据用户查询 | 1,200 |
| 中文 | COIG - 考试 | 1,650 |
| 对话理解 | C3-D | 5,085 |
| 角色扮演聊天 | BELLE | 1,200 |
| 逻辑推理 | LogiQA2.0 | 12,951 |
| 思维链 - Grad-Math | PRM800K | 11,701 |
| 伦理 | COIG - 人类价值 | 1,200 |
每个数据实例在收集后都经过人工标注者仔细审核,以保持质量控制。
数据格式
javascript // 预览中展示的示例 { // "messages" 包含指令-输出对。 "messages":[{"role":"user", "content":"xxxxx"}, {"role":"assistant", "content":"xxxxx"}]
// 数据 id,每个能力类别的 id 是独立的。 "idx": 100
// 能力类别的名称。 "type": "role-play"
// "0" 表示是精确匹配问题,"1" 表示是开放式问题 "question_format": 1
// 可选,仅用于验证和测试集中的开放式问题评估。 "choices":[gold_answer, fine-grained corruption, coarse-grained corruption] }
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个中文指令微调数据集,包含超过40,000条人类标注的指令-输出对,覆盖STEM、人文学科、代码生成等10个能力类别。数据经过人工审核以确保质量,适用于文本生成任务,并基于MIT许可证发布。
以上内容由遇见数据集搜集并总结生成



