ChiyuSONG/dynamics-of-instruction-tuning

Name: ChiyuSONG/dynamics-of-instruction-tuning
Creator: ChiyuSONG
Published: 2024-02-26 09:23:23
License: 暂无描述

Hugging Face2024-02-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ChiyuSONG/dynamics-of-instruction-tuning

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit viewer: false task_categories: - text-generation language: - zh --- <p align="center"> 💻 <a href="https://github.com/ChiyuSONG/dynamics-of-instruction-tuning" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2310.19651" target="_blank">[Paper]</a> • 👀 <a href="https://huggingface.co/datasets/ChiyuSONG/dynamics-of-instruction-tuning/blob/main/preview.json" target="_blank">[Preview]</a> </p> #### Update 12/01/23: Corrected ambiguous choices in the validation and test sets of the role-play chat data. ## Overview We introduce *DoIT*, a collection of over 40k human-curated instruction-output pairs in Chinese. This dataset is organized into ten representative ability categories: (1) STEM subject - Biology, (2) Humanity subject - History, (3) Code Generation, (4) Creative Writing, (5) Language proficiency - Chinese, (6) Dialogue Understanding, (7) Role-play Chat, (8) Logical Reasoning, (9) Chain of Thought, and (10) Ethics. | Ability | Data Source | Data Size | |---|---|---| |STEM - Biology|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,242| |Humanity - History|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|2,093| |Code Generation|[Leetcode](https://leetcode.cn/)|5,168| |Creative Writing|User Queries from In-House Data|1,200| |Chinese|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,650| |Dialogue Understanding|[C3-D](https://dataset.org/c3/)|5,085| |Role-play Chat|[BELLE](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)|1,200| |Logical Reasoning|[LogiQA2.0](https://github.com/csitfun/LogiQA2.0)|12,951| |COT for Grad-Math|[PRM800K](https://github.com/openai/prm800k)|11,701| |Ethics|[COIG - Human Value](https://github.com/BAAI-Zlab/COIG#human-value-alignment-instructions-34471)|1,200| Each data instance is meticulously reviewed by human annotators after collection to maintain quality control. For in-depth information on the annotation process and the variations in the development of each ability during instruction tuning, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning). ## Data Format ```javascript // As demonstrated in the preview { // "messages" contains the instruction-output pairs. "messages":[{"role":"user", "content":"xxxxx"}, {"role":"assistant", "content":"xxxxx"}] // Data id, ids are independent for each ability category. "idx": 100 // Name of its ability category. "type": "role-play" // "0" means it is a exact-match question, "1" means it is a open-ended question "question_format": 1 // optional, only for evaluating open-ended questions in valid and test sets. "choices":[gold_answer, fine-grained corruption, coarse-grained corruption] } ``` For more details on data usage in model training and evaluation, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning). ## Citation ``` @article{song2023dynamics, title={Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace}, author={Song, Chiyu and Zhou, Zhanchao and Yan, Jianhao and Fei, Yuejiao and Lan, Zhenzhong and Zhang, Yue}, journal={arXiv preprint arXiv:2310.19651}, year={2023} } ```

提供机构：

ChiyuSONG

原始信息汇总

数据集概述

我们介绍 DoIT，一个包含超过 40k 条人工精选的中文指令-输出对的数据集。该数据集分为十个代表性能力类别：

STEM 学科 - 生物学
人文学科 - 历史
代码生成
创意写作
语言能力 - 中文
对话理解
角色扮演聊天
逻辑推理
思维链
伦理

能力	数据来源	数据量
STEM - 生物学	COIG - 考试	1,242
人文 - 历史	COIG - 考试	2,093
代码生成	Leetcode	5,168
创意写作	内部数据用户查询	1,200
中文	COIG - 考试	1,650
对话理解	C3-D	5,085
角色扮演聊天	BELLE	1,200
逻辑推理	LogiQA2.0	12,951
思维链 - Grad-Math	PRM800K	11,701
伦理	COIG - 人类价值	1,200