five

ChiyuSONG/dynamics-of-instruction-tuning

收藏
Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ChiyuSONG/dynamics-of-instruction-tuning
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit viewer: false task_categories: - text-generation language: - zh --- <p align="center"> 💻 <a href="https://github.com/ChiyuSONG/dynamics-of-instruction-tuning" target="_blank">[Github Repo]</a> • 📃 <a href="https://arxiv.org/abs/2310.19651" target="_blank">[Paper]</a> • 👀 <a href="https://huggingface.co/datasets/ChiyuSONG/dynamics-of-instruction-tuning/blob/main/preview.json" target="_blank">[Preview]</a> </p> #### Update 12/01/23: Corrected ambiguous choices in the validation and test sets of the role-play chat data. ## Overview We introduce *DoIT*, a collection of over 40k human-curated instruction-output pairs in Chinese. This dataset is organized into ten representative ability categories: (1) STEM subject - Biology, (2) Humanity subject - History, (3) Code Generation, (4) Creative Writing, (5) Language proficiency - Chinese, (6) Dialogue Understanding, (7) Role-play Chat, (8) Logical Reasoning, (9) Chain of Thought, and (10) Ethics. | Ability | Data Source | Data Size | |---|---|---| |STEM - Biology|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,242| |Humanity - History|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|2,093| |Code Generation|[Leetcode](https://leetcode.cn/)|5,168| |Creative Writing|User Queries from In-House Data|1,200| |Chinese|[COIG - Exam](https://github.com/BAAI-Zlab/COIG#exam-instructions-63532)|1,650| |Dialogue Understanding|[C3-D](https://dataset.org/c3/)|5,085| |Role-play Chat|[BELLE](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)|1,200| |Logical Reasoning|[LogiQA2.0](https://github.com/csitfun/LogiQA2.0)|12,951| |COT for Grad-Math|[PRM800K](https://github.com/openai/prm800k)|11,701| |Ethics|[COIG - Human Value](https://github.com/BAAI-Zlab/COIG#human-value-alignment-instructions-34471)|1,200| Each data instance is meticulously reviewed by human annotators after collection to maintain quality control. For in-depth information on the annotation process and the variations in the development of each ability during instruction tuning, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning). ## Data Format ```javascript // As demonstrated in the preview { // "messages" contains the instruction-output pairs. "messages":[{"role":"user", "content":"xxxxx"}, {"role":"assistant", "content":"xxxxx"}] // Data id, ids are independent for each ability category. "idx": 100 // Name of its ability category. "type": "role-play" // "0" means it is a exact-match question, "1" means it is a open-ended question "question_format": 1 // optional, only for evaluating open-ended questions in valid and test sets. "choices":[gold_answer, fine-grained corruption, coarse-grained corruption] } ``` For more details on data usage in model training and evaluation, please refer to our [Paper](https://arxiv.org/abs/2310.19651) and [Github Repo](https://github.com/ChiyuSONG/dynamics-of-instruction-tuning). ## Citation ``` @article{song2023dynamics, title={Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace}, author={Song, Chiyu and Zhou, Zhanchao and Yan, Jianhao and Fei, Yuejiao and Lan, Zhenzhong and Zhang, Yue}, journal={arXiv preprint arXiv:2310.19651}, year={2023} } ```
提供机构:
ChiyuSONG
原始信息汇总

数据集概述

我们介绍 DoIT,一个包含超过 40k 条人工精选的中文指令-输出对的数据集。该数据集分为十个代表性能力类别:

  1. STEM 学科 - 生物学
  2. 人文学科 - 历史
  3. 代码生成
  4. 创意写作
  5. 语言能力 - 中文
  6. 对话理解
  7. 角色扮演聊天
  8. 逻辑推理
  9. 思维链
  10. 伦理
能力 数据来源 数据量
STEM - 生物学 COIG - 考试 1,242
人文 - 历史 COIG - 考试 2,093
代码生成 Leetcode 5,168
创意写作 内部数据用户查询 1,200
中文 COIG - 考试 1,650
对话理解 C3-D 5,085
角色扮演聊天 BELLE 1,200
逻辑推理 LogiQA2.0 12,951
思维链 - Grad-Math PRM800K 11,701
伦理 COIG - 人类价值 1,200

每个数据实例在收集后都经过人工标注者仔细审核,以保持质量控制。

数据格式

javascript // 预览中展示的示例 { // "messages" 包含指令-输出对。 "messages":[{"role":"user", "content":"xxxxx"}, {"role":"assistant", "content":"xxxxx"}]

// 数据 id,每个能力类别的 id 是独立的。 "idx": 100

// 能力类别的名称。 "type": "role-play"

// "0" 表示是精确匹配问题,"1" 表示是开放式问题 "question_format": 1

// 可选,仅用于验证和测试集中的开放式问题评估。 "choices":[gold_answer, fine-grained corruption, coarse-grained corruption] }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个中文指令微调数据集,包含超过40,000条人类标注的指令-输出对,覆盖STEM、人文学科、代码生成等10个能力类别。数据经过人工审核以确保质量,适用于文本生成任务,并基于MIT许可证发布。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作