COIG

Name: COIG
Creator: maas
Published: 2025-12-04 16:15:02
License: 暂无描述

魔搭社区2025-12-04 更新2024-06-01 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/COIG

下载链接

链接失效反馈

官方服务：

资源简介：

# This is the Chinese Open Instruction Generalist project We propose the Chinese Open Instruction Generalist (**COIG**) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. We welcome all researchers in the community to contribute to the corpus set and collaborate with us. We only release the first chip of COIG to help the Chinese LLMs' development in the exploration stage and appeal to more researchers joining us in building COIG. We introduce a manually verified translated general instruction corpus, a manually annotated exam instruction corpus, a human value alignment instruction corpus, a multi-round counterfactual correction chat corpus, and a leetcode instruction corpus. We provide these new instruction corpora to assist the community with instruction tuning on Chinese LLMs. These instruction corpora are also template workflows for how new Chinese instruction corpora can be built and expanded effectively. It is best to download the individual data files directly that you wish to use instead of using HF load_datasets. All datasets can be downloaded from: https://huggingface.co/datasets/BAAI/COIG/tree/main This dataset card is modified from [OIG](https://huggingface.co/datasets/laion/OIG). ### Translated Instructions (66,858) There are 66,858 instructions in total, which are composed of 1,616 task descriptions in [Super-NaturalInstructions](https://arxiv.org/abs/2204.07705) along with a single instance for each of them, 175 seed tasks in [Self-Instruct](https://arxiv.org/abs/2212.10560), and 66,007 instructions from [Unnatural Instructions](https://arxiv.org/abs/2212.09689). To reduce the cost and further improve the quality of the instruction corpus, we separate the translation procedure into three phases: automatic translation, manual verification, and manual correction. These strict quality verification procedures assure the reliability of the translated corpus. ### Exam Instructions (63,532) The Chinese National College Entrance Examination, Middle School Entrance Examinations, and Civil Servant Examination are the main Chinese commonsense tests. These exams contain various question formats and detailed analysis that can be used as the Chain-of-Thought (**CoT**) corpus. We extract six informative elements from original exam questions, including instruction, question context, question, answer, answer analysis, and coarse-grained subject. There are six main coarse-grained subjects: Chinese, English, Politics, Biology, History, and Geology. There are very few Math, Physics, and Chemistry questions in the corpus because these questions are often with complex symbols which are hard to annotate. For many choice questions, we recommend that the researchers utilize this corpus to further post-process it using prompts or post-process it to blank-filling questions to increase the instructions' diversity further. ### Human Value Alignment Instructions (34,471) To respect and reflect the major difference caused by different cultural backgrounds, different from other tasks in COIG that leverage one unified collection of instruction-following samples, we categorize the value alignment data into two separate series: - A set of samples that present shared human values in the Chinese-speaking world. In total, we choose 50 instructions as the augmentation seeds, and produce 3k resulting instructions following samples for general-purpose value alignment in the Chinese-speaking world. - Some additional sets of samples that present regional-culture or country-specific human values. ### Counterfactural Correction Multi-round Chat (13,653) The Counterfactual Correction Multi-round Chat dataset (CCMC) is constructed based on the [CN-DBpedia knowledge graph dataset](https://link.springer.com/chapter/10.1007/978-3-319-60045-1_44) with the aim of alleviating and resolving the pain points of hallucination and factual inconsistency in current LLMs. The CCMC dataset includes 5 rounds of role-playing chat between a student and a teacher, and the corresponding knowledge they refer to. The dataset contains ~13,000 dialogues with an average of 5 rounds per dialogue, resulting in ~65,000 rounds of chat. ### Leetcode Instructions (11,737) Given that the code-related tasks potentially contribute to the ability emergence of LLMs, we argue that code-related tasks aligned with the Chinese natural language should be considered in our datasets. Therefore, we build the Leetcode instructions from a **CC-BY-SA-4.0** license [collection](https://github.com/doocs/leetcode) of 2,589 programming questions. The questions contain problem descriptions, multiple programming languages, and explanations (834 questions do not have explanations). ## Support this project Your contributions and feedback support the open source ecosystem, improve the bot and provide datasets for future AI research. To participate you can: Submit Github issues, track issues and help create datasets that need improvement. https://github.com/BAAI-Zlab/COIG ## Update: May 27, 2023 - v0.3: Update counterfactural_correction_multi_round_chat.tar.gz and make sure all round responses can be decoded as json. - v0.2: Update exam_instructions.jsonl, translated_instructions.jsonl and human_value_alignment_instructions_part2.json. - v0.1: Release the five datasets of COIG. ## Disclaimer These datasets contain synthetic data and in some cases data that includes humans trying to get the language model to say toxic/offensive/trolling things. If you are concerned about the presence of this type of material in the dataset please make sure you carefully inspect each of the entries and filter appropriately. Our goal is for the model to be as helpful and non-toxic as possible and we are actively evaluating ways to reduce or eliminate undesirable content from the instruction tuning datasets. ## License The COIG dataset that is authored by BAAI is released under an Apache 2.0 license. However, the data also includes content licensed under other permissive licenses such as unnatural instructions data which is licensed under MIT License, or web-crawled data which is used under fair use principles. ## BibTeX & Citation ``` @misc{zhang2023chinese, title={Chinese Open Instruction Generalist: A Preliminary Release}, author={Ge Zhang and Yemin Shi and Ruibo Liu and Ruibin Yuan and Yizhi Li and Siwei Dong and Yu Shu and Zhaoqun Li and Zekun Wang and Chenghua Lin and Wenhao Huang and Jie Fu}, year={2023}, eprint={2304.07987}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# 本项目为中文开放指令通用集（Chinese Open Instruction Generalist, COIG）我们提出中文开放指令通用集（COIG）项目，旨在维护一个无害、实用且多样化的中文指令语料库。我们欢迎社区内所有研究者为该语料库贡献内容并与我们开展合作。本次仅发布COIG的首个版本，以助力探索阶段的中文大语言模型（Large Language Model, LLM）研发，并呼吁更多研究者加入我们共同构建COIG。我们推出了经人工校验的翻译通用指令语料库、经人工标注的考试指令语料库、人类价值观对齐指令语料库、多轮反事实修正对话语料库，以及Leetcode指令语料库。我们提供这些全新的指令语料库，以助力社区开展中文大语言模型的指令微调工作。同时，这些语料库也可为如何高效构建与扩展新的中文指令语料库提供模板化工作流。建议直接下载所需的单个数据文件，而非使用Hugging Face的`load_datasets`工具。所有数据集均可从以下链接获取：https://huggingface.co/datasets/BAAI/COIG/tree/main 本数据集卡片改编自[OIG](https://huggingface.co/datasets/laion/OIG)。 ### 翻译指令集（66,858条）本语料库总计包含66,858条指令，由三部分构成：一是[Super-NaturalInstructions](https://arxiv.org/abs/2204.07705)中的1,616条任务描述及每条描述对应的单条实例；二是[Self-Instruct](https://arxiv.org/abs/2212.10560)中的175个种子任务；三是[Unnatural Instructions](https://arxiv.org/abs/2212.09689)中的66,007条指令。为降低成本并进一步提升指令语料库的质量，我们将翻译流程分为三个阶段：自动翻译、人工校验与人工修正。这套严格的质量验证流程保障了翻译语料库的可靠性。 ### 考试指令集（63,532条）中国普通高等学校招生全国统一考试、初中学业水平考试以及公务员录用考试是国内主流的常识类测评考试。这类考试包含多样的题型与细致的解析，可作为思维链（Chain-of-Thought, CoT）语料库使用。我们从原始考试题目中提取了六项关键信息：指令、题目上下文、问题、答案、答案解析，以及粗粒度学科分类。该语料库涵盖六大主要粗粒度学科：语文、英语、政治、生物、历史与地理。由于数学、物理、化学的题目常包含复杂符号，难以进行标注，因此此类题目在语料库中占比极低。针对其中大量的选择题，我们建议研究者可通过提示词进一步后处理，或将其转换为填空题，以进一步提升指令的多样性。 ### 人类价值观对齐指令集（34,471条）为尊重并反映不同文化背景带来的显著差异，与COIG中其他采用统一指令跟随样本集的任务不同，我们将价值观对齐数据划分为两个独立系列： - 一套体现华语圈共享人类价值观的样本集。我们共选取50条指令作为增强种子，针对华语圈通用型价值观对齐任务生成了3,000条最终指令样本。 - 若干体现地域文化或特定国家人类价值观的附加样本集。 ### 反事实修正多轮对话语料库（13,653条）反事实修正多轮对话（Counterfactual Correction Multi-round Chat, CCMC）数据集基于[CN-DBpedia知识图谱数据集](https://link.springer.com/chapter/10.1007/978-3-319-60045-1_44)构建，旨在缓解并解决当前大语言模型中存在的幻觉与事实不一致问题。CCMC数据集包含学生与教师之间的5轮角色扮演对话，以及对话中涉及的相关知识。该语料库包含约13,000条对话，平均每条对话包含5轮交互，总计约65,000轮对话内容。 ### Leetcode指令集（11,737条）鉴于代码相关任务可能有助于大语言模型的能力涌现，我们认为应在数据集中纳入与中文自然语言对齐的代码相关任务。因此，我们从遵循**CC-BY-SA-4.0**许可协议的[编程题集合](https://github.com/doocs/leetcode)中选取了2,589道编程题，构建了Leetcode指令语料库。这些题目包含问题描述、多种编程语言实现方案与解析（其中834道题目暂无解析）。 ## 支持本项目您的贡献与反馈将助力开源生态建设，优化模型性能，并为未来的人工智能研究提供数据集支持。您可通过以下方式参与：提交GitHub议题、跟踪议题并协助完善待改进的数据集：https://github.com/BAAI-Zlab/COIG ## 更新日志：2023年5月27日 - v0.3：更新反事实修正多轮对话数据集压缩包，确保所有轮次的回复均可解码为JSON格式。 - v0.2：更新考试指令集JSONL文件、翻译指令集JSONL文件与人类价值观对齐指令集第二部分JSON文件。 - v0.1：发布COIG的五大数据集。 ## 免责声明本数据集包含合成数据，在部分场景下还包含诱导语言模型生成有毒、冒犯性或挑衅性内容的样本。若您担忧数据集中存在此类内容，请务必仔细检查每条条目并进行适当过滤。我们的目标是让模型尽可能实用且无毒性，目前正积极评估各类方法以减少或消除指令微调数据集中的不良内容。 ## 许可协议由北京智源人工智能研究院（BAAI）创作的COIG数据集采用Apache 2.0许可协议发布。但本数据集同时包含其他宽松许可协议下的内容，例如采用MIT许可协议的Unnatural Instructions数据集，以及基于合理使用原则获取的网络爬取数据。 ## BibTeX引用格式 @misc{zhang2023chinese, title={Chinese Open Instruction Generalist: A Preliminary Release}, author={Ge Zhang and Yemin Shi and Ruibo Liu and Ruibin Yuan and Yizhi Li and Siwei Dong and Yu Shu and Zhaoqun Li and Zekun Wang and Chenghua Lin and Wenhao Huang and Jie Fu}, year={2023}, eprint={2304.07987}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2024-05-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集