BAAI_COIG
收藏Opencsg2024-07-19 更新2025-05-03 收录
下载链接:
https://www.opencsg.com/datasets/BAAI/BAAI_COIG
下载链接
链接失效反馈官方服务:
资源简介:
COIG旨在维护一个无害、有益且多样化的中文指令数据集,促进中文大型语言模型的发展。它包含多个数据集,包括翻译指令集(66,858条,源自Super-NaturalInstructions、Self-Instruct和Unnatural Instructions,经过自动翻译、人工验证和校正),考试指令集(63,532条,来自中国高考、中考和公务员考试,包含指令、问题背景、问题、答案、答案分析和粗粒度科目),人类价值观对齐指令集(34,471条,分为通用价值观和特定区域文化价值观),反事实修正多轮对话数据集(13,653条,基于CN-DBpedia知识图谱,包含学生和老师之间的五轮对话),以及Leetcode指令集(11,737条,来自CC-BY-SA-4.0许可的编程问题集合)。COIG遵循Apache 2.0许可协议,部分数据也包含其他许可协议的内容。
COIG aims to maintain a harmless, beneficial and diverse Chinese instruction dataset to advance the development of Chinese large language models. It encompasses multiple sub-datasets:
1. Translation Instruction Set: 66,858 entries, sourced from Super-NaturalInstructions, Self-Instruct and Unnatural Instructions, which have undergone automatic translation, manual verification and correction;
2. Exam Instruction Set: 63,532 entries, originating from China's National College Entrance Examination (Gaokao), High School Entrance Examination (Zhongkao) and civil service examinations, including instructions, question backgrounds, questions, answers, answer analyses and coarse-grained subject categories;
3. Human Values Alignment Instruction Set: 34,471 entries, divided into general values and specific regional cultural values;
4. Counterfactual Revision Multi-turn Dialogue Dataset: 13,653 entries, based on the CN-DBpedia knowledge graph, containing five-round conversations between students and teachers;
5. Leetcode Instruction Set: 11,737 entries, sourced from a collection of programming problems under the CC-BY-SA-4.0 license.
COIG is released under the Apache 2.0 license, and portions of its data also incorporate content covered by other license agreements.
创建时间:
2024-07-19



