five

BAAI_COIG

收藏
Opencsg2024-07-19 更新2025-05-03 收录
下载链接:
https://www.opencsg.com/datasets/BAAI/BAAI_COIG
下载链接
链接失效反馈
官方服务:
资源简介:
COIG旨在维护一个无害、有益且多样化的中文指令数据集,促进中文大型语言模型的发展。它包含多个数据集,包括翻译指令集(66,858条,源自Super-NaturalInstructions、Self-Instruct和Unnatural Instructions,经过自动翻译、人工验证和校正),考试指令集(63,532条,来自中国高考、中考和公务员考试,包含指令、问题背景、问题、答案、答案分析和粗粒度科目),人类价值观对齐指令集(34,471条,分为通用价值观和特定区域文化价值观),反事实修正多轮对话数据集(13,653条,基于CN-DBpedia知识图谱,包含学生和老师之间的五轮对话),以及Leetcode指令集(11,737条,来自CC-BY-SA-4.0许可的编程问题集合)。COIG遵循Apache 2.0许可协议,部分数据也包含其他许可协议的内容。

COIG aims to maintain a harmless, beneficial and diverse Chinese instruction dataset to advance the development of Chinese large language models. It encompasses multiple sub-datasets: 1. Translation Instruction Set: 66,858 entries, sourced from Super-NaturalInstructions, Self-Instruct and Unnatural Instructions, which have undergone automatic translation, manual verification and correction; 2. Exam Instruction Set: 63,532 entries, originating from China's National College Entrance Examination (Gaokao), High School Entrance Examination (Zhongkao) and civil service examinations, including instructions, question backgrounds, questions, answers, answer analyses and coarse-grained subject categories; 3. Human Values Alignment Instruction Set: 34,471 entries, divided into general values and specific regional cultural values; 4. Counterfactual Revision Multi-turn Dialogue Dataset: 13,653 entries, based on the CN-DBpedia knowledge graph, containing five-round conversations between students and teachers; 5. Leetcode Instruction Set: 11,737 entries, sourced from a collection of programming problems under the CC-BY-SA-4.0 license. COIG is released under the Apache 2.0 license, and portions of its data also incorporate content covered by other license agreements.
创建时间:
2024-07-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作