five

CoT-Collection

收藏
魔搭社区2026-01-02 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/CoT-Collection
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Dataset Name ## Dataset Description - **Homepage:https://github.com/kaistAI/CoT-Collection** - **Repository:https://github.com/kaistAI/CoT-Collection** - **Paper:https://arxiv.org/abs/2305.14045** - **Point of Contact:seungone@kaist.ac.kr** ### Dataset Summary ![plot](./cot_collection.JPG) The CoT Collection is a dataset designed to induce Chain-of-Thought (CoT) capabilities into language models. While proprietary LLMs excel at generating Chain-of-Thoughts based on prompting, smaller LMs do not have this capability. Thus, by fine-tuning to generate Chain-of-Thoughts, it could acquire such abilities. The CoT Collection provides 1.84 million Chain-of-Thoughts augmented across 1060 tasks from the Flan Collection.\\ Experimental results show that fine-tuning on the CoT Collection results in (1) better zero-shot performance and (2) a better base model for few-shot learning. We also provide a multilingual version of CoT Collection at this [link](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection). ### Supported Tasks and Leaderboards 1060 tasks chosen from the Flan Collection. The list of categories within the CoT Collection are: * Natural Language Inference * Extractive Question Answering * Closed Book Question Answering * Science * Toxic Classification * Arithmetic * Program Execution * Dialogue * Ethics * Commonsense Reasoning * Multiple Choice Question Answering ### Languages English ## Dataset Structure * source: The input that is given to the language model (LM). * target: The ground truth answer to the source. * rationale: The Chain of Thought (CoT) that explains how the target could be derived from the source. * task: A category that shows which dataset the source and target was extracted from. In our paper, we trained the underlying language model to generate in the following format: ``` \{rationale\} [RESULT] \{target\} ``` Then during evaluation, we parsed the prediction after the phrase ```[RESULT]```. ### Data Splits | name | train | |-------------------|------:| |CoT-Collection|1837928| ### Citation Information If you find the following model helpful, please considering citing our paper! ``` @article{kim2023cot, title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning}, author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon}, journal={arXiv preprint arXiv:2305.14045}, year={2023} } ```

# 数据集卡片:数据集名称 ## 数据集概述 - **主页:https://github.com/kaistAI/CoT-Collection** - **代码仓库:https://github.com/kaistAI/CoT-Collection** - **相关论文:https://arxiv.org/abs/2305.14045** - **联系方式:seungone@kaist.ac.kr** ### 数据集摘要 ![plot](./cot_collection.JPG) 思维链(Chain-of-Thought,CoT)数据集(CoT Collection)是一款专为赋能语言模型习得思维链能力而打造的数据集。尽管闭源大语言模型(Large Language Model, LLM)凭借提示工程便可出色生成思维链,但小型语言模型并不具备此类能力。因此,通过针对思维链生成进行微调,小型模型即可习得此类技能。 CoT Collection 共包含来自Flan Collection的1060项任务,并为其扩增了184万条思维链样本。实验结果表明,在CoT Collection上进行微调可达成两大效果:(1) 更优异的零样本(Zero-shot)表现;(2) 适配少样本(Few-shot)学习的更优质基础模型。 我们还在此[链接](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection)中提供了CoT Collection的多语言版本。 ### 支持的任务与评测榜单 从Flan Collection中选取的1060项任务。 CoT Collection包含以下任务类别: * 自然语言推理 * 抽取式问答 * 闭卷问答 * 科学推理 * 毒性分类 * 算术推理 * 程序执行 * 对话系统 * 伦理判断 * 常识推理 * 多项选择问答 ### 语言 英语 ## 数据集结构 * source:输入给语言模型(Language Model, LM)的原始提示文本 * target:对应source的标准答案 * rationale:用于阐释如何从source推导出target的思维链(Chain-of-Thought,CoT) * task:用于标注source与target所属数据集类别的任务类别标签 在本研究的论文中,我们将基础语言模型的生成格式设定为如下形式: {rationale} [RESULT] {target} 在评测阶段,我们会提取`[RESULT]`短语之后的内容作为模型预测结果。 ### 数据划分 | 数据集名称 | 训练集样本数 | |-------------------|------:| |CoT-Collection|1837928| ### 引用说明 若本数据集对你的研究有所帮助,请考虑引用我们的论文! @article{kim2023cot, title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning}, author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon}, journal={arXiv preprint arXiv:2305.14045}, year={2023} }
提供机构:
maas
创建时间:
2024-05-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作