Multilingual-CoT-Collection
收藏魔搭社区2025-12-04 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Multilingual-CoT-Collection
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:https://github.com/kaistAI/CoT-Collection**
- **Repository:https://github.com/kaistAI/CoT-Collection**
- **Paper:https://arxiv.org/abs/2305.14045**
- **Point of Contact:seungone@kaist.ac.kr**
### Dataset Summary

The Multilingual CoT Collection is a dataset designed to induce Chain-of-Thought (CoT) capabilities into multilingual language models.
While proprietary LLMs excel at generating Chain-of-Thoughts based on prompting, smaller LMs do not have this capability. Thus, by fine-tuning to generate Chain-of-Thoughts, it could acquire such abilities.
The Multilingual CoT Collection provides 1.84 million Chain-of-Thoughts augmented across 1060 tasks from the Flan Collection.\\
Experimental results show that fine-tuning on the CoT Collection results in (1) better zero-shot performance and (2) a better base model for few-shot learning.
We also provide a multilingual version of CoT Collection at this [link](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection).
### Supported Tasks and Leaderboards
1060 tasks chosen from the Flan Collection.
The list of categories within the CoT Collection are:
* Natural Language Inference
* Extractive Question Answering
* Closed Book Question Answering
* Science
* Toxic Classification
* Arithmetic
* Program Execution
* Dialogue
* Ethics
* Commonsense Reasoning
* Multiple Choice Question Answering
### Languages
English
## Dataset Structure
* source: The input that is given to the language model (LM).
* target: The ground truth answer to the source.
* rationale: The Chain of Thought (CoT) that explains how the target could be derived from the source.
* task: A category that shows which dataset the source and target was extracted from.
In our paper, we trained the underlying language model to generate in the following format:
```
\{rationale\}
[RESULT]
\{target\}
```
Then during evaluation, we parsed the prediction after the phrase ```[RESULT]```.
### Data Splits
| name | train |
|-------------------|------:|
|CoT-Collection|1837928|
### Citation Information
If you find the following model helpful, please considering citing our paper!
```
@article{kim2023cot,
title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning},
author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon},
journal={arXiv preprint arXiv:2305.14045},
year={2023}
}
```
# 数据集卡片:多语言思维链数据集
## 数据集描述
- **主页:** https://github.com/kaistAI/CoT-Collection
- **代码仓库:** https://github.com/kaistAI/CoT-Collection
- **相关论文:** https://arxiv.org/abs/2305.14045
- **联系方式:** seungone@kaist.ac.kr
### 数据集摘要

多语言思维链数据集(Multilingual CoT Collection)是一款旨在为多语言大语言模型(Large Language Model, LLM)注入思维链(Chain-of-Thought, CoT)能力的数据集。尽管闭源大语言模型可基于提示工程高效生成思维链,但小型大语言模型并不具备此项能力。因此,通过微调以生成思维链,可使小型模型获得该能力。
多语言思维链数据集共包含源自Flan数据集集合的1060项任务的184万条思维链增强数据。实验结果表明,基于本数据集进行微调可实现两项提升:(1) 更优异的零样本(Zero-shot)学习性能;(2) 更优质的少样本(Few-shot)学习基础模型。
我们还通过此[链接](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection)提供了多语言版思维链数据集。
### 支持的任务与评测榜
从Flan数据集集合中选取的1060项任务。
本数据集涵盖的任务类别如下:
* 自然语言推理
* 抽取式问答
* 闭卷问答
* 科学问题
* 毒性分类
* 算术推理
* 程序执行
* 对话系统
* 伦理判断
* 常识推理
* 多项选择问答
### 语言
英语
## 数据集结构
* source: 输入给大语言模型的源文本。
* target: 源文本对应的标准答案。
* rationale: 用于解释如何从源文本推导出目标答案的思维链(Chain-of-Thought, CoT)。
* task: 用于标识源文本与目标答案所属数据集类别的任务类别标签。
在本研究的论文中,我们将基础大语言模型的生成格式设定为如下形式:
{rationale}
[RESULT]
{target}
在评测阶段,我们会提取`[RESULT]`之后的内容作为模型预测结果。
### 数据划分
| 数据集名称 | 训练样本数 |
|-------------------|------:|
|CoT-Collection|1,837,928|
### 引用信息
若本数据集对您的研究有所帮助,请引用我们的论文!
@article{kim2023cot,
title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning},
author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon},
journal={arXiv preprint arXiv:2305.14045},
year={2023}
}
提供机构:
maas
创建时间:
2024-05-09



