pharaouk/CoT-Collection
收藏Hugging Face2024-04-10 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/pharaouk/CoT-Collection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- text-classification
language:
- en
size_categories:
- 1M<n<10M
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:https://github.com/kaistAI/CoT-Collection**
- **Repository:https://github.com/kaistAI/CoT-Collection**
- **Paper:https://arxiv.org/abs/2305.14045**
- **Point of Contact:seungone@kaist.ac.kr**
### Dataset Summary

The CoT Collection is a dataset designed to induce Chain-of-Thought (CoT) capabilities into language models.
While proprietary LLMs excel at generating Chain-of-Thoughts based on prompting, smaller LMs do not have this capability. Thus, by fine-tuning to generate Chain-of-Thoughts, it could acquire such abilities.
The CoT Collection provides 1.84 million Chain-of-Thoughts augmented across 1060 tasks from the Flan Collection.\\
Experimental results show that fine-tuning on the CoT Collection results in (1) better zero-shot performance and (2) a better base model for few-shot learning.
We also provide a multilingual version of CoT Collection at this [link](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection).
### Supported Tasks and Leaderboards
1060 tasks chosen from the Flan Collection.
The list of categories within the CoT Collection are:
* Natural Language Inference
* Extractive Question Answering
* Closed Book Question Answering
* Science
* Toxic Classification
* Arithmetic
* Program Execution
* Dialogue
* Ethics
* Commonsense Reasoning
* Multiple Choice Question Answering
### Languages
English
## Dataset Structure
* source: The input that is given to the language model (LM).
* target: The ground truth answer to the source.
* rationale: The Chain of Thought (CoT) that explains how the target could be derived from the source.
* task: A category that shows which dataset the source and target was extracted from.
In our paper, we trained the underlying language model to generate in the following format:
```
\{rationale\}
[RESULT]
\{target\}
```
Then during evaluation, we parsed the prediction after the phrase ```[RESULT]```.
### Data Splits
| name | train |
|-------------------|------:|
|CoT-Collection|1837928|
### Citation Information
If you find the following model helpful, please considering citing our paper!
```
@article{kim2023cot,
title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning},
author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon},
journal={arXiv preprint arXiv:2305.14045},
year={2023}
}
```
提供机构:
pharaouk
原始信息汇总
数据集概述
数据集名称
- 名称: CoT Collection
数据集描述
- 目的: 旨在通过微调使语言模型具备Chain-of-Thought (CoT)能力。
- 规模: 包含1.84百万个CoT,覆盖1060个任务。
- 效果: 微调后,模型在零样本学习和少样本学习中表现更佳。
支持的任务
- 任务数量: 1060个任务
- 任务类型:
- 自然语言推理
- 提取式问答
- 闭卷问答
- 科学
- 毒性分类
- 算术
- 程序执行
- 对话
- 伦理
- 常识推理
- 多选题问答
语言
- 主要语言: 英语
数据集结构
- 数据字段:
- source: 输入给语言模型的数据
- target: 源数据的真实答案
- rationale: 解释如何从源数据推导出目标答案的CoT
- task: 显示源数据和目标数据来自哪个数据集的类别
数据分割
- 分割:
- 训练集: 1837928条数据
引用信息
-
引用文献:
@article{kim2023cot, title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning}, author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon}, journal={arXiv preprint arXiv:2305.14045}, year={2023} }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



