five

pharaouk/CoT-Collection

收藏
Hugging Face2024-04-10 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/pharaouk/CoT-Collection
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - text-classification language: - en size_categories: - 1M<n<10M --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:https://github.com/kaistAI/CoT-Collection** - **Repository:https://github.com/kaistAI/CoT-Collection** - **Paper:https://arxiv.org/abs/2305.14045** - **Point of Contact:seungone@kaist.ac.kr** ### Dataset Summary ![plot](./cot_collection.JPG) The CoT Collection is a dataset designed to induce Chain-of-Thought (CoT) capabilities into language models. While proprietary LLMs excel at generating Chain-of-Thoughts based on prompting, smaller LMs do not have this capability. Thus, by fine-tuning to generate Chain-of-Thoughts, it could acquire such abilities. The CoT Collection provides 1.84 million Chain-of-Thoughts augmented across 1060 tasks from the Flan Collection.\\ Experimental results show that fine-tuning on the CoT Collection results in (1) better zero-shot performance and (2) a better base model for few-shot learning. We also provide a multilingual version of CoT Collection at this [link](https://huggingface.co/datasets/kaist-ai/Multilingual-CoT-Collection). ### Supported Tasks and Leaderboards 1060 tasks chosen from the Flan Collection. The list of categories within the CoT Collection are: * Natural Language Inference * Extractive Question Answering * Closed Book Question Answering * Science * Toxic Classification * Arithmetic * Program Execution * Dialogue * Ethics * Commonsense Reasoning * Multiple Choice Question Answering ### Languages English ## Dataset Structure * source: The input that is given to the language model (LM). * target: The ground truth answer to the source. * rationale: The Chain of Thought (CoT) that explains how the target could be derived from the source. * task: A category that shows which dataset the source and target was extracted from. In our paper, we trained the underlying language model to generate in the following format: ``` \{rationale\} [RESULT] \{target\} ``` Then during evaluation, we parsed the prediction after the phrase ```[RESULT]```. ### Data Splits | name | train | |-------------------|------:| |CoT-Collection|1837928| ### Citation Information If you find the following model helpful, please considering citing our paper! ``` @article{kim2023cot, title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning}, author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon}, journal={arXiv preprint arXiv:2305.14045}, year={2023} } ```
提供机构:
pharaouk
原始信息汇总

数据集概述

数据集名称

  • 名称: CoT Collection

数据集描述

  • 目的: 旨在通过微调使语言模型具备Chain-of-Thought (CoT)能力。
  • 规模: 包含1.84百万个CoT,覆盖1060个任务。
  • 效果: 微调后,模型在零样本学习和少样本学习中表现更佳。

支持的任务

  • 任务数量: 1060个任务
  • 任务类型:
    • 自然语言推理
    • 提取式问答
    • 闭卷问答
    • 科学
    • 毒性分类
    • 算术
    • 程序执行
    • 对话
    • 伦理
    • 常识推理
    • 多选题问答

语言

  • 主要语言: 英语

数据集结构

  • 数据字段:
    • source: 输入给语言模型的数据
    • target: 源数据的真实答案
    • rationale: 解释如何从源数据推导出目标答案的CoT
    • task: 显示源数据和目标数据来自哪个数据集的类别

数据分割

  • 分割:
    • 训练集: 1837928条数据

引用信息

  • 引用文献:

    @article{kim2023cot, title={The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning}, author={Kim, Seungone and Joo, Se June and Kim, Doyoung and Jang, Joel and Ye, Seonghyeon and Shin, Jamin and Seo, Minjoon}, journal={arXiv preprint arXiv:2305.14045}, year={2023} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作