BBH
收藏魔搭社区2025-11-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/BBH
下载链接
链接失效反馈官方服务:
资源简介:
displayName: BBH
license:
- MIT
taskTypes: []
mediaTypes: []
labelTypes: []
tags:
- attrs: null
id: 11864
name:
en: ''
zh: 文本检索
publisher:
- Stanford University
- Google Research
publishDate: '2022-01-01'
publishUrl: https://github.com/suzgunmirac/BIG-Bench-Hard
paperUrl: https://arxiv.org/pdf/2210.09261.pdf
---
# 数据集介绍
## 简介
BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?
In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average humanrater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves."
## 引文
```
@article{suzgun2022challenging,
title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
journal={arXiv preprint arXiv:2210.09261},
year={2022}
}
```
## Download dataset
:modelscope-code[]{type="git"}
displayName: BBH
license:
- MIT 许可证
taskTypes: []
mediaTypes: []
labelTypes: []
tags:
- attrs: null
id: 11864
name:
en: ''
zh: 文本检索
publisher:
- 斯坦福大学(Stanford University)
- 谷歌研究院(Google Research)
publishDate: '2022-01-01'
publishUrl: https://github.com/suzgunmirac/BIG-Bench-Hard
paperUrl: https://arxiv.org/pdf/2210.09261.pdf
---
# 数据集介绍
## 简介
BIG-Bench(Srivastava等人,2022)是一款多样化的评估套件,聚焦于当前语言模型被认为尚不具备解决能力的任务。当前语言模型在该基准上已取得可观进展:BIG-Bench论文中的最优模型通过少样本(Few-shot)提示,在65%的BIG-Bench任务上超越了已报道的人类评分者平均表现。但语言模型究竟在哪些任务上未能达到人类评分者的平均水平?这些任务是否真的是当前语言模型无法攻克的?
在本研究中,我们聚焦于23个极具挑战性的BIG-Bench任务套件,并将其命名为BIG-Bench Hard(BBH)。这些任务均为此前语言模型评估中未能超越人类评分者平均表现的任务。我们发现,对BBH任务应用思维链(chain-of-thought, CoT)提示后,PaLM模型在23个任务中的10个上超越了人类评分者的平均表现,而Codex(code-davinci-002)则在23个任务中的17个上达成了这一目标。由于BBH中的多数任务需要多步推理,如BIG-Bench评估(Srivastava等人,2022)中所采用的无CoT少样本提示,会大幅低估语言模型的最优性能与能力,而通过CoT提示则能更充分地捕捉语言模型的此类能力。进一步分析中,我们探究了CoT与模型规模在BBH任务上的交互作用,发现CoT能让原本随规模增长曲线平缓的多个BBH任务涌现出优异的任务性能。
## 引文
@article{suzgun2022challenging,
title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
author={Suzgun, Mirac and Scales, Nathan and Sch{"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
journal={arXiv preprint arXiv:2210.09261},
year={2022}
}
## 下载数据集
:modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-07-01



