BBH Dataset

Name: BBH Dataset
Creator: Papers with Code
License: 暂无描述

paperswithcode.com2025-01-15 收录

下载链接：

https://paperswithcode.com/dataset/bbh

下载链接

链接失效反馈

官方服务：

资源简介：

BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a suite of 23 challenging tasks from BIG-Bench that were found to be beyond the capabilities of current language models. These tasks are ones where prior language model evaluations did not outperform the average human-rater. The BBH tasks require multi-step reasoning, and it was found that few-shot prompting without Chain-of-Thought (CoT), as done in the BIG-Bench evaluations, substantially underestimates the best performance and capabilities of language models. When CoT prompting was applied to BBH tasks, it enabled PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.

BIG-Bench Hard (BBH) 是 BIG-Bench 的一个子集，BIG-Bench 是一个针对语言模型的多样化评估套件。BBH 专注于 BIG-Bench 中的23个具有挑战性的任务，这些任务被发现超出了当前语言模型的能力范围。这些任务是在先前的语言模型评估中，未达到平均人类评分者水平的工作。BBH 任务需要多步骤推理，研究发现，在 BIG-Bench 评估中采用的少样本提示（无思维链提示）显著低估了语言模型的最佳性能和能力。当在 BBH 任务中应用思维链提示时，它使得 PaLM 在23个任务中的10个任务上超越了平均人类评分者的表现，而 Codex 在23个任务中的17个任务上也超越了平均人类评分者的表现。

提供机构：

Papers with Code

5,000+

优质数据集

54 个

任务类型

进入经典数据集