Think-Bench
收藏魔搭社区2026-01-05 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/zhiyuan218/Think-Bench
下载链接
链接失效反馈官方服务:
资源简介:
#### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
# THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
Official repository for "THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models".
For more details, please refer to the project page with dataset exploration and visualization tools.
[[Paper](https://arxiv.org/abs/2505.22113)] [[Github](https://github.com/ZhiyuanLi218/Think-Bench)] [[Huggingface Dataset](https://huggingface.co/datasets/zhiyuan218/Think-Bench)] [[Visualization](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench/dataPeview)]
## 👀 About Think-Bench
Reasoning models have made remarkable progress in complex tasks, outperforming traditional large language models. However, the problem of overthinking is prevalent, severely limiting computational efficiency as models generate excessive and redundant tokens with little contribution to answer accuracy, especially in simple tasks, leading to significant resource waste.
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/pipeline.png" width="90%"> <br>
</p>
To address this issue systematically, we introduce Think-Bench, a benchmark designed to evaluate the thinking efficiency of large reasoning models (LRMs). We propose a new efficiency metric and conduct a comprehensive analysis of LRMs from multiple aspects, including the reasoning process and chain-of-thought (CoT) characteristics.
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/dataset overview.png" width="90%"> <br>
</p>
Leveraging the Think-Bench benchmark and a novel evaluation strategy, we conduct a comprehensive analysis of large reasoning models (LRMs), uncovering several key insights: (1) Most LRMs tend to **overthink on simple tasks**, generating unnecessarily long reasoning chains, while they show higher efficiency in hard problems; (2) **There is a significant trade-off between efficiency and CoT quality among different models.** Grok-3-mini-beta achieves the highest efficiency score, while models like Qwen3-235b-a22b and Ernie-x1-turbo-32k stand out in CoT quality; (3) **Models show task heterogeneity in different disciplinary tasks.** Mathematical tasks generally have high token consumption and low reasoning efficiency, while chemistry and physics tasks show higher reasoning efficiency and lower token occupancy rate. We hope Think-Bench serves as an important benchmark for optimizing the performance of large reasoning models in the future.
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/radar.png" width="60%"> <br>
</p>
## 📚 Dataset
We release the Think-Bench dataset on [huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench) and [modelscope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench).
You can download the dataset from the [Huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench) or [ModelScope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench).
### Data Usage
You can download the dataset from the [🤗 Huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench) by the following command (make sure that you have installed [related packages](https://huggingface.co/docs/datasets/quickstart)):
```python
from datasets import load_dataset
dataset = load_dataset("zhiyuan218/Think-Bench")
```
Or You can download the dataset from the [ModelScope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench) by the following command (make sure that you have installed [related packages](https://www.modelscope.cn/docs/intro/quickstart)):
```python
from modelscope.msdatasets import MsDataset
dataset = MsDataset.load('zhiyuan218/Think-Bench')
```
## Citation
If you find **Think-Bench** useful for your research and applications, please kindly cite using this BibTeX:
```latex
@misc{li2025thinkbench,
title={THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models},
author={Zhiyuan Li and Yi Chang and Yuan Wu},
year={2025},
eprint={2505.22113},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22113},
}
```
## Acknowledgements
Our project referred to the following repositories:
- [MME-Cot](https://github.com/MME-Benchmarks/MME-CoT)
- [evalscope](https://github.com/modelscope/evalscope)
#### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
# THINK-Bench:评估大推理模型的思维效率与思维链质量
本仓库为论文《THINK-Bench:评估大推理模型的思维效率与思维链质量》的官方实现仓库。如需了解更多细节,请访问集成了数据集探索与可视化工具的项目主页。
[[论文](https://arxiv.org/abs/2505.22113)] [[Github](https://github.com/ZhiyuanLi218/Think-Bench)] [[Huggingface数据集](https://huggingface.co/datasets/zhiyuan218/Think-Bench)] [[可视化工具](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench/dataPeview)]
## 👀 关于Think-Bench
推理模型在复杂任务中取得了显著进展,性能优于传统大语言模型(Large Language Model,LLM)。然而,过度思考问题普遍存在:当模型生成过多冗余Token且对答案准确性贡献极低时,会严重限制计算效率,尤其在简单任务中,这会造成显著的资源浪费。
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/pipeline.png" width="90%"> <br>
</p>
为系统性解决这一问题,我们推出了Think-Bench——一款用于评估大推理模型(Large Reasoning Model,LRM)思维效率的基准测试集。我们提出了一种全新的效率度量指标,并从多维度对大推理模型展开全面分析,涵盖推理过程与思维链(Chain-of-Thought,CoT)特征等方面。
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/dataset overview.png" width="90%"> <br>
</p>
借助Think-Bench基准测试集与全新的评估策略,我们对大推理模型开展了全面分析,得到多项关键结论:(1) 多数大推理模型倾向于**在简单任务中过度思考**,生成不必要的长推理链,而在复杂任务中则表现出更高的效率;(2) **不同模型的效率与思维链质量间存在显著权衡关系**。Grok-3-mini-beta的效率得分最高,而Qwen3-235b-a22b、Ernie-x1-turbo-32k等模型在思维链质量方面表现突出;(3) **不同模型在跨学科任务中呈现出任务异质性**。数学任务通常具备较高的Token消耗量与较低的推理效率,而化学与物理任务则展现出更高的推理效率与更低的Token占用率。我们期望Think-Bench能够成为未来优化大推理模型性能的重要基准测试集。
<p align="center">
<img src="https://raw.githubusercontent.com/ZhiyuanLi218/Think-Bench/main/image/radar.png" width="60%"> <br>
</p>
## 📚 数据集
我们已将Think-Bench数据集发布至[Huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench)与[ModelScope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench)平台。您可通过[Huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench)或[ModelScope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench)下载该数据集。
### 数据使用方法
您可通过以下命令从[🤗 Huggingface](https://huggingface.co/datasets/zhiyuan218/Think-Bench)下载数据集(请确保已安装[相关依赖包](https://huggingface.co/docs/datasets/quickstart)):
python
from datasets import load_dataset
dataset = load_dataset("zhiyuan218/Think-Bench")
或者,您也可通过以下命令从[ModelScope](https://www.modelscope.cn/datasets/zhiyuan218/Think-Bench)下载数据集(请确保已安装[相关依赖包](https://www.modelscope.cn/docs/intro/quickstart)):
python
from modelscope.msdatasets import MsDataset
dataset = MsDataset.load('zhiyuan218/Think-Bench')
## 引用
若您的研究与应用中使用了Think-Bench,请通过以下BibTeX格式进行引用:
latex
@misc{li2025thinkbench,
title={THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models},
author={Zhiyuan Li and Yi Chang and Yuan Wu},
year={2025},
eprint={2505.22113},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22113},
}
## 致谢
本项目参考了以下开源仓库:
- [MME-Cot](https://github.com/MME-Benchmarks/MME-CoT)
- [evalscope](https://github.com/modelscope/evalscope)
提供机构:
maas
创建时间:
2025-05-14



