DeltaBench
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/OpenStellarTeam/DeltaBench
下载链接
链接失效反馈官方服务:
资源简介:
# Overview
<p align="center">
🌐 <a href="https://openstellarteam.github.io/DeltaBench" target="_blank">Website</a> • 🤗 <a href="https://huggingface.co/datasets/OpenStellarTeam/DeltaBench" target="_blank">Hugging Face</a> • ⏬ <a href="#data" target="https://huggingface.co/datasets/OpenStellarTeam/DeltaBench/resolve/main/Deltabench_v1.jsonl?download=true">Data</a> • 📃 <a href="https://arxiv.org/abs/2502.19361" target="_blank">Paper</a> • 🖥️ <a href="https://github.com/OpenStellarTeam/DeltaBench" target="_blank">Github</a>
</p>
## 💥 DeltaBench
**DeltaBench** the first dataset to analyze the qualities of the long CoTs generated by o1-like models and evaluate the critique abilities to **D**etect **E**rror in Long Co**T** Re**A**soning of existing critic models and PRMs. Specifically, DeltaBench comprises 1,236 samples across diverse domains, including **Math**, **Programming**, **PCB** (physics, chemistry and biology), and **General Reasoning**. Each sample encompasses a problem, its corresponding long CoT solution, and comprehensive human annotations
Please visit our [website](https://openstellarteam.github.io/DeltaBench) or check our [paper](https://arxiv.org/abs/2502.19361) for more details.
## 🆕 News
- **[Soon]** We plan to release more of our labeled datasets, which will be available for training and research purposes. **Stay tuned** 🔥🔥🔥
- **\[2025.03.05\]** We have released the DeltaBench dataset 🤗[huggingface](https://huggingface.co/datasets/OpenStellarTeam/DeltaBench) 🚀🚀🚀.
## 💫 Instroduction
* we introduceDeltaBench, the first dataset to analyze the qualities of the long CoTs generated by o1-like models and evaluate the critique abilities to **D**etect **E**rror in Long Co**T** Re**A**soning of existing critic models and PRMs. Specifically, in DeltaBench, we first collect a diverse collection of long CoTs generated by various o1-like models (i.e., QwQ, DeepSeek-R1, and Gemini-2.0 Flash Thinking) across different reasoning tasks such as **Math**, **Programming**, **PCB** (physics, chemistry and biology), and **General Reasoning**.
Then, we divide each long COT into different sections, where each section denotes an independent subtask.
After that, each section includes the following tags:
* 1️⃣**Strategy Shift:** whether this section introduces a new method or strategy attempt. If a new strategy is introduced, the specific step is annotated.
* 2️⃣**Reasoning Usefulness:** whether the reasoning in this section is useful.
* 3️⃣**Reasoning Correctness:** whether this section contains any errors. If an error is present, additional error-related fields are annotated, including the first step number at which the error occurs, explanation and correction.
* 4️⃣**Reflection Efficiency:** whether this section contains reflection and whether the reflection is correct. If reflection is present, the step at which the reflection begins is annotated.
## 📊 Leaderboard
详见: [📊](https://github.com/OpenStellarTeam/DeltaBench)
## 📜 Citation
Please cite our paper if you use our dataset.
```
@misc{he2025largelanguagemodelsdetect,
title={Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?},
author={Yancheng He and Shilong Li and Jiaheng Liu and Weixun Wang and Xingyuan Bu and Ge Zhang and Zhongyuan Peng and Zhaoxiang Zhang and Zhicheng Zheng and Wenbo Su and Bo Zheng},
year={2025},
eprint={2502.19361},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.19361},
}
```
# 数据集概览
<p align="center">🌐 <a href="https://openstellarteam.github.io/DeltaBench" target="_blank">官网</a> • 🤗 <a href="https://huggingface.co/datasets/OpenStellarTeam/DeltaBench" target="_blank">Hugging Face 数据集</a> • ⏬ <a href="#data" target="https://huggingface.co/datasets/OpenStellarTeam/DeltaBench/resolve/main/Deltabench_v1.jsonl?download=true">数据下载</a> • 📃 <a href="https://arxiv.org/abs/2502.19361" target="_blank">论文</a> • 🖥️ <a href="https://github.com/OpenStellarTeam/DeltaBench" target="_blank">GitHub 仓库</a></p>
## 💥 DeltaBench 数据集
**DeltaBench**是首个用于分析类o1模型(o1-like models)生成的长思维链(Chain-of-Thought, CoT)质量,并评估现有批判模型与概率推理模型(Probabilistic Reasoning Models, PRMs)对**长思维链推理错误检测**能力的数据集。具体而言,DeltaBench包含1236个覆盖多领域的样本,涵盖数学、编程、PCB(物理、化学与生物)以及通用推理四大领域。每个样本均包含问题本身、对应的长思维链解决方案,以及完整的人工标注内容。
如需了解更多细节,请访问我们的[官网](https://openstellarteam.github.io/DeltaBench)或查阅相关[论文](https://arxiv.org/abs/2502.19361)。
## 🆕 最新动态
- **[即将上线]** 我们计划发布更多标注数据集,可用于训练与研究工作,请持续关注🔥🔥🔥
- **[2025.03.05]** 我们已正式发布DeltaBench数据集 🤗[Hugging Face 平台](https://huggingface.co/datasets/OpenStellarTeam/DeltaBench) 🚀🚀🚀
## 💫 数据集介绍
我们推出DeltaBench,这是首个用于分析类o1模型生成的长思维链质量,并评估现有批判模型与概率推理模型对长思维链推理错误的检测能力的数据集。具体而言,在DeltaBench中,我们首先收集了由各类类o1模型(即QwQ、DeepSeek-R1以及Gemini-2.0 Flash Thinking)在数学、编程、PCB(物理、化学与生物)以及通用推理等不同推理任务上生成的多样化长思维链样本。随后,我们将每条长思维链划分为多个独立的子任务段落。每个段落均包含以下四类标注标签:
1. **策略偏移(Strategy Shift)**:该段落是否引入了全新的方法或策略尝试。若引入新策略,则需标注具体步骤。
2. **推理有效性(Reasoning Usefulness)**:该段落中的推理内容是否具备实际效用。
3. **推理正确性(Reasoning Correctness)**:该段落是否存在推理错误。若存在错误,则需补充标注与错误相关的额外字段,包括错误首次出现的步骤编号、错误解释与修正方案。
4. **反思效率(Reflection Efficiency)**:该段落是否包含反思环节,以及反思内容是否正确。若存在反思环节,则需标注反思开始的步骤编号。
## 📊 排行榜
详情请见: [📊](https://github.com/OpenStellarTeam/DeltaBench)
## 📜 引用格式
若您在研究中使用本数据集,请引用我们的论文:
@misc{he2025largelanguagemodelsdetect,
title={Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?},
author={Yancheng He and Shilong Li and Jiaheng Liu and Weixun Wang and Xingyuan Bu and Ge Zhang and Zhongyuan Peng and Zhaoxiang Zhang and Zhicheng Zheng and Wenbo Su and Bo Zheng},
year={2025},
eprint={2502.19361},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.19361},
}
提供机构:
maas
创建时间:
2025-03-19



