Libra-Bench

Name: Libra-Bench
Creator: maas
Published: 2025-12-04 16:43:26
License: 暂无描述

魔搭社区2025-12-04 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/meituan/Libra-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# Libra Bench ## Overview [Libra Bench](https://arxiv.org/abs/2507.21645) is a sophisticated, reasoning-oriented reward model (RM) benchmark, systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models. The Libra Bench is specifically designed to evaluate pointwise judging accuracy with respect to correctness. These attributes ensure that Libra Bench is well aligned with contemporary research, where reasoning models are primarily assessed and optimized for correctness on complex reasoning tasks. ## Dataset Structure Libra Bench consists of 3,740 samples and includes the following fields: - **`index`**: the sample ID - **`question`**: the mathematical problem - **`response`**: an LLM-generated response to the problem - **`label`**: a binary value indicating whether the response is correct - **`model`**: the generator of the response - **`reference`**: the reference answer to the problem - **`subset`**: the source of the problem - **`response_with_cot`**: a full version of the response with COT content ## Usage Run the reward model to evaluate the correctness of the `response` given `question`. The accuracy is computed separately for each subset and then averaged to obtain the final score. ## Model Used [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) ## Citation ``` @misc{zhou2025libraassessingimprovingreward, title={Libra: Assessing and Improving Reward Model by Learning to Think}, author={Meng Zhou and Bei Li and Jiahao Liu and Xiaowen Shi and Yang Bai and Rongxiang Weng and Jingang Wang and Xunliang Cai}, year={2025}, eprint={2507.21645}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.21645}, } ```

# Libra基准测试集（Libra Bench） ## 概述 [Libra基准测试集（Libra Bench）](https://arxiv.org/abs/2507.21645) 是一款面向推理任务的高精度奖励模型（Reward Model，RM）基准测试集，其构建过程系统严谨，基于多样化的高难度数学问题集与先进推理模型集合而成。该基准测试集专为评估针对答案正确性的逐点判别准确率而设计。这些特性使得Libra基准测试集与当前研究范式高度契合——当前研究中，推理模型的主要评估与优化目标均为在复杂推理任务上的答案正确性。 ## 数据集结构 Libra基准测试集共包含3740条样本，具备以下字段： - **`index`**：样本唯一标识符 - **`question`**：数学问题题干 - **`response`**：大语言模型（Large Language Model，LLM）针对该问题生成的解答 - **`label`**：二分类标签，用于标识该解答是否正确 - **`model`**：生成该解答的大语言模型 - **`reference`**：该问题的官方参考答案 - **`subset`**：该问题的来源子集 - **`response_with_cot`**：包含思维链（Chain of Thought，CoT）内容的完整解答版本 ## 使用方法使用奖励模型对给定`question`的`response`进行正确性评估，针对每个子集分别计算准确率后取平均值，即可得到最终得分。 ## 所用模型 [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) ## 引用 @misc{zhou2025libraassessingimprovingreward, title={Libra: Assessing and Improving Reward Model by Learning to Think}, author={Meng Zhou and Bei Li and Jiahao Liu and Xiaowen Shi and Yang Bai and Rongxiang Weng and Jingang Wang and Xunliang Cai}, year={2025}, eprint={2507.21645}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.21645}, }

提供机构：

maas

创建时间：

2025-07-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集