CriticLeanBench
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/CriticLeanBench
下载链接
链接失效反馈官方服务:
资源简介:
# CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics
## 🔍 Overview
CriticLeanBench is a comprehensive benchmark designed to evaluate the critical reasoning capabilities of models in translating natural language mathematical statements into formally verified theorem declarations in Lean 4. This benchmark focuses on assessing whether generated Lean 4 code accurately reflects the semantic intent of original mathematical statements, addressing a key gap in existing evaluation frameworks for mathematical autoformalization.
- **Dual Focus**: Integrates both critic capabilities and Lean 4 formal verification, distinguishing itself from existing benchmarks
- **Diverse Coverage**: Includes problems from various mathematical domains and difficulty levels
- **Structured Errors**: Incorrect samples are designed to exhibit common error patterns in mathematical formalization
- **Human-Verified**: All samples undergo rigorous human validation to ensure quality and accuracy
<p align="center">
<img src="images/criticleanbench.png" width="1000">
<br>
<em>Figure 1: An overview for the CriticLeanBench construction.
</em>
</p>
## 📖 Statistics
With 500 carefully curated problem pairs (250 correct and 250 incorrect formalizations), CriticLeanBench provides a rigorous testbed for models tasked with evaluating the correctness of mathematical formalizations. Each problem features relatively lengthy inputs (average 700.94 tokens), challenging models to process and understand complex mathematical reasoning.
CriticLeanBench is unique in its integration of Lean 4 specific evaluation and its focus on mathematical formalization correctness.
<p align="center">
<img src="images/data_statistics.png" width="1000">
<br>
<em>Table 1 : Dataset statistics and comparison of various code benchmark datasets.
</em>
</p>
### Dataset Structure
Each entry in CriticLeanBench contains:
- A natural language mathematical statement
- A corresponding Lean 4 formalization
- A label indicating whether the formalization is semantically correct
- Detailed error annotations for incorrect samples (where applicable)
### Evaluation Metrics
We recommend using the following metrics when evaluating models on CriticLeanBench:
- Accuracy (ACC)
- True Positive Rate (TPR)
- False Positive Rate (FPR)
- True Negative Rate (TNR)
- False Negative Rate (FNR)
## Evaluation of SOTA LLMs
<p align="center">
<img src="images/results.png" width="1000">
<br>
<em>
Table 2:
<strong>Performance on CriticLeanBench.</strong>
The best, the second-best and the third-best scores for each indicator are shown in
<span style="border:1px solid;padding:2px;">box</span>,
<strong>bold</strong> and
<u>underlined</u>, respectively.
</em>
</p>
<p align="center">
<img src="images/scaling_analysis.png" width="1000">
<br>
<em>Figure 2: Scaling Analysis of LLMs on CriticLeanBench. <span style="border:1px solid;padding:4px;">ˆ</span> denoted closed-source LLMs.</em>
</p>
## ⛏️ Usage
### Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("m-a-p/CriticLeanBench")
```
## 📜 License
Apache-2.0 license
## ☕️ Citation
If you use CriticLeanBench in your research, please cite our paper:
```
@misc{peng2025criticleancriticguidedreinforcementlearning,
title={CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization},
author={Zhongyuan Peng and Yifan Yao and Kaijing Ma and Shuyue Guo and Yizhe Li and Yichi Zhang and Chenchen Zhang and Yifan Zhang and Zhouliang Yu and Luming Li and Minghao Liu and Yihang Xia and Jiawei Shen and Yuchen Wu and Yixin Cao and Zhaoxiang Zhang and Wenhao Huang and Jiaheng Liu and Ge Zhang},
year={2025},
eprint={2507.06181},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.06181},
}
```
# CriticLeanBench:面向数学形式化校验器的评测基准
## 🔍 概述
CriticLeanBench是一款综合性基准数据集,旨在评测模型将自然语言数学语句转换为Lean 4中经过形式化验证的定理声明的批判性推理能力。该基准聚焦于评估生成的Lean 4代码是否准确还原了原始数学语句的语义意图,填补了当前数学自动形式化评测框架中的一项关键空白。
- **双任务聚焦**:同时集成校验能力与Lean 4形式化验证功能,区别于现有基准数据集
- **覆盖广泛**:涵盖多个数学领域与不同难度层级的问题
- **结构化错误集**:错误样本被设计为呈现数学形式化过程中常见的错误模式
- **人工校验**:所有样本均经过严格的人工验证,以确保质量与准确性
<p align="center">
<img src="images/criticleanbench.png" width="1000">
<br>
<em>图1:CriticLeanBench构建流程概览。
</em>
</p>
## 📖 统计信息
该基准共包含500组精心筛选的问题对(250组正确形式化结果与250组错误形式化结果),为负责评估数学形式化正确性的模型提供了严谨的测试平台。每个问题的输入文本相对较长(平均700.94个Token),对模型处理与理解复杂数学推理的能力提出了挑战。
CriticLeanBench的独特之处在于其集成了针对Lean 4的专属评测机制,并聚焦于数学形式化的正确性评估。
<p align="center">
<img src="images/data_statistics.png" width="1000">
<br>
<em>表1:数据集统计信息与各类代码基准数据集的对比。
</em>
</p>
### 数据集结构
CriticLeanBench的每个数据条目包含以下内容:
- 自然语言数学语句
- 对应的Lean 4形式化代码
- 用于标注该形式化结果是否符合语义正确性的标签
- 针对错误样本的详细错误标注(如适用)
### 评测指标
我们建议在基于CriticLeanBench评测模型时使用以下指标:
- 准确率(ACC)
- 真正例率(TPR)
- 假正例率(FPR)
- 真反例率(TNR)
- 假反例率(FNR)
## 前沿大语言模型评测结果
<p align="center">
<img src="images/results.png" width="1000">
<br>
<em>
表2:<strong>CriticLeanBench上的模型性能</strong>。各指标的第一、第二、第三名成绩分别用<span style="border:1px solid;padding:2px;">边框标注</span>、<strong>粗体</strong>和<u>下划线</u>标示。
</em>
</p>
<p align="center">
<img src="images/scaling_analysis.png" width="1000">
<br>
<em>图2:大语言模型在CriticLeanBench上的缩放性分析。<span style="border:1px solid;padding:4px;">ˆ</span> 代表闭源大语言模型。</em>
</p>
## ⛏️ 使用方法
### 数据集加载
python
from datasets import load_dataset
dataset = load_dataset("m-a-p/CriticLeanBench")
## 📜 许可证
Apache-2.0许可证
## ☕️ 引用
若您在研究中使用CriticLeanBench,请引用我们的论文:
@misc{peng2025criticleancriticguidedreinforcementlearning,
title={CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization},
author={Zhongyuan Peng and Yifan Yao and Kaijing Ma and Shuyue Guo and Yizhe Li and Yichi Zhang and Chenchen Zhang and Yifan Zhang and Zhouliang Yu and Luming Li and Minghao Liu and Yihang Xia and Jiawei Shen and Yuchen Wu and Yixin Cao and Zhaoxiang Zhang and Wenhao Huang and Jiaheng Liu and Ge Zhang},
year={2025},
eprint={2507.06181},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.06181},
}
提供机构:
maas
创建时间:
2025-08-28



