EvolMathEval

Name: EvolMathEval
Creator: 中山大学、复旦大学
Published: 2025-08-18 23:24:10
License: 暂无描述

arXiv2025-08-18 更新2025-11-26 收录

下载链接：

https://github.com/SYSUSELab/EvolMathEval

下载链接

链接失效反馈

官方服务：

资源简介：

EvolMathEval是一个基于进化测试的自动化数学推理基准生成和进化框架。它通过动态生成独特的评估实例，从根本上消除了数据污染的风险，并确保基准始终对未来的模型构成挑战。EvolMathEval的核心机制包括：基于逆向工程的种子问题生成，具有代数保证；设计多维遗传操作符以注入多样化的认知挑战；以及一个复合适应度函数，可以快速准确地评估问题难度。实验结果表明，所提出的复合适应度函数可以有效地精确量化数学问题的难度。此外，EvolMathEval不仅可以通过持续自我迭代生成大量高难度问题，还可以通过进化显著提高公共数据集（如GSM8K）的复杂性，平均降低模型精度48%。进一步的研究发现，在解决这些演化的复杂问题时，LLMs倾向于使用非严格的启发式方法来绕过复杂的多步逻辑推理，从而导致错误的解决方案。我们将这种现象定义为“伪Aha时刻”。这一发现揭示了当前LLMs在深度推理过程中存在认知捷径行为，我们发现这种现象在针对问题的错误解决方案中占77%到100%。

EvolMathEval is an automated mathematical reasoning benchmark generation and evolution framework based on evolutionary testing. It dynamically generates unique evaluation instances, fundamentally eliminating the risk of data contamination and ensuring that the benchmark always poses challenges to future models. The core mechanisms of EvolMathEval include: reverse engineering-based seed problem generation with algebraic guarantees; the design of multi-dimensional genetic operators to inject diverse cognitive challenges; and a composite fitness function that can quickly and accurately assess problem difficulty. Experimental results show that the proposed composite fitness function can effectively and accurately quantify the difficulty of mathematical problems. Furthermore, EvolMathEval can not only generate a large number of high-difficulty problems through continuous self-iteration, but also significantly improve the complexity of public datasets such as GSM8K via evolution, reducing the model's accuracy by an average of 48%. Further research reveals that when solving these evolved complex problems, LLMs tend to use non-rigorous heuristic methods to bypass complex multi-step logical reasoning, leading to incorrect solutions. We define this phenomenon as "pseudo-Aha moment". This finding uncovers the cognitive shortcut behavior existing in the current LLMs' deep reasoning process, and we find that this phenomenon accounts for 77% to 100% of the incorrect solutions to these problems.

提供机构：

中山大学、复旦大学

创建时间：

2025-08-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集