AMO-Bench

Name: AMO-Bench
Creator: maas
Published: 2026-01-07 03:41:21
License: 暂无描述

魔搭社区2026-01-07 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/meituan-longcat/AMO-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# 📐 AMO-Bench: Large Language Models Still Struggle in High School Math Competitions - 📄 [Paper](https://huggingface.co/papers/2510.26768) - 🌐 [Project Page](https://amo-bench.github.io/) - 💻 [Github Repo](https://github.com/meituan-longcat/AMO-Bench) ## Updates - 2025.12.01: We have added [Token Efficiency](#-token-efficiency) showing the number of output tokens used by models in the leaderboard. **[Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves the highest token efficiency among top-performance models!** - 2025.11.24: **[Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves 63.1%, setting a new SOTA and breaking 60% for the first time!** We have updated the [Leaderboard](#-leaderboard) with the results of Gemini 3 Pro and [Qwen3-Max-Thinking (Preview)](https://qwen.ai/blog?id=qwen3-max). - 2025.11.19: [Kimi-K2-Thinking](https://moonshotai.github.io/Kimi-K2/thinking.html) achieves 56.0%, new SOTA on [Leaderboard](#-leaderboard)! - 2025.11.05: The problem statement of **Problem 35** has been revised: (1) the five integers that sum to $k$ should be **non-negative** rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this [discussion](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4) for details). Thanks to the feedback from [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4)! - 2025.10.31: We release the dataset, evaluation code, and technical report of AMO-Bench. ## 📊 Leaderboard <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/leaderboard.png" width="800"> ## 📈 Token Efficiency <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/acc_vs_len.png" width="800"> ## 📝 Abstract We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/cover_fig.png" width="800"> ## ⭐ Key Features <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/pipeline.png" width="800"> - **Original problems.** To prevent performance leaks from existing resources as much as possible, all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources. - **Guaranteed difficulty.** Each problem has undergone rigorous cross-validation by multiple experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models. - **Final-answer based grading.** Each problem in AMO-Bench requires a final answer rather than a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability. - **Human-annotated reasoning paths.** In addition to the final answer, each problem also includes a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis. ## 📖 Dataset Description **AMO-Bench** is a curated test set designed to benchmark advanced mathematical reasoning. Its creation was guided by the following core features to ensure a fair and challenging evaluation: - **Original Problems**: To prevent performance leaks from existing resources as much as possible, all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources. - **Guaranteed Difficulty**: Each problem has undergone rigorous cross-validation by multiple experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models. - **Final-Answer Based Grading**, Each problem in AMO-Bench requires a final answer rather than a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability. - **Human-Annotated Reasoning Paths**: In addition to the final answer, each problem also includes a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis. ## 📊 Data Fields Each entry in the dataset consists of five fields: - `question_id`: (`int64`) - The ID of the mathematical problem. - `prompt`: (`string`) - A full statement of the mathematical problem, formatted in Markdown with LaTeX support for mathematical expressions. - `solution`: (`string`) - A detailed reasoning path written by human experts. - `answer`: (`string`) - The final answer to the problem. - `answer_type`: (`string`) - The types of the final answer, including four main types. Here is an example of a data instance: ```json { "question_id": 4, "prompt": "Let $x_1,x_2,\\ldots,x_{2025}$ be real numbers in the interval $[0,1]$, and let $f(x)$ be a real-valued function ...", "solution": "Let\n$$\nm=\\max_{x_1,x_2,\\dots,x_{2025}}\\left|\\sum_{k=1}^{2025}f(x_k)-\\left(\\sum_{k=1}^{2025}x_k\\right)^2\\right|.\n$$\nWe have ...", "answer": "\\boxed{512578}", "answer_type": "number" } ``` ## ✍️ Dataset Construction **AMO-Bench** have built up a comprehensive multi-stage construction pipeline that covers the entire process from question creation to final inclusion. This pipeline comprises four major stages: **(1) Data creation.** All problems are independently designed by mathematics experts from top universities and educational institutions. Beyond the final answer, each problem author must provide a detailed step-by-step solution. **(2) Quality review.** Each candidate problem undergoes blind review by at least three experts to assess its quality. **(3) Originality review.** The originality review stage aims to ensure that these newly created problems are not mere rewrites of publicly available materials, but demonstrate genuine originality. **(4) Difficulty review.** We implement a difficulty review stage to filter out problems lacking adequate complexity, to ensure that AMO-Bench presents a sufficient challenge to state-of-the-art LLMs. ## 🚀 Sample Usage (and Evaluation) This section outlines how to quickly get started with the AMO-Bench dataset and evaluate model performance. For more detailed guidelines, please refer to the [GitHub repository](https://github.com/meituan-longcat/AMO-Bench). ### Installation 1. Clone the repository: ```bash git clone https://github.com/meituan-longcat/AMO-Bench.git cd AMO-Bench ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` ### Running evaluations #### Step 1: Format Model Response File After obtaining model responses, format them as follows (one JSON object per line): ```data {"question_id": 1, "model_response": "..."} {"question_id": 2, "model_response": "..."} ... ``` Save this file in the `./model_responses/` directory. #### Step 2: Grading Responses Set your API key and URL in lines 13-14 of `utils.py`. Then run: ```bash python grading.py --response_file example.jsonl ``` Evaluation results will be saved under the `./grading_results/` directory. #### Step 3 (Optional): Grade on AMO-Bench-P Subset For a quick evaluation using only the parser-based subset (39 problems), run: ```bash python grading.py --response_file example.jsonl --only_parser True ``` ## Discussions and Feedbacks Here we summarize the discussions and feedbacks on AMO-Bench from the open-source community. We will regularly update the dataset to address urgent data issues. We welcome any feedback you may have! - Problem 26 appears to be effectively the same as an existing contest problem. Thanks to [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/3) to point this out! - The problem statement for Problem 35 should be further clarified: (1) the five integers that sum to $k$ should be **non-negative** rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this [discussion](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4) for details). Thanks to the suggestions from [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4)! - Four problems involve complex numerical expressions (Problem 12, 13, 15 and 21). When tackling these problems, LLMs may struggle to perform accurate calculations without calling external tools. Thanks to the feedback from [@prnake](https://github.com/meituan-longcat/AMO-Bench/issues/1)! - Problem 38 & 39 appear to be similar in content to two arXiv papers [[1]](https://arxiv.org/pdf/2508.19413) [[2]](https://arxiv.org/pdf/2508.03927). ## 📎 Citation If you use the AMO-Bench dataset in your research or work, please consider citing it: ```bibtex @misc{an2025amobench, title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions}, author={Shengnan An and Xunliang Cai and Xuezhi Cao and Xiaoyu Li and Yehao Lin and Junlin Liu and Xinxuan Lv and Dan Ma and Xuanlin Wang and Ziwen Wang and Shuang Zhou}, year={2025}, eprint={2510.26768}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.26768}, } ``` ## ✅ License AMO-Bench is released under the [MIT LICENSE](LICENSE). ## 🤝 Acknowledgement The evaluation script utilizes [Math-Verify](https://github.com/huggingface/Math-Verify) to parse and verify model outputs. We greatly appreciate the contributors' efforts in providing this valuable tool. ## 📩 Support For questions and support, please open an issue on GitHub or contact the maintainers.

# 📐 AMO-Bench：大语言模型在高中数学竞赛任务中仍表现欠佳 - 📄 [论文](https://huggingface.co/papers/2510.26768) - 🌐 [项目主页](https://amo-bench.github.io/) - 💻 [GitHub 仓库](https://github.com/meituan-longcat/AMO-Bench) ## 更新日志 - 2025.12.01：我们新增了 [Token 效率（Token Efficiency）](#token-效率) 板块，展示排行榜中各模型的输出 Token 数量。**Gemini 3 Pro** 在高性能模型中实现了最高的 Token 效率！ - 2025.11.24：**Gemini 3 Pro** 准确率达到63.1%，创下新的最优性能（SOTA）纪录，首次突破60%大关！我们已更新 [排行榜](#排行榜)，加入了 Gemini 3 Pro 与 [Qwen3-Max-Thinking（预览版）](https://qwen.ai/blog?id=qwen3-max) 的结果。 - 2025.11.19：[Kimi-K2-Thinking](https://moonshotai.github.io/Kimi-K2/thinking.html) 准确率达到56.0%，成为新的 SOTA，已更新至 [排行榜](#排行榜)！ - 2025.11.05：修正了第35题的题目描述：(1) 求和为 $k$ 的五个整数应为**非负整数**而非正整数；(2) 我们同时规定，1无法被拆分为五个整数之和。此外，对于原题中的严格正整数拆分场景，正确答案应为7656（详见此[讨论帖](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4)）。感谢 [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4) 的反馈！ - 2025.10.31：我们发布了 AMO-Bench 数据集、评估代码与技术报告。 ## 📊 排行榜 <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/leaderboard.png" width="800"> ## 📈 Token 效率 <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/acc_vs_len.png" width="800"> ## 📝 摘要我们提出AMO-Bench，一个具备国际数学奥林匹克（International Mathematical Olympiad，简称IMO）乃至更高难度的高级数学推理基准数据集，包含50道人工精心设计的题目。现有基准广泛利用高中数学竞赛来评估大语言模型（Large Language Model，简称LLM）的数学推理能力。然而，由于性能饱和问题（如AIME24/25），许多现有数学竞赛已难以有效评估顶级大语言模型的能力。为解决这一问题，AMO-Bench引入了更严苛的挑战：确保50道题目均满足以下两点：(1) 经专家交叉验证，难度至少达到IMO标准；(2) 全部为原创题目，避免因数据记忆导致的性能泄露。此外，AMO-Bench中的每道题目仅需最终答案而非完整证明，从而支持自动化、鲁棒的评分与评估。在26个LLM上的实验结果表明，即便表现最优的模型在AMO-Bench上的准确率也仅为52.4%，多数LLM的得分低于40%。除了这些不佳的表现外，我们的进一步分析还揭示了随着测试时计算量增加，模型性能呈现出可观的缩放趋势。这些结果凸显了当前LLM在数学推理能力上仍有巨大的提升空间。我们发布AMO-Bench以推动语言模型推理能力的进一步研究。 <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/cover_fig.png" width="800"> ## ⭐ 核心特性 <img src="https://github.com/meituan-longcat/AMO-Bench/raw/main/figures/pipeline.png" width="800"> - **原创题目**：为尽可能避免从现有资源中泄露性能，AMO-Bench中的所有题目均由人类专家全新创作。此外，我们还进行了二次验证，确保现有竞赛或在线资源中不存在高度相似的题目。 - **难度保障**：每道题目均经多位专家严格交叉验证，确保其难度至少达到IMO标准。我们还引入了基于LLM的难度过滤环节，以排除对当前推理模型缺乏足够挑战性的题目。 - **基于最终答案的评分**：AMO-Bench中的每道题目仅需提交最终答案而非完整证明，从而支持高效的自动评分。针对每道题目，我们会根据答案类型选择基于解析器或基于LLM的评分方法，平衡评分成本与泛化性。 - **人工标注的解题路径**：除最终答案外，每道题目还包含由人类专家撰写的详细解题路径。这些额外的注释提升了解题过程的透明度，可支持AMO-Bench上的进一步探索，如提示工程与错误分析。 ## 📖 数据集说明 **AMO-Bench** 是一个精心 curated 的测试集，用于 benchmark 高级数学推理能力。其创建遵循以下核心特性，以确保评估的公平性与挑战性： - **原创题目**：为尽可能避免从现有资源中泄露性能，AMO-Bench中的所有题目均由人类专家全新创作。此外，我们还进行了二次验证，确保现有竞赛或在线资源中不存在高度相似的题目。 - **难度保障**：每道题目均经多位专家严格交叉验证，确保其难度至少达到IMO标准。我们还引入了基于LLM的难度过滤环节，以排除对当前推理模型缺乏足够挑战性的题目。 - **基于最终答案的评分**：AMO-Bench中的每道题目仅需提交最终答案而非完整证明，从而支持高效的自动评分。针对每道题目，我们会根据答案类型选择基于解析器或基于LLM的评分方法，平衡评分成本与泛化性。 - **人工标注的解题路径**：除最终答案外，每道题目还包含由人类专家撰写的详细解题路径。这些额外的注释提升了解题过程的透明度，可支持AMO-Bench上的进一步探索，如提示工程与错误分析。 ## 📊 数据字段数据集中的每个条目包含五个字段： - `question_id`: (`int64`) - 数学题目的唯一标识ID。 - `prompt`: (`string`) - 数学题目的完整陈述，支持使用Markdown格式与LaTeX语法编写数学表达式。 - `solution`: (`string`) - 人类专家撰写的详细解题步骤。 - `answer`: (`string`) - 题目的最终答案。 - `answer_type`: (`string`) - 最终答案的类型，包含四大主要类别。以下是一个数据实例示例： json { "question_id": 4, "prompt": "Let $x_1,x_2,\ldots,x_{2025}$ be real numbers in the interval $[0,1]$, and let $f(x)$ be a real-valued function ...", "solution": "Let $$ m=\max_{x_1,x_2,\dots,x_{2025}}\left|\sum_{k=1}^{2025}f(x_k)-\left(\sum_{k=1}^{2025}x_k\right)^2\right|. $$ We have ...", "answer": "\boxed{512578}", "answer_type": "number" } ## ✍️ 数据集构建流程 **AMO-Bench** 构建了一套完整的多阶段构建流程，覆盖从题目创作到最终入选的全流程。该流程包含四大主要阶段： 1. **题目创作**：所有题目均由顶尖高校与教育机构的数学专家独立设计。除最终答案外，每位题目作者还需提供详细的分步解题过程。 2. **质量审核**：每道候选题目需经至少三位专家盲审，以评估其质量。 3. **原创性审核**：原创性审核阶段旨在确保新创作的题目并非对公开资料的简单改写，而是具备真正的原创性。 4. **难度审核**：我们实施难度审核阶段，以过滤掉复杂度不足的题目，确保AMO-Bench能够对当前的顶尖LLM构成足够挑战。 ## 🚀 示例使用（与评估）本节概述如何快速上手AMO-Bench数据集并评估模型性能。如需更详细的指南，请参阅 [GitHub 仓库](https://github.com/meituan-longcat/AMO-Bench)。 ### 安装步骤 1. 克隆仓库： bash git clone https://github.com/meituan-longcat/AMO-Bench.git cd AMO-Bench 2. 安装依赖： bash pip install -r requirements.txt ### 运行评估 #### 步骤1：格式化模型响应文件获取模型响应后，按以下格式整理（每行一个JSON对象）： data {"question_id": 1, "model_response": "..."} {"question_id": 2, "model_response": "..."} ... 将该文件保存至 `./model_responses/` 目录下。 #### 步骤2：评分模型响应在`utils.py`的第13-14行设置你的API密钥与接口地址，然后运行： bash python grading.py --response_file example.jsonl 评估结果将保存至 `./grading_results/` 目录下。 #### 步骤3（可选）：仅在AMO-Bench-P子集上评分若仅需使用基于解析器的子集（共39道题）进行快速评估，可运行： bash python grading.py --response_file example.jsonl --only_parser True ## 讨论与反馈本节汇总了开源社区对AMO-Bench的讨论与反馈。我们将定期更新数据集以解决紧急的数据问题。欢迎提出任何反馈！ - 第26题看起来与某现有竞赛题目高度相似。感谢 [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/3) 指出该问题！ - 第35题的题目描述需进一步澄清：(1) 求和为 $k$ 的五个整数应为**非负整数**而非正整数；(2) 我们同时规定，1无法被拆分为五个整数之和。此外，对于原题中的严格正整数拆分场景，正确答案应为7656（详见此[讨论帖](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4)）。感谢 [@applesilicon](https://huggingface.co/datasets/meituan-longcat/AMO-Bench/discussions/4) 的建议！ - 四道题目涉及复杂数值表达式（第12、13、15与21题）。在解决这些题目时，LLM可能难以在不调用外部工具的情况下完成精确计算。感谢 [@prnake](https://github.com/meituan-longcat/AMO-Bench/issues/1) 的反馈！ - 第38与39题的内容与两篇arXiv论文[[1]](https://arxiv.org/pdf/2508.19413) [[2]](https://arxiv.org/pdf/2508.03927)较为相似。 ## 📎 引用若你的研究或工作中使用了AMO-Bench数据集，请引用如下： bibtex @misc{an2025amobench, title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions}, author={Shengnan An and Xunliang Cai and Xuezhi Cao and Xiaoyu Li and Yehao Lin and Junlin Liu and Xinxuan Lv and Dan Ma and Xuanlin Wang and Ziwen Wang and Shuang Zhou}, year={2025}, eprint={2510.26768}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.26768}, } ## ✅ 许可证 AMO-Bench 采用 [MIT 许可证](LICENSE) 发布。 ## 🤝 致谢本评估脚本使用了 [Math-Verify](https://github.com/huggingface/Math-Verify) 来解析与验证模型输出。我们衷心感谢该工具的贡献者所付出的宝贵努力。 ## 📩 支持如需咨询或获取支持，请在GitHub上提交Issue或联系维护人员。

提供机构：

maas

创建时间：

2025-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集