meituan-longcat/AMO-Bench
收藏Hugging Face2026-02-05 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/meituan-longcat/AMO-Bench
下载链接
链接失效反馈官方服务:
资源简介:
AMO-Bench是一个精心策划的测试集,旨在对高级数学推理进行基准测试。其创建遵循以下核心特征以确保评估的公平性和挑战性:
- **原创问题**:为了防止现有资源的性能泄漏,AMO-Bench中的所有问题都是由人类专家新设计的。此外,我们还进行了二次验证,以确保在现有竞赛或在线资源中没有高度相似的问题。
- **保证难度**:每个问题都经过多位专家的严格交叉验证,以确保其至少符合IMO的难度标准。我们还引入了一个基于LLM的难度过滤阶段,以排除对当前推理模型不足够具有挑战性的问题。
- **基于最终答案的评分**:AMO-Bench中的每个问题都需要一个最终答案而不是完整的证明,从而实现高效的自动评分。对于每个问题,我们根据其答案类型采用基于解析器或基于LLM的评分方法,以平衡评分成本和泛化能力。
- **人工注释的推理路径**:除了最终答案外,每个问题还包括由人类专家编写的详细推理路径。这些额外的注释增强了解决方案的透明度,并可以支持对AMO-Bench的进一步探索,如提示工程和错误分析。
**AMO-Bench** is a curated test set designed to benchmark advanced mathematical reasoning. Its creation was guided by the following core features to ensure a fair and challenging evaluation:
- **Original Problems**: To prevent performance leaks from existing resources as much as possible, all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources.
- **Guaranteed Difficulty**: Each problem has undergone rigorous cross-validation by multiple experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models.
- **Final-Answer Based Grading**, Each problem in AMO-Bench requires a final answer rather than a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability.
- **Human-Annotated Reasoning Paths**: In addition to the final answer, each problem also includes a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis.
提供机构:
meituan-longcat



