five

SKYLENAGE-ReasoningMATH

收藏
魔搭社区2026-05-12 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/Alibaba-DT/SKYLENAGE-ReasoningMATH
下载链接
链接失效反馈
官方服务:
资源简介:
[![GITHUB Code](https://img.shields.io/badge/GITHUB-💻-black.svg)](https://github.com/alibaba/SKYLENAGE-ReasoningMath) [![Dataset-HF](https://img.shields.io/badge/Data-Huggingface-orange.svg)](https://huggingface.co/datasets/alibabagroup/SKYLENAGE-ReasoningMath) [![Paper](https://img.shields.io/badge/Paper-📄-b31b1b.svg)](https://arxiv.org/abs/2510.01241) # SKYLENAGE-ReasoningMath # I. 基准介绍 SKYLENAGE-ReasoningMath(推理数学评测集)覆盖从基础算术到高等数学的多层次推理任务,涵盖逻辑推导、代数变换、几何分析、概率统计等多个维度,旨在评估模型在结构化数学问题中的理解、推理与解答能力。数据集设计注重问题难度分级与认知层次划分,支持细粒度的能力评估。 数据集介绍: ReasoningMath.jsonl为数据文件,总计开源100道题,包括了序号(id),难度(difficulty),学科分类(subject),题目(question),最终解答(final_answer)。 Selected Problem Appendix.xlsx为部分题目解答附录,总计开源30道题,包括了序号,思维链分析标准,解题分析标准。 SKYLENAGE Technical Report.pdf为技术报告。 # II. 基准特性 本评测集的核心目标是突破当前主流评测中“重计算、轻逻辑”“重结果、轻过程”的局限,聚焦于三类在通用训练语料中覆盖不足、难以通过记忆或模式匹配完成的高阶推理任务:直觉逻辑、数论组合与空间推理。这三类能力不仅对模型的抽象思维水平构成挑战,也能够有效规避现有评测中的典型问题。 1.针对数据污染与记忆化作答问题,我们刻意避开高频题型,选择在互联网文本中低频甚至缺失的任务类型。直觉逻辑类任务涉及非标准化的多角色、多约束推理,其表述形式多样、解法路径不固定,难以被完整收录于训练数据中的; 2.数论组合问题强调构造性思维与数学直觉,而非公式套用,显著降低模型通过“见过类似题”而直接作答的可能性; 3.空间推理任务依赖对二维/三维结构的内在建模,以文本片段呈现空间推理任务让大模型进行复杂求解。 # III. 排行榜 评测榜单: | 模型名称 | 准确率 | |-------------------------------|----------| | GPT-5-20250807 | 81 | | Qwen3-235B-A22B-2507 | 79 | | Grok-4-0709 | 75 | | GPT-oss-120b | 69 | | Gemini2.5-Pro-0617 | 69 | | GPT-5 mini | 68 | | DeepSeek-V3.1 | 68 | | DeepSeek-R1-0528 | 67 | | GPT-5-Chat-0807 | 65 | | DeepSeek-V3-0324 | 64 | | Gemini2.5-flash-0617 | 63 | | Kimi-k2-Turbo | 60 | | GLM-4.5 | 56 | | Llama 4 Maverick | 45 | | Ernie-4.5-424B-A47B | 42 | # SKYLENAGE-Math # I. 基准介绍 SKYLENAGE-Math(数学竞赛创新评测集)闭源评测集是一套系统化、阶梯式设计的原创数学能力评估资源,涵盖高中、本科、硕士、博士四个学术阶段,每个阶段精选具有代表性的数学题目,共计150道题目。评测集内容深度融合各阶段核心知识点,结合高水平数学竞赛的命题思路,兼顾基础巩固与思维拔高。 # II. 基准特性 高中阶段:聚焦代数、几何、数论、组合数学等竞赛核心模块,注重逻辑推理与问题转化能力,题目难度覆盖全国高中数学联赛水平; 本科阶段:覆盖数学分析、高等代数、概率统计等主干课程,融合数学竞赛与考研难题,强调理论深度与综合应用; 硕士阶段:深入实分析、抽象代数、拓扑学等领域,引入研究型问题与开放性思考题,贴近研究生资格考试与学术竞赛要求; 博士阶段:聚焦前沿数学方向,如代数几何、表示论、数论中的高阶问题等,题目具有研究启发性。 # III. 排行榜 评测榜单: | 模型名称 | 准确率 | |----------------------------------|----------------| | GPT-5-20250807 | 44.0 | | Grok-4-0709 | 37.3 | | Qwen3-235B-A22B-Thinking-2507 * | 31.3 | | Gemini2.5-pro-0617 | 28.7 | | GPT-5 mini | 28.7 | | GPT-oss-120b * | 23.3 | | Gemini2.5-flash-0617 | 22.7 | | DeepSeek-R1-0528 * | 22.0 | | DeepSeek-V3.1 * | 18.7 | | GPT-5-Chat-0807 | 16.0 | | Kimi-K2-Turbo * | 13.3 | | DeepSeek-V3-0324 * | 13.3 | | GLM-4.5 * | 12.7 | | Ernie-4.5-424B-A47B * | 11.3 | | Llama 4 Maverick * | 10.7 | | 平均 | 22.3 | 注:* 为开源模型。 # 联系我们 更多详情请登陆晓天衡宇大模型评测平台官网:https://skylenage.alibaba-inc.com/sla/home 联系我们:skylenage@service.alibaba.com

[![GITHUB Code](https://img.shields.io/badge/GITHUB-💻-black.svg)](https://github.com/alibaba/SKYLENAGE-ReasoningMath) [![Dataset-HF](https://img.shields.io/badge/Data-Huggingface-orange.svg)](https://huggingface.co/datasets/alibabagroup/SKYLENAGE-ReasoningMath) [![Paper](https://img.shields.io/badge/Paper-📄-b31b1b.svg)](https://arxiv.org/abs/2510.01241) # SKYLENAGE-ReasoningMath ## I. Benchmark Introduction SKYLENAGE-ReasoningMath (Reasoning Math Evaluation Benchmark) covers multi-level reasoning tasks ranging from basic arithmetic to advanced mathematics, including logical deduction, algebraic transformation, geometric analysis, probability and statistics and other dimensions, aiming to evaluate the model's ability to understand, reason and solve structured mathematical problems. The dataset design focuses on difficulty grading and cognitive hierarchy division, supporting fine-grained capability assessment. ## Dataset Introduction: The data file `ReasoningMath.jsonl` contains a total of 100 open-source questions, including serial number (id), difficulty (difficulty), subject classification (subject), question (question), and final answer (final_answer). `Selected Problem Appendix.xlsx` is an appendix to partial problem solutions, containing 30 open-source questions in total, including serial numbers, standard chain-of-thought analysis, and standard problem-solving analysis. `SKYLENAGE Technical Report.pdf` is the technical report. ## II. Benchmark Features The core goal of this evaluation benchmark is to break through the limitations of current mainstream evaluations, which "prioritize computation over logic" and "prioritize results over process", focusing on three types of high-order reasoning tasks that are underrepresented in general training corpora and difficult to complete via memory or pattern matching: intuitive logic, number theory combinatorics, and spatial reasoning. These three types of abilities not only challenge the abstract thinking level of the model, but also effectively avoid typical issues in existing evaluations. 1. To address data contamination and memorized responses, we deliberately avoid high-frequency question types and select task types that are low-frequency or even absent in internet text. Intuitive logic tasks involve non-standard multi-role and multi-constraint reasoning, with diverse expression forms and variable solution paths, making it difficult to be fully included in training data; 2. Number theory combinatorial problems emphasize constructive thinking and mathematical intuition rather than formula application, significantly reducing the possibility of the model answering directly by "having encountered similar questions"; 3. Spatial reasoning tasks rely on internal modeling of 2D/3D structures, presenting spatial reasoning tasks as text fragments for large language models to perform complex solving. ## III. Leaderboard Evaluation Leaderboard: | Model Name | Accuracy | |-------------------------------|----------| | GPT-5-20250807 | 81 | | Qwen3-235B-A22B-2507 | 79 | | Grok-4-0709 | 75 | | GPT-oss-120b | 69 | | Gemini2.5-Pro-0617 | 69 | | GPT-5 mini | 68 | | DeepSeek-V3.1 | 68 | | DeepSeek-R1-0528 | 67 | | GPT-5-Chat-0807 | 65 | | DeepSeek-V3-0324 | 64 | | Gemini2.5-flash-0617 | 63 | | Kimi-k2-Turbo | 60 | | GLM-4.5 | 56 | | Llama 4 Maverick | 45 | | Ernie-4.5-424B-A47B | 42 | # SKYLENAGE-Math ## I. Benchmark Introduction SKYLENAGE-Math (Mathematics Competition Innovation Evaluation Benchmark) is a closed-source systematic and hierarchical original mathematical ability assessment resource, covering four academic stages: high school, undergraduate, master, and doctorate. Each stage selects representative mathematical questions, totaling 150 questions. The evaluation benchmark integrates core knowledge points of each stage, combines the proposition ideas of high-level mathematics competitions, and balances basic consolidation and thinking improvement. ## II. Benchmark Features - High school stage: Focuses on core competition modules such as algebra, geometry, number theory, and combinatorics, emphasizes logical reasoning and problem transformation ability, with question difficulty covering the level of the National High School Mathematics League; - Undergraduate stage: Covers main courses such as mathematical analysis, advanced algebra, probability and statistics, integrates mathematics competition and postgraduate entrance examination problems, emphasizes theoretical depth and comprehensive application; - Master stage: Deeply explores fields such as real analysis, abstract algebra, and topology, introduces research-oriented problems and open thinking questions, aligning with graduate qualification examination and academic competition requirements; - Doctoral stage: Focuses on cutting-edge mathematical directions, such as advanced problems in algebraic geometry, representation theory, and number theory, with questions with research inspiration. ## III. Leaderboard Evaluation Leaderboard: | Model Name | Accuracy | |----------------------------------|----------------| | GPT-5-20250807 | 44.0 | | Grok-4-0709 | 37.3 | | Qwen3-235B-A22B-Thinking-2507 | 31.3 | | Gemini2.5-pro-0617 | 28.7 | | GPT-5 mini | 28.7 | | GPT-oss-120b | 23.3 | | Gemini2.5-flash-0617 | 22.7 | | DeepSeek-R1-0528 | 22.0 | | DeepSeek-V3.1 | 18.7 | | GPT-5-Chat-0807 | 16.0 | | Kimi-K2-Turbo | 13.3 | | DeepSeek-V3-0324 | 13.3 | | GLM-4.5 | 12.7 | | Ernie-4.5-424B-A47B | 11.3 | | Llama 4 Maverick | 10.7 | | Average | 22.3 | Note: * indicates open-source models. ## Contact Us For more details, please visit the official website of Xiaotianhengyu Large Model Evaluation Platform: https://skylenage.alibaba-inc.com/sla/home Contact: skylenage@service.alibaba.com
提供机构:
maas
创建时间:
2025-09-22
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
SKYLENAGE-ReasoningMATH是一个专注于数学推理能力评估的数据集,包含100个难度分级的问题,覆盖从基础算术到高级数学的广泛任务。其独特之处在于强调高阶推理任务,如直观逻辑、数论与组合数学以及空间推理,旨在评估模型的抽象思维和问题解决能力。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作