Guji-Math: An Evaluation Benchmark for Assessing the Effectiveness of Reasoning Models in Solving Ancient Chinese Mathematical Problems

Name: Guji-Math: An Evaluation Benchmark for Assessing the Effectiveness of Reasoning Models in Solving Ancient Chinese Mathematical Problems
Creator: Science Data Bank
Published: 2025-06-23 00:37:22
License: 暂无描述

DataCite Commons2025-06-23 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=d5a0bfefae1b4a8a92cb1a4fbc316949

下载链接

链接失效反馈

官方服务：

资源简介：

As one of the earliest countries in the world to develop mathematics, China accumulated a vast wealth of valuable mathematical resources throughout its long history. To promote the revitalization and utilization of ancient Chinese mathematical resources in the era of generative AI, this study designs Guji-Math, a specialized benchmark for evaluating ancient mathematical problems tailored to assess reasoning models. Guji-Math is constructed from the Suanjing Shishu (《算经十书》), the most renowned compendium of ancient Chinese mathematics. Leveraging the unique "Question-Answer-Solution" textual structure, the benchmark creates verifiable question-answer pairs. Through semi-automatic annotation, each problem is assigned one of four difficulty levels and one of 15 problem types, resulting in a collection of 538 mathematical questions and 511 solution methods. The benchmark provides two evaluation modes—open-book and closed-book—to assess reasoning models' accuracy in solving problems either without external assistance or by referencing only the original solution methods from the texts.

作为世界上最早发展数学的国家之一，中国在漫长的历史长河中积累了极为丰厚的珍贵数学资源。为推动生成式AI（Generative AI）时代中国古代数学资源的活化与利用，本研究设计了Guji-Math——一款专为评估推理模型而打造的中国古代数学问题专用评测基准。该基准依托中国古代最负盛名的数学典籍汇编《算经十书》（Suanjing Shishu）构建而成，依托独特的“问-答-解”文本结构生成了可验证的问答对。通过半自动标注流程，每个问题均被划分为4个难度等级与15种问题类型中的一类，最终形成包含538道数学问题与511种解题方法的评测数据集。该基准提供开卷与闭卷两种评测模式，用于评估推理模型在两种场景下的解题准确率：一是无外部辅助的闭卷场景，二是仅参考原文原始解题方法的开卷场景。

提供机构：

Science Data Bank

创建时间：

2025-06-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集