AlgGeoBench

Name: AlgGeoBench
Creator: maas
Published: 2025-08-05 16:32:26
License: 暂无描述

魔搭社区2025-08-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/PKU-DS-LAB/AlgGeoBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Welcome to AlgGeoBench created by PKU-DS-LAB! ## Dataset Description AlgGeoBench is a frontier math benchmark consisting of 843 8-choice problems adapted from definitions and propositions of the open source Algebraic Geometry textbook and reference work [The Stacks project](https://stacks.math.columbia.edu/). It is designed to measure the frontier math ability of LLMs through Algebraic Geometry. Key characteristics of AlgGeoBench include: - **Frontier Knowledge**: Questions are based on The Stacks project, which contains knowledge of Algebraic Geometry and related topics ranging from undergraduate level to research level. - **Moderate Difficulty**: State-of-the-art LLMs show around 60% accuarcy on AlgGeoBench, demonstrating that it is of suitable difficulty for current LLMs. - **Easy Evaluation**: Questions of AlgGeoBench are easy-to-evaluate choice questions, bypassing the challange of evaluating model-generated mathematical proofs. - **Scalability**: The benchmark is created using an LLM-based methodology that requires no human-verifications, meaning that it is scalable and can be applied to domains other than math. Each question of AlgGeoBench is either adapted from a math definition or from a math proposition-proof pair. Questions adapted from math definitions consist of the original definition and 7 similar but incorrect definitions, and questions adapted from math proposition-proof pairs consist of the original proposition, the original proof, and 7 similar but incorrect proofs. ## Dataset Structure Each entry in the benchmark contains the following fields: - **tag**: The tag of The Stacks project from which the question is adapted. - **type**: Representing if the question is adapted from a math definition or from a math proposition-proof pair. - **proposition**: The original proposition if the question is adapted from a math proposition-proof pair, and empty if the question is adapted from a math definition. - **correct_text & incorrect_texts**: The original definition/proof and 7 similar but incorrect definitions/proofs. - **A-H**: The choices. - **answer**: The correct answer. The dataset is provided as a parquet file. ## Evaluation Result | **Model** | **Acc (%)** | |-----------|:-------------:| | DeepSeek-R1 | 65.2 | | Qwen3-235B-A22B | 61.3 | | o3 | 61.2 | | Claude 3.7 Sonnet Thinking | 58.4 | | Gemini 2.5 Pro | 56.8 | | DeepSeek-Prover-V2-671B | 53.5 | | QwQ-32B | 49.1 | | GPT-4.1 | 46.7 | | Gemini 2.0 Flash Thinking | 44.1 | | DeepSeek-V3 | 43.4 | | o4-mini | 42.1 | | Grok-3 | 40.6 | | DeepSeek-R1-Distill-Qwen-32B | 33.8 | | Qwen2.5-72B-Instruct | 32.1 | | GPT-4o | 26.7 | | Llama-4-Scout-17B-16E-Instruct | 23.7 | ## Citation Information This paper will soon be published on arXiv for open access.

# 欢迎使用由北京大学数据科学实验室（PKU-DS-LAB）打造的AlgGeoBench！ ## 数据集概述 AlgGeoBench是一款前沿数学基准数据集，包含843道八选一选择题，均改编自开源代数几何（Algebraic Geometry）教材与参考著作《栈项目》（The Stacks project）的定义与命题。该基准旨在通过代数几何领域任务，评估大语言模型（Large Language Model，LLM）的前沿数学推理能力。 AlgGeoBench的核心特点包括： - **前沿知识覆盖**：题目均源自《栈项目》，该资源涵盖从本科到科研层级的代数几何及相关领域知识。 - **难度适配合理**：当前主流顶尖大语言模型在AlgGeoBench上的准确率约为60%，证明该基准的难度契合现有大语言模型的能力水平。 - **评估便捷高效**：基准题目均为易于自动评分的选择题，规避了评估模型生成数学证明的技术难点。 - **可扩展性强**：该基准基于大语言模型辅助的方法构建，无需人工验证，因此具备良好的可扩展性，可推广至数学以外的其他领域。每一道AlgGeoBench题目均改编自数学定义，或改编自数学命题-证明对。改编自数学定义的题目包含原定义与7个相似但错误的定义；改编自数学命题-证明对的题目则包含原命题、原证明与7个相似但错误的证明。 ## 数据集结构该基准的每个条目包含以下字段： - **tag**：题目所改编的《栈项目》对应标签。 - **type**：标识题目改编自数学定义还是数学命题-证明对。 - **proposition**：若题目改编自数学命题-证明对，则填入原命题内容；若改编自数学定义，则该字段为空。 - **correct_text & incorrect_texts**：原定义/证明与7个相似但错误的定义/证明。 - **A-H**：八个选项。 - **answer**：正确答案。该数据集以Parquet文件格式提供。 ## 评估结果 | **模型** | **准确率（%）** | |-----------|:-------------:| | DeepSeek-R1 | 65.2 | | Qwen3-235B-A22B | 61.3 | | o3 | 61.2 | | Claude 3.7 Sonnet Thinking | 58.4 | | Gemini 2.5 Pro | 56.8 | | DeepSeek-Prover-V2-671B | 53.5 | | QwQ-32B | 49.1 | | GPT-4.1 | 46.7 | | Gemini 2.0 Flash Thinking | 44.1 | | DeepSeek-V3 | 43.4 | | o4-mini | 42.1 | | Grok-3 | 40.6 | | DeepSeek-R1-Distill-Qwen-32B | 33.8 | | Qwen2.5-72B-Instruct | 32.1 | | GPT-4o | 26.7 | | Llama-4-Scout-17B-16E-Instruct | 23.7 | ## 引用信息本论文即将在arXiv平台开放发表。

提供机构：

maas

创建时间：

2025-06-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集