AlgGeoBench
收藏魔搭社区2025-08-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PKU-DS-LAB/AlgGeoBench
下载链接
链接失效反馈官方服务:
资源简介:
# Welcome to AlgGeoBench created by PKU-DS-LAB!
## Dataset Description
AlgGeoBench is a frontier math benchmark consisting of 843 8-choice problems adapted from definitions and propositions of the open source Algebraic Geometry textbook and reference work [The Stacks project](https://stacks.math.columbia.edu/). It is designed to measure the frontier math ability of LLMs through Algebraic Geometry.
Key characteristics of AlgGeoBench include:
- **Frontier Knowledge**: Questions are based on The Stacks project, which contains knowledge of Algebraic Geometry and related topics ranging from undergraduate level to research level.
- **Moderate Difficulty**: State-of-the-art LLMs show around 60% accuarcy on AlgGeoBench, demonstrating that it is of suitable difficulty for current LLMs.
- **Easy Evaluation**: Questions of AlgGeoBench are easy-to-evaluate choice questions, bypassing the challange of evaluating model-generated mathematical proofs.
- **Scalability**: The benchmark is created using an LLM-based methodology that requires no human-verifications, meaning that it is scalable and can be applied to domains other than math.
Each question of AlgGeoBench is either adapted from a math definition or from a math proposition-proof pair. Questions adapted from math definitions consist of the original definition and 7 similar but incorrect definitions, and questions adapted from math proposition-proof pairs consist of the original proposition, the original proof, and 7 similar but incorrect proofs.
## Dataset Structure
Each entry in the benchmark contains the following fields:
- **tag**: The tag of The Stacks project from which the question is adapted.
- **type**: Representing if the question is adapted from a math definition or from a math proposition-proof pair.
- **proposition**: The original proposition if the question is adapted from a math proposition-proof pair, and empty if the question is adapted from a math definition.
- **correct_text & incorrect_texts**: The original definition/proof and 7 similar but incorrect definitions/proofs.
- **A-H**: The choices.
- **answer**: The correct answer.
The dataset is provided as a parquet file.
## Evaluation Result
| **Model** | **Acc (%)** |
|-----------|:-------------:|
| DeepSeek-R1 | 65.2 |
| Qwen3-235B-A22B | 61.3 |
| o3 | 61.2 |
| Claude 3.7 Sonnet Thinking | 58.4 |
| Gemini 2.5 Pro | 56.8 |
| DeepSeek-Prover-V2-671B | 53.5 |
| QwQ-32B | 49.1 |
| GPT-4.1 | 46.7 |
| Gemini 2.0 Flash Thinking | 44.1 |
| DeepSeek-V3 | 43.4 |
| o4-mini | 42.1 |
| Grok-3 | 40.6 |
| DeepSeek-R1-Distill-Qwen-32B | 33.8 |
| Qwen2.5-72B-Instruct | 32.1 |
| GPT-4o | 26.7 |
| Llama-4-Scout-17B-16E-Instruct | 23.7 |
## Citation Information
This paper will soon be published on arXiv for open access.
# 欢迎使用由北京大学数据科学实验室(PKU-DS-LAB)打造的AlgGeoBench!
## 数据集概述
AlgGeoBench是一款前沿数学基准数据集,包含843道八选一选择题,均改编自开源代数几何(Algebraic Geometry)教材与参考著作《栈项目》(The Stacks project)的定义与命题。该基准旨在通过代数几何领域任务,评估大语言模型(Large Language Model,LLM)的前沿数学推理能力。
AlgGeoBench的核心特点包括:
- **前沿知识覆盖**:题目均源自《栈项目》,该资源涵盖从本科到科研层级的代数几何及相关领域知识。
- **难度适配合理**:当前主流顶尖大语言模型在AlgGeoBench上的准确率约为60%,证明该基准的难度契合现有大语言模型的能力水平。
- **评估便捷高效**:基准题目均为易于自动评分的选择题,规避了评估模型生成数学证明的技术难点。
- **可扩展性强**:该基准基于大语言模型辅助的方法构建,无需人工验证,因此具备良好的可扩展性,可推广至数学以外的其他领域。
每一道AlgGeoBench题目均改编自数学定义,或改编自数学命题-证明对。改编自数学定义的题目包含原定义与7个相似但错误的定义;改编自数学命题-证明对的题目则包含原命题、原证明与7个相似但错误的证明。
## 数据集结构
该基准的每个条目包含以下字段:
- **tag**:题目所改编的《栈项目》对应标签。
- **type**:标识题目改编自数学定义还是数学命题-证明对。
- **proposition**:若题目改编自数学命题-证明对,则填入原命题内容;若改编自数学定义,则该字段为空。
- **correct_text & incorrect_texts**:原定义/证明与7个相似但错误的定义/证明。
- **A-H**:八个选项。
- **answer**:正确答案。
该数据集以Parquet文件格式提供。
## 评估结果
| **模型** | **准确率(%)** |
|-----------|:-------------:|
| DeepSeek-R1 | 65.2 |
| Qwen3-235B-A22B | 61.3 |
| o3 | 61.2 |
| Claude 3.7 Sonnet Thinking | 58.4 |
| Gemini 2.5 Pro | 56.8 |
| DeepSeek-Prover-V2-671B | 53.5 |
| QwQ-32B | 49.1 |
| GPT-4.1 | 46.7 |
| Gemini 2.0 Flash Thinking | 44.1 |
| DeepSeek-V3 | 43.4 |
| o4-mini | 42.1 |
| Grok-3 | 40.6 |
| DeepSeek-R1-Distill-Qwen-32B | 33.8 |
| Qwen2.5-72B-Instruct | 32.1 |
| GPT-4o | 26.7 |
| Llama-4-Scout-17B-16E-Instruct | 23.7 |
## 引用信息
本论文即将在arXiv平台开放发表。
提供机构:
maas
创建时间:
2025-06-19



