five

AlgGeoBench

收藏
魔搭社区2025-08-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PKU-DS-LAB/AlgGeoBench
下载链接
链接失效反馈
官方服务:
资源简介:
# Welcome to AlgGeoBench created by PKU-DS-LAB! ## Dataset Description AlgGeoBench is a frontier math benchmark consisting of 843 8-choice problems adapted from definitions and propositions of the open source Algebraic Geometry textbook and reference work [The Stacks project](https://stacks.math.columbia.edu/). It is designed to measure the frontier math ability of LLMs through Algebraic Geometry. Key characteristics of AlgGeoBench include: - **Frontier Knowledge**: Questions are based on The Stacks project, which contains knowledge of Algebraic Geometry and related topics ranging from undergraduate level to research level. - **Moderate Difficulty**: State-of-the-art LLMs show around 60% accuarcy on AlgGeoBench, demonstrating that it is of suitable difficulty for current LLMs. - **Easy Evaluation**: Questions of AlgGeoBench are easy-to-evaluate choice questions, bypassing the challange of evaluating model-generated mathematical proofs. - **Scalability**: The benchmark is created using an LLM-based methodology that requires no human-verifications, meaning that it is scalable and can be applied to domains other than math. Each question of AlgGeoBench is either adapted from a math definition or from a math proposition-proof pair. Questions adapted from math definitions consist of the original definition and 7 similar but incorrect definitions, and questions adapted from math proposition-proof pairs consist of the original proposition, the original proof, and 7 similar but incorrect proofs. ## Dataset Structure Each entry in the benchmark contains the following fields: - **tag**: The tag of The Stacks project from which the question is adapted. - **type**: Representing if the question is adapted from a math definition or from a math proposition-proof pair. - **proposition**: The original proposition if the question is adapted from a math proposition-proof pair, and empty if the question is adapted from a math definition. - **correct_text & incorrect_texts**: The original definition/proof and 7 similar but incorrect definitions/proofs. - **A-H**: The choices. - **answer**: The correct answer. The dataset is provided as a parquet file. ## Evaluation Result | **Model** | **Acc (%)** | |-----------|:-------------:| | DeepSeek-R1 | 65.2 | | Qwen3-235B-A22B | 61.3 | | o3 | 61.2 | | Claude 3.7 Sonnet Thinking | 58.4 | | Gemini 2.5 Pro | 56.8 | | DeepSeek-Prover-V2-671B | 53.5 | | QwQ-32B | 49.1 | | GPT-4.1 | 46.7 | | Gemini 2.0 Flash Thinking | 44.1 | | DeepSeek-V3 | 43.4 | | o4-mini | 42.1 | | Grok-3 | 40.6 | | DeepSeek-R1-Distill-Qwen-32B | 33.8 | | Qwen2.5-72B-Instruct | 32.1 | | GPT-4o | 26.7 | | Llama-4-Scout-17B-16E-Instruct | 23.7 | ## Citation Information This paper will soon be published on arXiv for open access.

# 欢迎使用由北京大学数据科学实验室(PKU-DS-LAB)打造的AlgGeoBench! ## 数据集概述 AlgGeoBench是一款前沿数学基准数据集,包含843道八选一选择题,均改编自开源代数几何(Algebraic Geometry)教材与参考著作《栈项目》(The Stacks project)的定义与命题。该基准旨在通过代数几何领域任务,评估大语言模型(Large Language Model,LLM)的前沿数学推理能力。 AlgGeoBench的核心特点包括: - **前沿知识覆盖**:题目均源自《栈项目》,该资源涵盖从本科到科研层级的代数几何及相关领域知识。 - **难度适配合理**:当前主流顶尖大语言模型在AlgGeoBench上的准确率约为60%,证明该基准的难度契合现有大语言模型的能力水平。 - **评估便捷高效**:基准题目均为易于自动评分的选择题,规避了评估模型生成数学证明的技术难点。 - **可扩展性强**:该基准基于大语言模型辅助的方法构建,无需人工验证,因此具备良好的可扩展性,可推广至数学以外的其他领域。 每一道AlgGeoBench题目均改编自数学定义,或改编自数学命题-证明对。改编自数学定义的题目包含原定义与7个相似但错误的定义;改编自数学命题-证明对的题目则包含原命题、原证明与7个相似但错误的证明。 ## 数据集结构 该基准的每个条目包含以下字段: - **tag**:题目所改编的《栈项目》对应标签。 - **type**:标识题目改编自数学定义还是数学命题-证明对。 - **proposition**:若题目改编自数学命题-证明对,则填入原命题内容;若改编自数学定义,则该字段为空。 - **correct_text & incorrect_texts**:原定义/证明与7个相似但错误的定义/证明。 - **A-H**:八个选项。 - **answer**:正确答案。 该数据集以Parquet文件格式提供。 ## 评估结果 | **模型** | **准确率(%)** | |-----------|:-------------:| | DeepSeek-R1 | 65.2 | | Qwen3-235B-A22B | 61.3 | | o3 | 61.2 | | Claude 3.7 Sonnet Thinking | 58.4 | | Gemini 2.5 Pro | 56.8 | | DeepSeek-Prover-V2-671B | 53.5 | | QwQ-32B | 49.1 | | GPT-4.1 | 46.7 | | Gemini 2.0 Flash Thinking | 44.1 | | DeepSeek-V3 | 43.4 | | o4-mini | 42.1 | | Grok-3 | 40.6 | | DeepSeek-R1-Distill-Qwen-32B | 33.8 | | Qwen2.5-72B-Instruct | 32.1 | | GPT-4o | 26.7 | | Llama-4-Scout-17B-16E-Instruct | 23.7 | ## 引用信息 本论文即将在arXiv平台开放发表。
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作