u-math

Name: u-math
Creator: maas
Published: 2025-12-05 16:50:35
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/toloka/u-math

下载链接

链接失效反馈

官方服务：

资源简介：

**U-MATH** is a comprehensive benchmark of 1,100 unpublished university-level problems sourced from real teaching materials. It is designed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs). \ The dataset is balanced across six core mathematical topics and includes 20% of multimodal problems (involving visual elements such as graphs and diagrams). For fine-grained performance evaluation results and detailed discussion, check out our [paper](LINK). * 📊 [U-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/umath) * 🔎 [μ-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/mumath) * 🗞️ [Paper](https://arxiv.org/abs/2412.03205) * 👾 [Evaluation Code at GitHub](https://github.com/Toloka/u-math/) ### Key Features * **Topics Covered**: Precalculus, Algebra, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series. * **Problem Format**: Free-form answer with LLM judgement * **Evaluation Metrics**: Accuracy; splits by subject and text-only vs multimodal problem type. * **Curation**: Original problems composed by math professors and used in university curricula, samples validated by math experts at [Toloka AI](https://toloka.ai), [Gradarius](https://www.gradarius.com) ### Use it ```python from datasets import load_dataset ds = load_dataset('toloka/u-math', split='test') ``` ### Dataset Fields `uuid`: problem id \ `has_image`: a boolean flag on whether the problem is multimodal or not \ `image`: binary data encoding the accompanying image, empty for text-only problems \ `subject`: subject tag marking the topic that the problem belongs to \ `problem_statement`: problem formulation, written in natural language \ `golden_answer`: a correct solution for the problem, written in natural language \ For meta-evaluation (evaluating the quality of LLM judges), refer to the [µ-MATH dataset](https://huggingface.co/datasets/toloka/mu-math). ### Evaluation Results <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/> </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/> </div> The prompt used for inference: ``` {problem_statement} Please reason step by step, and put your final answer within \boxed{} ``` ### Licensing Information All the dataset contents are available under the MIT license. ### Citation If you use U-MATH or μ-MATH in your research, please cite the paper: ```bibtex @inproceedings{umath2024, title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs}, author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga}, year={2024} } ``` ### Contact For inquiries, please contact kchernyshev@toloka.ai

**U-MATH** 是一款综合性基准测试集，包含1100道未公开过的大学层级数学题，素材均取自真实教学材料。该数据集旨在评估大语言模型（Large Language Models，LLMs）的数学推理能力。该数据集在六大核心数学主题间分布均衡，且包含20%的多模态题目（涉及图表、示意图等视觉元素）。如需获取细粒度性能评估结果与详细讨论，请参阅我们的[论文](LINK)。 * 📊 [U-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/umath) * 🔎 [μ-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/mumath) * 🗞️ [论文原文](https://arxiv.org/abs/2412.03205) * 👾 [评估代码 GitHub 仓库](https://github.com/Toloka/u-math/) ### 核心特性 * **覆盖主题**：预备微积分、代数、微分学、积分学、多元微积分、序列与级数。 * **题目格式**：自由格式答案，搭配大语言模型评判 * **评估指标**：准确率；按学科、纯文本与多模态题型划分的细分评估维度。 * **数据集遴选**：题目均由数学教授编写并应用于大学课程体系，样本经[Toloka AI](https://toloka.ai)与[Gradarius](https://www.gradarius.com)的数学专家验证。 ### 使用方法 python from datasets import load_dataset ds = load_dataset('toloka/u-math', split='test') ### 数据集字段 `uuid`：题目唯一标识符 `has_image`：布尔标记，用于标识题目是否为多模态题型 `image`：伴随图片的二进制编码数据，纯文本题目此字段为空 `subject`：主题标签，用于标记题目所属的数学分支 `problem_statement`：以自然语言撰写的题目描述 `golden_answer`：以自然语言撰写的题目正确解答如需开展元评估（即评估大语言模型评判器的质量），请参阅[μ-MATH数据集](https://huggingface.co/datasets/toloka/mu-math)。 ### 评估结果 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/> </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/> </div> 推理所用提示词格式如下： {problem_statement} 请逐步进行推理，并将最终答案置于 oxed{} 内 ### 许可信息本数据集所有内容均采用MIT许可协议发布。 ### 引用声明若您在研究中使用U-MATH或μ-MATH数据集，请引用如下论文： bibtex @inproceedings{umath2024, title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs}, author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga}, year={2024} } ### 联系方式如有任何咨询，请发送邮件至 kchernyshev@toloka.ai

提供机构：

maas

创建时间：

2025-09-15

搜集汇总

数据集介绍