u-math
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/toloka/u-math
下载链接
链接失效反馈官方服务:
资源简介:
**U-MATH** is a comprehensive benchmark of 1,100 unpublished university-level problems sourced from real teaching materials.
It is designed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs). \
The dataset is balanced across six core mathematical topics and includes 20% of multimodal problems (involving visual elements such as graphs and diagrams).
For fine-grained performance evaluation results and detailed discussion, check out our [paper](LINK).
* 📊 [U-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/umath)
* 🔎 [μ-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/mumath)
* 🗞️ [Paper](https://arxiv.org/abs/2412.03205)
* 👾 [Evaluation Code at GitHub](https://github.com/Toloka/u-math/)
### Key Features
* **Topics Covered**: Precalculus, Algebra, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series.
* **Problem Format**: Free-form answer with LLM judgement
* **Evaluation Metrics**: Accuracy; splits by subject and text-only vs multimodal problem type.
* **Curation**: Original problems composed by math professors and used in university curricula, samples validated by math experts at [Toloka AI](https://toloka.ai), [Gradarius](https://www.gradarius.com)
### Use it
```python
from datasets import load_dataset
ds = load_dataset('toloka/u-math', split='test')
```
### Dataset Fields
`uuid`: problem id \
`has_image`: a boolean flag on whether the problem is multimodal or not \
`image`: binary data encoding the accompanying image, empty for text-only problems \
`subject`: subject tag marking the topic that the problem belongs to \
`problem_statement`: problem formulation, written in natural language \
`golden_answer`: a correct solution for the problem, written in natural language \
For meta-evaluation (evaluating the quality of LLM judges), refer to the [µ-MATH dataset](https://huggingface.co/datasets/toloka/mu-math).
### Evaluation Results
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/>
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/>
</div>
The prompt used for inference:
```
{problem_statement}
Please reason step by step, and put your final answer within \boxed{}
```
### Licensing Information
All the dataset contents are available under the MIT license.
### Citation
If you use U-MATH or μ-MATH in your research, please cite the paper:
```bibtex
@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}
```
### Contact
For inquiries, please contact kchernyshev@toloka.ai
**U-MATH** 是一款综合性基准测试集,包含1100道未公开过的大学层级数学题,素材均取自真实教学材料。该数据集旨在评估大语言模型(Large Language Models,LLMs)的数学推理能力。
该数据集在六大核心数学主题间分布均衡,且包含20%的多模态题目(涉及图表、示意图等视觉元素)。如需获取细粒度性能评估结果与详细讨论,请参阅我们的[论文](LINK)。
* 📊 [U-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/umath)
* 🔎 [μ-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/mumath)
* 🗞️ [论文原文](https://arxiv.org/abs/2412.03205)
* 👾 [评估代码 GitHub 仓库](https://github.com/Toloka/u-math/)
### 核心特性
* **覆盖主题**:预备微积分、代数、微分学、积分学、多元微积分、序列与级数。
* **题目格式**:自由格式答案,搭配大语言模型评判
* **评估指标**:准确率;按学科、纯文本与多模态题型划分的细分评估维度。
* **数据集遴选**:题目均由数学教授编写并应用于大学课程体系,样本经[Toloka AI](https://toloka.ai)与[Gradarius](https://www.gradarius.com)的数学专家验证。
### 使用方法
python
from datasets import load_dataset
ds = load_dataset('toloka/u-math', split='test')
### 数据集字段
`uuid`:题目唯一标识符
`has_image`:布尔标记,用于标识题目是否为多模态题型
`image`:伴随图片的二进制编码数据,纯文本题目此字段为空
`subject`:主题标签,用于标记题目所属的数学分支
`problem_statement`:以自然语言撰写的题目描述
`golden_answer`:以自然语言撰写的题目正确解答
如需开展元评估(即评估大语言模型评判器的质量),请参阅[μ-MATH数据集](https://huggingface.co/datasets/toloka/mu-math)。
### 评估结果
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/>
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/>
</div>
推理所用提示词格式如下:
{problem_statement}
请逐步进行推理,并将最终答案置于 oxed{} 内
### 许可信息
本数据集所有内容均采用MIT许可协议发布。
### 引用声明
若您在研究中使用U-MATH或μ-MATH数据集,请引用如下论文:
bibtex
@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}
### 联系方式
如有任何咨询,请发送邮件至 kchernyshev@toloka.ai
提供机构:
maas
创建时间:
2025-09-15



