five

u-math

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/toloka/u-math
下载链接
链接失效反馈
官方服务:
资源简介:
**U-MATH** is a comprehensive benchmark of 1,100 unpublished university-level problems sourced from real teaching materials. It is designed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs). \ The dataset is balanced across six core mathematical topics and includes 20% of multimodal problems (involving visual elements such as graphs and diagrams). For fine-grained performance evaluation results and detailed discussion, check out our [paper](LINK). * 📊 [U-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/umath) * 🔎 [μ-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/mumath) * 🗞️ [Paper](https://arxiv.org/abs/2412.03205) * 👾 [Evaluation Code at GitHub](https://github.com/Toloka/u-math/) ### Key Features * **Topics Covered**: Precalculus, Algebra, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series. * **Problem Format**: Free-form answer with LLM judgement * **Evaluation Metrics**: Accuracy; splits by subject and text-only vs multimodal problem type. * **Curation**: Original problems composed by math professors and used in university curricula, samples validated by math experts at [Toloka AI](https://toloka.ai), [Gradarius](https://www.gradarius.com) ### Use it ```python from datasets import load_dataset ds = load_dataset('toloka/u-math', split='test') ``` ### Dataset Fields `uuid`: problem id \ `has_image`: a boolean flag on whether the problem is multimodal or not \ `image`: binary data encoding the accompanying image, empty for text-only problems \ `subject`: subject tag marking the topic that the problem belongs to \ `problem_statement`: problem formulation, written in natural language \ `golden_answer`: a correct solution for the problem, written in natural language \ For meta-evaluation (evaluating the quality of LLM judges), refer to the [µ-MATH dataset](https://huggingface.co/datasets/toloka/mu-math). ### Evaluation Results <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/> </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/> </div> The prompt used for inference: ``` {problem_statement} Please reason step by step, and put your final answer within \boxed{} ``` ### Licensing Information All the dataset contents are available under the MIT license. ### Citation If you use U-MATH or μ-MATH in your research, please cite the paper: ```bibtex @inproceedings{umath2024, title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs}, author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga}, year={2024} } ``` ### Contact For inquiries, please contact kchernyshev@toloka.ai

**U-MATH** 是一款综合性基准测试集,包含1100道未公开过的大学层级数学题,素材均取自真实教学材料。该数据集旨在评估大语言模型(Large Language Models,LLMs)的数学推理能力。 该数据集在六大核心数学主题间分布均衡,且包含20%的多模态题目(涉及图表、示意图等视觉元素)。如需获取细粒度性能评估结果与详细讨论,请参阅我们的[论文](LINK)。 * 📊 [U-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/umath) * 🔎 [μ-MATH 基准测试集 Hugging Face 页面](https://huggingface.co/datasets/toloka/mumath) * 🗞️ [论文原文](https://arxiv.org/abs/2412.03205) * 👾 [评估代码 GitHub 仓库](https://github.com/Toloka/u-math/) ### 核心特性 * **覆盖主题**:预备微积分、代数、微分学、积分学、多元微积分、序列与级数。 * **题目格式**:自由格式答案,搭配大语言模型评判 * **评估指标**:准确率;按学科、纯文本与多模态题型划分的细分评估维度。 * **数据集遴选**:题目均由数学教授编写并应用于大学课程体系,样本经[Toloka AI](https://toloka.ai)与[Gradarius](https://www.gradarius.com)的数学专家验证。 ### 使用方法 python from datasets import load_dataset ds = load_dataset('toloka/u-math', split='test') ### 数据集字段 `uuid`:题目唯一标识符 `has_image`:布尔标记,用于标识题目是否为多模态题型 `image`:伴随图片的二进制编码数据,纯文本题目此字段为空 `subject`:主题标签,用于标记题目所属的数学分支 `problem_statement`:以自然语言撰写的题目描述 `golden_answer`:以自然语言撰写的题目正确解答 如需开展元评估(即评估大语言模型评判器的质量),请参阅[μ-MATH数据集](https://huggingface.co/datasets/toloka/mu-math)。 ### 评估结果 <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/beMyOikpKfp3My2vu5Mjc.png" alt="umath-table" width="800"/> </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/7_VZXidxMHG7PiDM983lS.png" alt="umath-bar" width="950"/> </div> 推理所用提示词格式如下: {problem_statement} 请逐步进行推理,并将最终答案置于 oxed{} 内 ### 许可信息 本数据集所有内容均采用MIT许可协议发布。 ### 引用声明 若您在研究中使用U-MATH或μ-MATH数据集,请引用如下论文: bibtex @inproceedings{umath2024, title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs}, author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga}, year={2024} } ### 联系方式 如有任何咨询,请发送邮件至 kchernyshev@toloka.ai
提供机构:
maas
创建时间:
2025-09-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作