mu-math
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/toloka/mu-math
下载链接
链接失效反馈官方服务:
资源简介:
**μ-MATH** (**M**eta **U-MATH**) is a meta-evaluation dataset derived from the [U-MATH](https://huggingface.co/datasets/toloka/umath) benchmark.
It is intended to assess the ability of LLMs to judge free-form mathematical solutions. \
The dataset includes 1,084 labeled samples generated from 271 U-MATH tasks, covering problems of varying assessment complexity.
For fine-grained performance evaluation results, in-depth analyses and detailed discussions on behaviors and biases of LLM judges, check out our [paper](LINK).
* 📊 [U-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/umath)
* 🔎 [μ-MATH benchmark at Huggingface](https://huggingface.co/datasets/toloka/mumath)
* 🗞️ [Paper](https://arxiv.org/abs/2412.03205)
* 👾 [Evaluation Code at GitHub](https://github.com/Toloka/u-math/)
### Key Features
* **Dataset Construction**:
- Subset of U-MATH problems (25%).
- Includes solutions generated by four top-performing LLMs: Llama-3.1 70B, Qwen2.5 72B, GPT-4o, Gemini 1.5 Pro
- Solutions labeled as correct or incorrect with the use of math experts and formal auto-verification.
- Samples validated by math experts at [Toloka AI](https://toloka.ai), [Gradarius](https://www.gradarius.com)
* **Focus**: Meta-evaluation of LLMs as evaluators, testing their accuracy in judging free-form solutions.
* **Primary Metric**: Macro F1-score.
* **Secondary Metrics**: True Positive Rate, True Negative Rate, Positive Predictive Value, Negative Predictive Value.
For original tasks on mathematical problem-solving, refer to the [U-MATH dataset](https://huggingface.co/datasets/toloka/umath).
### Use it
```python
from datasets import load_dataset
ds = load_dataset('toloka/mu-math', split='test')
```
### Dataset Fields
`uuid`: problem id \
`problem_statement`: problem formulation, written in natural language \
`golden_answer`: a correct solution for the problem to compare the generated solutions against, written in natural language \
`model`: name of the instruction-finetuned LLM that generated the solution \
`model_output`: the LLM's solution \
`label`: boolean flag on whether the generated solution is correct or not
### Evaluation Results
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/lz_ylYOUd6BSK8yFn3K77.png" alt="mumath-table" width="1000"/>
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/Ook-JXum03E0UdWBIW0qB.png" alt="mumath-scatter" width="800"/>
</div>
The prompt used for judgment:
```
You'll be provided with a math problem, a correct answer for it and a solution for evaluation.
You have to answer whether the solution is correct or not.
---
PROBLEM STATEMENT:
{problem_statement}
CORRECT ANSWER:
{golden_answer}
SOLUTION TO EVALUATE:
{model_output}
---
Now please compare the answer obtained in the solution with the provided correct answer to evaluate whether the solution is correct or not.
Think step-by-step, following these steps, don't skip any:
1. Extract the answer from the provided solution
2. Make any derivations or transformations that may be necessary to compare the provided correct answer with the extracted answer
3. Perform the comparison
4. Conclude with your final verdict — put either "Yes" or "No" on a separate line
```
### Licensing Information
* The contents of the machine-generated `model_output` column are subject to the underlying LLMs' licensing terms.
* Contents of all the other fields are available under the MIT license.
### Citation
If you use U-MATH or μ-MATH in your research, please cite the paper:
```bibtex
@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}
```
### Contact
For inquiries, please contact kchernyshev@toloka.ai
**μ-MATH(Meta U-MATH)**是源自[U-MATH](https://huggingface.co/datasets/toloka/umath)基准测试的元评估数据集。
本数据集旨在评估大语言模型(Large Language Model)对自由格式数学解答的评判能力。
该数据集包含从271个U-MATH任务中生成的1084个带标注样本,覆盖了不同评估复杂度的问题。
如需了解细粒度性能评估结果、深入分析以及关于大语言模型评判器的行为与偏差的详细讨论,请参阅我们的[论文](LINK)。
* 📊 [Hugging Face 上的 U-MATH 基准测试](https://huggingface.co/datasets/toloka/umath)
* 🔎 [Hugging Face 上的 μ-MATH 基准测试](https://huggingface.co/datasets/toloka/mumath)
* 🗞️ [论文](https://arxiv.org/abs/2412.03205)
* 👾 [GitHub 上的评估代码](https://github.com/Toloka/u-math/)
### 核心特性
* **数据集构建**:
- 为U-MATH问题的子集(占比25%)。
- 包含由四款顶级表现的大语言模型生成的解答:Llama-3.1 70B、Qwen2.5 72B、GPT-4o、Gemini 1.5 Pro
- 借助数学专家与正式自动验证手段,将解答标注为正确或错误。
- 样本由[Toloka AI](https://toloka.ai)、[Gradarius](https://www.gradarius.com)的数学专家完成验证。
* **研究聚焦**:对作为评估者的大语言模型进行元评估,测试其在评判自由格式解答时的准确性。
* **主要评估指标**:宏F1分数(Macro F1-score)。
* **次要评估指标**:真阳性率(True Positive Rate)、真阴性率(True Negative Rate)、阳性预测值(Positive Predictive Value)、阴性预测值(Negative Predictive Value)。
如需查阅原始数学解题任务,请参阅[U-MATH数据集](https://huggingface.co/datasets/toloka/umath)。
### 使用方法
python
from datasets import load_dataset
ds = load_dataset('toloka/mu-math', split='test')
### 数据集字段
`uuid`:问题编号
`problem_statement`:以自然语言撰写的问题描述
`golden_answer`:用于对比生成解答的正确解答,以自然语言撰写
`model`:生成该解答的指令微调大语言模型名称
`model_output`:大语言模型生成的解答
`label`:用于标识生成解答是否正确的布尔标记
### 评估结果
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/lz_ylYOUd6BSK8yFn3K77.png" alt="mumath-table" width="1000"/>
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/650238063e61bc019201e3e2/Ook-JXum03E0UdWBIW0qB.png" alt="mumath-scatter" width="800"/>
</div>
评判所用提示词:
您将获得一道数学题、该题的正确答案以及待评估的解答。
您需要判断该解答是否正确。
---
问题描述:
{problem_statement}
正确答案:
{golden_answer}
待评估解答:
{model_output}
---
现在请将解答中得到的答案与提供的正确答案进行比较,以评估该解答是否正确。
请遵循以下步骤逐步思考,切勿跳过任何一步:
1. 从提供的解答中提取答案
2. 进行必要的推导或转换,以便将提供的正确答案与提取的答案进行对比
3. 完成对比
4. 给出最终结论——单独一行输出“Yes”或“No”
### 许可信息
* 机器生成的`model_output`列内容需遵循对应大语言模型的许可条款。
* 其余所有字段的内容均采用MIT许可协议发布。
### 引用
若您在研究中使用U-MATH或μ-MATH,请引用该论文:
bibtex
@inproceedings{umath2024,
title={U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs},
author={Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov and Sergei Tilga},
year={2024}
}
### 联系方式
如有任何疑问,请联系kchernyshev@toloka.ai
提供机构:
maas
创建时间:
2025-09-15



