toloka/mu-math
收藏Hugging Face2026-01-30 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/toloka/mu-math
下载链接
链接失效反馈官方服务:
资源简介:
μ-MATH数据集是一个用于评估大语言模型(LLMs)在判断自由形式数学解决方案方面能力的元评估数据集。该数据集包含1,084个标记样本,这些样本来自271个U-MATH任务,涵盖了不同评估复杂性的问题。数据集构建过程中,使用了四个表现优异的LLMs(Llama-3.1 70B、Qwen2.5 72B、GPT-4o、Gemini 1.5 Pro)生成的解决方案,并由数学专家和自动验证工具进行标记。数据集的主要评估指标是宏F1分数,次要指标包括真阳性率、真阴性率、阳性预测值和阴性预测值。
μ-MATH is a meta-evaluation dataset designed to assess the ability of large language models (LLMs) to judge free-form mathematical solutions. The dataset contains 1,084 labeled samples derived from 271 U-MATH tasks, covering problems of varying assessment complexity. The construction of the dataset includes solutions generated by four top-performing LLMs, which are labeled by math experts and formal auto-verification. The primary focus is on the meta-evaluation of LLMs as evaluators, testing their accuracy in judging free-form solutions. The primary evaluation metric is the Macro F1-score, with secondary metrics including True Positive Rate, True Negative Rate, Positive Predictive Value, and Negative Predictive Value.
提供机构:
toloka



