MathArena/brokenarxiv-0326_outputs

Name: MathArena/brokenarxiv-0326_outputs
Creator: MathArena
Published: 2026-04-03 17:55:01
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/MathArena/brokenarxiv-0326_outputs

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: problem_idx dtype: string - name: problem dtype: string - name: model_name dtype: string - name: model_config dtype: string - name: idx_answer dtype: int64 - name: all_messages list: - name: content dtype: string - name: role dtype: string - name: type dtype: string - name: user_message dtype: string - name: answer dtype: string - name: input_tokens dtype: int64 - name: output_tokens dtype: int64 - name: cost dtype: float64 - name: input_cost_per_tokens dtype: float64 - name: output_cost_per_tokens dtype: float64 - name: source dtype: string - name: points_judge_1 dtype: int64 - name: grading_details_judge_1 list: - name: desc dtype: string - name: grading_scheme_desc dtype: string - name: max_points dtype: float64 - name: points dtype: int64 - name: title dtype: string - name: error_judge_1 dtype: 'null' - name: max_points_judge_1 dtype: float64 splits: - name: train num_bytes: 55493712 num_examples: 560 download_size: 24743507 dataset_size: 55493712 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-sa-4.0 language: - en pretty_name: Model Outputs BrokenArXiv March 2026 size_categories: - 1K<n<10K --- ### Homepage and repository - **Homepage:** [https://matharena.ai/](https://matharena.ai/) - **Repository:** [https://github.com/eth-sri/matharena](https://github.com/eth-sri/matharena) ### Dataset Summary This dataset contains model answers to the questions from BrokenArXiv March 2026 generated using the MathArena GitHub repository. ### Data Fields Below one can find the description of each field in the dataset. - `problem_idx` (int): Index of the problem in the competition - `problem` (str): Full problem statement - `gold_answer` (str): Ground-truth answer to the question - `model_name` (str): Name of the model as presented on the MathArena website - `model_config` (str): Path to the config file in the MathArena Github repo - `idx_answer` (int): Each model answered every question multiple times. This index indicates which attempt this is - `user_message` (str): User message presented to the model. Contains a competition-specific instruction along with the problem statement - `answer` (str): Full model answer - `parsed_answer` (str): Answer as it was parsed by the MathArena parser. Note: a direct string comparison between the parsed_answer and the gold_answer will give false negatives when measuring correctness. - `correct` (bool): Indicates whether the answer is correct as evaluated by the MathArena parser - `input_tokens` (int): Number of input tokens. Is 0 when this value is missing - `output_tokens` (int): Number of output tokens. Is 0 when this value is missing - `cost` (float): Total cost Is 0 when this value is missing - `input_cost_per_tokens` (float): Cost per one million input tokens - `output_cost_per_tokens` (float): Cost per one million output tokens ### Licensing Information This dataset is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Please abide by the license when using the provided data. ### Citation Information ``` @misc{balunovic_srimatharena_2025, title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions}, author = {Mislav Balunović and Jasper Dekoninck and Ivo Petrov and Nikola Jovanović and Martin Vechev}, copyright = {MIT}, url = {https://matharena.ai/}, publisher = {SRI Lab, ETH Zurich}, month = feb, year = {2025}, } ```

提供机构：

MathArena

搜集汇总

数据集介绍

构建方式

在数学竞赛评估领域，BrokenArXiv-0326_outputs数据集依托MathArena平台构建，其核心源于2026年3月BrokenArXiv竞赛的数学问题。通过系统化采集多个大型语言模型对同一组竞赛题目的多次解答，数据集记录了每次交互的完整对话历史、模型配置及成本指标。构建过程严格遵循竞赛环境，确保问题陈述与评估标准的一致性，同时整合了自动解析器对模型答案进行正确性判定，从而形成结构化的模型输出集合。

特点

该数据集显著特点在于其多维度的评估信息架构。每条记录不仅包含模型生成的原始答案与解析后答案，还详细标注了输入输出令牌数、计算成本以及基于MathArena解析器的正确性标签。数据集特别设计了重复尝试索引字段，能够追踪同一模型对同一问题的多次解答表现，为研究模型稳定性与一致性提供了细致的数据基础。此外，其字段结构兼顾了竞赛评估与资源消耗分析的双重需求。

使用方法

研究人员可利用该数据集进行多方面的分析。通过关联`problem_idx`与`model_name`字段，可以横向比较不同模型在特定数学问题上的性能差异；借助`idx_answer`与`correct`字段，能够深入探究模型多次尝试中的表现波动与学习轨迹。数据集中的令牌计数与成本字段为评估模型经济效益提供了量化依据。使用时应结合原始MathArena解析器进行验证，并注意`parsed_answer`与`gold_answer`的直接字符串比较可能产生误判，需遵循平台指定的评估流程。

背景与挑战

背景概述

在大型语言模型（LLM）迅猛发展的时代背景下，对其数学推理能力进行精准、无污染的评估成为人工智能领域的关键研究议题。MathArena项目由苏黎世联邦理工学院（ETH Zurich）安全、可靠与智能系统实验室（SRI Lab）的研究团队于2025年创建，其核心目标是构建一个纯净的基准测试平台，专门用于评估LLM在未公开数学竞赛问题上的表现。该项目衍生的数据集，如brokenarxiv-0326_outputs，收录了针对2026年3月BrokenArXiv竞赛问题的多轮模型输出，旨在系统分析模型在复杂数学问题求解中的行为模式与性能瓶颈，为提升模型的逻辑推理与问题解决能力提供了至关重要的实证数据基础。

当前挑战

该数据集致力于应对数学推理评估领域的核心挑战，即如何设计一个能有效区分模型真实推理能力与数据记忆效应的评估框架。具体而言，挑战在于构建一个包含新颖、未泄露竞赛题目的测试集，以防止模型通过预训练数据产生评估偏差。在数据集构建过程中，研究者面临多重技术难题：需开发自动化解析器以准确提取和标准化模型输出的数学答案，处理自然语言答案与标准答案在形式上的不一致性；同时，还需设计高效的成本与令牌统计机制，以量化模型生成过程的经济与计算开销，为模型效率的横向比较提供可靠依据。

常用场景

经典使用场景

在数学推理与大型语言模型评估领域，BrokenArXiv-0326_outputs数据集为研究者提供了一个标准化的基准平台。该数据集收录了多个模型在BrokenArXiv数学竞赛问题上的详细输出记录，包括问题陈述、模型回答、解析结果及正确性标注。经典使用场景集中于系统性地对比不同模型在复杂数学问题求解上的性能，通过量化指标如正确率、成本消耗等，揭示模型在数学推理、逻辑推导和符号计算方面的能力差异。

实际应用

在实际应用中，BrokenArXiv-0326_outputs数据集被广泛用于指导数学辅助工具与教育技术的开发。基于该数据集的评估结果，工程师能够筛选出在数学问题求解上表现优异的模型，集成到智能辅导系统、自动解题平台或科研辅助工具中。这些应用不仅提升了数学学习与研究的效率，还通过成本效益分析助力企业优化模型部署策略，实现高性能与低资源消耗的平衡，推动人工智能技术在科学与工程领域的落地。

衍生相关工作

围绕该数据集，学术界衍生了一系列经典研究工作。例如，基于其评估框架，研究者开展了针对数学推理模型的微调与强化学习策略探索，提出了多种提升模型符号计算与逻辑一致性的新方法。同时，该数据集也催生了跨模型集成与对抗性测试的相关研究，通过分析不同模型在相同问题上的失败模式，揭示了数学推理中的常见错误类型与改进路径。这些工作共同丰富了数学人工智能的理论与实践，为后续更复杂的评估基准构建奠定了基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集