MATH_Difficulty

Name: MATH_Difficulty
Creator: maas
Published: 2025-11-27 16:34:46
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/lime-nlp/MATH_Difficulty

下载链接

链接失效反馈

官方服务：

资源简介：

# Difficulty Estimation on MATH We annotate the entire [**MATH**](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval) dataset with a **difficulty score** based on the performance of the [Qwen 2.5-MATH-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model. This provides an adaptive signal for curriculum construction and model evaluation. The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of problems from mathematics competitions, including the AMC 10, AMC 12, AIME, and more. Each problem in MATH has a full step-by-step solution, which can be used to teach models to generate answer derivations and explanations. ## Difficulty Scoring Method Difficulty scores are estimated using the **Qwen 2.5-MATH-7B** model with the following generation settings: - `temperature = 0.6` - `top_p = 0.9` - `max_tokens = 4096` - Inference performed using [vLLM](https://github.com/vllm-project/vllm) - Each problem is attempted **128 times** The difficulty score `d_i` for each problem is computed as: d_i = 100 × (1 - (# successes / 128)) This approach balances the evaluation signal: - A **strong model** would trivially solve easy problems, compressing the difficulty scale. - A **weak model** would fail uniformly, providing poor resolution. - Qwen 2.5-MATH-7B was selected for its **mid-range capabilities**, offering meaningful gradients across a wide spectrum of problems. ## Difficulty Estimation on Other Datasets We also apply the same difficulty estimation procedure to the following datasets: - [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty) - [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) - [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty) - [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty) ## 📬 Contact For questions or feedback, feel free to reach out to [**Taiwei Shi**](https://maksimstw.github.io/) at [taiweish@usc.edu](mailto:taiweish@usc.edu). ## 📚 Citations Github: https://github.com/uscnlp-lime/verl If you find our dataset useful, please cite [Efficient Reinforcement Finetuning via Adaptive Curriculum Learning](https://huggingface.co/papers/2504.05520): ```bibtex @misc{shi2025efficientreinforcementfinetuningadaptive, title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning}, author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao}, year={2025}, eprint={2504.05520}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.05520}, } ```

# MATH数据集难度估计本研究基于[通义千问2.5-MATH-7B（Qwen 2.5-MATH-7B）](https://huggingface.co/Qwen/Qwen2.5-Math-7B)模型的性能表现，为完整的[启发式数学能力测试（Mathematics Aptitude Test of Heuristics，简称MATH）数据集](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval)标注了**难度得分**，可为课程构建与模型评估提供自适应信号。启发式数学能力测试（Mathematics Aptitude Test of Heuristics，简称MATH）数据集收录了包括AMC 10、AMC 12、AIME等在内的数学竞赛题目。该数据集的每道题目均配有完整的分步解答，可用于指导模型生成答案推导过程与解释性内容。 ## 难度评分方法难度得分通过**通义千问2.5-MATH-7B（Qwen 2.5-MATH-7B）**模型估算，生成配置如下： - 温度系数（temperature）= 0.6 - 核采样阈值（top_p）= 0.9 - 最大令牌数（max_tokens）= 4096 - 推理过程基于[vLLM](https://github.com/vllm-project/vllm)实现 - 每道题目进行**128次**尝试求解单道题目的难度得分`d_i`计算公式如下： $d_i = 100 imes (1 - frac{ ext{成功求解次数}}{128})$ 该评分方法可平衡评估信号的分辨率： - 若使用**强模型**，则会轻易求解简单题目，导致难度标尺被压缩； - 若使用**弱模型**，则会普遍无法解题，难以提供精细的难度区分度； - 本次研究选择通义千问2.5-MATH-7B模型，因其具备**中等性能水平**，可在广泛的题目范围内生成具有区分度的难度梯度。 ## 其他数据集的难度估算本研究还将相同的难度估算流程应用于以下数据集： - [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty) - [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) - [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty) - [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty) ## 📬 联系方式如有疑问或反馈，请联系[**史泰伟（Taiwei Shi）**](https://maksimstw.github.io/)，邮箱：[taiweish@usc.edu](mailto:taiweish@usc.edu)。 ## 📚 引用项目仓库：https://github.com/uscnlp-lime/verl 若您使用本数据集，请引用论文《Efficient Reinforcement Finetuning via Adaptive Curriculum Learning》：https://huggingface.co/papers/2504.05520 bibtex @misc{shi2025efficientreinforcementfinetuningadaptive, title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning}, author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao}, year={2025}, eprint={2504.05520}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.05520}, }

提供机构：

maas

创建时间：

2025-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集