orz_math_difficulty
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/lime-nlp/orz_math_difficulty
下载链接
链接失效反馈官方服务:
资源简介:
## Difficulty Estimation on Open Reasoner Zero
We annotate the entire [**Open Reasoner Zero**]((https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B)) dataset with a **difficulty score** based on the performance of the [Qwen 2.5-MATH-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model. This provides an adaptive signal for curriculum construction.
Open Reasoner Zero is a curated a dataset of 57,000 reasoning-intensive problems used to train and evaluate reinforcement learning-based methods for large language models.
## Difficulty Scoring Method
Difficulty scores are estimated using the **Qwen 2.5-MATH-7B** model with the following generation settings:
- `temperature = 0.6`
- `top_p = 0.9`
- `max_tokens = 4096`
- Inference performed using [vLLM](https://github.com/vllm-project/vllm)
- Each problem is attempted **128 times**
The difficulty score `d_i` for each problem is computed as:
d_i = 100 × (1 - (# successes / 128))
This approach balances the evaluation signal:
- A **strong model** would trivially solve easy problems, compressing the difficulty scale.
- A **weak model** would fail uniformly, providing poor resolution.
- Qwen 2.5-MATH-7B was selected for its **mid-range capabilities**, offering meaningful gradients across a wide spectrum of problems.
## Difficulty Estimation on Other Datasets
We also apply the same difficulty estimation procedure to the following datasets:
- [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty)
- [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty)
- [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty)
- [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty)
## 📬 Contact
For questions or feedback, feel free to reach out to [**Taiwei Shi**](https://maksimstw.github.io/) at [taiweish@usc.edu](mailto:taiweish@usc.edu).
## 📚 Citations
Github: https://github.com/uscnlp-lime/verl
If you find our dataset useful, please cite [Efficient Reinforcement Finetuning via Adaptive Curriculum Learning](https://huggingface.co/papers/2504.05520):
```bibtex
@misc{shi2025efficientreinforcementfinetuningadaptive,
title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning},
author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao},
year={2025},
eprint={2504.05520},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.05520},
}
```
## Open Reasoner Zero 难度评估
我们基于[**Qwen 2.5-MATH-7B**](https://huggingface.co/Qwen/Qwen2.5-Math-7B)模型的性能,对完整的[**Open Reasoner Zero**](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B)数据集进行了**难度评分**标注,该标注可为课程构建提供自适应信号。
Open Reasoner Zero 是一个精选的57000道推理密集型问题数据集,用于训练和评估面向大语言模型(Large Language Model, LLM)的强化学习方法。
## 难度评分方法
难度评分通过**Qwen 2.5-MATH-7B**模型进行估算,生成配置如下:
- `temperature = 0.6`(温度系数=0.6)
- `top_p = 0.9`(核心采样概率=0.9)
- `max_tokens = 4096`(最大生成Token数=4096)
- 推理过程通过[vLLM](https://github.com/vllm-project/vllm)完成
- 每个问题进行**128次**尝试求解
单道题目的难度评分`d_i`计算方式如下:
`d_i = 100 × (1 - 成功次数 / 128)`
该方法平衡了评估信号:
- 若使用**强模型**,则会轻松解决简单问题,压缩难度标尺的区分度;
- 若使用**弱模型**,则会普遍失败,无法提供有效的难度区分分辨率;
- 本次选择Qwen 2.5-MATH-7B模型,是因其具备**中等能力水平**,可在广泛的问题范围内提供有意义的难度梯度。
## 其他数据集的难度评估
我们还将相同的难度评估流程应用于以下数据集:
- [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty)
- [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty)
- [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty)
- [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty)
## 📬 联系方式
如有疑问或反馈,请联系[**史泰威**](https://maksimstw.github.io/),邮箱:`taiweish@usc.edu`(mailto:taiweish@usc.edu)。
## 📚 引用
GitHub仓库:https://github.com/uscnlp-lime/verl
若您认为本数据集对您的研究有所帮助,请引用论文《Efficient Reinforcement Finetuning via Adaptive Curriculum Learning》(https://huggingface.co/papers/2504.05520):
bibtex
@misc{shi2025efficientreinforcementfinetuningadaptive,
title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning},
author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao},
year={2025},
eprint={2504.05520},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.05520},
}
提供机构:
maas
创建时间:
2025-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是基于Qwen 2.5-MATH-7B模型对Open Reasoner Zero(包含57,000个推理密集型问题)进行难度评分的数据集,旨在为课程构建提供自适应信号。评分方法通过模型多次尝试计算成功率,并扩展应用于其他数学推理数据集,如DeepScaleR、MATH和GSM8K。
以上内容由遇见数据集搜集并总结生成



