five

orz_math_difficulty

收藏
魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/lime-nlp/orz_math_difficulty
下载链接
链接失效反馈
官方服务:
资源简介:
## Difficulty Estimation on Open Reasoner Zero We annotate the entire [**Open Reasoner Zero**]((https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B)) dataset with a **difficulty score** based on the performance of the [Qwen 2.5-MATH-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model. This provides an adaptive signal for curriculum construction. Open Reasoner Zero is a curated a dataset of 57,000 reasoning-intensive problems used to train and evaluate reinforcement learning-based methods for large language models. ## Difficulty Scoring Method Difficulty scores are estimated using the **Qwen 2.5-MATH-7B** model with the following generation settings: - `temperature = 0.6` - `top_p = 0.9` - `max_tokens = 4096` - Inference performed using [vLLM](https://github.com/vllm-project/vllm) - Each problem is attempted **128 times** The difficulty score `d_i` for each problem is computed as: d_i = 100 × (1 - (# successes / 128)) This approach balances the evaluation signal: - A **strong model** would trivially solve easy problems, compressing the difficulty scale. - A **weak model** would fail uniformly, providing poor resolution. - Qwen 2.5-MATH-7B was selected for its **mid-range capabilities**, offering meaningful gradients across a wide spectrum of problems. ## Difficulty Estimation on Other Datasets We also apply the same difficulty estimation procedure to the following datasets: - [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty) - [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) - [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty) - [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty) ## 📬 Contact For questions or feedback, feel free to reach out to [**Taiwei Shi**](https://maksimstw.github.io/) at [taiweish@usc.edu](mailto:taiweish@usc.edu). ## 📚 Citations Github: https://github.com/uscnlp-lime/verl If you find our dataset useful, please cite [Efficient Reinforcement Finetuning via Adaptive Curriculum Learning](https://huggingface.co/papers/2504.05520): ```bibtex @misc{shi2025efficientreinforcementfinetuningadaptive, title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning}, author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao}, year={2025}, eprint={2504.05520}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.05520}, } ```

## Open Reasoner Zero 难度评估 我们基于[**Qwen 2.5-MATH-7B**](https://huggingface.co/Qwen/Qwen2.5-Math-7B)模型的性能,对完整的[**Open Reasoner Zero**](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B)数据集进行了**难度评分**标注,该标注可为课程构建提供自适应信号。 Open Reasoner Zero 是一个精选的57000道推理密集型问题数据集,用于训练和评估面向大语言模型(Large Language Model, LLM)的强化学习方法。 ## 难度评分方法 难度评分通过**Qwen 2.5-MATH-7B**模型进行估算,生成配置如下: - `temperature = 0.6`(温度系数=0.6) - `top_p = 0.9`(核心采样概率=0.9) - `max_tokens = 4096`(最大生成Token数=4096) - 推理过程通过[vLLM](https://github.com/vllm-project/vllm)完成 - 每个问题进行**128次**尝试求解 单道题目的难度评分`d_i`计算方式如下: `d_i = 100 × (1 - 成功次数 / 128)` 该方法平衡了评估信号: - 若使用**强模型**,则会轻松解决简单问题,压缩难度标尺的区分度; - 若使用**弱模型**,则会普遍失败,无法提供有效的难度区分分辨率; - 本次选择Qwen 2.5-MATH-7B模型,是因其具备**中等能力水平**,可在广泛的问题范围内提供有意义的难度梯度。 ## 其他数据集的难度评估 我们还将相同的难度评估流程应用于以下数据集: - [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty) - [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) - [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty) - [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty) ## 📬 联系方式 如有疑问或反馈,请联系[**史泰威**](https://maksimstw.github.io/),邮箱:`taiweish@usc.edu`(mailto:taiweish@usc.edu)。 ## 📚 引用 GitHub仓库:https://github.com/uscnlp-lime/verl 若您认为本数据集对您的研究有所帮助,请引用论文《Efficient Reinforcement Finetuning via Adaptive Curriculum Learning》(https://huggingface.co/papers/2504.05520): bibtex @misc{shi2025efficientreinforcementfinetuningadaptive, title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning}, author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao}, year={2025}, eprint={2504.05520}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.05520}, }
提供机构:
maas
创建时间:
2025-05-23
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是基于Qwen 2.5-MATH-7B模型对Open Reasoner Zero(包含57,000个推理密集型问题)进行难度评分的数据集,旨在为课程构建提供自适应信号。评分方法通过模型多次尝试计算成功率,并扩展应用于其他数学推理数据集,如DeepScaleR、MATH和GSM8K。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作