GSM8K_Difficulty
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/lime-nlp/GSM8K_Difficulty
下载链接
链接失效反馈官方服务:
资源简介:
# Difficulty Estimation on DeepScaleR
We annotate the entire [**GSM8K**](https://huggingface.co/datasets/openai/gsm8k) dataset with a **difficulty score** based on the performance of the [Qwen 2.5-MATH-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model. This provides an adaptive signal for curriculum construction and model evaluation.
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
## Difficulty Scoring Method
Difficulty scores are estimated using the **Qwen 2.5-MATH-7B** model with the following generation settings:
- `temperature = 0.6`
- `top_p = 0.9`
- `max_tokens = 4096`
- Inference performed using [vLLM](https://github.com/vllm-project/vllm)
- Each problem is attempted **128 times**
The difficulty score `d_i` for each problem is computed as:
d_i = 100 × (1 - (# successes / 128))
This approach balances the evaluation signal:
- A **strong model** would trivially solve easy problems, compressing the difficulty scale.
- A **weak model** would fail uniformly, providing poor resolution.
- Qwen 2.5-MATH-7B was selected for its **mid-range capabilities**, offering meaningful gradients across a wide spectrum of problems.
## Difficulty Estimation on Other Datasets
We also apply the same difficulty estimation procedure to the following datasets:
- [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty)
- [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty)
- [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty)
- [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty)
## 📬 Contact
For questions or feedback, feel free to reach out to [**Taiwei Shi**](https://maksimstw.github.io/) at [taiweish@usc.edu](mailto:taiweish@usc.edu).
## 📚 Citations
If you find our dataset useful, please cite [Efficient Reinforcement Finetuning via Adaptive Curriculum Learning](https://huggingface.co/papers/2504.05520):
```bibtex
@misc{shi2025efficientreinforcementfinetuningadaptive,
title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning},
author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao},
year={2025},
eprint={2504.05520},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.05520},
}
```
# DeepScaleR数据集难度预估
我们基于[Qwen 2.5-MATH-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B)模型的性能,为完整的**GSM8K**数据集标注难度得分。该工作可为课程构建与模型评估提供自适应信号。
GSM8K(Grade School Math 8K,即中小学数学8K数据集)是一个包含8.5千个高质量、语言风格多样的中小学数学文字题的数据集,其构建旨在支持需要多步推理的基础数学问题问答任务。
## 难度评分方法
难度得分通过Qwen 2.5-MATH-7B模型估算,生成配置如下:
- `temperature = 0.6` 即温度系数为0.6
- `top_p = 0.9` 即核采样概率为0.9
- `max_tokens = 4096` 即最大生成Token数为4096
- 推理过程基于[vLLM](https://github.com/vllm-project/vllm)实现
- 每个题目进行128次尝试
每个题目的难度得分$d_i$按如下公式计算:
$$d_i = 100 imes (1 - ext{成功次数} / 128)$$
该方法平衡了评估信号:
- 性能强劲的模型可轻松解决简单题目,压缩难度分布区间;
- 性能较弱的模型则会普遍答错,无法提供有效的区分度;
- 选用Qwen 2.5-MATH-7B是因其具备中等水平的推理能力,可在广泛的题目范围内提供有意义的难度梯度。
## 其他数据集的难度预估
我们还将相同的难度预估流程应用于以下数据集:
- [Open Reasoner Zero](https://huggingface.co/datasets/lime-nlp/orz_math_difficulty)
- [DeepScaleR](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty)
- [MATH](https://huggingface.co/datasets/lime-nlp/MATH_difficulty)
- [GSM8K](https://huggingface.co/datasets/lime-nlp/GSM8K_difficulty)
## 📬 联系方式
如有疑问或反馈,请联系[**Taiwei Shi**](https://maksimstw.github.io/),邮箱为[taiweish@usc.edu](mailto:taiweish@usc.edu)。
## 📚 引用
若您认为本数据集对您的研究有所帮助,请引用论文《Efficient Reinforcement Finetuning via Adaptive Curriculum Learning》(https://huggingface.co/papers/2504.05520):
bibtex
@misc{shi2025efficientreinforcementfinetuningadaptive,
title={Efficient Reinforcement Finetuning via Adaptive Curriculum Learning},
author={Taiwei Shi and Yiyang Wu and Linxin Song and Tianyi Zhou and Jieyu Zhao},
year={2025},
eprint={2504.05520},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.05520},
}
提供机构:
maas
创建时间:
2025-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
GSM8K_Difficulty是基于GSM8K数据集(包含8.5K个小学数学文字问题)的扩展,为每个问题标注了难度分数。这些分数通过Qwen 2.5-MATH-7B模型在128次尝试中的成功率计算得出,旨在支持课程构建和模型评估,并提供跨问题难度的有意义梯度。
以上内容由遇见数据集搜集并总结生成



