KlearReasoner-MathSub-30K
收藏魔搭社区2026-01-02 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/Kwai-Klear/KlearReasoner-MathSub-30K
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Summary
This dataset is a subset of the Klear-Reasoner Math RL dataset.
The full dataset contains approximately 88K entries, while this release includes a 30K-entry subset.
The subset was obtained by filtering the outputs of DeepSeek-R1-0120. For each prompt, DeepSeek-R1-0120 generated 16 responses, and we retained only those responses where the majority of completions passed a rule-based validator designed for mathematical correctness and format compliance.
You can load the dataset using:
```python
from datasets import load_dataset
dataset = load_dataset("Kwai-Klear/KlearReasoner-MathSub-30K")
```
See our paper and GitHub repository for more details.
| Resource | Link |
|---|---|
| 📝 Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
| 🤗 Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
| 🤗 Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| 🤗 Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| 🤗 Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| 🐛 Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| 📧 Contact | suzhenpeng13@163.com |
## Data Fields
- **data_source** (string) — The source identifier for the sample.
- **prompt** (list of dict) — The input prompt, stored as a list of message objects in chat format.
- **ability** (string) — The skill or task category associated with the sample.
- **reward_model** (dict) — Information about the ground truth or reward signal.
- **ground_truth** (string) — The expected correct answer (may include LaTeX formatting).
- **style** (string) — The method or type of evaluation, e.g., "rule".
- **index_level_0** (int) — An internal index or unique identifier for the sample.
## Demonstration of Data Quality
This dataset contains exclusively high-quality, filtered samples.
All samples have been selected to ensure accurate reward signals for reinforcement learning, following the gradient-preserving clipping policy optimization (GPPO) method introduced in our paper. Models trained using this dataset achieve strong generalization and reliable performance on a range of math reasoning tasks.
## Citation
If you find this work helpful, please cite our paper:
```bibtex
@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}
```
```bibtex
@article{DBLP:journals/corr/abs-2508-07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
# 数据集概览
本数据集为Klear-Reasoner数学强化学习(Klear-Reasoner Math RL)数据集的子集。完整数据集包含约8.8万条数据条目,本次发布的子集共包含3万条数据。
该子集通过对DeepSeek-R1-0120的模型输出进行筛选得到:针对每个输入提示,DeepSeek-R1-0120生成了16条回复,我们仅保留其中绝大多数生成结果通过了专为数学正确性与格式合规性设计的基于规则的验证器的回复。
可通过以下代码加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("Kwai-Klear/KlearReasoner-MathSub-30K")
更多细节可查阅我们的论文与GitHub仓库。
| 资源类型 | 链接 |
|---|---|
| 📝 预印本 | [论文](https://arxiv.org/pdf/2508.07629) |
| 🤗 论文页面 | [论文](https://huggingface.co/papers/2508.07629) |
| 🤗 模型仓库 | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| 🤗 数据集仓库 | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| 🤗 数据集仓库 | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| 🐛 问题与讨论 | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| 📧 联系方式 | suzhenpeng13@163.com |
## 数据字段
- **data_source**(字符串类型)—— 样本的来源标识符。
- **prompt**(字典列表类型)—— 输入提示,以对话格式的消息对象列表形式存储。
- **ability**(字符串类型)—— 与样本关联的技能或任务类别。
- **reward_model**(字典类型)—— 关于真实标签或奖励信号的信息。
- **ground_truth**(字符串类型)—— 预期的正确答案(可能包含LaTeX格式)。
- **style**(字符串类型)—— 评估方法或类型,例如"rule(基于规则)"。
- **index_level_0**(整数类型)—— 样本的内部索引或唯一标识符。
## 数据质量说明
本数据集仅包含经过筛选的高质量样本。所有样本均经过筛选,以确保为强化学习提供准确的奖励信号,该筛选流程遵循我们论文中提出的梯度保留裁剪策略优化(gradient-preserving clipping policy optimization, GPPO)方法。使用本数据集训练的模型在各类数学推理任务中均展现出优异的泛化能力与可靠的性能表现。
## 引用
若您认为本工作对您有所帮助,请引用我们的论文:
bibtex
@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}
bibtex
@article{DBLP:journals/corr/abs-2508.07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508.07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
提供机构:
maas
创建时间:
2025-09-06



