KlearReasoner-CodeSub-15K
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Summary
This dataset is a high-quality subset of the Klear-Reasoner Code RL dataset, derived from the RL data used in the [rllm project](https://github.com/agentica-project/rllm). Part of this data contributed to training Klear-Reasoner’s code reasoning models.
The dataset is carefully cleaned and filtered to include only reliable samples suitable for reinforcement learning. Models trained with this dataset have shown substantial performance improvements across various code reasoning benchmarks.
You can load the dataset via the Hugging Face datasets library:
```python
from datasets import load_dataset
dataset = load_dataset("Kwai-Klear/KlearReasoner-CodeSub-15K")
```
| Resource | Link |
|---|---|
| 📝 Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
| 🤗 Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
| 🤗 Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| 🤗 Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| 🤗 Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| 🐛 Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| 📧 Contact | suzhenpeng13@163.com |
## Data Fields
- **data_source** (string) — The source identifier for the sample.
- **prompt** (list of dict) — The input prompt, stored as a list of message objects in chat format.
- **ability** (string) — The skill or task category associated with the sample.
- **reward_model** (dict) — Information about the ground truth or reward signal.
- **ground_truth** (string) — The expected correct answer (may include LaTeX formatting).
- **style** (string) — The method or type of evaluation, e.g., "rule".
- **index_level_0** (int) — An internal index or unique identifier for the sample.
## Demonstration of Data Quality
This dataset contains exclusively high-quality, filtered samples.
All samples have been selected to ensure accurate reward signals for reinforcement learning, following the gradient-preserving clipping policy optimization (GPPO) method introduced in our paper. Models trained using this dataset achieve strong generalization and reliable performance on a range of math reasoning tasks.
## Citation
If you find this work helpful, please cite our paper:
```bibtex
@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}
```
```bibtex
@article{DBLP:journals/corr/abs-2508-07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
# 数据集概述
本数据集是Klear-Reasoner代码强化学习(Reinforcement Learning, RL)数据集的高质量子集,源自[rllm项目](https://github.com/agentica-project/rllm)所使用的强化学习数据。其中部分数据用于训练Klear-Reasoner的代码推理模型。
本数据集经过精心清洗与筛选,仅保留适用于强化学习的可靠样本。基于该数据集训练的模型在各类代码推理基准测试中均展现出显著的性能提升。
您可通过Hugging Face数据集库加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("Kwai-Klear/KlearReasoner-CodeSub-15K")
| 资源类型 | 链接 |
|---|---|
| 📝 预印本 | [论文](https://arxiv.org/pdf/2508.07629) |
| 🤗 每日论文 | [论文](https://huggingface.co/papers/2508.07629) |
| 🤗 模型仓库 | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| 🤗 数据集仓库 | [数学强化学习(Math RL)](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| 🤗 数据集仓库 | [代码强化学习(Code RL)](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| 🐛 问题与讨论 | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| 📧 联系方式 | suzhenpeng13@163.com |
# 数据字段
- **data_source**(字符串类型)—— 样本的来源标识符。
- **prompt**(字典列表类型)—— 输入提示,以聊天格式的消息对象列表形式存储。
- **ability**(字符串类型)—— 该样本对应的技能或任务类别。
- **reward_model**(字典类型)—— 关于基准真值或奖励信号的信息。
- **ground_truth**(字符串类型)—— 预期的正确答案(可能包含LaTeX格式)。
- **style**(字符串类型)—— 评估方法或类型,例如"rule"(规则式)。
- **index_level_0**(整数类型)—— 样本的内部索引或唯一标识符。
# 数据质量示例
本数据集仅包含经过筛选的高质量样本。所有样本均经过严格挑选,以确保强化学习所需的奖励信号准确无误,且遵循了论文中提出的梯度保留裁剪策略优化(GPPO)方法。基于该数据集训练的模型在各类数学推理任务中均具备出色的泛化能力与可靠性能。
# 引用方式
如果您认为本工作对您有所帮助,请引用我们的论文:
bibtex
@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}
bibtex
@article{DBLP:journals/corr/abs-2508.07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508.07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
提供机构:
maas
创建时间:
2025-09-06



