TMLR-Group-HF/Co-rewarding-RephrasedDAPO-14k
收藏Hugging Face2025-10-11 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/TMLR-Group-HF/Co-rewarding-RephrasedDAPO-14k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language: en
tags:
- mathematical-reasoning
- reinforcement-learning
- self-supervised-learning
- llm
- llm-reasoning
- question-rewriting
---
# Co-rewarding: Rephrased DAPO-14k Training Set
This repository contains the DAPO-14k training set used in the **Co-rewarding-I** method, which is rephrased by the Qwen3-32B model. This dataset is associated with the paper [Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models](https://huggingface.co/papers/2508.00410).
Code: [https://github.com/tmlr-group/Co-rewarding](https://github.com/tmlr-group/Co-rewarding)
The rephrased questions were generated using the following prompt:
```
You are given a math problem. Please rewrite it using different wording and a different real-world scenario, while keeping the underlying mathematical meaning and answer exactly the same.
Guidelines:
1. Do not change the math logic or the final answer.
2. Use different words and a new context to make it look like a different problem.
3. Avoid copying phrases or sentence structures from the original.
4. Make sure the rewritten question is natural, clear, and solvable.
5. Output ONLY between the following markers, and strictly in this format (no extra explanation):
### RESULT_START
ORIGINAL:
<original question>
REWRITE:
<rewritten question>
### RESULT_END
```
This dataset contains the original math problem in DAPO-14k dataset and its rephrased version that maintain the same solution as the original one.
## Sample Usage (Rephrasing Data)
This dataset was created by rephrasing existing math problems. If you want to obtain similar rephrased data for training Co-rewarding-I from other datasets, you can use the `rewrite_questions.py` script from the project's GitHub repository. Below is an example demonstrating how to rephrase the DAPO-14k data, assuming you have already set up the environment and copied the preprocessed dataset as described in the [GitHub repository](https://github.com/tmlr-group/Co-rewarding#install-environment):
```bash
# Assuming you are in the Co-rewarding-I directory and data is preprocessed
python rewrite_questions.py \
--input_path data/dapo/train.parquet \
--output_jsonl data/dapo/train_rewrite_Qwen3-32B.jsonl \
--output_parquet data/dapo/train_rewrite_Qwen3-32B.parquet \
--output_original_parquet data/dapo/train_original.parquet \
--model_path $YOUR_Qwen3-32B_MODEL_PATH \
--tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \
--question_column prompt \
--batch_size 128
```
Replace `$YOUR_Qwen3-32B_MODEL_PATH` and `$YOUR_Qwen3-32B_TOKENIZER_PATH` with the actual paths to your downloaded Qwen3-32B model and its tokenizer.
## Citation
If you use this dataset, please cite our paper!
```bibtex
@article{zhang2025co,
title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
author={Zhang, Zizhuo and Zhu, Jianing and Ge, Xinmu and Zhao, Zihua and Zhou, Zhanke and Li, Xuan and Feng, Xiao and Yao, Jiangchao and Han, Bo},
journal={arXiv preprint arXiv:2508.00410},
year={2025}
}
```
提供机构:
TMLR-Group-HF



