TMLR-Group-HF/Co-rewarding-RephrasedDAPO-14k

Name: TMLR-Group-HF/Co-rewarding-RephrasedDAPO-14k
Creator: TMLR-Group-HF
Published: 2025-10-11 06:48:10
License: 暂无描述

Hugging Face2025-10-11 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/TMLR-Group-HF/Co-rewarding-RephrasedDAPO-14k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: en tags: - mathematical-reasoning - reinforcement-learning - self-supervised-learning - llm - llm-reasoning - question-rewriting --- # Co-rewarding: Rephrased DAPO-14k Training Set This repository contains the DAPO-14k training set used in the **Co-rewarding-I** method, which is rephrased by the Qwen3-32B model. This dataset is associated with the paper [Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models](https://huggingface.co/papers/2508.00410). Code: [https://github.com/tmlr-group/Co-rewarding](https://github.com/tmlr-group/Co-rewarding) The rephrased questions were generated using the following prompt: ``` You are given a math problem. Please rewrite it using different wording and a different real-world scenario, while keeping the underlying mathematical meaning and answer exactly the same. Guidelines: 1. Do not change the math logic or the final answer. 2. Use different words and a new context to make it look like a different problem. 3. Avoid copying phrases or sentence structures from the original. 4. Make sure the rewritten question is natural, clear, and solvable. 5. Output ONLY between the following markers, and strictly in this format (no extra explanation): ### RESULT_START ORIGINAL: <original question> REWRITE: <rewritten question> ### RESULT_END ``` This dataset contains the original math problem in DAPO-14k dataset and its rephrased version that maintain the same solution as the original one. ## Sample Usage (Rephrasing Data) This dataset was created by rephrasing existing math problems. If you want to obtain similar rephrased data for training Co-rewarding-I from other datasets, you can use the `rewrite_questions.py` script from the project's GitHub repository. Below is an example demonstrating how to rephrase the DAPO-14k data, assuming you have already set up the environment and copied the preprocessed dataset as described in the [GitHub repository](https://github.com/tmlr-group/Co-rewarding#install-environment): ```bash # Assuming you are in the Co-rewarding-I directory and data is preprocessed python rewrite_questions.py \ --input_path data/dapo/train.parquet \ --output_jsonl data/dapo/train_rewrite_Qwen3-32B.jsonl \ --output_parquet data/dapo/train_rewrite_Qwen3-32B.parquet \ --output_original_parquet data/dapo/train_original.parquet \ --model_path $YOUR_Qwen3-32B_MODEL_PATH \ --tokenizer_path $YOUR_Qwen3-32B_TOKENIZER_PATH \ --question_column prompt \ --batch_size 128 ``` Replace `$YOUR_Qwen3-32B_MODEL_PATH` and `$YOUR_Qwen3-32B_TOKENIZER_PATH` with the actual paths to your downloaded Qwen3-32B model and its tokenizer. ## Citation If you use this dataset, please cite our paper! ```bibtex @article{zhang2025co, title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models}, author={Zhang, Zizhuo and Zhu, Jianing and Ge, Xinmu and Zhao, Zihua and Zhou, Zhanke and Li, Xuan and Feng, Xiao and Yao, Jiangchao and Han, Bo}, journal={arXiv preprint arXiv:2508.00410}, year={2025} } ```

提供机构：

TMLR-Group-HF

5,000+

优质数据集

54 个

任务类型

进入经典数据集