OmniAI-ZJU/NuminaMath-Cot-Distillation-100K
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/OmniAI-ZJU/NuminaMath-Cot-Distillation-100K
下载链接
链接失效反馈官方服务:
资源简介:
# NuminaMath-Cot-Distillation-100K: A Distilled Reasoning Dataset for Group Fine-Tuning
## 💡 Dataset Summary
**NuminaMath-Cot-Distillation-100K** is a high-quality instruction-tuning dataset specifically designed for mathematical reasoning tasks. It is the official dataset released for the paper **"GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification,"** which has been accepted to **ACL 2026 Findings**.
This dataset is engineered to address the "single-path dependency" and "entropy collapse" issues inherent in standard Supervised Fine-Tuning (SFT) by providing diverse reasoning trajectories for each mathematical problem.It serves as the foundational data for training models within the **Group Fine-Tuning (GFT)** framework.
## 📚 Dataset Lineage
This dataset is a secondary development and enhancement of the open-source [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) dataset.NuminaMath-CoT provides a wide range of mathematical challenges, from high school exercises to international olympiad-level problems.
## 🛠️ Construction Method
Following the methodology described in the GFT paper, we applied the following modifications to create this 100K version:
* **Data Sampling**: We randomly sampled **100,000 (100K)** unique mathematical problems from the original NuminaMath-CoT corpus.
* **Expert Trace Preservation**: For each problem, we retained the original chain-of-thought (CoT) reasoning path to serve as the "Expert Demonstration" ($y_{exp}$).
* **Multi-Path Teacher Distillation**:
* **Teacher Model**: We utilized **Qwen-2.5-Math-72B** as the powerful teacher model to introduce diverse reasoning paradigms.
* **Response Generation**: For every problem in the 100K subset, we generated **8 distilled responses** using the teacher model.
* **Rationale**: These diverse teacher outputs ($y_{demo}$) are integrated into a hybrid response group to break single-path dependency and provide comparative signals for advantage-based learning.
## 🚀 Usage & Purpose
This dataset is optimized for:
* **GFT Training**: Supporting the construction of hybrid response groups to perform Group Advantage Learning (GAL) and Dynamic Coefficient Rectification (DCR).
* **RL Alignment**: Providing a superior cold-start initialization for subsequent reinforcement learning (e.g., GRPO), raising the attainable performance ceiling.
* **Diversity Analysis**: Enabling researchers to analyze solution coverage and solution variety in mathematical reasoning.
## 📖 Citation
If you use this dataset in your research, please cite our ACL paper:
```bibtex
@article{gan2026gft,
title={GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification},
author={Gan, Wangjie and Pan, Miao and Xi, Linbo and Zhang, Wenqi and Chen, Jintao and Yin, Jianwei and Zhang, Xuhong},
journal={arXiv preprint arXiv:2604.14258},
year={2026}
}
提供机构:
OmniAI-ZJU



