CoopReason/TESSY-SuperGPQA-3K
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/CoopReason/TESSY-SuperGPQA-3K
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
tags:
- GPQA
license: apache-2.0
language:
- en
---
## TESSY-SuperGPQA-3K
<img src="https://cdn-uploads.huggingface.co/production/uploads/656d9eb2b40203890228a4f8/HjD75pCOxK6s0VZxSzbwo.png" alt="Logo" width="600" style="display: block; margin: 0 auto;" />
<p align="center">
📄 <a href="https://arxiv.org/pdf/2604.14164">Paper Link</a>
|
🔗 <a href="https://github.com/CoopReason/TESSY/tree/main">GitHub Repository</a>
</p>
---
## 📣 Paper
[How to Fine-Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesize Student-Consistent SFT Data](https://arxiv.org/pdf/2604.14164)
## 🚀 Overview
We construct a programming contest training dataset for Qwen3-8B by leveraging GPT-OSS-120B as the teacher model. The synthesized data preserves the strong reasoning capabilities of GPT-OSS-120B, while being aligned with the data distribution of Qwen3-8B. This enables more effective on-policy SFT, ensuring that the student model learns from samples that are both high-quality and distribution-consistent with its own generation behavior.
> **Note:** This dataset is specifically tailored and optimized for **Qwen/Qwen3-8B**. We use GPT-OSS-120B as teacher and Qwen3-8B as student to synthesize this dataset. The question is sampled from m-a-p/SuperGPQA.
## 💡 Motivation
Training reasoning models (e.g., Qwen3) is highly sensitive to the data distribution. We observe that:
> ❗ Using off-policy data (e.g., directly from a strong teacher model) for SFT can lead to **severe catastrophic forgetting**, especially for complex reasoning tasks.
---
## 🔦 Key Idea
To address this critical issue, we propose **TESSY**, a novel **Teacher–Student Cooperative Data Synthesis framework** designed to generate *on-policy* training data. Instead of relying on a teacher model to fully generate training samples, TESSY **decouples the generation process into two distinct parts**:
- 🧠 **Teacher model** → specializes in generating *capability tokens*.
- ✍️ **Student model** → focuses on generating *style tokens* (e.g., Hmm, Wait...).
This cooperative approach ensures:
- **Alignment with student distribution (on-policy)**: The synthesized data is tailored to the student model's own generation patterns.
- **Preservation of teacher reasoning quality**: The teacher's advanced reasoning capabilities are effectively leveraged and maintained.
---
## 🧩 Method
<img src="https://cdn-uploads.huggingface.co/production/uploads/656d9eb2b40203890228a4f8/93LZKxa1cafsyLHdm-cl9.png" alt="TESSY Method Overview" width="800" style="display: block; margin: 0 auto;" />
TESSY performs **iterative cooperative generation** through the following steps:
1. **Predict Reasoning Boundaries**: The process begins by identifying the boundaries between reasoning steps and non-reasoning content within a given problem.
2. **Alternate Generation**: The teacher and student models then alternate in generating parts of the solution.
3. **Construct Full Trajectories**: By combining these collaboratively generated segments, TESSY constructs complete, high-quality reasoning trajectories that are aligned with the student model's distribution.
---
## ⚙️ Download
Install package `datasets`:
```bash
pip install datasets
```
Load this dataset:
```python
from datasets import load_dataset
dataset = load_dataset("CoopReason/TESSY-SuperGPQA-3K", split="train")
print(dataset[0]["dialogs"][0]["content"])
# Output: question
print(dataset[0]["dialogs"][1]["content"])
# Output: Synthetic reasoning trajectory and final answer
```
## 📌 Citation
If this work is useful to you, please cite:
```bibtex
@article{TESSY,
title={How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data},
author={Huang, Zixian and Yang, Kaichen and Huang, Xu and Hao, Feiyang and Ge, Qiming and Li, Bowen and Du, He and Chen, Kai and Guo, Qipeng},
journal={arXiv preprint arXiv:2604.14164},
year={2026}
}
```
---
任务类别:
- 文本生成
标签:
- GPQA
许可协议:Apache-2.0
语言:
- 英语
---
# TESSY-SuperGPQA-3K
<p align="center">
📄 <a href="https://arxiv.org/pdf/2604.14164">论文链接</a>
|
🔗 <a href="https://github.com/CoopReason/TESSY/tree/main">GitHub 仓库</a>
</p>
## 📣 论文
[《如何微调推理模型?一种师生协作框架以合成符合学生模型一致性的监督微调数据》](https://arxiv.org/pdf/2604.14164)
## 🚀 数据集概览
我们以GPT-OSS-120B作为教师模型,为Qwen3-8B构建了编程竞赛训练数据集。所合成的数据既保留了GPT-OSS-120B强大的推理能力,又与Qwen3-8B的数据分布对齐。这能够实现更高效的同策略监督微调,确保学生模型从兼具高质量且与其自身生成行为分布一致的样本中学习。
> **注意:** 本数据集专为**Qwen/Qwen3-8B**定制优化。我们以GPT-OSS-120B为教师模型、Qwen3-8B为学生模型合成了本数据集。问题样本取自`m-a-p/SuperGPQA`。
## 💡 研究动机
训练推理模型(如Qwen3)对数据分布极为敏感。我们观察到:
> ❗ 使用离线策略数据(例如直接取自强教师模型的数据)进行监督微调,可能会导致**严重的灾难性遗忘**,在复杂推理任务中尤为明显。
---
## 🔦 核心思路
为解决这一关键问题,我们提出**TESSY**——一种新颖的**师生协作数据合成框架**,用于生成*同策略*训练数据。与依赖教师模型完全生成训练样本不同,TESSY将生成过程解耦为两个独立阶段:
- 🧠 **教师模型**:专注生成*能力Token (Token)*。
- ✍️ **学生模型**:专注生成*风格Token (Token)*(例如“嗯”“稍等……”)。
这种协作方式能够确保:
- **与学生模型分布对齐(同策略)**:合成数据适配学生模型自身的生成模式。
- **保留教师模型的推理质量**:教师模型的先进推理能力得到有效利用与保留。
---
## 🧩 方法流程

TESSY通过以下步骤执行**迭代式协作生成**:
1. **预测推理边界**:首先识别给定问题中推理步骤与非推理内容的边界。
2. **交替生成**:随后教师与学生模型交替生成解决方案的各个部分。
3. **构建完整轨迹**:通过组合这些协作生成的片段,TESSY构建出与学生模型分布对齐的完整、高质量推理轨迹。
---
## ⚙️ 数据集下载
安装`datasets`库:
bash
pip install datasets
加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("CoopReason/TESSY-SuperGPQA-3K", split="train")
print(dataset[0]["dialogs"][0]["content"])
# 输出:问题文本
print(dataset[0]["dialogs"][1]["content"])
# 输出:合成的推理轨迹与最终答案
## 📌 引用信息
如果本工作对你有所帮助,请引用:
bibtex
@article{TESSY,
title={How to Fine-Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesize Student-Consistent SFT Data},
author={Huang, Zixian and Yang, Kaichen and Huang, Xu and Hao, Feiyang and Ge, Qiming and Li, Bowen and Du, He and Chen, Kai and Guo, Qipeng},
journal={arXiv preprint arXiv:2604.14164},
year={2026}
}
提供机构:
CoopReason



