wrmedford/Gemma-4-E4B-it-SSD
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/wrmedford/Gemma-4-E4B-it-SSD
下载链接
链接失效反馈官方服务:
资源简介:
# SSD Dataset Replication (Gemma-4-E4B-it)
This dataset is a replication of the **"Embarrassingly Simple Self-Distillation Improves Code Generation" (SSD)** paper ([arXiv:2604.01193](https://arxiv.org/pdf/2604.01193)).
## Overview
The dataset contains coding problems and their corresponding solutions generated by **Gemma-4-E4B-it** using high-temperature sampling (T=1.1) to explore the model's latent capabilities. This approach, known as SSD, focuses on "self-distillation" where a model's own correct but non-greedy outputs are used for fine-tuning to improve its standard (greedy) performance.
## Dataset Construction
- **Seed Prompts:** Samples from `deepmind/code_contests`.
- **Generation:** High-temperature sampling (T=1.1, Top-K=20, Top-P=0.95) with **Gemma-4-E4B-it**.
- **Reasoning:** Reasoning chains (thinking) were enabled during generation to capture the model's logical process.
## Contents
- `ssd_dataset.jsonl`: The generated samples (13,328 entries) in OpenAI-style message format.
- `generate_dataset.py`: The replication script used to synthesize the data (vLLM-based, data-parallel).
## Usage
To replicate the generation process:
```bash
python generate_dataset.py --model google/gemma-4-E4B-it --dp <NUM_GPUS>
```
## Citation
```bibtex
@misc{zhang2026embarrassinglysimpleselfdistillationimproves,
title={Embarrassingly Simple Self-Distillation Improves Code Generation},
author={Ruixiang Zhang and Richard He Bai and Huangjie Zheng and Navdeep Jaitly and Ronan Collobert and Yizhe Zhang},
year={2026},
eprint={2604.01193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01193},
}
```
# SSD 数据集复现(Gemma-4-E4B-it)
本数据集是对**《极简自蒸馏优化代码生成》(SSD)**论文的复现,该论文的arXiv预印本可访问:[arXiv:2604.01193](https://arxiv.org/pdf/2604.01193)。
## 数据集概述
该数据集包含编程问题及其对应的生成解决方案,生成模型为**Gemma-4-E4B-it**,采用高温采样方案(温度参数T=1.1)以探索模型的潜在能力。本方法即SSD,核心为“自蒸馏”:利用模型自身生成的正确非贪心解码输出进行微调,以提升模型的标准贪心解码性能。
## 数据集构建
- **种子提示词**:样本取自`deepmind/code_contests`数据集。
- **生成流程**:使用**Gemma-4-E4B-it**执行高温采样(T=1.1,Top-K=20,Top-P=0.95)生成样本。
- **推理过程捕获**:生成过程中启用推理链(思考步骤),以完整记录模型的逻辑推导过程。
## 数据内容
- `ssd_dataset.jsonl`:生成的样本集,共13328条数据,采用OpenAI风格的消息格式存储。
- `generate_dataset.py`:用于复现数据合成的脚本(基于vLLM框架,支持数据并行)。
## 使用方法
若需复现生成流程,请执行以下命令:
bash
python generate_dataset.py --model google/gemma-4-E4B-it --dp <GPU数量>
## 引用格式
bibtex
@misc{zhang2026embarrassinglysimpleselfdistillationimproves,
title={极简自蒸馏优化代码生成},
author={Ruixiang Zhang and Richard He Bai and Huangjie Zheng and Navdeep Jaitly and Ronan Collobert and Yizhe Zhang},
year={2026},
eprint={2604.01193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01193},
}
提供机构:
wrmedford



