wrmedford/Gemma-4-E4B-it-SSD

Name: wrmedford/Gemma-4-E4B-it-SSD
Creator: wrmedford
Published: 2026-04-07 02:39:49
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/wrmedford/Gemma-4-E4B-it-SSD

下载链接

链接失效反馈

官方服务：

资源简介：

# SSD Dataset Replication (Gemma-4-E4B-it) This dataset is a replication of the **"Embarrassingly Simple Self-Distillation Improves Code Generation" (SSD)** paper ([arXiv:2604.01193](https://arxiv.org/pdf/2604.01193)). ## Overview The dataset contains coding problems and their corresponding solutions generated by **Gemma-4-E4B-it** using high-temperature sampling (T=1.1) to explore the model's latent capabilities. This approach, known as SSD, focuses on "self-distillation" where a model's own correct but non-greedy outputs are used for fine-tuning to improve its standard (greedy) performance. ## Dataset Construction - **Seed Prompts:** Samples from `deepmind/code_contests`. - **Generation:** High-temperature sampling (T=1.1, Top-K=20, Top-P=0.95) with **Gemma-4-E4B-it**. - **Reasoning:** Reasoning chains (thinking) were enabled during generation to capture the model's logical process. ## Contents - `ssd_dataset.jsonl`: The generated samples (13,328 entries) in OpenAI-style message format. - `generate_dataset.py`: The replication script used to synthesize the data (vLLM-based, data-parallel). ## Usage To replicate the generation process: ```bash python generate_dataset.py --model google/gemma-4-E4B-it --dp <NUM_GPUS> ``` ## Citation ```bibtex @misc{zhang2026embarrassinglysimpleselfdistillationimproves, title={Embarrassingly Simple Self-Distillation Improves Code Generation}, author={Ruixiang Zhang and Richard He Bai and Huangjie Zheng and Navdeep Jaitly and Ronan Collobert and Yizhe Zhang}, year={2026}, eprint={2604.01193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.01193}, } ```

# SSD 数据集复现（Gemma-4-E4B-it）本数据集是对**《极简自蒸馏优化代码生成》（SSD）**论文的复现，该论文的arXiv预印本可访问：[arXiv:2604.01193](https://arxiv.org/pdf/2604.01193)。 ## 数据集概述该数据集包含编程问题及其对应的生成解决方案，生成模型为**Gemma-4-E4B-it**，采用高温采样方案（温度参数T=1.1）以探索模型的潜在能力。本方法即SSD，核心为“自蒸馏”：利用模型自身生成的正确非贪心解码输出进行微调，以提升模型的标准贪心解码性能。 ## 数据集构建 - **种子提示词**：样本取自`deepmind/code_contests`数据集。 - **生成流程**：使用**Gemma-4-E4B-it**执行高温采样（T=1.1，Top-K=20，Top-P=0.95）生成样本。 - **推理过程捕获**：生成过程中启用推理链（思考步骤），以完整记录模型的逻辑推导过程。 ## 数据内容 - `ssd_dataset.jsonl`：生成的样本集，共13328条数据，采用OpenAI风格的消息格式存储。 - `generate_dataset.py`：用于复现数据合成的脚本（基于vLLM框架，支持数据并行）。 ## 使用方法若需复现生成流程，请执行以下命令： bash python generate_dataset.py --model google/gemma-4-E4B-it --dp <GPU数量> ## 引用格式 bibtex @misc{zhang2026embarrassinglysimpleselfdistillationimproves, title={极简自蒸馏优化代码生成}, author={Ruixiang Zhang and Richard He Bai and Huangjie Zheng and Navdeep Jaitly and Ronan Collobert and Yizhe Zhang}, year={2026}, eprint={2604.01193}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.01193}, }

提供机构：

wrmedford

5,000+

优质数据集

54 个

任务类型

进入经典数据集