five

swallow-code-v2

收藏
魔搭社区2026-01-08 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/tokyotech-llm/swallow-code-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# SwallowCode-v2 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath-v2 Icon" width="500"> ### Resources - 📑 **arXiv**: Read our paper for detailed methodology and results at [arXiv:2505.02881](https://arxiv.org/abs/2505.02881). - 🤗 **Sister Dataset**: Discover [SwallowMath-v2](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2), our companion dataset for mathematical reasoning. ## 💻 What is it? [SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code) was a high-quality Python code dataset generated through an LLM-based rewriting pipeline. However, it had two significant limitations: (1) it was distributed under the **Llama 3.3 Community License**, and (2) its size was limited to **16.1 B** tokens, restricting large-scale pre-training. To address these issues, we built **SwallowCode-v2**, a fully rewritten Python corpus derived from [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) , using [Qwen3-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507). The resulting dataset contains **49.8 billion** tokens and is released under the **Apache 2.0 License**, ensuring both open accessibility and reproducibility for research and commercial use. As shown in the figure below, SwallowCode-v2 demonstrates stronger performance than other open-source code datasets on downstream code-generation benchmarks.<br> <sub>† Note: While datasets such as [OpenCoder](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta) and [NVIDIA/Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1) are labeled “open,” they only release metadata, not the actual training samples. Unlike The-Stack-v2, they cannot be directly downloaded from public storage (e.g., S3) and instead require large-scale re-crawling of GitHub repositories based on metadata. For smaller open-source LLM projects, this reconstruction process is prohibitively expensive, making it impractical to reproduce or directly compare those datasets. Hence, results for those corpora are omitted in our comparison.</sub> <img src="swallow_code.png" width="800"/> ## 📊 Dataset Comparison | Dataset | Token Count (Llama-3 Tokenizer) | License | | :-------------------------------- | :-----------------------------: | :--------------------------------- | | **[Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1)** | metadata release | [NVIDIA Open Data License Agreement](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) | | [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu) (python) | 17.9 B tokens | - | | **SwallowCode-v1 (our previous)** | 16.1 B tokens | Llama-3.3 Community License | | **SwallowCode-v2 (this work)** | 49.8 B tokens | **Apache 2.0 License** | ## 📦 What is being released? **SwallowCode-v2**: A **49.8 B**-token Apache-2.0-licensed Python code dataset rewritten from The-Stack-v2, designed for scalable LLM pre-training. All samples are auto-formatted, style-normalized, and enhanced for algorithmic clarity via a LLM rewriting pipeline. ## 🧩 Dataset curation 1. **Auto-Formatting** – Standardize code style using [ruff formatter](https://docs.astral.sh/ruff/). 2. **Length Filtering** – Remove excessively long or truncated samples. 3. **LLM Quality Scoring** – Rate each snippet for readability and style compliance (0–10 scale) using [SeedCoder](https://arxiv.org/abs/2506.03524) prompt for quality scoring. 4. **LLM Rewriting Phase** – Use Qwen3-235B-A22B-Instruct to rewrite and enhance code for clarity, structure, and algorithmic soundness. 5. **Post-Formatting** – Apply a final ruff pass to ensure uniform formatting and compliance. <img src="pipeline.png" width="800"/> ### 🗂️ Dataset structure - **Stage 1** - auto-format: [stage1-auto-format/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage1-auto-format/python) - **Stage 2** - length-filter: [stage2-length-filter/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage2-length-filter/python) - **Stage 3** - llm-score: [stage3-llm-score/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage3-llm-score/python) - **Stage 4** - llm-rewrite: [stage4-llm-rewrite/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage4-llm-rewrite/python/medium) - **Stage 5** - auto-format: [stage5-auto-format/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage5-auto-format/python/medium) (**SwallowCode-v2**) ### 🧪 Rewriting ablation experiments To investigate how different LLM-based rewriting strategies affect the quality of generated code data, we conducted the following ablation experiments. All experiments involved **50B-token continual pre-training of Llama-3.1-8B**, and performance was tracked by measuring **HumanEval** and **HumanEval+** pass@1 scores over the course of training. By using datasets created with different rewriting strategies as the training corpus, we compared the effectiveness of each method. Insights obtained from these ablations directly informed the construction of **SwallowCode-v2**. #### Instruct vs Thinking model We compared the effectiveness of using an **Instruct** model and a **Thinking** model (both from Qwen-3-235B-A22B) for rewriting. As shown in the figure below, there was **no significant difference** in performance between data rewritten by the Instruct model and that by the Thinking model. However, the Thinking model outputs a `<think>...</think>` reasoning trajectory before producing the final rewritten code, leading to higher GPU cost per rewritten sample. Based on these findings, we adopted the **Instruct model for rewriting**, as it provides comparable quality at a substantially lower computational cost. <img src="swallow_code_instruct.png" width="800"/> #### 1 stage Rewriting vs 2 stage Rewriting In [SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code), we employed a **2-stage rewriting process**. For SwallowCode-v2, we revisited the prompt design used in v1 to test whether a single-stage (1-stage) rewriting could achieve the same quality. Specifically, we combined the two stages of v1 into a single instruction, asking the LLM to perform the same overall rewriting within one step. However, since LLMs are known to ignore parts of overly complex prompts, we could not rule out that the act of explicitly separating the rewriting into two stages was itself beneficial. Therefore, we directly compared 1-stage and 2-stage rewriting. The results showed that **2-stage rewriting required nearly twice the GPU hours** but produced **similar downstream performance** to 1-stage rewriting. Consequently, we adopted the 1-stage rewriting strategy for SwallowCode-v2 construction. <img src="swallow_code_stage.png" width="800"/> #### High Quality vs Medium Quality Using the [SeedCoder](https://arxiv.org/abs/2506.03524) quality-scoring prompt, we evaluated and categorized source code data into **High**, **Medium**, and **Low** quality groups. Intuitively, one might expect that higher-quality inputs would yield better rewritten data. However, when we tested this hypothesis through HumanEval and HumanEval+ performance, the results showed the opposite trend — **rewriting from Medium-quality data slightly outperformed rewriting from High-quality data**, as shown below. We hypothesize that this may be due to distributional differences: High-quality code often includes complex, class-based implementations or heavy library use, whereas Medium-quality code tends to resemble the **simpler, problem-oriented** format of HumanEval tasks. This qualitative observation, while informative, remains a preliminary analysis and has not yet been verified through deeper experimentation. <img src="swallow_code_quality.png" width="800"/> ## 📊 Results and Performance SwallowCode-v2 achieved **+20.7** and **+21.9** higher pass@1 scores on HumanEval and HumanEval+, respectively, compared to [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu). These experiments were conducted using Llama-3.1-8B. <img src="swallow_code.png" width="800"/> ## 📝 Note The SwallowCode-v2 project was originally designed to build a multilingual code dataset covering 13 programming languages. However, due to the substantial GPU hours and development effort required, and since SwallowCode-v2 and SwallowMath-v2 were both developed by three students in parallel with their main research, completing all subsets proved infeasible. We therefore decided to release the Python subset, which was fully constructed, as SwallowCode-v2. Future versions — SwallowCode-v3 / SwallowMath-v3 — are planned to be larger, higher-quality, and may incorporate Thinking-Augmentation and other advanced methodologies. However, the continuation of this project depends on strong demand from the open community or the potential for clear academic contribution. ## ⚖️ Licensing Information SwallowCode-v2 is released under the **Apache-2.0 License**. Usage is subject to [The-Stack-v2’s licensing terms](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids). ## 👥 Contributors The dataset was primarily developed by the following contributors: - [Kazuki Fujii](https://www.linkedin.com/in/kazuki-fujii/) — Designed the experiments, implemented the data pipeline, and conducted the experiments. - [Yukito Tajima](https://www.linkedin.com/in/yukito-tajima-51bbb2299/) — Implemented the data pipeline and optimized the inference pipeline. (vLLM, TensorRT-LLM) - [Masaki Kawamura](https://www.linkedin.com/in/masaki-kawamura-0806a7361/) — Co-designed the experiments, evaluated the models, and performed visualization and analysis. ## 📖 Citation ``` @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, } ```

# SwallowCode-v2 <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath-v2 Icon" width="500"> ### Resources - 📑 **arXiv预印本**:详细的方法与实验结果可参阅我们的论文[arXiv:2505.02881](https://arxiv.org/abs/2505.02881)。 - 🤗 **姊妹数据集**:配套的数学推理数据集[SwallowMath-v2](https://huggingface.co/datasets/tokyotech-llm/swallow-math-v2)敬请探索。 ## 💻 本数据集简介 [SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code)是一款基于大语言模型(Large Language Model, LLM)重写流程构建的高质量Python代码数据集。但该数据集存在两处明显局限: (1) 其授权协议为**Llama 3.3社区许可协议**; (2) 总Token数仅为**161亿**,无法满足大规模预训练的需求。 为解决上述问题,我们基于[The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2)数据集,借助[Qwen3-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)模型构建了全新重写的Python语料库**SwallowCode-v2**。本数据集总Token数达**498亿**,采用**Apache 2.0许可协议**发布,可为科研与商业应用提供开放可及、可复现的代码数据。 如下方图表所示,SwallowCode-v2在下游代码生成基准测试中展现出优于其他开源代码数据集的性能。<br> <sub>† 注:尽管[OpenCoder](https://huggingface.co/datasets/OpenCoder-LLM/RefineCode-code-corpus-meta)与[NVIDIA/Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1)等数据集被标注为"开源",但它们仅发布元数据,并未提供实际训练样本。与The-Stack-v2不同,此类数据集无法直接从公共存储(如S3)下载,需基于元数据大规模重新爬取GitHub仓库。对于中小型开源大语言模型项目而言,该重建流程成本高昂,难以复现或直接对比此类数据集。因此本对比未纳入此类语料的实验结果。</sub> <img src="swallow_code.png" width="800"> ## 📊 数据集对比 | 数据集名称 | 使用Llama-3分词器统计的Token数 | 授权协议 | | :-------------------------------- | :-----------------------------: | :--------------------------------- | | **[Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1)** | 仅发布元数据 | [NVIDIA开放数据许可协议](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample/raw/main/LICENSE.md) | | [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu)(Python子集) | 179亿Token | - | | **SwallowCode-v1(本团队此前版本)** | 161亿Token | Llama 3.3社区许可协议 | | **SwallowCode-v2(本工作)** | 498亿Token | **Apache 2.0许可协议** | ## 📦 本次发布内容 **SwallowCode-v2**:基于The-Stack-v2重写的Python代码数据集,采用Apache 2.0许可协议,总Token数达498亿,专为大规模大语言模型预训练设计。所有样本均通过大语言模型重写流程完成自动格式化、风格标准化,并针对算法可读性进行优化。 ## 🧩 数据集构建流程 1. **自动格式化**:使用[ruff格式化工具](https://docs.astral.sh/ruff/)统一代码风格。 2. **长度过滤**:移除过长或截断的代码样本。 3. **大语言模型质量评分**:采用[SeedCoder](https://arxiv.org/abs/2506.03524)的质量评分提示词,对每个代码片段的可读性与风格合规性进行0~10分的量化评级。 4. **大语言模型重写阶段**:借助Qwen3-235B-A22B-Instruct模型重写代码,优化其可读性、结构与算法合理性。 5. **最终格式化**:再次运行ruff工具,确保所有样本格式统一合规。 <img src="pipeline.png" width="800"> ### 🗂️ 数据集结构 - **阶段1**:自动格式化:[stage1-auto-format/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage1-auto-format/python) - **阶段2**:长度过滤:[stage2-length-filter/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage2-length-filter/python) - **阶段3**:大语言模型质量评分:[stage3-llm-score/python](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage3-llm-score/python) - **阶段4**:大语言模型重写:[stage4-llm-rewrite/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage4-llm-rewrite/python/medium) - **阶段5**:自动格式化:[stage5-auto-format/python/medium](https://huggingface.co/datasets/tokyotech-llm/swallow-code-v2/tree/main/stage5-auto-format/python/medium)(**SwallowCode-v2**) ### 🧪 重写策略消融实验 为探究不同大语言模型重写策略对代码数据集质量的影响,我们开展了如下消融实验。所有实验均基于Llama-3.1-8B进行**500亿Token的持续预训练**,并通过训练过程中的**HumanEval与HumanEval+基准测试的pass@1得分**评估模型性能。我们将采用不同重写策略构建的数据集作为训练语料,对比各方法的实际效果。本次消融实验得到的结论直接指导了**SwallowCode-v2**的构建流程。 #### 指令模型与思考模型对比 我们对比了基于Qwen-3-235B-A22B系列的**指令模型(Instruct)**与**思考模型(Thinking)**在代码重写任务中的效果。如下方图表所示,指令模型与思考模型重写得到的数据集在下游任务中性能无显著差异。但思考模型会在输出最终重写代码前生成`<think>...</think>`格式的推理轨迹,导致每个重写样本的GPU计算成本更高。基于上述结论,我们选择**指令模型用于代码重写**,在保证重写质量的同时大幅降低计算成本。 <img src="swallow_code_instruct.png" width="800"> #### 单阶段重写与两阶段重写对比 在[SwallowCode-v1](https://huggingface.co/datasets/tokyotech-llm/swallow-code)中,我们采用了**两阶段重写流程**。针对SwallowCode-v2,我们重新审视了v1版本的提示词设计,验证单阶段重写是否可达到与两阶段重写相当的质量。具体而言,我们将v1的两个重写步骤合并为单条指令,要求大语言模型在一步内完成全部重写任务。但由于大语言模型往往会忽略过于复杂的提示词中的部分内容,我们无法排除“将重写流程显式拆分为两阶段”本身对效果有增益的可能性。因此我们直接对比了单阶段与两阶段重写的效果。实验结果表明,**两阶段重写的GPU耗时约为单阶段的两倍**,但下游任务性能与单阶段重写基本相当。因此我们在SwallowCode-v2的构建中采用了单阶段重写策略。 <img src="swallow_code_stage.png" width="800"> #### 高质量与中等质量样本对比 我们采用[SeedCoder](https://arxiv.org/abs/2506.03524)的质量评分提示词,将源代码样本划分为**高质量、中等质量与低质量**三个组别。直观来看,人们可能会认为使用更高质量的输入样本可得到更优质的重写结果。但通过HumanEval与HumanEval+基准测试验证该假设时,我们得到了相反的结论——**基于中等质量样本重写得到的数据集性能略优于基于高质量样本的重写结果**,如下方图表所示。 我们推测这一现象可能源于数据分布差异:高质量代码往往包含复杂的面向类实现或大量依赖库调用,而中等质量代码则更贴近HumanEval任务中**简洁的问题导向**格式。尽管该定性观察具有一定参考价值,但仍属于初步分析,尚未通过更深入的实验验证。 <img src="swallow_code_quality.png" width="800"> ## 📊 实验结果与性能表现 相较于[Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu)数据集,SwallowCode-v2在HumanEval与HumanEval+基准测试上的pass@1得分分别提升了**20.7**与**21.9**。本次实验基于Llama-3.1-8B模型开展。 <img src="swallow_code.png" width="800"> ## 📝 补充说明 SwallowCode-v2项目最初计划构建覆盖13种编程语言的多语言代码数据集。但由于所需GPU计算资源与开发工作量巨大,且SwallowCode-v2与SwallowMath-v2均由三名学生在主科研任务之外并行开发,最终无法完成全部语言子集的构建。因此我们将已完整构建的Python子集作为SwallowCode-v2进行发布。 未来版本(SwallowCode-v3与SwallowMath-v3)计划拥有更大规模与更高质量,并可能引入思考增强(Thinking-Augmentation)等先进技术方案。但本项目的后续推进将取决于开源社区的广泛需求或明确的学术价值。 ## ⚖️ 许可协议说明 SwallowCode-v2采用**Apache 2.0许可协议**发布。使用本数据集需遵守[The-Stack-v2的许可条款](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)。 ## 👥 贡献者 本数据集主要由以下人员开发: - [Kazuki Fujii](https://www.linkedin.com/in/kazuki-fujii/):设计实验方案、实现数据流程并开展实验。 - [Yukito Tajima](https://www.linkedin.com/in/yukito-tajima-51bbb2299/):实现数据流程并优化推理管线(基于vLLM、TensorRT-LLM)。 - [Masaki Kawamura](https://www.linkedin.com/in/masaki-kawamura-0806a7361/):协同设计实验方案、评估模型并完成可视化与分析工作。 ## 📖 引用格式 @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, }
提供机构:
maas
创建时间:
2025-11-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作