five

swallow-code

收藏
魔搭社区2026-01-09 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/swallow-code
下载链接
链接失效反馈
官方服务:
资源简介:
# SwallowCode <img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowMath Icon" width="600"> ### Notice - **May 21, 2025**: We have deleted `ablation/exp1-the-stack-v2-train-smol-ids-python` because it was flagged as potentially containing unsafe data collected from the Python subset of https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids. However, since this dataset can be reconstructed from the-stack-v2-train-smol-ids, there is no issue in terms of reproducibility. - **May 21, 2025**: ClamAV has flagged “Win.Trojan.MSShellcode-88” in `ablation/exp10-direct-sgcr/jsonl/train-00005-of-00005.jsonl` (a dataset directly rewritten from the-stack-v2-train-smol-ids). While loading it as JSONL for LLM training poses an extremely low risk, please be aware. ### Resources - 🐙 **GitHub**: Explore the project repository, including pipeline code and prompts at [rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math). - 📑 **arXiv**: Read our paper for detailed methodology and results at [arXiv:2505.02881](https://arxiv.org/abs/2505.02881). - 🤗 **Sister Dataset**: Discover [SwallowMath](https://huggingface.co/datasets/tokyotech-llm/swallow-math), our companion dataset for mathematical reasoning. ## What is it? 💻 SwallowCode is a high-quality code dataset comprising approximately 16.1 billion tokens of Python code, derived from [The-Stack-v2-train-smol-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids) through a four-stage pipeline:syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process using [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). The pipeline enforces style conformity (Style-Guided Code Rewriting, SGCR) and transforms snippets into self-contained, algorithmically efficient examples (Self-Contained Optimization Rewriting, SCOR). SwallowCode is designed to enhance large language model (LLM) performance in program synthesis and code generation. More details are available in our paper: https://arxiv.org/abs/2505.02881. <img src="assets/code_dataset_compare.png" width="800"/> ## What is being released? The dataset is released as: **SwallowCode**: Approximately 16.1 billion tokens of Python code, processed through syntax validation, pylint filtering, SGCR, and SCOR, formatted as JSONL files. (`ablation/exp11-scor/jsonl`) Additionally, intermediate datasets from ablation experiments are released in the `ablation/` directory. All data is publicly available under the [Llama 3.3 Community License](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/resolve/main/LICENSE). ## Dataset curation SwallowCode refines Python code from The-Stack-v2-train-smol-ids through a four-stage pipeline to eliminate noise, ensure stylistic consistency, and enhance semantic quality. Pipeline Overview 1. **Syntax Error Filtering**: Removes invalid Python code using the `compile()` function, reducing samples by 9.7% (from 41M to 37M). 2. **Linter-Based Filtering**: Applies pylint with a threshold score of 7.0 and a custom comment penalty heuristic, reducing samples by 34.3% (to 24.1M). 3. **Style-Guided Code Rewriting (SGCR)**: Uses [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to enforce [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) criteria, improving readability and consistency. 4. **Self-Contained Optimization Rewriting (SCOR)**: Ensures self-containment, optimizes algorithms, and transforms trivial snippets into educational examples. The full pipeline and prompts are available at [https://github.com/rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math). <img src="assets/data-pipeline.png" width="800"/> ## Ablation Experiments The `ablation/` directory contains JSONL files for intermediate datasets from ablation experiments. These datasets correspond to the experiments described in the paper: - `exp1-the-stack-v2-train-smol-ids-python`: Baseline Python subset from The-Stack-v2. - `exp2-syntax-error-filtered`: After syntax error filtering. - `exp3-linter-filtered`: After pylint-based filtering (score ≥ 7). - `exp4-code_comment_ja_or_en`: Restricted to English or Japanese comments. - `exp5-sgcr`: After SGCR rewriting. - `exp6-llm-based-scoring`: Filtered by LLM-based scoring (score ≥ 6). - `exp7`: Mixed data (1:1 ratio of exp3 and exp5). - `exp10-direct-sgcr`: Python subset from The-Stack-v2 (exp1) with direct SGCR applied, skipping syntax error and pylint-based filtering. - `exp11-scor`: **Final SwallowCode** dataset after SCOR rewriting. Each directory contains JSONL files with processed code samples. For details, see the paper’s Appendix (Tables 6–18) or the repository at [https://github.com/rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math). <img src="assets/experiments.png" width="1000"/> ## Results and Performance Continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) for approximately 50 billion tokens, with SwallowCode as the code subset (16% of the mixture), yields significant improvements: - HumanEval: **+17.0** points pass@1 compared to [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu). - HumanEval+: **+17.7** points pass@1 compared to [Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu). SwallowCode outperforms other datasets (e.g., CodeParrot-Clean, The-Stack-v1/v2, Stack-Edu) on code generation benchmarks, as shown in the paper (Figure 1, Tables 6–18). ## Considerations for Using the Data ### Social Impact of the Dataset SwallowCode aims to advance LLM capabilities in code generation by providing a high-quality, openly licensed dataset. We: - Promote transparency in code dataset curation. - Reduce barriers for training code-proficient models. - Offer a benchmark for code quality enhancement. ### Discussion of Biases The dataset may inherit biases from The-Stack-v2 or Llama-3.3-70B-Instruct, including: - Over-representation of certain coding patterns (e.g., dynamic programming (DP)). - Influence of Llama-3.3-70B-Instruct’s preferences in naming and style. ### Other Known Limitations - Limited to Python code in current experiments. - May not capture all edge cases in code dependencies. - Rewriting may introduce model-specific stylistic preferences. ## Licensing Information SwallowCode is released under the Llama 3.3 Community License. Usage is subject to [The-Stack-v2’s licensing terms](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids). ## Future work Potential directions include: - Extending the pipeline to other programming languages (e.g., Java, C++). - Enhancing dependency handling for complex codebases. - Exploring larger pre-training budgets to assess scalability. - Integrating multilingual code comments beyond English and Japanese. ## Citation information ``` @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, } ```

# SwallowCode ![SwallowMath 图标](https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png) ### 注意事项 - **2025年5月21日**:我们已删除`ablation/exp1-the-stack-v2-train-smol-ids-python`,因其被标记为可能包含来自https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids的Python子集的不安全数据。不过,由于该数据集可从the-stack-v2-train-smol-ids重建,因此不存在可复现性方面的问题。 - **2025年5月21日**:ClamAV在`ablation/exp10-direct-sgcr/jsonl/train-00005-of-00005.jsonl`(一个直接从the-stack-v2-train-smol-ids重写的数据集)中检测到“Win.Trojan.MSShellcode-88”病毒。尽管将其作为JSONL格式用于大语言模型(Large Language Model, LLM)训练的风险极低,但请知悉。 ### 资源 - 🐙 **GitHub**:访问项目仓库(包含流水线代码与提示词):[rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math)。 - 📑 **arXiv**:阅读我们的论文以获取详细方法与实验结果:[arXiv:2505.02881](https://arxiv.org/abs/2505.02881)。 - 🤗 **姊妹数据集**:探索[SwallowMath](https://huggingface.co/datasets/tokyotech-llm/swallow-math),我们用于数学推理的配套数据集。 ### 数据集简介 💻 SwallowCode是一个高质量代码数据集,包含约161亿Token的Python代码,其源自[The-Stack-v2-train-smol-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids),并通过四阶段流水线处理:语法验证、基于pylint的风格过滤,以及使用[Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)的两阶段大语言模型(Large Language Model, LLM)重写流程。 该流水线强制实现风格合规性(风格引导式代码重写(Style-Guided Code Rewriting, SGCR)),并将代码片段转换为自包含、算法高效的示例(自包含优化重写(Self-Contained Optimization Rewriting, SCOR))。 SwallowCode旨在提升大语言模型(LLM)在程序合成与代码生成任务中的性能。 更多细节可参阅我们的论文:https://arxiv.org/abs/2505.02881。 ![代码数据集对比图](assets/code_dataset_compare.png) ### 本次发布内容 本次发布的数据集如下: **SwallowCode**:约161亿Token的Python代码,经过语法验证、pylint过滤、SGCR与SCOR处理,格式为JSONL文件(存储于`ablation/exp11-scor/jsonl`路径下)。 此外,消融实验的中间数据集已发布于`ablation/`目录中。 所有数据均基于[Llama 3.3 Community License](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/resolve/main/LICENSE)开源协议公开。 ### 数据集构建流程 SwallowCode通过四阶段流水线对来自The-Stack-v2-train-smol-ids的Python代码进行精炼,以去除噪声、确保风格一致性并提升语义质量。 流水线概览: 1. **语法错误过滤**:使用`compile()`函数移除无效的Python代码,使样本量减少9.7%(从4100万降至3700万)。 2. **基于代码检查工具的过滤**:应用pylint并设置7.0的阈值分数与自定义注释惩罚启发式规则,使样本量减少34.3%(降至2410万)。 3. **风格引导式代码重写(Style-Guided Code Rewriting, SGCR)**:使用[Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)强制执行[Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)规范,提升代码可读性与一致性。 4. **自包含优化重写(Self-Contained Optimization Rewriting, SCOR)**:确保代码自包含性、优化算法,并将琐碎的代码片段转换为教学示例。 完整流水线与提示词可访问:[https://github.com/rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math)。 ![数据流水线图](assets/data-pipeline.png) ### 消融实验 `ablation/`目录包含消融实验中间数据集的JSONL文件,这些数据集对应论文中描述的实验: - `exp1-the-stack-v2-train-smol-ids-python`:来自The-Stack-v2的基线Python子集。 - `exp2-syntax-error-filtered`:经过语法错误过滤后的数据集。 - `exp3-linter-filtered`:经过pylint过滤(分数≥7)后的数据集。 - `exp4-code_comment_ja_or_en`:仅保留英语或日语注释的数据集。 - `exp5-sgcr`:经过SGCR重写后的数据集。 - `exp6-llm-based-scoring`:经过大语言模型(LLM)评分过滤(分数≥6)后的数据集。 - `exp7`:混合数据集(exp3与exp5按1:1比例混合)。 - `exp10-direct-sgcr`:直接对The-Stack-v2的Python子集(exp1)应用SGCR,跳过语法错误过滤与pylint过滤步骤的数据集。 - `exp11-scor`:经过SCOR重写后的**最终SwallowCode**数据集。 每个目录均包含处理后的代码样本的JSONL文件。详细信息请参阅论文附录(表6至表18)或项目仓库:[https://github.com/rioyokotalab/swallow-code-math](https://github.com/rioyokotalab/swallow-code-math)。 ![实验示意图](assets/experiments.png) ### 实验结果与性能 以SwallowCode作为代码子集(占混合数据集的16%),对[Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)进行约500亿Token的持续预训练,可获得显著性能提升: - HumanEval:相较于[Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu),pass@1指标提升**17.0**分。 - HumanEval+:相较于[Stack-Edu](https://huggingface.co/datasets/HuggingFaceTB/stack-edu),pass@1指标提升**17.7**分。 SwallowCode在代码生成基准测试中优于其他数据集(如CodeParrot-Clean、The-Stack-v1/v2、Stack-Edu),详细结果请参阅论文(图1、表6至表18)。 ### 数据使用注意事项 #### 数据集的社会影响 SwallowCode旨在通过提供高质量、开源许可的数据集,提升大语言模型(LLM)在代码生成领域的能力。我们: - 提升代码数据集构建流程的透明度。 - 降低训练精通代码的大语言模型的门槛。 - 提供代码质量提升的基准测试方案。 #### 偏差讨论 该数据集可能继承自The-Stack-v2或Llama-3.3-70B-Instruct的偏差,包括: - 某些编码模式的过度代表(例如动态编程(Dynamic Programming, DP))。 - Llama-3.3-70B-Instruct在命名与风格上的偏好带来的影响。 #### 其他已知局限性 - 当前实验仅局限于Python代码。 - 可能未覆盖代码依赖关系的所有边缘场景。 - 重写过程可能引入模型特定的风格偏好。 ### 许可信息 SwallowCode基于Llama 3.3 Community License协议发布。 使用该数据集需遵守[The-Stack-v2的许可条款](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)。 ### 未来研究方向 潜在研究方向包括: - 将流水线扩展至其他编程语言(例如Java、C++)。 - 增强复杂代码库的依赖关系处理能力。 - 探索更大的预训练预算以评估可扩展性。 - 集成英语与日语之外的多语言代码注释。 ### 引用信息 @misc{fujii2025rewritingpretrainingdataboosts, title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki}, year={2025}, eprint={2505.02881}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.02881}, }
提供机构:
maas
创建时间:
2025-05-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作