code_bagel
收藏魔搭社区2025-06-05 更新2024-06-22 收录
下载链接:
https://modelscope.cn/datasets/thomas/code_bagel
下载链接
链接失效反馈官方服务:
资源简介:
## A coding bagel, with everything coding related
Around 800 million tokens of unique coding data
10,000 max tokens per line
Support for over 100 coding languages (You can find a list of languages and how of each language is in the dataset at the bottom of the model card)

## Want to train your own coding model with this dataset? Just follow the dock and instructions at the bottom of this model card.
This dataset contains 3.2 Million+ lines of high quality, filtered, uncensored, deduplicated, unique coding data.
This dataset is the combination of the largest and highest quality instruction based coding datasets on huggingface and is big enough to continue pretraining a new coding model.
The process to create this dataset was as follows:
1. Download all the individual datasets
2. Use Meta.ai to create code to extract the data from the dataset into alpaca format, and add an instruction to most of them
3. Use the same method of extracting to combine all the datasets into 1
4. Use Claude.ai to create the code to dedupe and uncensor the data
(Note the glaiveai/glaive-function-calling-v2 dataset was not uncensored because it containes data for function calling, in which case sometimes the model is required to refuse incorrect function calls.)
_______________________________________________________________________________________________
The following datasets were used in the merger of this dataset:
- https://huggingface.co/datasets/layoric/tiny-codes-alpaca
- https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3
- https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT
- https://huggingface.co/datasets/TIGER-Lab/MathInstruct
- https://huggingface.co/datasets/chargoddard/commitpack-ft-instruct-rated
- https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca
- https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K
- https://huggingface.co/datasets/cognitivecomputations/dolphin-coder
- https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
- https://huggingface.co/datasets/coseal/CodeUltraFeedback_binarized
- https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
- https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
_________________________________________________________________________________________
## How to train your llama (Or other Ai model):
1. Go to this google colab: https://colab.research.google.com/drive/1bX4BsjLcdNJnoAf7lGXmWOgaY8yekg8p?usp=sharing#scrollTo=LjY75GoYUCB8
2. click File -> Download -> Download.ipynb
3. Go to tensordock.com (make an account)
4. Deploy a server, a5000 24gb has very good price to performance, and start a Juniper lab
5. Drag and drop your Download.ipynb (aka colab doc file) into your Juniper lab
6. Edit the to match your huggingface name and add your huggingface token
7. Run the code
8. Enjoy!
_________________________________________________________________________________________
Thank you to jondurbin for the bagel-v0.5 dataset, the inspiration for this dataset, and the image used for this dataset which I edited. You can find his dataset bellow.
- https://huggingface.co/datasets/jondurbin/bagel-v0.5
__________________________________________________________________________________________
## Join the Replete-Ai discord! We are a great and Loving community!
- https://discord.gg/ZZbnsmVnjD
_________________________________________________________________________________________
## SUPPORTED CODING LANGUAGES (BY LINE)
Note some coding languages may overlap with common words like "Self" which is often used in the dataset in other ways other than as a coding language.
| Language | How Often | Percentage |
|--------------|------------|------------|
| python | 1311720 | 3.29% |
| c | 1975101 | 4.95% |
| self | 923505 | 2.31% |
| java | 631756 | 1.58% |
| javascript | 589796 | 1.48% |
| ruby | 562800 | 1.41% |
| sql | 527178 | 1.32% |
| go | 488987 | 1.23% |
| bash | 461695 | 1.16% |
| rust | 455318 | 1.14% |
| typescript | 377306 | 0.95% |
| julia | 357836 | 0.90% |
| clean | 297606 | 0.75% |
| q | 284196 | 0.71% |
| php | 226355 | 0.57% |
| io | 154093 | 0.39% |
| xml | 138704 | 0.35% |
| red | 105092 | 0.26% |
| factor | 95553 | 0.24% |
| assembly | 86287 | 0.22% |
| alice | 82390 | 0.21% |
| blue | 73990 | 0.19% |
| shell | 57488 | 0.14% |
| dart | 54459 | 0.14% |
| curl | 53624 | 0.13% |
| swift | 49298 | 0.12% |
| scala | 45305 | 0.11% |
| icon | 44932 | 0.11% |
| batch | 43222 | 0.11% |
| inform | 42218 | 0.11% |
| clojure | 40807 | 0.10% |
| scheme | 39851 | 0.10% |
| perl | 39366 | 0.10% |
| verilog | 37458 | 0.09% |
| bc | 37017 | 0.09% |
| lua | 36977 | 0.09% |
| sas | 33938 | 0.09% |
| powershell | 33766 | 0.08% |
| haskell | 33054 | 0.08% |
| kotlin | 32468 | 0.08% |
| elixir | 32400 | 0.08% |
| fortran | 31288 | 0.08% |
| erlang | 29807 | 0.07% |
| lisp | 28644 | 0.07% |
| vhdl | 28002 | 0.07% |
| abc | 26873 | 0.07% |
| ml | 24625 | 0.06% |
| tcl | 23951 | 0.06% |
| zig | 22801 | 0.06% |
| sed | 22645 | 0.06% |
| xslt | 19771 | 0.05% |
| latex | 19566 | 0.05% |
| ring | 18498 | 0.05% |
| racket | 18396 | 0.05% |
| groovy | 17520 | 0.04% |
| whitespace | 15258 | 0.04% |
| ocaml | 15184 | 0.04% |
| logo | 14739 | 0.04% |
| sol | 13969 | 0.04% |
| spark | 13751 | 0.03% |
| matlab | 12689 | 0.03% |
| delphi | 12688 | 0.03% |
| scratch | 12461 | 0.03% |
| stata | 11721 | 0.03% |
| gap | 10940 | 0.03% |
| pascal | 9735 | 0.02% |
| llvm | 9534 | 0.02% |
| objective-c | 9359 | 0.02% |
| forth | 7683 | 0.02% |
| tex | 7233 | 0.02% |
| common lisp | 6954 | 0.02% |
| smalltalk | 6813 | 0.02% |
| visual basic | 6509 | 0.02% |
| prolog | 6284 | 0.02% |
| c++ | 5946 | 0.02% |
| mathematica | 5524 | 0.01% |
| emacs lisp | 5288 | 0.01% |
| ada | 3459 | 0.01% |
| webassembly | 3320 | 0.01% |
| jade | 3084 | 0.01% |
| mercury | 2808 | 0.01% |
| gml | 2794 | 0.01% |
| squirrel | 2773 | 0.01% |
| clips | 2744 | 0.01% |
| coffeescript | 2546 | 0.01% |
| arduino | 2390 | 0.01% |
| dylan | 2266 | 0.01% |
| eiffel | 2263 | 0.01% |
| cocoa | 2193 | 0.01% |
| opencl | 2190 | 0.01% |
| slip | 2096 | 0.01% |
| m4 | 2082 | 0.01% |
| idris | 474 | 0.01% |
| purescript | 345 | 0.01% |
| c# | 396 | 0.01% |
# 全能编程百宝箱(Coding Bagel),涵盖全品类编程相关内容
该数据集包含约8亿个唯一编码Token(Token),单行最大Token数为10000,支持超过100种编程语言(编程语言列表及各语言在数据集中的出现次数与占比可在模型卡片底部查阅)。

---
若你希望使用该数据集训练专属编程模型,请遵循本模型卡片底部的文档与操作指引。
本数据集包含320万+条经过高质量筛选、未做内容审核、去重的唯一编程数据。它整合了Hugging Face平台上规模最大、质量最优的基于指令的编程数据集,规模足以支撑全新编程模型的持续预训练。
该数据集的构建流程如下:
1. 下载所有独立数据集
2. 使用Meta.ai生成代码,将数据提取为Alpaca格式,并为多数数据添加指令
3. 使用相同的提取方法将所有数据集合并为单一数据集
4. 使用Claude.ai生成代码完成数据去重与去内容审核处理
> 注:`glaiveai/glaive-function-calling-v2`数据集未进行去内容审核处理,因其包含函数调用相关数据,此类场景下模型可能需要拒绝错误的函数调用请求。
---
本次合并所使用的数据集列表如下:
- https://huggingface.co/datasets/layoric/tiny-codes-alpaca
- https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3
- https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT
- https://huggingface.co/datasets/TIGER-Lab/MathInstruct
- https://huggingface.co/datasets/chargoddard/commitpack-ft-instruct-rated
- https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca
- https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K
- https://huggingface.co/datasets/cognitivecomputations/dolphin-coder
- https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
- https://huggingface.co/datasets/coseal/CodeUltraFeedback_binarized
- https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
- https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
---
## 如何训练Llama(或其他AI模型):
1. 访问该Google Colab链接:https://colab.research.google.com/drive/1bX4BsjLcdNJnoAf7lGXmWOgaY8yekg8p?usp=sharing#scrollTo=LjY75GoYUCB8
2. 依次点击`文件 -> 下载 -> 下载.ipynb`
3. 访问`tensordock.com`(需注册账号)
4. 部署服务器,推荐使用a5000 24GB机型,其性价比出众,随后启动Juniper Lab实例
5. 将下载的`.ipynb`文件(即Colab文档)拖拽至Juniper Lab中
6. 编辑配置代码,填入你的Hugging Face账号名称并添加Hugging Face令牌(Token)
7. 运行代码即可开始训练
8. 尽情使用!
---
感谢jondurbin提供的`bagel-v0.5`数据集,本数据集的灵感即来源于该数据集,同时本数据集使用的演示图片也经他的原图片编辑而来。你可通过以下链接获取他的数据集:
- https://huggingface.co/datasets/jondurbin/bagel-v0.5
---
## 加入Replete-Ai Discord社区!我们是一个友好互助的优质社区!
- https://discord.gg/ZZbnsmVnjD
---
## 支持的编程语言(按行计数)
> 注:部分编程语言名称可能与常用词汇重合,例如`self`,其在数据集中的使用场景可能并非指代编程语言。
| 编程语言 | 出现次数 | 占比 |
|--------------|------------|------------|
| python | 1311720 | 3.29% |
| c | 1975101 | 4.95% |
| self | 923505 | 2.31% |
| java | 631756 | 1.58% |
| javascript | 589796 | 1.48% |
| ruby | 562800 | 1.41% |
| sql | 527178 | 1.32% |
| go | 488987 | 1.23% |
| bash | 461695 | 1.16% |
| rust | 455318 | 1.14% |
| typescript | 377306 | 0.95% |
| julia | 357836 | 0.90% |
| clean | 297606 | 0.75% |
| q | 284196 | 0.71% |
| php | 226355 | 0.57% |
| io | 154093 | 0.39% |
| xml | 138704 | 0.35% |
| red | 105092 | 0.26% |
| factor | 95553 | 0.24% |
| assembly | 86287 | 0.22% |
| alice | 82390 | 0.21% |
| blue | 73990 | 0.19% |
| shell | 57488 | 0.14% |
| dart | 54459 | 0.14% |
| curl | 53624 | 0.13% |
| swift | 49298 | 0.12% |
| scala | 45305 | 0.11% |
| icon | 44932 | 0.11% |
| batch | 43222 | 0.11% |
| inform | 42218 | 0.11% |
| clojure | 40807 | 0.10% |
| scheme | 39851 | 0.10% |
| perl | 39366 | 0.10% |
| verilog | 37458 | 0.09% |
| bc | 37017 | 0.09% |
| lua | 36977 | 0.09% |
| sas | 33938 | 0.09% |
| powershell | 33766 | 0.08% |
| haskell | 33054 | 0.08% |
| kotlin | 32468 | 0.08% |
| elixir | 32400 | 0.08% |
| fortran | 31288 | 0.08% |
| erlang | 29807 | 0.07% |
| lisp | 28644 | 0.07% |
| vhdl | 28002 | 0.07% |
| abc | 26873 | 0.07% |
| ml | 24625 | 0.06% |
| tcl | 23951 | 0.06% |
| zig | 22801 | 0.06% |
| sed | 22645 | 0.06% |
| xslt | 19771 | 0.05% |
| latex | 19566 | 0.05% |
| ring | 18498 | 0.05% |
| racket | 18396 | 0.05% |
| groovy | 17520 | 0.04% |
| whitespace | 15258 | 0.04% |
| ocaml | 15184 | 0.04% |
| logo | 14739 | 0.04% |
| sol | 13969 | 0.04% |
| spark | 13751 | 0.03% |
| matlab | 12689 | 0.03% |
| delphi | 12688 | 0.03% |
| scratch | 12461 | 0.03% |
| stata | 11721 | 0.03% |
| gap | 10940 | 0.03% |
| pascal | 9735 | 0.02% |
| llvm | 9534 | 0.02% |
| objective-c | 9359 | 0.02% |
| forth | 7683 | 0.02% |
| tex | 7233 | 0.02% |
| common lisp | 6954 | 0.02% |
| smalltalk | 6813 | 0.02% |
| visual basic | 6509 | 0.02% |
| prolog | 6284 | 0.02% |
| c++ | 5946 | 0.02% |
| mathematica | 5524 | 0.01% |
| emacs lisp | 5288 | 0.01% |
| ada | 3459 | 0.01% |
| webassembly | 3320 | 0.01% |
| jade | 3084 | 0.01% |
| mercury | 2808 | 0.01% |
| gml | 2794 | 0.01% |
| squirrel | 2773 | 0.01% |
| clips | 2744 | 0.01% |
| coffeescript | 2546 | 0.01% |
| arduino | 2390 | 0.01% |
| dylan | 2266 | 0.01% |
| eiffel | 2263 | 0.01% |
| cocoa | 2193 | 0.01% |
| opencl | 2190 | 0.01% |
| slip | 2096 | 0.01% |
| m4 | 2082 | 0.01% |
| idris | 474 | 0.01% |
| purescript | 345 | 0.01% |
| c# | 396 | 0.01% |
提供机构:
maas
创建时间:
2024-06-06



