Replete-AI/code_bagel_hermes-2.5
收藏Hugging Face2024-10-09 更新2024-05-18 收录
下载链接:
https://hf-mirror.com/datasets/Replete-AI/code_bagel_hermes-2.5
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是code_bagel和Open-Hermes-2.5两个数据集的合并版本,包含90万行高质量的非编程指令数据和300万行高质量的编程指令数据。每行最多包含10,000个token,支持超过100种编程语言。数据集的目标是用于微调编程模型,使其能够处理各种编程任务。数据集的创建过程包括下载多个数据集、提取数据、合并数据、去重和去审查等步骤。
提供机构:
Replete-AI
原始信息汇总
数据集概述
数据集名称
code_bagel + Open-Hermes-2.5 Datasets combined
数据集内容
- 非代码指令数据:900,000行
- 编码指令数据:3,000,000行
- 每行最大令牌数:10,000
- 支持的编程语言数量:超过100种
编程语言分布
| 语言 | 出现频率 | 百分比 |
|---|---|---|
| python | 1,311,720 | 3.29% |
| c | 1,975,101 | 4.95% |
| self | 923,505 | 2.31% |
| java | 631,756 | 1.58% |
| javascript | 589,796 | 1.48% |
| ruby | 562,800 | 1.41% |
| sql | 527,178 | 1.32% |
| go | 488,987 | 1.23% |
| bash | 461,695 | 1.16% |
| rust | 455,318 | 1.14% |
| typescript | 377,306 | 0.95% |
| julia | 357,836 | 0.90% |
| clean | 297,606 | 0.75% |
| q | 284,196 | 0.71% |
| php | 226,355 | 0.57% |
| io | 154,093 | 0.39% |
| xml | 138,704 | 0.35% |
| red | 105,092 | 0.26% |
| factor | 95,553 | 0.24% |
| assembly | 86,287 | 0.22% |
| alice | 82,390 | 0.21% |
| blue | 73,990 | 0.19% |
| shell | 57,488 | 0.14% |
| dart | 54,459 | 0.14% |
| curl | 53,624 | 0.13% |
| swift | 49,298 | 0.12% |
| scala | 45,305 | 0.11% |
| icon | 44,932 | 0.11% |
| batch | 43,222 | 0.11% |
| inform | 42,218 | 0.11% |
| clojure | 40,807 | 0.10% |
| scheme | 39,851 | 0.10% |
| perl | 39,366 | 0.10% |
| verilog | 37,458 | 0.09% |
| bc | 37,017 | 0.09% |
| lua | 36,977 | 0.09% |
| sas | 33,938 | 0.09% |
| powershell | 33,766 | 0.08% |
| haskell | 33,054 | 0.08% |
| kotlin | 32,468 | 0.08% |
| elixir | 32,400 | 0.08% |
| fortran | 31,288 | 0.08% |
| erlang | 29,807 | 0.07% |
| lisp | 28,644 | 0.07% |
| vhdl | 28,002 | 0.07% |
| abc | 26,873 | 0.07% |
| ml | 24,625 | 0.06% |
| tcl | 23,951 | 0.06% |
| zig | 22,801 | 0.06% |
| sed | 22,645 | 0.06% |
| xslt | 19,771 | 0.05% |
| latex | 19,566 | 0.05% |
| ring | 18,498 | 0.05% |
| racket | 18,396 | 0.05% |
| groovy | 17,520 | 0.04% |
| whitespace | 15,258 | 0.04% |
| ocaml | 15,184 | 0.04% |
| logo | 14,739 | 0.04% |
| sol | 13,969 | 0.04% |
| spark | 13,751 | 0.03% |
| matlab | 12,689 | 0.03% |
| delphi | 12,688 | 0.03% |
| scratch | 12,461 | 0.03% |
| stata | 11,721 | 0.03% |
| gap | 10,940 | 0.03% |
| pascal | 9,735 | 0.02% |
| llvm | 9,534 | 0.02% |
| objective-c | 9,359 | 0.02% |
| forth | 7,683 | 0.02% |
| tex | 7,233 | 0.02% |
| common lisp | 6,954 | 0.02% |
| smalltalk | 6,813 | 0.02% |
| visual basic | 6,509 | 0.02% |
| prolog | 6,284 | 0.02% |
| c++ | 5,946 | 0.02% |
| mathematica | 5,524 | 0.01% |
| emacs lisp | 5,288 | 0.01% |
| ada | 3,459 | 0.01% |
| webassembly | 3,320 | 0.01% |
| jade | 3,084 | 0.01% |
| mercury | 2,808 | 0.01% |
| gml | 2,794 | 0.01% |
| squirrel | 2,773 | 0.01% |
| clips | 2,744 | 0.01% |
| coffeescript | 2,546 | 0.01% |
| arduino | 2,390 | 0.01% |
| dylan | 2,266 | 0.01% |
| eiffel | 2,263 | 0.01% |
| cocoa | 2,193 | 0.01% |
| opencl | 2,190 | 0.01% |
| slip | 2,096 | 0.01% |
| m4 | 2,082 | 0.01% |
| idris | 474 | 0.01% |
| purescript | 345 | 0.01% |
| c# | 396 | 0.01% |
数据集来源
- code_bagel:由Replete-AI提供
- Open-Hermes-2.5:由teknium提供
数据集用途
理论上,该数据集应能支持最终的编码微调,几乎能处理任何编码任务。



