bigcode/commitpackft
收藏数据集卡片 for CommitPackFT
数据集描述
- 数据集概述: CommitPackFT 是一个 2GB 的过滤版本数据集,源自 CommitPack,仅包含高质量的提交信息,这些信息类似于自然语言指令。
- 创建方式: 数据集可以通过这里提供的指令重新创建。
- 语言数量: 277 种
数据集结构
数据实例
一个示例数据如下:
json { "commit": "0c17311f7fd511f5dae8f8e4acc2dce1a2de3cf5", "old_file": "main.py", "new_file": "main.py", "old_contents": "import numpy as np import matplotlib.pyplot as plt
generate sample data
x_data = np.linspace(-5, 5, 20) y_data = np.random.normal(0.0, 1.0, x_data.size)
plt.plot(x_data, y_data, o) plt.show() ", "new_contents": "import math import numpy as np import matplotlib.pyplot as plt
generate sample data
x_data = np.linspace(-math.pi, math.pi, 30) y_data = np.sin(x_data) + np.random.normal(0.0, 0.1, x_data.size)
plt.plot(x_data, y_data, o) plt.show()
", "subject": "Change to sin() function with noise", "message": "Change to sin() function with noise ", "lang": "Python", "license": "mit", "repos": "MorganR/basic-gaussian-process" }
数据字段
所有分割的数据字段相同:
commit: 唯一的提交 IDold_file: 提交前的文件名new_file: 提交后的文件名old_contents: 提交前的文件内容new_contents: 提交后的文件内容subject: 提交的主题(用于论文中的所有实验)message: 提交信息(通常与主题相同)lang: 编程语言license: 代码来源仓库的许可证,可选值包括[mit, artistic-2.0, isc, cc0-1.0, epl-1.0, mpl-2.0, unlicense, unknown, apache-2.0, bsd-3-clause, agpl-3.0, lgpl-2.1, bsd-2-clause]repos: 代码来源的仓库名称(如果有多个,以逗号分隔)
数据分割
| 名称 | 兆字节 | 占总量的百分比 | 样本数 | 占总量的百分比 |
|---|---|---|---|---|
| total | 1545.02 | 100.0% | 702062 | 100.0% |
| ruby | 195.292 | 12.6401% | 69413 | 9.887% |
| yaml | 190.876 | 12.3543% | 114320 | 16.2835% |
| python | 132.68 | 8.5876% | 56025 | 7.9801% |
| markdown | 131.152 | 8.4887% | 62518 | 8.9049% |
| javascript | 125.008 | 8.091% | 52989 | 7.5476% |
| json | 86.744 | 5.6144% | 39777 | 5.6657% |
| shell | 66.864 | 4.3277% | 31217 | 4.4465% |
| text | 66.664 | 4.3148% | 46588 | 6.6359% |
| php | 60.22 | 3.8977% | 24791 | 3.5312% |
| java | 56.284 | 3.6429% | 20635 | 2.9392% |
| html | 48.42 | 3.1339% | 20214 | 2.8792% |
| c# | 26.84 | 1.7372% | 9346 | 1.3312% |
| xml | 23.676 | 1.5324% | 9337 | 1.3299% |
| html+erb | 23.104 | 1.4954% | 10910 | 1.554% |
| c | 21.08 | 1.3644% | 8506 | 1.2116% |
| ini | 21.04 | 1.3618% | 11360 | 1.6181% |
| coffeescript | 16.96 | 1.0977% | 5513 | 0.7853% |
| swift | 16.272 | 1.0532% | 4849 | 0.6907% |
| restructuredtext | 15.728 | 1.018% | 6560 | 0.9344% |
| typescript | 14.284 | 0.9245% | 5868 | 0.8358% |
| c++ | 14.136 | 0.9149% | 4992 | 0.711% |
| scss | 13.208 | 0.8549% | 6829 | 0.9727% |
| go | 12.132 | 0.7852% | 5004 | 0.7128% |
| scala | 11.184 | 0.7239% | 5040 | 0.7179% |
| haml | 10.74 | 0.6951% | 4415 | 0.6289% |
| css | 9.364 | 0.6061% | 5049 | 0.7192% |
| rust | 7.244 | 0.4689% | 2996 | 0.4267% |
| toml | 5.584 | 0.3614% | 3424 | 0.4877% |
| jsx | 5.5 | 0.356% | 2199 | 0.3132% |
| kotlin | 5.368 | 0.3474% | 2214 | 0.3154% |
| clojure | 5.068 | 0.328% | 2403 | 0.3423% |
| perl | 4.988 | 0.3228% | 2288 | 0.3259% |
| bitbake | 4.464 | 0.2889% | 1308 | 0.1863% |
| groovy | 4.168 | 0.2698% | 1486 | 0.2117% |
| twig | 3.956 | 0.256% | 1610 | 0.2293% |
| nix | 3.84 | 0.2485% | 1593 | 0.2269% |
| sql | 3.74 | 0.2421% | 2069 | 0.2947% |
| less | 3.724 | 0.241% | 1360 | 0.1937% |
| haskell | 3.308 | 0.2141% | 1389 | 0.1978% |
| handlebars | 3.292 | 0.2131% | 1429 | 0.2035% |
| unknown | 3.048 | 0.1973% | 1597 | 0.2275% |
| batchfile | 2.984 | 0.1931% | 1466 | 0.2088% |
| cucumber | 2.588 | 0.1675% | 976 | 0.139% |
| makefile | 2.528 | 0.1636% | 960 | 0.1367% |
| elixir | 2.348 | 0.152% | 1150 | 0.1638% |
| jade | 2.348 | 0.152% | 1119 | 0.1594% |
| cmake | 2.268 | 0.1468% | 981 | 0.1397% |
| powershell | 2.064 | 0.1336% | 991 | 0.1412% |
| slim | 2.056 | 0.1331% | 1052 | 0.1498% |
| emacs-lisp | 1.972 | 0.1276% | 1015 | 0.1446% |
| dart | 1.96 | 0.1269% | 765 | 0.109% |
| viml | 1.956 | 0.1266% | 1063 | 0.1514% |
| asciidoc | 1.864 | 0.1206% | 523 | 0.0745% |
| lua | 1.852 | 0.1199% | 920 | 0.131% |
| llvm | 1.6 | 0.1036% | 780 | 0.1111% |
| smarty | 1.588 | 0.1028% | 737 | 0.105% |
| diff | 1.48 | 0.0958% | 680 | 0.0969% |
| common-lisp | 1.448 | 0.0937% | 778 | 0.1108% |
| saltstack | 1.412 | 0.0914% | 617 | 0.0879% |
| vue | 1.384 | 0.0896% | 587 | 0.0836% |
| sass | 1.364 | 0.0883% | 705 | 0.1004% |
| fish | 1.328 | 0.086% | 813 | 0.1158% |
| erlang | 1.192 | 0.0772% | 480 | 0.0684% |
| freemarker | 1.028 | 0.0665% | 510 | 0.0726% |
| stylus | 0.948 | 0.0614% | 480 | 0.0684% |
| qml | 0.936 | 0.0606% | 368 | 0.0524% |
| hcl | 0.912 | 0.059% | 421 | 0.06% |
| html+django | 0.848 | 0.0549% | 399 | 0.0568% |
| mako | 0.756 | 0.0489% | 170 | 0.0242% |
| ada | 0.728 | 0.0471% | 265 | 0.0377% |
| ocaml | 0.704 | 0.0456% | 333 | 0.0474% |
| f# | 0.656 | 0.0425% | 254 | 0.0362% |
| elm | 0.62 | 0.0401% | 265 | 0.0377% |
| tex | 0.564 | 0.0365% | 307 | 0.0437% |
| rdoc | 0.552 | 0.0357% | 270 | 0.0385% |
| csv | 0.532 | 0.0344% | 375 | 0.0534% |
| protocol-buffer | 0.524 | 0.0339% | 181 | 0.0258% |
| smalltalk | 0.46 | 0.0298% | 284 | 0.0405% |
| arduino | 0.456 | 0.0295% | 225 | 0.032% |
| java-server-pages | 0.452 | 0.0293% | 173 | 0.0246% |
| scheme | 0.42 | 0.0272% | 213 | 0.0303% |
| groff | 0.396 | 0.0256% | 192 | 0.0273% |
| objective-c++ | 0.376 | 0.0243% |




