bigcode/commitpack
收藏Hugging Face2023-08-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bigcode/commitpack
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
pretty_name: CommitPack
language:
- code
---

# Dataset Card for CommitPack
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** https://github.com/bigcode-project/octopack
- **Paper:** [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124)
- **Point of Contact:** [Niklas Muennighoff](mailto:n.muennighoff@gmail.com)
### Dataset Summary
> CommitPack is is a 4TB dataset of commits scraped from GitHub repositories that are permissively licensed.
- **Creation:** The dataset can be recreated using instructions available [here](https://github.com/bigcode-project/octopack).
- **Languages:** 350
- **OctoPack🐙🎒:**
<table>
<tr>
<th>Data</t>
<td><a href=https://huggingface.co/datasets/bigcode/commitpack>CommitPack</a></td>
<td>4TB of GitHub commits across 350 programming languages</td>
</tr>
<tr>
<th></t>
<td><a href=https://huggingface.co/datasets/bigcode/commitpackft>CommitPackFT</a></td>
<td>Filtered version of CommitPack for high-quality commit messages that resemble instructions</td>
</tr>
<tr>
<th>Model</t>
<td><a href=https://huggingface.co/bigcode/octocoder>OctoCoder</a></td>
<td>StarCoder (16B parameters) instruction tuned on CommitPackFT + OASST</td>
</tr>
<tr>
<th></t>
<td><a href=https://huggingface.co/bigcode/octogeex>OctoGeeX</a></td>
<td>CodeGeeX2 (6B parameters) instruction tuned on CommitPackFT + OASST</td>
</tr>
<tr>
<th>Evaluation</t>
<td><a href=https://huggingface.co/datasets/bigcode/humanevalpack>HumanEvalPack</a></td>
<td>Extension of OpenAI's HumanEval to cover 3 scenarios across 6 languages</td>
</tr>
</table>
## Dataset Structure
### Data Instances
An example looks as follows:
```json
{
'commit': '0c17311f7fd511f5dae8f8e4acc2dce1a2de3cf5',
'old_file': 'main.py',
'new_file': 'main.py',
'old_contents': "import numpy as np\nimport matplotlib.pyplot as plt\n\n# generate sample data\nx_data = np.linspace(-5, 5, 20)\ny_data = np.random.normal(0.0, 1.0, x_data.size)\n\nplt.plot(x_data, y_data, 'o')\nplt.show()\n",
'new_contents': "import math\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n# generate sample data\nx_data = np.linspace(-math.pi, math.pi, 30)\ny_data = np.sin(x_data) + np.random.normal(0.0, 0.1, x_data.size)\n\nplt.plot(x_data, y_data, 'o')\nplt.show()\n\n",
'subject': 'Change to sin() function with noise',
'message': 'Change to sin() function with noise\n',
'lang': 'Python',
'license': 'mit',
'repos': 'MorganR/basic-gaussian-process',
'returncode': 0,
'stderr': ''
}
```
### Data Fields
The data fields are the same among all splits:
- `commit`: unique commit id
- `old_file`: name of the file before the commit
- `new_file`: name of the file after the commit
- `old_contents`: contents of the file before the commit
- `new_contents`: contents of the file after the commit
- `subject`: subject of the commit (this is used for all experiments in the paper)
- `message`: message of the commit (commonly the same as the subject)
- `lang`: programming language
- `license`: license of the repository the code stems from, one of `['mit', 'artistic-2.0', 'isc', 'cc0-1.0', 'epl-1.0', 'mpl-2.0', 'unlicense', 'unknown', 'apache-2.0', 'bsd-3-clause', 'agpl-3.0', 'lgpl-2.1', 'bsd-2-clause']`
- `repos`: name of the the repository the code stems from (if multiple, they are comma separated)
- `returncode`: if applicable errorcode during scraping (0 = no error)
- 'stderr': if applicable the error that occured during scraping (empty = no error)
### Data Splits
| Name | Megabytes | % of total | Samples | % of total |
| --- | --- | --- | --- | --- |
| total | 3709175.78 | 100.0% | 57700105 | 100.0% |
| json | 583293.816 | 15.7257% | 3495038 | 6.0572% |
| xml | 279208.676 | 7.5275% | 1923159 | 3.333% |
| text | 270662.596 | 7.2971% | 1389525 | 2.4082% |
| javascript | 262824.844 | 7.0858% | 5401937 | 9.3621% |
| objective-c++ | 239009.3 | 6.4437% | 32227 | 0.0559% |
| python | 234311.564 | 6.3171% | 6189601 | 10.7272% |
| c | 200876.804 | 5.4157% | 2779478 | 4.8171% |
| c++ | 186585.256 | 5.0304% | 2402294 | 4.1634% |
| markdown | 171849.952 | 4.6331% | 7645354 | 13.2502% |
| java | 127103.448 | 3.4267% | 3744377 | 6.4894% |
| html | 105305.284 | 2.839% | 2366841 | 4.102% |
| yaml | 100466.64 | 2.7086% | 2592787 | 4.4936% |
| go | 86444.624 | 2.3306% | 1183612 | 2.0513% |
| csv | 82946.192 | 2.2362% | 79268 | 0.1374% |
| php | 74961.64 | 2.021% | 2555419 | 4.4288% |
| jupyter-notebook | 66854.08 | 1.8024% | 94000 | 0.1629% |
| gettext-catalog | 62296.88 | 1.6795% | 168327 | 0.2917% |
| sql | 56802.764 | 1.5314% | 132772 | 0.2301% |
| unity3d-asset | 39535.008 | 1.0659% | 17867 | 0.031% |
| typescript | 39254.804 | 1.0583% | 572136 | 0.9916% |
| web-ontology-language | 36435.464 | 0.9823% | 7458 | 0.0129% |
| ruby | 35830.74 | 0.966% | 2928702 | 5.0757% |
| c# | 33669.652 | 0.9077% | 923157 | 1.5999% |
| nix | 33547.92 | 0.9045% | 221281 | 0.3835% |
| shell | 25109.952 | 0.677% | 1017977 | 1.7643% |
| perl | 21148.928 | 0.5702% | 374266 | 0.6486% |
| tex | 17471.108 | 0.471% | 89283 | 0.1547% |
| css | 16306.632 | 0.4396% | 548818 | 0.9512% |
| restructuredtext | 15613.888 | 0.421% | 494037 | 0.8562% |
| rust | 15011.296 | 0.4047% | 296214 | 0.5134% |
| groff | 12020.188 | 0.3241% | 32923 | 0.0571% |
| ini | 8375.164 | 0.2258% | 297100 | 0.5149% |
| scala | 8325.96 | 0.2245% | 316064 | 0.5478% |
| coffeescript | 6795.14 | 0.1832% | 292446 | 0.5068% |
| haskell | 6306.12 | 0.17% | 217325 | 0.3766% |
| swift | 5902.716 | 0.1591% | 319289 | 0.5534% |
| lua | 5763.12 | 0.1554% | 139091 | 0.2411% |
| svg | 5645.44 | 0.1522% | 27095 | 0.047% |
| gas | 5585.384 | 0.1506% | 15121 | 0.0262% |
| ocaml | 5355.4 | 0.1444% | 81360 | 0.141% |
| erlang | 5043.32 | 0.136% | 93685 | 0.1624% |
| makefile | 4238.512 | 0.1143% | 343379 | 0.5951% |
| asciidoc | 4138.588 | 0.1116% | 96671 | 0.1675% |
| emacs-lisp | 3988.652 | 0.1075% | 83228 | 0.1442% |
| scss | 3944.936 | 0.1064% | 288190 | 0.4995% |
| clojure | 3523.408 | 0.095% | 158674 | 0.275% |
| org | 3126.22 | 0.0843% | 30198 | 0.0523% |
| common-lisp | 2954.904 | 0.0797% | 74628 | 0.1293% |
| diff | 2586.048 | 0.0697% | 21021 | 0.0364% |
| groovy | 2569.14 | 0.0693% | 110057 | 0.1907% |
| html+erb | 2450.676 | 0.0661% | 225379 | 0.3906% |
| nesc | 2439.564 | 0.0658% | 473 | 0.0008% |
| dart | 2395.796 | 0.0646% | 56873 | 0.0986% |
| powershell | 2289.276 | 0.0617% | 55381 | 0.096% |
| f# | 2289.236 | 0.0617% | 66840 | 0.1158% |
| dm | 2223.144 | 0.0599% | 55584 | 0.0963% |
| kotlin | 2219.248 | 0.0598% | 124266 | 0.2154% |
| pascal | 2194.676 | 0.0592% | 42511 | 0.0737% |
| jsx | 2124.744 | 0.0573% | 139148 | 0.2412% |
| viml | 1948.208 | 0.0525% | 74062 | 0.1284% |
| actionscript | 1844.148 | 0.0497% | 28819 | 0.0499% |
| cython | 1736.588 | 0.0468% | 25927 | 0.0449% |
| turtle | 1698.948 | 0.0458% | 3882 | 0.0067% |
| less | 1616.564 | 0.0436% | 88634 | 0.1536% |
| mathematica | 1475.044 | 0.0398% | 925 | 0.0016% |
| xslt | 1441.456 | 0.0389% | 27956 | 0.0485% |
| scheme | 1249.244 | 0.0337% | 30546 | 0.0529% |
| perl6 | 1223.16 | 0.033% | 12167 | 0.0211% |
| edn | 1186.94 | 0.032% | 2289 | 0.004% |
| fortran | 1178.548 | 0.0318% | 13463 | 0.0233% |
| java-server-pages | 1173.072 | 0.0316% | 53574 | 0.0928% |
| standard-ml | 1133.476 | 0.0306% | 20097 | 0.0348% |
| cmake | 1132.068 | 0.0305% | 58446 | 0.1013% |
| json5 | 1108.2 | 0.0299% | 1827 | 0.0032% |
| vala | 1104.512 | 0.0298% | 14822 | 0.0257% |
| vue | 1093.8 | 0.0295% | 68967 | 0.1195% |
| freemarker | 1032.332 | 0.0278% | 36216 | 0.0628% |
| graphql | 1004.844 | 0.0271% | 2009 | 0.0035% |
| twig | 958.96 | 0.0259% | 39588 | 0.0686% |
| tcl | 869.832 | 0.0235% | 16407 | 0.0284% |
| pod | 859.016 | 0.0232% | 14922 | 0.0259% |
| dockerfile | 849.728 | 0.0229% | 259379 | 0.4495% |
| yacc | 845.704 | 0.0228% | 8230 | 0.0143% |
| postscript | 800.728 | 0.0216% | 903 | 0.0016% |
| racket | 796.64 | 0.0215% | 16615 | 0.0288% |
| eagle | 785.684 | 0.0212% | 2237 | 0.0039% |
| haxe | 772.896 | 0.0208% | 28447 | 0.0493% |
| julia | 752.068 | 0.0203% | 22695 | 0.0393% |
| handlebars | 740.816 | 0.02% | 49842 | 0.0864% |
| smarty | 720.944 | 0.0194% | 41065 | 0.0712% |
| visual-basic | 681.516 | 0.0184% | 10511 | 0.0182% |
| literate-haskell | 673.74 | 0.0182% | 10729 | 0.0186% |
| smalltalk | 665.892 | 0.018% | 11741 | 0.0203% |
| isabelle | 655.82 | 0.0177% | 8359 | 0.0145% |
| nimrod | 652.86 | 0.0176% | 12023 | 0.0208% |
| zig | 621.384 | 0.0168% | 4290 | 0.0074% |
| m4 | 603.584 | 0.0163% | 12465 | 0.0216% |
| max | 603.56 | 0.0163% | 2259 | 0.0039% |
| elixir | 558.116 | 0.015% | 35473 | 0.0615% |
| mako | 543.012 | 0.0146% | 8943 | 0.0155% |
| arduino | 534.176 | 0.0144% | 32350 | 0.0561% |
| jade | 531.4 | 0.0143% | 46993 | 0.0814% |
| haml | 502.012 | 0.0135% | 74792 | 0.1296% |
| elm | 481.968 | 0.013% | 18542 | 0.0321% |
| purebasic | 474.276 | 0.0128% | 36 | 0.0001% |
| coldfusion | 470.78 | 0.0127% | 9263 | 0.0161% |
| lean | 470.032 | 0.0127% | 7507 | 0.013% |
| r | 454.32 | 0.0122% | 12858 | 0.0223% |
| cuda | 437.668 | 0.0118% | 11450 | 0.0198% |
| textile | 425.116 | 0.0115% | 18491 | 0.032% |
| robotframework | 421.612 | 0.0114% | 9211 | 0.016% |
| abap | 409.62 | 0.011% | 1955 | 0.0034% |
| rdoc | 397.028 | 0.0107% | 38760 | 0.0672% |
| llvm | 382.2 | 0.0103% | 10727 | 0.0186% |
| ada | 380.7 | 0.0103% | 13258 | 0.023% |
| batchfile | 372.16 | 0.01% | 43674 | 0.0757% |
| qml | 361.452 | 0.0097% | 19360 | 0.0336% |
| jasmin | 359.82 | 0.0097% | 4782 | 0.0083% |
| assembly | 343.62 | 0.0093% | 8126 | 0.0141% |
| g-code | 334.964 | 0.009% | 3690 | 0.0064% |
| cucumber | 331.38 | 0.0089% | 26677 | 0.0462% |
| html+php | 323.348 | 0.0087% | 18381 | 0.0319% |
| kicad | 321.936 | 0.0087% | 759 | 0.0013% |
| api-blueprint | 317.852 | 0.0086% | 4765 | 0.0083% |
| eiffel | 311.48 | 0.0084% | 373 | 0.0006% |
| toml | 292.676 | 0.0079% | 63517 | 0.1101% |
| modelica | 284.616 | 0.0077% | 2611 | 0.0045% |
| bitbake | 277.576 | 0.0075% | 43239 | 0.0749% |
| lex | 275.96 | 0.0074% | 705 | 0.0012% |
| stylus | 273.056 | 0.0074% | 21967 | 0.0381% |
| protocol-buffer | 254.124 | 0.0069% | 9202 | 0.0159% |
| unknown | 252.228 | 0.0068% | 30570 | 0.053% |
| nit | 244.54 | 0.0066% | 4951 | 0.0086% |
| factor | 241.192 | 0.0065% | 15378 | 0.0267% |
| xs | 239.04 | 0.0064% | 3215 | 0.0056% |
| sass | 230.648 | 0.0062% | 23144 | 0.0401% |
| parrot-internal-representation | 230.196 | 0.0062% | 6231 | 0.0108% |
| html+django | 217.04 | 0.0059% | 10535 | 0.0183% |
| mediawiki | 214.324 | 0.0058% | 10188 | 0.0177% |
| logos | 212.296 | 0.0057% | 1733 | 0.003% |
| genshi | 209.3 | 0.0056% | 956 | 0.0017% |
| coldfusion-cfc | 208.164 | 0.0056% | 4410 | 0.0076% |
| xtend | 179.544 | 0.0048% | 7775 | 0.0135% |
| sqf | 168.656 | 0.0045% | 7778 | 0.0135% |
| vhdl | 155.948 | 0.0042% | 2185 | 0.0038% |
| antlr | 143.548 | 0.0039% | 3651 | 0.0063% |
| systemverilog | 140.192 | 0.0038% | 3944 | 0.0068% |
| hcl | 136.752 | 0.0037% | 13379 | 0.0232% |
| asp | 136.104 | 0.0037% | 4286 | 0.0074% |
| nsis | 129.124 | 0.0035% | 4048 | 0.007% |
| inform-7 | 120.188 | 0.0032% | 184 | 0.0003% |
| slim | 119.036 | 0.0032% | 18726 | 0.0325% |
| groovy-server-pages | 117.368 | 0.0032% | 6695 | 0.0116% |
| ceylon | 116.144 | 0.0031% | 7256 | 0.0126% |
| fish | 111.28 | 0.003% | 15351 | 0.0266% |
| processing | 108.58 | 0.0029% | 5912 | 0.0102% |
| component-pascal | 105.5 | 0.0028% | 43 | 0.0001% |
| lasso | 104.168 | 0.0028% | 67 | 0.0001% |
| glsl | 99.488 | 0.0027% | 9478 | 0.0164% |
| saltstack | 98.196 | 0.0026% | 12314 | 0.0213% |
| xbase | 94.424 | 0.0025% | 1670 | 0.0029% |
| autohotkey | 94.22 | 0.0025% | 1452 | 0.0025% |
| liquid | 93.792 | 0.0025% | 2651 | 0.0046% |
| purescript | 92.412 | 0.0025% | 5024 | 0.0087% |
| agda | 92.06 | 0.0025% | 4956 | 0.0086% |
| inno-setup | 91.36 | 0.0025% | 3014 | 0.0052% |
| oz | 90.476 | 0.0024% | 1551 | 0.0027% |
| chapel | 89.62 | 0.0024% | 26447 | 0.0458% |
| arc | 87.212 | 0.0024% | 758 | 0.0013% |
| opencl | 86.432 | 0.0023% | 2489 | 0.0043% |
| graphviz-dot | 85.804 | 0.0023% | 1525 | 0.0026% |
| pawn | 85.424 | 0.0023% | 580 | 0.001% |
| jsoniq | 75.152 | 0.002% | 1343 | 0.0023% |
| bluespec | 72.38 | 0.002% | 2500 | 0.0043% |
| smali | 71.38 | 0.0019% | 174 | 0.0003% |
| krl | 69.868 | 0.0019% | 1879 | 0.0033% |
| maple | 68.284 | 0.0018% | 1311 | 0.0023% |
| unrealscript | 67.668 | 0.0018% | 585 | 0.001% |
| ooc | 63.188 | 0.0017% | 3416 | 0.0059% |
| pure-data | 62.624 | 0.0017% | 603 | 0.001% |
| xquery | 61.956 | 0.0017% | 2237 | 0.0039% |
| digital-command-language | 59.644 | 0.0016% | 833 | 0.0014% |
| moonscript | 59.208 | 0.0016% | 1951 | 0.0034% |
| awk | 57.176 | 0.0015% | 2206 | 0.0038% |
| pike | 52.872 | 0.0014% | 1262 | 0.0022% |
| livescript | 51.228 | 0.0014% | 5194 | 0.009% |
| solidity | 50.856 | 0.0014% | 3689 | 0.0064% |
| monkey | 48.256 | 0.0013% | 1367 | 0.0024% |
| jsonld | 48.012 | 0.0013% | 462 | 0.0008% |
| zephir | 42.684 | 0.0012% | 1265 | 0.0022% |
| crystal | 41.924 | 0.0011% | 4217 | 0.0073% |
| rhtml | 41.02 | 0.0011% | 4551 | 0.0079% |
| stata | 40.684 | 0.0011% | 1344 | 0.0023% |
| idris | 39.896 | 0.0011% | 3025 | 0.0052% |
| raml | 39.388 | 0.0011% | 948 | 0.0016% |
| openscad | 37.732 | 0.001% | 2178 | 0.0038% |
| red | 35.26 | 0.001% | 1108 | 0.0019% |
| c2hs-haskell | 34.472 | 0.0009% | 1021 | 0.0018% |
| cycript | 33.96 | 0.0009% | 197 | 0.0003% |
| applescript | 33.512 | 0.0009% | 1304 | 0.0023% |
| mupad | 32.488 | 0.0009% | 178 | 0.0003% |
| literate-agda | 31.384 | 0.0008% | 567 | 0.001% |
| boo | 31.172 | 0.0008% | 26289 | 0.0456% |
| sourcepawn | 29.528 | 0.0008% | 717 | 0.0012% |
| qmake | 29.508 | 0.0008% | 3632 | 0.0063% |
| ragel-in-ruby-host | 28.296 | 0.0008% | 888 | 0.0015% |
| io | 27.952 | 0.0008% | 1247 | 0.0022% |
| desktop | 27.648 | 0.0007% | 5021 | 0.0087% |
| propeller-spin | 26.772 | 0.0007% | 625 | 0.0011% |
| thrift | 26.748 | 0.0007% | 1007 | 0.0017% |
| volt | 25.052 | 0.0007% | 1660 | 0.0029% |
| xproc | 24.212 | 0.0007% | 914 | 0.0016% |
| igor-pro | 23.748 | 0.0006% | 388 | 0.0007% |
| lolcode | 23.74 | 0.0006% | 24861 | 0.0431% |
| html+eex | 21.412 | 0.0006% | 2100 | 0.0036% |
| logtalk | 20.428 | 0.0006% | 1035 | 0.0018% |
| mirah | 20.104 | 0.0005% | 706 | 0.0012% |
| gnuplot | 19.676 | 0.0005% | 889 | 0.0015% |
| literate-coffeescript | 19.016 | 0.0005% | 1041 | 0.0018% |
| jflex | 18.608 | 0.0005% | 555 | 0.001% |
| emberscript | 18.392 | 0.0005% | 1024 | 0.0018% |
| cobol | 17.0 | 0.0005% | 24953 | 0.0432% |
| yang | 16.94 | 0.0005% | 597 | 0.001% |
| rebol | 16.468 | 0.0004% | 239 | 0.0004% |
| linker-script | 16.084 | 0.0004% | 1604 | 0.0028% |
| cartocss | 15.916 | 0.0004% | 555 | 0.001% |
| urweb | 13.068 | 0.0004% | 304 | 0.0005% |
| rmarkdown | 13.032 | 0.0004% | 750 | 0.0013% |
| darcs-patch | 13.008 | 0.0004% | 80 | 0.0001% |
| csound | 12.852 | 0.0003% | 229 | 0.0004% |
| squirrel | 12.844 | 0.0003% | 531 | 0.0009% |
| apl | 12.56 | 0.0003% | 586 | 0.001% |
| hlsl | 12.168 | 0.0003% | 1529 | 0.0026% |
| latte | 11.888 | 0.0003% | 1380 | 0.0024% |
| pony | 11.836 | 0.0003% | 624 | 0.0011% |
| ioke | 10.86 | 0.0003% | 373 | 0.0006% |
| hy | 10.512 | 0.0003% | 879 | 0.0015% |
| uno | 10.356 | 0.0003% | 628 | 0.0011% |
| pan | 10.336 | 0.0003% | 637 | 0.0011% |
| xojo | 10.308 | 0.0003% | 642 | 0.0011% |
| papyrus | 10.256 | 0.0003% | 130 | 0.0002% |
| stan | 10.252 | 0.0003% | 540 | 0.0009% |
| slash | 9.904 | 0.0003% | 640 | 0.0011% |
| supercollider | 9.796 | 0.0003% | 318 | 0.0006% |
| vcl | 9.456 | 0.0003% | 747 | 0.0013% |
| smt | 9.032 | 0.0002% | 117 | 0.0002% |
| glyph | 8.948 | 0.0002% | 7 | 0.0% |
| wisp | 8.736 | 0.0002% | 262 | 0.0005% |
| renpy | 8.3 | 0.0002% | 421 | 0.0007% |
| clips | 7.728 | 0.0002% | 450 | 0.0008% |
| dns-zone | 7.56 | 0.0002% | 54 | 0.0001% |
| sas | 7.536 | 0.0002% | 269 | 0.0005% |
| rouge | 7.196 | 0.0002% | 396 | 0.0007% |
| ec | 7.032 | 0.0002% | 94 | 0.0002% |
| dylan | 6.82 | 0.0002% | 280 | 0.0005% |
| tcsh | 6.524 | 0.0002% | 748 | 0.0013% |
| aspectj | 6.332 | 0.0002% | 451 | 0.0008% |
| netlogo | 6.304 | 0.0002% | 140 | 0.0002% |
| gap | 6.096 | 0.0002% | 46 | 0.0001% |
| fancy | 5.952 | 0.0002% | 675 | 0.0012% |
| coq | 5.744 | 0.0002% | 80 | 0.0001% |
| click | 5.74 | 0.0002% | 9 | 0.0% |
| capn-proto | 5.644 | 0.0002% | 330 | 0.0006% |
| flux | 5.572 | 0.0002% | 47 | 0.0001% |
| forth | 5.512 | 0.0001% | 265 | 0.0005% |
| ats | 5.424 | 0.0001% | 383 | 0.0007% |
| netlinx | 5.172 | 0.0001% | 144 | 0.0002% |
| clean | 5.068 | 0.0001% | 171 | 0.0003% |
| parrot-assembly | 4.664 | 0.0001% | 227 | 0.0004% |
| alloy | 4.644 | 0.0001% | 203 | 0.0004% |
| lfe | 4.576 | 0.0001% | 287 | 0.0005% |
| gdscript | 4.488 | 0.0001% | 460 | 0.0008% |
| augeas | 4.444 | 0.0001% | 395 | 0.0007% |
| sparql | 4.404 | 0.0001% | 1036 | 0.0018% |
| lilypond | 4.308 | 0.0001% | 265 | 0.0005% |
| scilab | 4.088 | 0.0001% | 375 | 0.0006% |
| autoit | 4.06 | 0.0001% | 279 | 0.0005% |
| myghty | 3.864 | 0.0001% | 105 | 0.0002% |
| blitzmax | 3.74 | 0.0001% | 220 | 0.0004% |
| creole | 3.416 | 0.0001% | 337 | 0.0006% |
| harbour | 3.336 | 0.0001% | 107 | 0.0002% |
| piglatin | 3.168 | 0.0001% | 513 | 0.0009% |
| opa | 3.164 | 0.0001% | 211 | 0.0004% |
| sage | 3.032 | 0.0001% | 414 | 0.0007% |
| ston | 2.848 | 0.0001% | 414 | 0.0007% |
| maxscript | 2.8 | 0.0001% | 47 | 0.0001% |
| lsl | 2.68 | 0.0001% | 74 | 0.0001% |
| gentoo-ebuild | 2.576 | 0.0001% | 601 | 0.001% |
| nu | 2.38 | 0.0001% | 170 | 0.0003% |
| bro | 2.34 | 0.0001% | 333 | 0.0006% |
| xc | 2.02 | 0.0001% | 88 | 0.0002% |
| j | 1.808 | 0.0% | 142 | 0.0002% |
| metal | 1.724 | 0.0% | 151 | 0.0003% |
| module-management-system | 1.544 | 0.0% | 91 | 0.0002% |
| webidl | 1.508 | 0.0% | 96 | 0.0002% |
| tea | 1.468 | 0.0% | 29 | 0.0001% |
| redcode | 1.272 | 0.0% | 149 | 0.0003% |
| shen | 1.2 | 0.0% | 71 | 0.0001% |
| pov-ray-sdl | 1.136 | 0.0% | 104 | 0.0002% |
| x10 | 1.008 | 0.0% | 33 | 0.0001% |
| brainfuck | 0.964 | 0.0% | 167 | 0.0003% |
| ninja | 0.952 | 0.0% | 187 | 0.0003% |
| golo | 0.896 | 0.0% | 115 | 0.0002% |
| webassembly | 0.86 | 0.0% | 83 | 0.0001% |
| self | 0.824 | 0.0% | 15 | 0.0% |
| labview | 0.808 | 0.0% | 61 | 0.0001% |
| octave | 0.804 | 0.0% | 12 | 0.0% |
| pogoscript | 0.804 | 0.0% | 74 | 0.0001% |
| d | 0.796 | 0.0% | 20 | 0.0% |
| http | 0.736 | 0.0% | 140 | 0.0002% |
| ecl | 0.664 | 0.0% | 48 | 0.0001% |
| chuck | 0.584 | 0.0% | 99 | 0.0002% |
| gosu | 0.524 | 0.0% | 60 | 0.0001% |
| parrot | 0.52 | 0.0% | 17 | 0.0% |
| opal | 0.472 | 0.0% | 69 | 0.0001% |
| objective-j | 0.456 | 0.0% | 37 | 0.0001% |
| kit | 0.412 | 0.0% | 48 | 0.0001% |
| gams | 0.376 | 0.0% | 18 | 0.0% |
| prolog | 0.276 | 0.0% | 35 | 0.0001% |
| clarion | 0.268 | 0.0% | 13 | 0.0% |
| mask | 0.252 | 0.0% | 37 | 0.0001% |
| brightscript | 0.244 | 0.0% | 28 | 0.0% |
| scaml | 0.184 | 0.0% | 31 | 0.0001% |
| matlab | 0.164 | 0.0% | 29 | 0.0001% |
| idl | 0.148 | 0.0% | 1 | 0.0% |
| ags-script | 0.124 | 0.0% | 31 | 0.0001% |
| lookml | 0.12 | 0.0% | 10 | 0.0% |
| apacheconf | 0.108 | 0.0% | 59 | 0.0001% |
| oxygene | 0.104 | 0.0% | 9 | 0.0% |
| txl | 0.096 | 0.0% | 3 | 0.0% |
| grammatical-framework | 0.088 | 0.0% | 39 | 0.0001% |
| renderscript | 0.064 | 0.0% | 54 | 0.0001% |
| mtml | 0.052 | 0.0% | 13 | 0.0% |
| unified-parallel-c | 0.052 | 0.0% | 6 | 0.0% |
| dogescript | 0.04 | 0.0% | 10 | 0.0% |
| gentoo-eclass | 0.04 | 0.0% | 6 | 0.0% |
| zimpl | 0.04 | 0.0% | 7 | 0.0% |
| irc-log | 0.036 | 0.0% | 9 | 0.0% |
| fantom | 0.028 | 0.0% | 11 | 0.0% |
| numpy | 0.028 | 0.0% | 1 | 0.0% |
| cirru | 0.024 | 0.0% | 4 | 0.0% |
| xpages | 0.024 | 0.0% | 7 | 0.0% |
| nginx | 0.02 | 0.0% | 6 | 0.0% |
| objdump | 0.02 | 0.0% | 1 | 0.0% |
| python-traceback | 0.02 | 0.0% | 10 | 0.0% |
| realbasic | 0.012 | 0.0% | 1 | 0.0% |
| befunge | 0.008 | 0.0% | 2 | 0.0% |
| bison | 0.008 | 0.0% | 1 | 0.0% |
| m | 0.008 | 0.0% | 1 | 0.0% |
| omgrofl | 0.008 | 0.0% | 1 | 0.0% |
## Additional Information
### Licensing Information
Each sample comes from a code repository with a permissive license. The license is provided by the `license` field for each sample.
### Citation Information
```bibtex
@article{muennighoff2023octopack,
title={OctoPack: Instruction Tuning Code Large Language Models},
author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre},
journal={arXiv preprint arXiv:2308.07124},
year={2023}
}
```
提供机构:
bigcode
原始信息汇总
数据集概述
数据集名称
- 名称: CommitPack
- 别名: OctoPack🐙🎒
数据集基本信息
- 大小: 4TB
- 来源: 从GitHub上抓取的许可宽松的仓库
- 包含内容: 超过350种编程语言的提交记录
相关产品与模型
- 数据集:
- CommitPack: 原始数据集,包含所有提交记录
- CommitPackFT: 经过过滤的数据集,专注于高质量的提交消息,类似于指令
- 模型:
- OctoCoder: 基于StarCoder(16亿参数),通过CommitPackFT和OASST进行指令调优
- OctoGeeX: 基于CodeGeeX2(6亿参数),通过CommitPackFT和OASST进行指令调优
- 评估数据集:
- HumanEvalPack: 扩展自OpenAI的HumanEval,覆盖3种场景和6种语言
数据集结构
数据实例
- 示例字段:
commit: 提交IDold_file: 提交前的文件名new_file: 提交后的文件名old_contents: 提交前的文件内容new_contents: 提交后的文件内容subject: 提交主题message: 提交消息lang: 编程语言license: 仓库许可repos: 仓库名称returncode: 抓取时的错误代码stderr: 抓取时的错误信息
数据字段
- 字段列表:
commitold_filenew_fileold_contentsnew_contentssubjectmessagelanglicensereposreturncodestderr
数据分割
- 总大小: 3709175.78 MB
- 总样本数: 57700105
- 详细分割:
- json: 583293.816 MB (15.7257%), 3495038 samples (6.0572%)
- xml: 279208.676 MB (7.5275%), 1923159 samples (3.333%)
- text: 270662.596 MB (7.2971%), 1389525 samples (2.4082%)
- javascript: 262824.844 MB (7.0858%), 5401937 samples (9.3621%)
- objective-c++: 239009.3 MB (6.4437%), 32227 samples (0.0559%)
- python: 234311.564 MB (6.3171%), 6189601 samples (10.7272%)
- c: 200876.804 MB (5.4157%), 2779478 samples (4.8171%)
- c++: 186585.256 MB (5.0304%), 2402294 samples (4.1634%)
- markdown: 171849.952 MB (4.6331%), 7645354 samples (13.2502%)
- java: 127103.448 MB (3.4267%), 3744377 samples (6.4894%)
- html: 105305.284 MB (2.839%), 2366841 samples (4.102%)
- yaml: 100466.64 MB (2.7086%), 2592787 samples (4.4936%)
- go: 86444.624 MB (2.3306%), 1183612 samples (2.0513%)
- csv: 82946.192 MB (2.2362%), 79268 samples (0.1374%)
- php: 74961.64 MB (2.021%), 2555419 samples (4.4288%)
- jupyter-notebook: 66854.08 MB (1.8024%), 94000 samples (0.1629%)
- gettext-catalog: 62296.88 MB (1.6795%), 168327 samples (0.2917%)
- sql: 56802.764 MB (1.5314%), 132772 samples (0.2301%)
- unity3d-asset: 39535.008 MB (1.0659%), 17867 samples (0.031%)
- typescript: 39254.804 MB (1.0583%), 572136 samples (0.9916%)
- web-ontology-language: 36435.464 MB (0.9823%), 7458 samples (0.0129%)
- ruby: 35830.74 MB (0.966%), 2928702 samples (5.0757%)
- c#: 33669.652 MB (0.9077%), 923157 samples (1.5999%)
- nix: 33547.92 MB (0.9045%), 221281 samples (0.3835%)
- shell: 25109.952 MB (0.677%), 1017977 samples (1.7643%)
- perl: 21148.928 MB (0.5702%), 374266 samples (0.6486%)
- tex: 17471.108 MB (0.471%), 89283 samples (0.1547%)
- css: 16306.632 MB (0.4396%), 548818 samples (0.9512%)
- restructuredtext: 15613.888 MB (0.421%), 494037 samples (0.8562%)
- rust: 15011.296 MB (0.4047%), 296214 samples (0.5134%)
- groff: 12020.188 MB (0.3241%), 32923 samples (0.0571%)
- ini: 8375.164 MB (0.2258%), 297100 samples (0.5149%)
- scala: 8325.96 MB (0.2245%), 316064 samples (0.5478%)
- coffeescript: 6795.14 MB (0.1832%), 292446 samples (0.5068%)
- haskell: 6306.12 MB (0.17%), 217325 samples (0.3766%)
- swift: 5902.716 MB (0.1591%), 319289 samples (0.5534%)
- lua: 5763.12 MB (0.1554%), 139091 samples (0.2411%)
- svg: 5645.44 MB (0.1522%), 27095 samples (0.047%)
- gas: 5585.384 MB (0.1506%), 15121 samples (0.0262%)
- ocaml: 5355.4 MB (0.1444%), 81360 samples (0.141%)
- erlang: 5043.32 MB (0.136%), 93685 samples (0.1624%)
- makefile: 4238.512 MB (0.1143%), 343379 samples (0.5951%)
- asciidoc: 4138.588 MB (0.1116%), 96671 samples (0.1675%)
- emacs-lisp: 3988.652 MB (0.1075%), 83228 samples (0.1442%)
- scss: 3944.936 MB (0.1064%), 288190 samples (0.4995%)
- clojure: 3523.408 MB (0.095%), 158674 samples (0.275%)
- org: 3126.22 MB (0.0843%), 30198 samples (0.0523%)
- common-lisp: 2954.904 MB (0.0797%), 74628 samples (0.1293%)
- diff: 2586.048 MB (0.0697%), 21021 samples (0.0364%)
- groovy: 2569.14 MB (0.0693%), 110057 samples (0.1907%)
- html+erb: 2450.676 MB (0.0661%), 225379 samples (0.3906%)
- nesc: 2439.564 MB (0.0658%), 473 samples (0.0008%)
- dart: 2395.796 MB (0.0646%), 56873 samples (0.0986%)
- powershell: 2289.276 MB (0.0617%), 55381 samples (0.096%)
- f#: 2289.236 MB (0.0617%), 66840 samples (0.1158%)
- dm: 2223.144 MB (0.0599%), 55584 samples (0.0963%)
- kotlin: 2219.248 MB (0.0598%), 124266 samples (0.2154%)
- pascal: 2194.676 MB (0.0592%), 42511 samples (0.0737%)
- jsx: 2124.744 MB (0.0573%), 139148 samples (0.2412%)
- viml: 1948.208 MB (0.0525%), 74062 samples (0.1284%)
- actionscript: 1844.148 MB (0.0497%), 28819 samples (0.0499%)
- cython: 1736.588 MB (0.0468%), 25927 samples (0.0449%)
- turtle: 1698.948 MB (0.0458%), 3882 samples (0.0067%)
- less: 1616.564 MB (0.0436%), 88634 samples (0.1536%)
- mathematica: 1475.044 MB (0.0398%), 925 samples (0.0016%)
- xslt: 1441.456 MB (0.0389%), 27956 samples (0.0485%)
- scheme: 1249.244 MB (0.0337%), 30546 samples (0.0529%)
- perl6: 1223.16 MB (0.033%), 12167 samples (0.0211%)
- edn: 1186.94 MB (0.032%), 2289 samples (0.004%)
- fortran: 1178.548 MB (0.0318%), 13463 samples (0.0233%)
- java-server-pages: 1173.072 MB (0.0316%), 53574 samples (0.0928%)
- standard-ml: 1133.476 MB (0.0306%), 20097 samples (0.0348%)
- cmake: 1132.068 MB (0.0305%), 58446 samples (0.1013%)
- json5: 1108.2 MB (0.0299%), 1827 samples (0.0032%)
- vala: 1104.512 MB (0.0298%), 14822 samples (0.0257%)
- vue: 1093.8 MB (0.0295%), 68967 samples (0.1195%)
- freemarker: 1032.332 MB (0.0278%), 36216 samples (0.0628%)
- graphql: 1004.844 MB (0.0271%), 2009 samples (0.0035%)
- twig: 958.96 MB (0.0259%), 39588 samples (0.0686%)
- tcl: 869.832 MB (0.0235%), 16407 samples (0.0284%)
- pod: 859.016 MB (0.0232%), 14922 samples (0.0259%)
- dockerfile: 849.728 MB (0.0229%), 259379 samples (0.4495%)
- yacc: 845.704 MB (0.0228%), 8230 samples (0.0143%)
- postscript: 800.728 MB (0.0216%), 903 samples (0.0016%)
- racket: 796.64 MB (0.0215%), 16615 samples (0.0288%)
- eagle: 785.684 MB (0.0212%), 2237 samples (0.0039%)
- haxe: 772.896 MB (0.0208%), 28447 samples (0.0493%)
- julia: 752.068 MB (0.0203%), 22695 samples (0.0393%)
- handlebars: 740.816 MB (0.02%), 49842 samples (
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



