bigcode/commitpackft|代码提交数据集|数据分析数据集
收藏数据集卡片 for CommitPackFT
数据集描述
- 数据集概述: CommitPackFT 是一个 2GB 的过滤版本数据集,源自 CommitPack,仅包含高质量的提交信息,这些信息类似于自然语言指令。
- 创建方式: 数据集可以通过这里提供的指令重新创建。
- 语言数量: 277 种
数据集结构
数据实例
一个示例数据如下:
json { "commit": "0c17311f7fd511f5dae8f8e4acc2dce1a2de3cf5", "old_file": "main.py", "new_file": "main.py", "old_contents": "import numpy as np import matplotlib.pyplot as plt
generate sample data
x_data = np.linspace(-5, 5, 20) y_data = np.random.normal(0.0, 1.0, x_data.size)
plt.plot(x_data, y_data, o) plt.show() ", "new_contents": "import math import numpy as np import matplotlib.pyplot as plt
generate sample data
x_data = np.linspace(-math.pi, math.pi, 30) y_data = np.sin(x_data) + np.random.normal(0.0, 0.1, x_data.size)
plt.plot(x_data, y_data, o) plt.show()
", "subject": "Change to sin() function with noise", "message": "Change to sin() function with noise ", "lang": "Python", "license": "mit", "repos": "MorganR/basic-gaussian-process" }
数据字段
所有分割的数据字段相同:
commit
: 唯一的提交 IDold_file
: 提交前的文件名new_file
: 提交后的文件名old_contents
: 提交前的文件内容new_contents
: 提交后的文件内容subject
: 提交的主题(用于论文中的所有实验)message
: 提交信息(通常与主题相同)lang
: 编程语言license
: 代码来源仓库的许可证,可选值包括[mit, artistic-2.0, isc, cc0-1.0, epl-1.0, mpl-2.0, unlicense, unknown, apache-2.0, bsd-3-clause, agpl-3.0, lgpl-2.1, bsd-2-clause]
repos
: 代码来源的仓库名称(如果有多个,以逗号分隔)
数据分割
名称 | 兆字节 | 占总量的百分比 | 样本数 | 占总量的百分比 |
---|---|---|---|---|
total | 1545.02 | 100.0% | 702062 | 100.0% |
ruby | 195.292 | 12.6401% | 69413 | 9.887% |
yaml | 190.876 | 12.3543% | 114320 | 16.2835% |
python | 132.68 | 8.5876% | 56025 | 7.9801% |
markdown | 131.152 | 8.4887% | 62518 | 8.9049% |
javascript | 125.008 | 8.091% | 52989 | 7.5476% |
json | 86.744 | 5.6144% | 39777 | 5.6657% |
shell | 66.864 | 4.3277% | 31217 | 4.4465% |
text | 66.664 | 4.3148% | 46588 | 6.6359% |
php | 60.22 | 3.8977% | 24791 | 3.5312% |
java | 56.284 | 3.6429% | 20635 | 2.9392% |
html | 48.42 | 3.1339% | 20214 | 2.8792% |
c# | 26.84 | 1.7372% | 9346 | 1.3312% |
xml | 23.676 | 1.5324% | 9337 | 1.3299% |
html+erb | 23.104 | 1.4954% | 10910 | 1.554% |
c | 21.08 | 1.3644% | 8506 | 1.2116% |
ini | 21.04 | 1.3618% | 11360 | 1.6181% |
coffeescript | 16.96 | 1.0977% | 5513 | 0.7853% |
swift | 16.272 | 1.0532% | 4849 | 0.6907% |
restructuredtext | 15.728 | 1.018% | 6560 | 0.9344% |
typescript | 14.284 | 0.9245% | 5868 | 0.8358% |
c++ | 14.136 | 0.9149% | 4992 | 0.711% |
scss | 13.208 | 0.8549% | 6829 | 0.9727% |
go | 12.132 | 0.7852% | 5004 | 0.7128% |
scala | 11.184 | 0.7239% | 5040 | 0.7179% |
haml | 10.74 | 0.6951% | 4415 | 0.6289% |
css | 9.364 | 0.6061% | 5049 | 0.7192% |
rust | 7.244 | 0.4689% | 2996 | 0.4267% |
toml | 5.584 | 0.3614% | 3424 | 0.4877% |
jsx | 5.5 | 0.356% | 2199 | 0.3132% |
kotlin | 5.368 | 0.3474% | 2214 | 0.3154% |
clojure | 5.068 | 0.328% | 2403 | 0.3423% |
perl | 4.988 | 0.3228% | 2288 | 0.3259% |
bitbake | 4.464 | 0.2889% | 1308 | 0.1863% |
groovy | 4.168 | 0.2698% | 1486 | 0.2117% |
twig | 3.956 | 0.256% | 1610 | 0.2293% |
nix | 3.84 | 0.2485% | 1593 | 0.2269% |
sql | 3.74 | 0.2421% | 2069 | 0.2947% |
less | 3.724 | 0.241% | 1360 | 0.1937% |
haskell | 3.308 | 0.2141% | 1389 | 0.1978% |
handlebars | 3.292 | 0.2131% | 1429 | 0.2035% |
unknown | 3.048 | 0.1973% | 1597 | 0.2275% |
batchfile | 2.984 | 0.1931% | 1466 | 0.2088% |
cucumber | 2.588 | 0.1675% | 976 | 0.139% |
makefile | 2.528 | 0.1636% | 960 | 0.1367% |
elixir | 2.348 | 0.152% | 1150 | 0.1638% |
jade | 2.348 | 0.152% | 1119 | 0.1594% |
cmake | 2.268 | 0.1468% | 981 | 0.1397% |
powershell | 2.064 | 0.1336% | 991 | 0.1412% |
slim | 2.056 | 0.1331% | 1052 | 0.1498% |
emacs-lisp | 1.972 | 0.1276% | 1015 | 0.1446% |
dart | 1.96 | 0.1269% | 765 | 0.109% |
viml | 1.956 | 0.1266% | 1063 | 0.1514% |
asciidoc | 1.864 | 0.1206% | 523 | 0.0745% |
lua | 1.852 | 0.1199% | 920 | 0.131% |
llvm | 1.6 | 0.1036% | 780 | 0.1111% |
smarty | 1.588 | 0.1028% | 737 | 0.105% |
diff | 1.48 | 0.0958% | 680 | 0.0969% |
common-lisp | 1.448 | 0.0937% | 778 | 0.1108% |
saltstack | 1.412 | 0.0914% | 617 | 0.0879% |
vue | 1.384 | 0.0896% | 587 | 0.0836% |
sass | 1.364 | 0.0883% | 705 | 0.1004% |
fish | 1.328 | 0.086% | 813 | 0.1158% |
erlang | 1.192 | 0.0772% | 480 | 0.0684% |
freemarker | 1.028 | 0.0665% | 510 | 0.0726% |
stylus | 0.948 | 0.0614% | 480 | 0.0684% |
qml | 0.936 | 0.0606% | 368 | 0.0524% |
hcl | 0.912 | 0.059% | 421 | 0.06% |
html+django | 0.848 | 0.0549% | 399 | 0.0568% |
mako | 0.756 | 0.0489% | 170 | 0.0242% |
ada | 0.728 | 0.0471% | 265 | 0.0377% |
ocaml | 0.704 | 0.0456% | 333 | 0.0474% |
f# | 0.656 | 0.0425% | 254 | 0.0362% |
elm | 0.62 | 0.0401% | 265 | 0.0377% |
tex | 0.564 | 0.0365% | 307 | 0.0437% |
rdoc | 0.552 | 0.0357% | 270 | 0.0385% |
csv | 0.532 | 0.0344% | 375 | 0.0534% |
protocol-buffer | 0.524 | 0.0339% | 181 | 0.0258% |
smalltalk | 0.46 | 0.0298% | 284 | 0.0405% |
arduino | 0.456 | 0.0295% | 225 | 0.032% |
java-server-pages | 0.452 | 0.0293% | 173 | 0.0246% |
scheme | 0.42 | 0.0272% | 213 | 0.0303% |
groff | 0.396 | 0.0256% | 192 | 0.0273% |
objective-c++ | 0.376 | 0.0243% |

ICESat-2 Data
ICESat-2 Data 是由美国国家航空航天局(NASA)发布的卫星数据集,主要用于全球冰层和陆地高程的测量。该数据集包括高精度激光测高数据,用于研究冰川、海冰、植被和地形变化。
icesat-2.gsfc.nasa.gov 收录
poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
flames-and-smoke-datasets
该仓库总结了多个公开的火焰和烟雾数据集,包括DFS、D-Fire dataset、FASDD、FLAME、BoWFire、VisiFire、fire-smoke-detect-yolov4、Forest Fire等数据集。每个数据集都有详细的描述,包括数据来源、图像数量、标注信息等。
github 收录
NuminaMath-CoT
数据集包含约86万道数学题目,每道题目的解答都采用思维链(Chain of Thought, CoT)格式。数据来源包括中国高中数学练习题以及美国和国际数学奥林匹克竞赛题目。数据主要从在线考试试卷PDF和数学讨论论坛收集。处理步骤包括从原始PDF中进行OCR识别、分割成问题-解答对、翻译成英文、重新对齐以生成CoT推理格式,以及最终答案格式化。
huggingface 收录
Apple Stock Price Data
Historical stock price data for AAPL (apple)
kaggle 收录