bigcode/the-stack-inspection-data
收藏数据集描述
该数据集是the-stack数据集的一个子集,包含87种编程语言和295种扩展。每种语言在data/目录下有单独的文件夹,并包含其扩展的文件夹。从原始数据集的20,000个随机文件中选择样本,每个扩展最多保留1,000个文件。
语言
数据集包含87种编程语言:
ada, agda, alloy, antlr, applescript, assembly, augeas, awk, batchfile, bison, bluespec, c, c++, c-sharp, clojure, cmake, coffeescript, common-lisp, css, cuda, dart, dockerfile, elixir, elm, emacs-lisp,erlang, f-sharp, fortran, glsl, go, groovy, haskell,html, idris, isabelle, java, java-server-pages, javascript, julia, kotlin, lean, literate-agda, literate-coffeescript, literate-haskell, lua, makefile, maple, markdown, mathematica, matlab, ocaml, pascal, perl, php, powershell, prolog, protocol-buffer, python, r, racket, restructuredtext, rmarkdown, ruby, rust, sas, scala, scheme, shell, smalltalk, solidity, sparql, sql, stan, standard-ml, stata, systemverilog, tcl, tcsh, tex, thrift, typescript, verilog, vhdl, visual-basic, xslt, yacc, zig
数据集结构
可以指定要加载的语言和扩展: python
加载python的py扩展
from datasets import load_dataset
load_dataset("bigcode/the-stack-inspection-data", data_dir="data/python/py")
DatasetDict({ train: Dataset({ features: [content, lang, size, ext, max_stars_count, avg_line_length, max_line_length, alphanum_fraction], num_rows: 1000 }) })




