jordiae/exebench
收藏ExeBench: 可执行C函数的大型机器学习数据集
ExeBench是一个包含数百万C函数及其依赖和元数据的数据集,其中至少有一部分可以通过IO对执行。它主要用于机器学习应用,但也足够通用,可以用于其他用途。
使用方法
选项1:使用本仓库中的辅助工具
bash git clone https://github.com/jordiae/exebench.git cd exebench/ python -m venv venv source venv/bin/activate pip install -r requirements_examples.txt PYTHONPATH="${PYTHONPATH}:${pwd}" python examples/basic.py
选项2:直接使用Huggingface Datasets库
python !pip install datasets zstandard
加载数据集分割,例如合成测试分割
dataset = load_dataset(jordiae/exebench, split=test_synth) for e in dataset: ...
选项3:直接下载数据集
数据集由使用TAR压缩的目录组成,每个TAR内部包含一系列使用zstandard压缩的jsonline文件。
统计信息和版本
此版本对应ExeBench v1.01,相对于论文中介绍的原始版本有一些改进。新版本的最终分割包含以下函数:
train_not_compilable: 2.357M train_synth_compilable: 2.308373M train_real_compilable: 0.675074M train_synth_simple_io: 0.550116M train_real_simple_io: 0.043769M train_synth_rich_io: 0.097250M valid_synth: 5k valid_real: 2.133k test_synth: 5k test_real: 2.134k
原始数据集(v1.00)可以在请求时访问:https://huggingface.co/datasets/jordiae/exebench_legacy
许可证
所有C函数保留其原始许可证,所有ExeBench贡献(I/O示例、运行函数的样板等)以MIT许可证发布。
引用
@inproceedings{10.1145/3520312.3534867, author = {Armengol-Estap{e}, Jordi and Woodruff, Jackson and Brauckmann, Alexander and Magalh~{a}es, Jos{e} Wesley de Souza and OBoyle, Michael F. P.}, title = {ExeBench: An ML-Scale Dataset of Executable C Functions}, year = {2022}, isbn = {9781450392730}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3520312.3534867}, doi = {10.1145/3520312.3534867}, abstract = {Machine-learning promises to transform compilation and software engineering, yet is frequently limited by the scope of available datasets. In particular, there is a lack of runnable, real-world datasets required for a range of tasks ranging from neural program synthesis to machine learning-guided program optimization. We introduce a new dataset, ExeBench, which attempts to address this. It tackles two key issues with real-world code: references to external types and functions and scalable generation of IO examples. ExeBench is the first publicly available dataset that pairs real-world C code taken from GitHub with IO examples that allow these programs to be run. We develop a toolchain that scrapes GitHub, analyzes the code, and generates runnable snippets of code. We analyze our benchmark suite using several metrics, and show it is representative of real-world code. ExeBench contains 4.5M compilable and 700k executable C functions. This scale of executable, real functions will enable the next generation of machine learning-based programming tasks.}, booktitle = {Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming}, pages = {50–59}, numpages = {10}, keywords = {Code Dataset, Program Synthesis, Mining Software Repositories, C, Machine Learning for Code, Compilers}, location = {San Diego, CA, USA}, series = {MAPS 2022} }




