jordiae/exebench

Name: jordiae/exebench
Creator: jordiae
Published: 2023-03-09 16:06:06
License: 暂无描述

Hugging Face2023-03-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jordiae/exebench

下载链接

链接失效反馈

官方服务：

资源简介：

ExeBench是一个包含数百万个C函数的数据集，这些函数与依赖项和元数据配对，使得至少其中的一部分可以与输入输出对一起执行。该数据集主要用于机器学习应用，但也足够通用，可用于其他用途。

ExeBench is a dataset comprising millions of C functions paired with their dependencies and metadata, enabling at least a subset of them to be executed alongside their corresponding input-output pairs. This dataset is primarily designed for machine learning applications, yet it is sufficiently general to be used for other purposes.

提供机构：

jordiae

原始信息汇总

ExeBench: 可执行C函数的大型机器学习数据集

ExeBench是一个包含数百万C函数及其依赖和元数据的数据集，其中至少有一部分可以通过IO对执行。它主要用于机器学习应用，但也足够通用，可以用于其他用途。

使用方法

选项1：使用本仓库中的辅助工具

bash git clone https://github.com/jordiae/exebench.git cd exebench/ python -m venv venv source venv/bin/activate pip install -r requirements_examples.txt PYTHONPATH="${PYTHONPATH}:${pwd}" python examples/basic.py

选项2：直接使用Huggingface Datasets库

python !pip install datasets zstandard

加载数据集分割，例如合成测试分割

dataset = load_dataset(jordiae/exebench, split=test_synth) for e in dataset: ...

选项3：直接下载数据集

数据集由使用TAR压缩的目录组成，每个TAR内部包含一系列使用zstandard压缩的jsonline文件。

统计信息和版本

此版本对应ExeBench v1.01，相对于论文中介绍的原始版本有一些改进。新版本的最终分割包含以下函数：

train_not_compilable: 2.357M train_synth_compilable: 2.308373M train_real_compilable: 0.675074M train_synth_simple_io: 0.550116M train_real_simple_io: 0.043769M train_synth_rich_io: 0.097250M valid_synth: 5k valid_real: 2.133k test_synth: 5k test_real: 2.134k

原始数据集（v1.00）可以在请求时访问：https://huggingface.co/datasets/jordiae/exebench_legacy

许可证

所有C函数保留其原始许可证，所有ExeBench贡献（I/O示例、运行函数的样板等）以MIT许可证发布。

引用

@inproceedings{10.1145/3520312.3534867, author = {Armengol-Estap{e}, Jordi and Woodruff, Jackson and Brauckmann, Alexander and Magalh~{a}es, Jos{e} Wesley de Souza and OBoyle, Michael F. P.}, title = {ExeBench: An ML-Scale Dataset of Executable C Functions}, year = {2022}, isbn = {9781450392730}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3520312.3534867}, doi = {10.1145/3520312.3534867}, abstract = {Machine-learning promises to transform compilation and software engineering, yet is frequently limited by the scope of available datasets. In particular, there is a lack of runnable, real-world datasets required for a range of tasks ranging from neural program synthesis to machine learning-guided program optimization. We introduce a new dataset, ExeBench, which attempts to address this. It tackles two key issues with real-world code: references to external types and functions and scalable generation of IO examples. ExeBench is the first publicly available dataset that pairs real-world C code taken from GitHub with IO examples that allow these programs to be run. We develop a toolchain that scrapes GitHub, analyzes the code, and generates runnable snippets of code. We analyze our benchmark suite using several metrics, and show it is representative of real-world code. ExeBench contains 4.5M compilable and 700k executable C functions. This scale of executable, real functions will enable the next generation of machine learning-based programming tasks.}, booktitle = {Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming}, pages = {50–59}, numpages = {10}, keywords = {Code Dataset, Program Synthesis, Mining Software Repositories, C, Machine Learning for Code, Compilers}, location = {San Diego, CA, USA}, series = {MAPS 2022} }

搜集汇总

数据集介绍

构建方式

ExeBench数据集的构建，是通过从GitHub上抓取真实的C代码，再通过一系列工具链分析代码，并生成可运行的代码片段。该数据集包含数百万个C函数，每个函数都配有其依赖和元数据，其中至少有一部分是可以配合输入输出对执行的。这种构建方式旨在为机器学习应用提供大规模的可执行代码数据集。

使用方法

使用ExeBench数据集，用户可以选择三种方式：一是使用该仓库提供的辅助工具，通过Python脚本加载数据集并运行示例；二是直接利用Huggingface的Datasets库加载特定的数据集分割；三是直接下载数据集，自行解压和处理。每种使用方式都为用户提供了灵活的数据处理和访问手段。

背景与挑战

背景概述

在计算机科学领域，特别是在机器学习应用于编译和软件工程的过程中，可执行代码的可用数据集显得尤为重要。ExeBench数据集，创建于2022年，由Jordi Armengol-Estapé等研究人员开发，旨在解决机器学习在处理现实世界代码时遇到的两个关键问题：对外部类型和函数的引用，以及IO示例的可扩展生成问题。该数据集是首个公开可用的，将来自GitHub的真实世界C代码与允许这些程序运行的IO示例相结合的数据库。其规模之大，包含450万个可编译和70万个可执行C函数，为基于机器学习的编程任务提供了重要的资源，对相关领域产生了显著影响。

当前挑战

ExeBench数据集在构建过程中面临的主要挑战包括：如何高效地从GitHub上抓取代码，并进行静态分析以生成可执行的代码片段；如何大规模地生成IO示例，以适应不同的机器学习应用需求。此外，数据集在解决领域问题，如神经程序合成和机器学习指导的程序优化等方面，也面临挑战，包括如何确保生成的代码片段具有足够的多样性和代表性，以及如何准确评估数据集对机器学习模型的训练效果。

常用场景

经典使用场景

在机器学习领域，ExeBench数据集因其包含大量可执行C函数及其依赖和元数据而成为研究者的宝贵资源。该数据集的经典使用场景在于为机器学习模型提供训练和测试的基础，特别是在程序合成、代码优化以及软件工程相关任务中，研究者可以借此开展函数行为预测、错误检测等深入研究。

解决学术问题

ExeBench数据集解决了机器学习在编译和软件工程应用中缺乏可运行的实际代码数据集的问题。它提供了丰富的输入输出案例，使得研究者能够构建和评估能够生成有效代码的机器学习模型，进而推动程序自动生成、代码优化等领域的学术研究。

实际应用

在实践应用方面，ExeBench数据集的应用范围广泛。它不仅有助于改进编译器的性能，还可以用于提高软件质量，例如通过自动化测试和代码审查过程。此外，该数据集也为开发机器学习辅助编程工具提供了可能，从而提升开发效率和软件可靠性。

数据集最近研究