FormAI dataset
收藏arXiv2024-03-28 更新2024-06-21 收录
下载链接:
https://github.com/FormAI-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
FormAI dataset是首个由AI生成的大规模数据集,包含112,000个独立的可编译C程序,这些程序执行各种计算任务。每个程序都通过正式验证方法(特别是ESBMC模块)进行标记,以识别其中的漏洞。数据集中的程序涵盖了从简单的字符串操作到复杂的网络管理、表游戏或加密任务。此外,每个程序的源代码中发现的漏洞都被标记,包括类型、行号和易受攻击的函数名。此数据集适用于评估各种静态和动态分析工具的有效性,并可用于训练LLMs和机器学习算法,以解决软件安全问题。
FormAI dataset is the first large-scale AI-generated dataset, containing 112,000 standalone compilable C programs designed to execute a wide range of computational tasks. Each program is annotated using formal verification techniques, specifically the ESBMC module, to detect vulnerabilities present in the code. The programs in this dataset cover a broad spectrum of tasks, ranging from simple string manipulation to complex network management, tabletop game development, and cryptographic tasks. Additionally, all vulnerabilities identified in the source code of each program are annotated with their type, line number, and the name of the vulnerable function. This dataset serves as a valuable resource for evaluating the effectiveness of various static and dynamic program analysis tools, and can be employed to train large language models (LLMs) and machine learning algorithms to address software security challenges.
提供机构:
Technology Innovation Institute Abu Dhabi UAE
创建时间:
2023-07-05



