OSVBench
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/lishangyu-hkust/OSVBench
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于评估大型语言模型(LLMs)在生成操作系统内核验证任务的完整规范代码方面的基准。它包含了245个复杂规范生成的任务,并基于现实世界中的操作系统内核——Hyperkernel构建。此外,该基准对12个LLMs进行了全面的评估。规模上,它涵盖了245个任务,每个任务都有长达20k至30k个标记的长上下文。任务的性质是针对操作系统内核验证的规范生成。
This dataset serves as a benchmark for evaluating Large Language Models (LLMs) in generating complete formal code for operating system kernel verification tasks. It comprises 245 tasks derived from complex formal specifications, and is constructed based on the real-world operating system kernel Hyperkernel. Furthermore, this benchmark carries out comprehensive evaluations on 12 LLMs. In terms of scale, it encompasses 245 tasks, each with long contexts ranging from 20,000 to 30,000 tokens. The core focus of these tasks is formal specification generation for operating system kernel verification.



