Artifact for "Beyond Benchmarks: A Case Study of LLM-Generated Verus Specification Failures on Asterinas Vostd"

Name: Artifact for "Beyond Benchmarks: A Case Study of LLM-Generated Verus Specification Failures on Asterinas Vostd"
Creator: Zenodo
Published: 2026-05-06 06:24:48
License: 暂无描述

DataCite Commons2026-05-06 更新2026-05-07 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.20047479

下载链接

链接失效反馈

官方服务：

资源简介：

Large language models (LLMs) have recently shown promise in automating formal specification generation, but their effectiveness has mainly been demonstrated on self-contained benchmark tasks. This paper examines the changes that occur when LLM-generated specifications are applied to an industrial verification project. To answer this question, we conducted experiments on Asterinas OSTD's Vostd repository, a Rust-based OS-kernel verification codebase built around cross-file dependencies and Entity–-Model proof abstractions. We analyze six formal verification targets spanning 156 task instances, 50 Rust source files, and 7,149 lines of code, together with 1,587 reviewed failure episodes produced by DeepSeek-V4-Flash, GPT-4o, and Qwen-Coder during generation and repair. Our results show that existing methods effective on benchmarks face significant challenges on Vostd, and that framework customization for industrial validation is both necessary and has considerable potential. LLM-generated specification breaks in systematic rather than random ways: they cluster across four verification stages (i.e., repository integration, Verus legality, semantic modeling, and proof construction), and their distributions vary across models and targets. We further show that verifier diagnostics do not always identify the true source of failure, and that verifier-guided repair can regress or stall rather than converge. These findings indicate that LLM-generated specifications fail not merely because industrial code is harder, but because failures are structured, stage-dependent, and often repaired in the wrong direction. Future evaluation and tooling should therefore move beyond final pass/fail outcomes and account for where specifications break, whether diagnostics identify the real cause, and whether repair makes progress or simply cycles through new errors.

提供机构：

Zenodo

创建时间：

2026-05-06