"From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation"

Name: "From Miscontextion to Misconceptions: Rethinking LLM-Based Vulnerability Detection through Context-Rich Evaluation"
Creator: IEEE DataPort
Published: 2026-02-06 02:01:44
License: 暂无描述

DataCite Commons2026-02-06 更新2026-05-03 收录

下载链接：

https://ieee-dataport.org/documents/miscontextion-misconceptions-rethinking-llm-based-vulnerability-detection-through-context

下载链接

链接失效反馈

官方服务：

资源简介：

"Large Language Models have become promising tools for automated vulnerability detection, supported by their success in code generation, repair, and integration into developer workflows. Yet a key question remains: Are LLMs truly effective at detecting real-world vulnerabilities?Current evaluations, typically limited to isolated functions or files, neglect the broader execution and data-flow context essential for understanding real vulnerabilities. This leads to misleading conclusions and flawed rationales, reducing the reliability of prior studies.Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations.To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically integrates contextual data into LLM-based vulnerability detection. We build a dataset of 2,000 vulnerable\u2013patched program pairs across 99 CWEs and evaluate 13 LLMs from four model families.Our framework elicits both binary predictions and natural-language rationales, further validated with LLM-as-a-judge methods. Our findings overturn existing misconceptions. When provided with sufficient context, state-of-the-art LLMs achieve substantially better performance (e.g., 67% accuracy and an F1 score above 70% on key CWEs), with precision nearing 0.8.We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases."

提供机构：

IEEE DataPort

创建时间：

2026-02-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集