What is wrong with your code generated by large language models? An extensive study

中国科学数据2026-01-04 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.1007/s11432-025-4632-8

下载链接

链接失效反馈

官方服务：

资源简介：

The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and achieve a repair success rate of 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

随着大语言模型（Large Language Model，LLM）在代码生成领域的持续发展，已引发研究者们的广泛关注。为提升基于大语言模型的代码生成能力，当前研究工作主要聚焦于构建高质量数据集与应用多样化训练技术。然而，当前仍缺乏针对现有方法的局限性与应用边界开展系统性探讨的全面研究。为填补这一研究空白，我们开展了一项大规模实证研究，对三款主流闭源大语言模型与六款热门开源大语言模型在三个常用基准测试集上的性能进行了评测。本次研究从生成代码的长度、圈复杂度以及API调用数量三个维度展开评估，结果表明，此类大语言模型在处理更复杂的问题时，生成合格代码的能力面临显著挑战；且相较于标准解决方案，其生成的代码往往更短但复杂度更高。此外，我们针对错误代码构建了一套包含3大类共10个子类的缺陷分类体系，并对常见缺陷类型的根本成因进行了分析。为进一步探究大语言模型在真实项目场景中的性能表现，我们还手动构建了真实世界基准数据集RWPB。通过对RWPB上的代码缺陷进行分析，我们揭示了真实应用场景与现有基准测试集在缺陷分布上的显著差异。最后，我们提出了一种全新的免训练迭代方法，该方法引入自我评审机制，使大语言模型能够依据缺陷类型与编译器反馈，对自身生成的代码进行评审与修正。实验结果表明，该方法可有效减少代码缺陷，在经过两次迭代后修复成功率可达29.2%，彰显了大语言模型处理更复杂问题的巨大潜力。我们这项全面且深入的研究，为当前基于大语言模型的代码生成技术的现存局限性提供了系统性洞察，也为提升生成代码的准确性与质量指明了改进方向。

创建时间：

2025-10-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集