Dataset for Evaluating Software Engineering Quality Metrics on LLM-Generated Code
收藏Zenodo2026-02-09 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18210383
下载链接
链接失效反馈官方服务:
资源简介:
This dataset supports the empirical study “From Correctness to Code Quality: Formalizing Software Engineering Metrics for Evaluating General LLMs,” accepted at ISEC 2026.
The dataset contains 240 code solutions generated by four general-purpose large language models (ChatGPT, Gemini, LLaMA, and DeepSeek) across 15 publicly available LeetCode problems. Each problem is solved in four programming languages: C++, Java, C#, and Python.
For each solution, the dataset provides:
Functional correctness results from the LeetCode online judge
Runtime and memory usage statistics
Five software engineering (SE) quality metrics: (i) Error Handling Score (EHS), (ii) Input Validation Score (IVS), (iii) Maintainability Score (MS), (iv) Style & Structure Score (S3), (v) Documentation Score (DS)
All code solutions were generated using zero-shot prompting with fixed decoding parameters and without manual modification. The dataset is intended to support reproducible evaluation of LLM-generated code beyond correctness, emphasizing robustness, maintainability, and documentation quality.
This release is designed for research and academic use. Runtime and memory measurements depend on the execution environment of the LeetCode platform and may vary over time.
提供机构:
Zenodo
创建时间:
2026-01-11



