ClassEval Dataset
收藏paperswithcode.com2025-01-15 收录
下载链接:
https://paperswithcode.com/dataset/classeval
下载链接
链接失效反馈官方服务:
资源简介:
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes.
在本项研究中,我们首次尝试在更具挑战性的代码生成场景下评估大语言模型,即类别级别的代码生成。我们首先手动构建了首个类别级别代码生成基准ClassEval,包含100个类别级别Python代码生成任务,耗时约500人时。基于此,我们随后对11个最先进的LLM在类别级别代码生成方面的性能进行了首次研究。基于我们的研究结果,得出以下主要发现。首先,我们发现所有现有的大语言模型在类别级别代码生成上的表现均远逊于在独立的方法级别代码生成基准(如HumanEval)上的表现;且方法级别的编码能力无法等价地反映LLM间的类别级别编码能力。其次,我们发现GPT-4和GPT-3.5在类别级别代码生成上仍然展现出对其他LLM的显著优势,其中表现相近的二线模型包括Instruct-Starcoder、Instruct-Codegen和Wizardcoder。第三,我们发现对于GPT-4和GPT-3.5而言,一次性生成整个类(即整体生成策略)是最佳生成策略,而对于其他理解长指令和利用中间信息能力有限的模型,逐方法生成(即增量与组合)是更为有效的策略。最后,我们发现模型在生成依赖于方法的代码方面的能力有限,并讨论了在生成的类别中频繁出现的错误类型。
提供机构:
Papers with Code



