CodeResearch/CodeJudge-Eval

Name: CodeResearch/CodeJudge-Eval
Creator: CodeResearch
Published: 2024-08-21 01:11:41
License: 暂无描述

Hugging Face2024-08-21 更新2025-04-19 收录

下载链接：

https://hf-mirror.com/datasets/CodeResearch/CodeJudge-Eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - code - en ---  <h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A"> CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?</a></h3> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>  ## Introduction Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce **CodeJudge-Eval (CJ-Eval)**, a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. ## Experiment Results Please check our [Paper](https://arxiv.org/abs/2408.10718) and [Github](https://github.com/CodeLLM-Research/CodeJudge-Eval).  ## More Details This work is still in progress. More details will be released in the coming month. ## 📑 Citation If you find **CodeJudge-Eval** useful for your research and applications, please cite using this BibTeX: ```bibtex @misc{zhao2024codejudgeevallargelanguagemodels, title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma}, year={2024}, eprint={2408.10718}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2408.10718}, } ```

--- license: mit language: - code - en ---  <h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A">CodeJudge-Eval：大语言模型能否成为代码理解任务中的优秀评判者？</a></h3> <h5 align="center">如果我们的项目对您有所帮助，请在GitHub上为我们点亮Star ⭐ 以示支持。🙏🙏 </h2>  ## 引言近期，大语言模型（Large Language Model，LLM）已展现出卓越的代码生成能力，这类能力的评估主要依托语言转代码类基准测试。然而，此类基准测试往往无法全面衡量模型的代码理解水平。我们提出**CodeJudge-Eval（简称CJ-Eval）**，这是一款全新的基准测试集，旨在从代码评判而非代码生成的视角，评估大语言模型的代码理解能力。CJ-Eval要求模型判断给定代码解决方案的正确性，涵盖各类错误类型与编译问题。依托多样化的测试问题与细粒度评判体系，CJ-Eval弥补了传统基准测试的局限，例如可能存在的测试样例记忆问题。我们在CJ-Eval上对12款知名大语言模型开展了评估，结果显示即便当前最先进的模型也面临挑战，这凸显了该基准测试能够深入探查模型代码理解能力的优势。 ## 实验结果请查阅我们的[论文](https://arxiv.org/abs/2408.10718)与[GitHub仓库](https://github.com/CodeLLM-Research/CodeJudge-Eval)。  ## 更多细节本研究仍在推进中，更多细节将在未来一个月内公布。 ## 📑 引用若您的研究与应用中用到了**CodeJudge-Eval**，请使用以下BibTeX格式进行引用： bibtex @misc{zhao2024codejudgeevallargelanguagemodels, title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma}, year={2024}, eprint={2408.10718}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2408.10718}, }

提供机构：

CodeResearch

5,000+

优质数据集

54 个

任务类型

进入经典数据集