CodeResearch/CodeJudge-Eval
收藏Hugging Face2024-08-21 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/CodeResearch/CodeJudge-Eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- code
- en
---
<!-- <p align="center">
<img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/logo.png" width="150" style="margin-bottom: 0.2;"/>
<p>
-->
<h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A">
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?</a></h3>
<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>
<!-- <h5 align="center">
[](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval)
[](https://arxiv.org/abs/2408.10718)
[](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt)
</h5>
-->
## Introduction
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce **CodeJudge-Eval (CJ-Eval)**, a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities.
## Experiment Results
Please check our [Paper](https://arxiv.org/abs/2408.10718) and [Github](https://github.com/CodeLLM-Research/CodeJudge-Eval).
<!-- <p align="center">
<img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/experiments.png" width="1550" style="margin-bottom: 0.2;"/>
<p> -->
## More Details
This work is still in progress. More details will be released in the coming month.
## 📑 Citation
If you find **CodeJudge-Eval** useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{zhao2024codejudgeevallargelanguagemodels,
title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?},
author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
year={2024},
eprint={2408.10718},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2408.10718},
}
```
---
license: mit
language:
- code
- en
---
<!-- <p align="center">
<img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/logo.png" width="150" style="margin-bottom: 0.2;"/>
<p>
-->
<h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A">CodeJudge-Eval:大语言模型能否成为代码理解任务中的优秀评判者?</a></h3>
<h5 align="center">如果我们的项目对您有所帮助,请在GitHub上为我们点亮Star ⭐ 以示支持。🙏🙏 </h2>
<!-- <h5 align="center">
[](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval)
[](https://arxiv.org/abs/2408.10718)
[](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt)
</h5>
-->
## 引言
近期,大语言模型(Large Language Model,LLM)已展现出卓越的代码生成能力,这类能力的评估主要依托语言转代码类基准测试。然而,此类基准测试往往无法全面衡量模型的代码理解水平。我们提出**CodeJudge-Eval(简称CJ-Eval)**,这是一款全新的基准测试集,旨在从代码评判而非代码生成的视角,评估大语言模型的代码理解能力。CJ-Eval要求模型判断给定代码解决方案的正确性,涵盖各类错误类型与编译问题。依托多样化的测试问题与细粒度评判体系,CJ-Eval弥补了传统基准测试的局限,例如可能存在的测试样例记忆问题。我们在CJ-Eval上对12款知名大语言模型开展了评估,结果显示即便当前最先进的模型也面临挑战,这凸显了该基准测试能够深入探查模型代码理解能力的优势。
## 实验结果
请查阅我们的[论文](https://arxiv.org/abs/2408.10718)与[GitHub仓库](https://github.com/CodeLLM-Research/CodeJudge-Eval)。
<!-- <p align="center">
<img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/experiments.png" width="1550" style="margin-bottom: 0.2;"/>
<p>
-->
## 更多细节
本研究仍在推进中,更多细节将在未来一个月内公布。
## 📑 引用
若您的研究与应用中用到了**CodeJudge-Eval**,请使用以下BibTeX格式进行引用:
bibtex
@misc{zhao2024codejudgeevallargelanguagemodels,
title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?},
author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma},
year={2024},
eprint={2408.10718},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2408.10718},
}
提供机构:
CodeResearch



