five

CodeResearch/CodeJudge-Eval

收藏
Hugging Face2024-08-21 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/CodeResearch/CodeJudge-Eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - code - en --- <!-- <p align="center"> <img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/logo.png" width="150" style="margin-bottom: 0.2;"/> <p> --> <h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A"> CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?</a></h3> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2> <!-- <h5 align="center"> [![hf_data](https://img.shields.io/badge/🤗-Datasets-9C276A.svg)](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval) [![arXiv](https://img.shields.io/badge/Arxiv-2408.10718-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2408.10718) [![License](https://img.shields.io/badge/License-MIT-yellow)](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt) </h5> --> ## Introduction Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce **CodeJudge-Eval (CJ-Eval)**, a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. ## Experiment Results Please check our [Paper](https://arxiv.org/abs/2408.10718) and [Github](https://github.com/CodeLLM-Research/CodeJudge-Eval). <!-- <p align="center"> <img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/experiments.png" width="1550" style="margin-bottom: 0.2;"/> <p> --> ## More Details This work is still in progress. More details will be released in the coming month. ## 📑 Citation If you find **CodeJudge-Eval** useful for your research and applications, please cite using this BibTeX: ```bibtex @misc{zhao2024codejudgeevallargelanguagemodels, title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma}, year={2024}, eprint={2408.10718}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2408.10718}, } ```

--- license: mit language: - code - en --- <!-- <p align="center"> <img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/logo.png" width="150" style="margin-bottom: 0.2;"/> <p> --> <h3 align="center"><a href="https://arxiv.org/abs/2408.10718" style="color:#9C276A">CodeJudge-Eval:大语言模型能否成为代码理解任务中的优秀评判者?</a></h3> <h5 align="center">如果我们的项目对您有所帮助,请在GitHub上为我们点亮Star ⭐ 以示支持。🙏🙏 </h2> <!-- <h5 align="center"> [![hf_data](https://img.shields.io/badge/🤗-Datasets-9C276A.svg)](https://huggingface.co/datasets/CodeResearch/CodeJudge-Eval) [![arXiv](https://img.shields.io/badge/Arxiv-2408.10718-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2408.10718) [![License](https://img.shields.io/badge/License-MIT-yellow)](https://github.com/CodeLLM-Research/CodeJudge-Eval/LICENSE.txt) </h5> --> ## 引言 近期,大语言模型(Large Language Model,LLM)已展现出卓越的代码生成能力,这类能力的评估主要依托语言转代码类基准测试。然而,此类基准测试往往无法全面衡量模型的代码理解水平。我们提出**CodeJudge-Eval(简称CJ-Eval)**,这是一款全新的基准测试集,旨在从代码评判而非代码生成的视角,评估大语言模型的代码理解能力。CJ-Eval要求模型判断给定代码解决方案的正确性,涵盖各类错误类型与编译问题。依托多样化的测试问题与细粒度评判体系,CJ-Eval弥补了传统基准测试的局限,例如可能存在的测试样例记忆问题。我们在CJ-Eval上对12款知名大语言模型开展了评估,结果显示即便当前最先进的模型也面临挑战,这凸显了该基准测试能够深入探查模型代码理解能力的优势。 ## 实验结果 请查阅我们的[论文](https://arxiv.org/abs/2408.10718)与[GitHub仓库](https://github.com/CodeLLM-Research/CodeJudge-Eval)。 <!-- <p align="center"> <img src="https://github.com/CodeLLM-Research/CodeJudge-Eval/blob/main/experiments.png" width="1550" style="margin-bottom: 0.2;"/> <p> --> ## 更多细节 本研究仍在推进中,更多细节将在未来一个月内公布。 ## 📑 引用 若您的研究与应用中用到了**CodeJudge-Eval**,请使用以下BibTeX格式进行引用: bibtex @misc{zhao2024codejudgeevallargelanguagemodels, title={CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?}, author={Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma}, year={2024}, eprint={2408.10718}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2408.10718}, }
提供机构:
CodeResearch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作