CodeEditorBench
收藏魔搭社区2025-11-12 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/CodeEditorBench
下载链接
链接失效反馈官方服务:
资源简介:
# CodeEditorBench
[**🌐 Homepage**](https://codeeditorbench.github.io/) | [**🤗 Dataset**](https://huggingface.co/datasets/m-a-p/CodeEditorBench) | [**📖 arXiv**](https://arxiv.org/pdf/2404.03543.pdf) | [**GitHub**](https://github.com/CodeEditorBench/CodeEditorBench)
## Introduction
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities.
CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

## Results
<div style="display: flex; justify-content: space-around; align-items: center;">
<img src="Models_Zero_Shot.png" alt="First Image Description" style="width: 48%;" />
<img src="win_rate_zero.png" alt="Second Image Description" style="width: 48%;" />
</div>
We propose evaluating LLMs across four scenarios capturing various code editing capabilities, namely code debug, code translate, code polish, and code requirement switch.The figure in left depicts various model performances across the four scenarios available in CodeEditorBench\_Plus in a radial plot – highlighting how relative differences across models change across the scenarios. We also give the Performance of open-source and closed-source models on CodeEditorBench\_Plus in zero-shot evaluated through win\_rate in the right figure.
🎯All results of models are generated by greedy decoding.
✨Code Debug, Code Translate and Code Requirement Switch are evaluated with pass@1, while Code Polish is evaluated with Mean OptScore.
## Disclaimers
The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution.
Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to [contact](#contact) us. Upon verification, such samples will be promptly removed.
## Contact
<!-- - Jiawei Guo: moriatysss152@gmail.com
- Ziming Li :
- Xueling Liu:
- Kaijing Ma: -->
- Ge Zhang: zhangge@01.ai
- Wenhu Chen: wenhuchen@uwaterloo.ca
- Jie Fu: jiefu@ust.hk
## Citation
**BibTeX:**
```bibtex
@misc{guo2024codeeditorbench,
title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models},
author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
year={2024},
eprint={2404.03543},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
```
# CodeEditorBench
[🌐 项目主页](https://codeeditorbench.github.io/) | [🤗 数据集页面](https://huggingface.co/datasets/m-a-p/CodeEditorBench) | [📖 arXiv 论文](https://arxiv.org/pdf/2404.03543.pdf) | [GitHub 仓库](https://github.com/CodeEditorBench/CodeEditorBench)
## 简介
面向代码的大语言模型(Large Language Models, LLMs)正快速发展,代码编辑已成为其核心能力之一。我们提出了CodeEditorBench,这是一个旨在严格评估大语言模型代码编辑能力的评测框架,涵盖代码调试、代码翻译、代码优化以及需求变更四类任务。
与仅聚焦于代码生成的现有评测基准不同,CodeEditorBench着重考量软件开发中的真实场景与实践环节。我们从五个数据源中精选了多样化的代码挑战与场景,涵盖多种编程语言、不同复杂度等级以及各类代码编辑任务。通过对19款大语言模型的评测发现,闭源模型(尤其是Gemini-Ultra与GPT-4)在CodeEditorBench上的表现优于开源模型,这凸显了不同模型在任务类型与提示词敏感性上的性能差异。
CodeEditorBench旨在通过搭建一个可靠的代码编辑能力评测平台,推动大语言模型在该领域的发展。我们将公开所有提示词与数据集,以便社区能够拓展数据集并对新兴大语言模型进行评测。通过推出CodeEditorBench,我们旨在助力大语言模型在代码编辑方向的技术进步,并为研究者与开发者提供宝贵的研究资源。

## 评测结果
<div style="display: flex; justify-content: space-around; align-items: center;">
<img src="Models_Zero_Shot.png" alt="零样本模型性能雷达图" style="width: 48%;" />
<img src="win_rate_zero.png" alt="零样本模型胜率对比图" style="width: 48%;" />
</div>
我们针对四类涵盖不同代码编辑能力的场景开展大语言模型评测,分别为代码调试、代码翻译、代码优化以及需求变更。左侧的雷达图展示了各模型在CodeEditorBench_Plus四类任务上的性能表现,凸显了不同模型在各任务间的相对性能差异;右侧图表则通过胜率(win_rate)展示了开源与闭源模型在CodeEditorBench_Plus上的零样本评测结果。
🎯 所有模型的评测结果均通过贪心解码生成。
✨ 代码调试、代码翻译与需求变更任务采用pass@1指标进行评测,代码优化任务则采用平均OptScore(Mean OptScore)作为评测指标。
## 免责声明
标注人员的指南中明确要求严格遵循原始数据源的版权与许可规则,尤其需规避来自禁止复制与再分发的网站的素材。若您发现任何数据样本可能违反任一站点的版权或许可规定,欢迎[联系我们](#contact)。经核实后,此类样本将被立即移除。
## 联系方式
<!-- - 郭家伟: moriatysss152@gmail.com
- 李子鸣:
- 刘学灵:
- 马凯静: -->
- 张戈: zhangge@01.ai
- 陈文虎: wenhuchen@uwaterloo.ca
- 傅杰: jiefu@ust.hk
## 引用
**BibTeX 格式:**
bibtex
@misc{guo2024codeeditorbench,
title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models},
author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
year={2024},
eprint={2404.03543},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
提供机构:
maas
创建时间:
2024-04-13



