MdEval
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/MdEval
下载链接
链接失效反馈官方服务:
资源简介:
<!-- # MDEVAL: Massively Multilingual Code Debugging -->
# MDEVAL: Massively Multilingual Code Debugging
<div align="center" style="line-height: 1;">
<a href="https://www.python.org/">
<img alt="Build" src="https://img.shields.io/badge/Python-3.9+-1f425f.svg?color=purple"style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53%3F?color=green" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="Data License" src="https://img.shields.io/badge/Data_License-CC--BY--SA--4.0-f5de53%3F?color=blue" style="display: inline-block; vertical-align: middle;"/>
</a>
<!-- <a href="" style="margin: 2px;">
<img alt="Data License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
</a> -->
</div>
<hr>
Official repository for our paper "MDEVAL: Massively Multilingual Code Debugging"
<p align="left">
<a href="https://mdeval-code.github.io/">🏠 Home Page </a> •
<a href="https://huggingface.co/datasets/Multilingual-Multimodal-NLP/MDEVAL">📊 Benchmark Data </a> •
<a href="https://mdeval-code.github.io/leaderboard.html">🏆 Leaderboard </a>
</p>
## Table of contents
- [MDEVAL: Massively Multilingual Code Debugging](#Introduction)
- [📌 Introduction](#introduction)
- [🏆 Leaderboard](#leaderboard)
- [📋 Task](#task)
- [📚 Data](#data)
- [💻 Usage](#usage)
- [📖 Citation](#citation)
## Introduction
**MDEVAL** is a massively multilingual debugging benchmark covering **20** programming languages with **3.9K** test samples and three tasks focused on bug fixing. It substantially pushes the limits of code LLMs in multilingual scenarios.
<p align="center">
<img src="assets/intro.png" width="50%" alt="McEval" />
</p>
### Task Examples
MDEVAL covers the automated program repair (APR) task,the bug localization(BL) task, and the bug identification (BI) task. Here is a visualization example from MDEVAL, where the model is required to address all three tasks.
<p align="center">
<img src="assets/bench_cases.png" width="80%" alt="McEval" />
</p>
### Error types in MDEVAL
MDEVAL covers **47** distinct error types, including both generic errors across all programming languages and language-specific errors such as "Missing Mut" in language Rust and "Misused Macro Definition" in language C
<p align="center">
<img src="assets/error_type.png" width="80%" alt="McEval" />
</p>
## Results
We systematically evaluate the multilingual code debugging capabilities of **40** models on MDEVAL and create a leaderboard to evaluate them on **20** programming languages dynamically. Notably, extensive experiments suggest that comprehensive multilingual multitask evaluation can realistically measure the gap between open-source and closed-source models
<p align="center">
<img src="assets/result.png" width="100%" alt="McEval" />
</p>
<!-- <p align="center">
<img src="assets/radar.png" width="100%" alt="McEval" />
</p> -->
Refer to our <a href="https://mdeval-code.github.io/leaderboard.html">🏆 Leaderboard </a> for more results.
## Data
<div align="center">
| **Dataset** | **Download** |
| :------------: | :------------: |
| MDEVAL Evaluation Dataset | [🤗 HuggingFace](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval) |
</div>
### Data File Struction
```
.
|-- bug : APR tasks providing only buggy code
|-- doc : APR tasks providing functional descriptions of programs
|-- example : APR tasks providing demonstration examples
|-- ident : Bug Identification
|-- loc : Bug Localization
|-- loc_apr : APR tasks providing bug location information
|-- raw : Raw data
`-- review : Code Review
```
## Usage
### Environment
We recommend using Docker for evaluation, we have created a Docker image with all the necessary environments pre-installed.
<!-- Docker images will be released soon. -->
Directly pull the image from Docker Hub:
```bash
# Docker hub:
docker pull multilingualnlp/mdeval
docker run -it -d --restart=always --name mdeval_dev --workdir / <image-name> /bin/bash
docker attach mdeval_dev
```
<!-- ### Inference
We provide the standard format for JSON files obtained after model inference.
```json
{
"question_id": "",
"category": "",
"subtype": "",
"level": "",
"example": "",
"docstring": "",
"canonical_solution": "",
"buggy_code": "",
"test": "",
"instruction": "",
"fix_code":"" //model output
}
``` -->
### Evaluation
#### Data Format
**🛎️ Please prepare the inference results of the model in the following format and use them for the next evaluation step.**
We provide a concise inference code example to help you get started quickly. The code is located under the path `inference/chat.py`, and you can initiate the inference process using the following bash script:
```bash
sh inference/chat.sh
```
##### Notes ⚠️
1. **Model and Task Configuration**: Before use, please ensure that the inference model and evaluation tasks are correctly configured in the `chat.sh` script.
2. **Flexible Customization**: You can flexibly modify the `chat` function in `inference/chat.py` according to your actual needs to accommodate different inference scenarios.
(1) Folder Structure
Place the data in the following folder structure, each file corresponds to the test results of each language.
```bash
\data\chat_result\${model}\${task}
- CPP.jsonl
- Python.jsonl
- Java.jsonl
...
```
Where "model" represents the model being tested, and "setting" represents the task , for example `doc` , `bug` , `example` , `review` , `ident` , `loc`.
(2) File Format
Each line in the file for each test language has the following format.
The *llm_response* field is the generated code.
<!-- More examples can be found in [Evualute Data Format Examples](examples/evaluate/) -->
```bash
{
"question_id": "",
"category": "",
"subtype": "",
"level": "",
"example": "",
"docstring": "",
"canonical_solution": "",
"buggy_code": "",
"test": "",
"instruction": "",
"llm_response":"" //model output
}
```
#### Evaluate APR Task
Take the evaluation generation task as an example.
```bash
sh excute/apr.sh
```
<!-- ## More Examples
More examples could be found in [Examples](docs/Examples.md) -->
## License
This code repository is licensed under the [the MIT License](LICENSE-CODE). The use of McEval data is subject to the [CC-BY-SA-4.0](LICENSE-DATA).
## Citation
If you find our work helpful, please use the following citations.
```bibtext
@misc{liu2024mdevalmassivelymultilingualcode,
title={MdEval: Massively Multilingual Code Debugging},
author={Shukai Liu and Linzheng Chai and Jian Yang and Jiajun Shi and He Zhu and Liran Wang and Ke Jin and Wei Zhang and Hualei Zhu and Shuyue Guo and Tao Sun and Jiaheng Liu and Yunlong Duan and Yu Hao and Liqun Yang and Guanglin Niu and Ge Zhang and Zhoujun Li},
year={2024},
eprint={2411.02310},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02310},
}
```
<!-- ## Contact -->
<!-- # MDEVAL:大规模多语言代码调试 -->
# MDEVAL:大规模多语言代码调试
<div align="center" style="line-height: 1;">
<a href="https://www.python.org/">
<img alt="构建状态" src="https://img.shields.io/badge/Python-3.9+-1f425f.svg?color=purple"style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="代码许可证" src="https://img.shields.io/badge/Code_License-MIT-f5de53%3F?color=green" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="" style="margin: 2px;">
<img alt="数据集许可证" src="https://img.shields.io/badge/Data_License-CC--BY--SA--4.0-f5de53%3F?color=blue" style="display: inline-block; vertical-align: middle;"/>
</a>
<!-- <a href="" style="margin: 2px;">
<img alt="Data License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
</a> -->
</div>
<hr>
本仓库对应论文《MDEVAL:大规模多语言代码调试》的官方实现
<p align="left">
<a href="https://mdeval-code.github.io/">🏠 项目主页 </a> •
<a href="https://huggingface.co/datasets/Multilingual-Multimodal-NLP/MDEVAL">📊 基准数据集 </a> •
<a href="https://mdeval-code.github.io/leaderboard.html">🏆 排行榜 </a>
</p>
## 目录
- [MDEVAL:大规模多语言代码调试](#简介)
- [📌 简介](#简介)
- [🏆 排行榜](#排行榜)
- [📋 任务](#任务)
- [📚 数据集](#数据集)
- [💻 使用方法](#使用方法)
- [📖 引用](#引用)
## 简介
**MDEVAL**是一款大规模多语言代码调试基准,覆盖**20**种编程语言,包含**3.9K**个测试样本,同时设置了三个聚焦代码修复的任务。该基准极大地拓展了代码大语言模型(Large Language Model, LLM)在多语言场景下的性能上限。
<p align="center">
<img src="assets/intro.png" width="50%" alt="MDEVAL" />
</p>
### 任务示例
MDEVAL涵盖自动程序修复(Automatic Program Repair,APR)、缺陷定位(Bug Localization,BL)以及缺陷识别(Bug Identification,BI)三类任务。以下为MDEVAL的可视化示例,模型需同时完成全部三类任务。
<p align="center">
<img src="assets/bench_cases.png" width="80%" alt="MDEVAL" />
</p>
### MDEVAL中的错误类型
MDEVAL涵盖**47**种不同的错误类型,既包含适用于所有编程语言的通用错误,也包含特定语言专属错误,例如Rust语言的"Missing Mut"以及C语言的"Misused Macro Definition"
<p align="center">
<img src="assets/error_type.png" width="80%" alt="MDEVAL" />
</p>
## 实验结果
我们在MDEVAL上系统性评估了**40**个模型的多语言代码调试能力,并搭建了可在20种编程语言上动态评估的排行榜。值得注意的是,大量实验表明,全面的多语言多任务评估能够切实衡量开源与闭源模型之间的性能差距
<p align="center">
<img src="assets/result.png" width="100%" alt="MDEVAL" />
</p>
<!-- <p align="center">
<img src="assets/radar.png" width="100%" alt="McEval" />
</p> -->
如需查看更多实验结果,请访问我们的<a href="https://mdeval-code.github.io/leaderboard.html">🏆 排行榜 </a>页面。
## 数据集
<div align="center">
| **数据集** | **下载链接** |
| :------------: | :------------: |
| MDEVAL 评估数据集 | [🤗 HuggingFace](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/McEval) |
</div>
### 数据文件结构
.
|-- bug : 仅提供缺陷代码的自动程序修复任务
|-- doc : 提供程序功能描述的自动程序修复任务
|-- example : 提供演示示例的自动程序修复任务
|-- ident : 缺陷识别任务
|-- loc : 缺陷定位任务
|-- loc_apr : 提供缺陷位置信息的自动程序修复任务
|-- raw : 原始数据
|-- review : 代码审查任务
## 使用方法
### 运行环境
我们推荐使用Docker开展评估,我们已预先构建好包含所有必要依赖的Docker镜像。可直接从Docker Hub拉取镜像:
bash
# Docker Hub 镜像地址:
docker pull multilingualnlp/mdeval
docker run -it -d --restart=always --name mdeval_dev --workdir / <镜像名称> /bin/bash
docker attach mdeval_dev
<!-- ### Inference
We provide the standard format for JSON files obtained after model inference.
json
{
"question_id": "",
"category": "",
"subtype": "",
"level": "",
"example": "",
"docstring": "",
"canonical_solution": "",
"buggy_code": "",
"test": "",
"instruction": "",
"fix_code":"" //model output
}
-->
### 模型评估
#### 数据格式
**🛎️ 请按照以下格式准备模型推理结果,用于后续评估步骤。**
我们提供了简洁的推理代码示例以帮助快速上手,代码位于`inference/chat.py`路径下,您可通过以下bash脚本启动推理流程:
bash
sh inference/chat.sh
##### 注意事项 ⚠️
1. **模型与任务配置**:使用前请确保已在`chat.sh`脚本中正确配置推理模型与评估任务。
2. **灵活自定义**:您可根据实际需求灵活修改`inference/chat.py`中的`chat`函数,以适配不同的推理场景。
(1) 文件夹结构
请将数据按照以下文件夹结构存放,每个文件对应各语言的测试结果。
bash
datachat_result${model}${task}
- CPP.jsonl
- Python.jsonl
- Java.jsonl
...
其中`model`代表待测试的模型名称,`task`代表评估任务,例如`doc`、`bug`、`example`、`review`、`ident`、`loc`。
(2) 文件格式
各测试语言的文件每行均遵循以下格式,其中`llm_response`字段为模型生成的代码。
<!-- More examples can be found in [Evualute Data Format Examples](examples/evaluate/) -->
bash
{
"question_id": "",
"category": "",
"subtype": "",
"level": "",
"example": "",
"docstring": "",
"canonical_solution": "",
"buggy_code": "",
"test": "",
"instruction": "",
"llm_response":"" //模型输出结果
}
#### 自动程序修复任务评估
以生成任务评估为例,执行以下命令:
bash
sh excute/apr.sh
<!-- ## More Examples
More examples could be found in [Examples](docs/Examples.md) -->
## 许可证
本代码仓库遵循[MIT开源协议](LICENSE-CODE)。MDEVAL数据集的使用需遵循[CC-BY-SA-4.0协议](LICENSE-DATA)。
## 引用
如果您认为我们的工作对您有所帮助,请使用以下引用格式:
bibtext
@misc{liu2024mdevalmassivelymultilingualcode,
title={MdEval: Massively Multilingual Code Debugging},
author={Shukai Liu and Linzheng Chai and Jian Yang and Jiajun Shi and He Zhu and Liran Wang and Ke Jin and Wei Zhang and Hualei Zhu and Shuyue Guo and Tao Sun and Jiaheng Liu and Yunlong Duan and Yu Hao and Liqun Yang and Guanglin Niu and Ge Zhang and Zhoujun Li},
year={2024},
eprint={2411.02310},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02310},
}
<!-- ## Contact -->
提供机构:
maas
创建时间:
2025-08-28



