MultiPL-E
收藏魔搭社区2026-05-15 更新2025-12-13 收录
下载链接:
https://modelscope.cn/datasets/evalscope/MultiPL-E
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MultiPL-E
## Dataset Description
- **Repository:** https://github.com/nuprl/MultiPL-E
- **Paper:** https://ieeexplore.ieee.org/abstract/document/10103177
- **Point of Contact:** carolyn.anderson@wellesley.edu, mfeldman@oberlin.edu, a.guha@northeastern.edu
## Dataset Summary
MultiPL-E is a dataset for evaluating large language models for code
generation that supports 22 programming languages. It takes the OpenAI
HumanEval and the Mostly Basic Python Programs (MBPP) benchmarks and uses little compilers to
translate them to other languages. It is easy to add support for new languages
and benchmarks.
The dataset is divided into several configurations named *SRCDATA-LANG*, where
*SRCDATA* is either "humaneval" or "mbpp" and *LANG* is one of the supported
languages. We use the canonical file extension for each language to identify
the language, e.g., "cpp" for C++, "lua" for Lua, "clj" for Clojure, and so on.
## Using MultiPL-E
- MultiPL-E is part of the [BigCode Code Generation LM Harness]. This
is the easiest way to use MultiPL-E.
- MultiPL-E has its own evaluation framework that supports proprietary models,
the prompt ablations, more source benchmarks, and more recently added
programming languages. See the [MultiPL-E tutorial] on how to use this
framework directly.
## The MultiPL-E Ablations
The MultiPL-E paper presented several ablations of the prompt for the original
set of programming languages. We do not include them in the current version of
MultiPL-E, but they are still available in this repository from revision
`d23b094` or earlier. (You can optionally pass the revision to
`datasets.load_dataset`.)
These are the prompt variations:
- *SRCDATA-LANG-keep* is the same as *SRCDATA-LANG*, but the text of the prompt
is totally unchanged. If the original prompt had Python doctests, they remain
as Python instead of being translated to *LANG*. If the original prompt had
Python-specific terminology, e.g., "list", it remains "list", instead of
being translated, e.g., to "vector" for C++.
- *SRCDATA-LANG-transform* transforms the doctests to *LANG* but leaves
the natural language text of the prompt unchanged.
- *SRCDATA-LANG-removed* removes the doctests from the prompt.
Note that MBPP does not have any doctests, so the "removed" and "transform"
variations are not available for MBPP.
## Changelog
### Version 3.3
This update fixes a Lua bug. We had a spurious stop token that would have negatively
impacts all Lua results. Re-evaluting models on Lua with this fix should produce
a result that is identical or slightly higher. See [Issue 165](https://github.com/nuprl/MultiPL-E/issues/165)
for more information.
### Version 3.2
MultiPL-E now supports Ada, thanks to [Rowan Walshe](https://github.com/rowan-walshe).
Rowan identified some issues that likely have a small negative impact on the benchmark
scores for existing languages. We have not updated the prompts for those languages
at this time. See the discussions [PR 162](https://github.com/nuprl/MultiPL-E/pull/162)
and [PR 163](https://github.com/nuprl/MultiPL-E/pull/163).
### Version 3.1.1
This version fixes a bug that affected some TypeScript problems, thanks to [Niels Mündler
](https://github.com/nielstron). The issue impacts MBPP-based problems. The fix changes
whitespace in a few HumanEval-based problems that should be insignificant. These
are the relevant changes:
```diff
=== mbpp-ts_prompt_mbpp_253_count_integer.diff ===
- function count_integer(list1: number| string| number[]): number {
+ function count_integer(list1: (number | string | number)[]): number {
=== mbpp-ts_prompt_mbpp_278_count_first_elements.diff ===
- function count_first_elements(test_tup: number| [number, number][]): number {
+ function count_first_elements(test_tup: (number | [number, number])[]): number {
=== mbpp-ts_prompt_mbpp_294_max_val.diff ===
- function max_val(listval: string| number[]): number {
+ function max_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_297_flatten_list.diff ===
- function flatten_list(list1: number| number[][]): number[] {
+ function flatten_list(list1: (number | number[])[]): number[] {
=== mbpp-ts_prompt_mbpp_405_check_tuplex.diff ===
- function check_tuplex(tuplex: string| number[], tuple1: any): boolean {
+ function check_tuplex(tuplex: (string | number)[], tuple1: any): boolean {
=== mbpp-ts_prompt_mbpp_410_min_val.diff ===
- function min_val(listval: string| number[]): number {
+ function min_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_419_round_and_sum.diff ===
- function round_and_sum(list1: number| number[]): number {
+ function round_and_sum(list1: (number | number)[]): number {
=== mbpp-ts_prompt_mbpp_65_recursive_list_sum.diff ===
- function recursive_list_sum(data_list: number| number[][]): number {
+ function recursive_list_sum(data_list: (number | number[])[]): number {
=== mbpp-ts_prompt_mbpp_755_second_smallest.diff ===
- function second_smallest(numbers: number| number[]): number | undefined {
+ function second_smallest(numbers: (number | number)[]): number | undefined {
```
See [Github Issue 160](https://github.com/nuprl/MultiPL-E/issues/160) for more
information.
### Version 3.1
MultiPL-E now supports Dart, thanks to [Devon Carew](https://github.com/devoncarew).
### Version 3.0
This is the first significant update since MultiPL-E was used in StarCoder 1.
1. The dataset was versioned at 3.0, and we are bumping the software version to stay in sync.
2. We no longer publish the MultiPL-E ablations, but they are available in
revision `d23b094` and earlier.
3. New programming languages supported:
- Clojure, thanks to [Alex Miller](https://github.com/puredanger)
- Elixir, thanks to [Marko Vukovic](https://github.com/mvkvc)
- Haskell, thanks to [Thomas Dwyer](https://github.com/Cajunvoodoo)
- OCaml, thanks to [John Gouwar](https://johngouwar.github.io)
4. Changes to existing HumanEval-based problems:
- Four Scala problems have fixed prompts/tests (12, 90, 128, 162).
- Some whitespace-only changes to problems for Racket (18 problems),
R (36 problems), Julia (159 problems), and D (156 problems). We will try to
avoid these kinds of changes in the future.
5. The MBPP-based problems have changes analogous to the HumanEval-based problems.
See the directory `diffs_v3.0` in the dataset repository for the diffs to
each prompt.
### Version 0.5.0
Instruction-following support and new languages
- New languages: Luau, Elixir, Lean, Coq, Dafny
- Support for instruction-following prompts
- vLLM support for faster evaluation
### Version 0.4.0
QoL improvements and new languages
- New languages: OCaml, MATLAB
- Using `.jsonl` instead of `.json` for prompts
- Several bugfixes to prompts
### Version 0.3.0
- This version was used to evaluate [StarCoder]
- This version corrects several bugs in prompts and test cases that resulted in lower
pass@k rates for some of the statically typed languages. The most significant difference
is that the pass@k for Java increases by about 2% on HumanEval.
### Version 0.2.0
This version was used to evaluate [SantaCoder]
[SantaCoder]: https://arxiv.org/abs/2301.03988
[StarCoder]: https://arxiv.org/abs/2305.06161
[BigCode Code Generation LM Harness]: https://github.com/bigcode-project/bigcode-evaluation-harness
[MultiPL-E tutorial]: https://nuprl.github.io/MultiPL-E/
# MultiPL-E 数据集卡片
## 数据集描述
- **仓库地址:** https://github.com/nuprl/MultiPL-E
- **论文地址:** https://ieeexplore.ieee.org/abstract/document/10103177
- **联系方式:** carolyn.anderson@wellesley.edu, mfeldman@oberlin.edu, a.guha@northeastern.edu
## 数据集概述
MultiPL-E 是一款用于评估代码生成大语言模型(Large Language Model)的数据集,支持22种编程语言。该数据集基于OpenAI HumanEval与基础Python程序(Mostly Basic Python Programs, MBPP)基准测试集,通过小型编译器将其翻译为其他编程语言,且易于扩展支持新的编程语言与基准测试集。
数据集被划分为多个名为*SRCDATA-LANG*的配置,其中*SRCDATA*可为“humaneval”或“mbpp”,*LANG*为受支持的编程语言之一。我们通过各语言标准文件扩展名来标识对应语言,例如“cpp”对应C++、“lua”对应Lua、“clj”对应Clojure等。
## 使用MultiPL-E
- MultiPL-E 已集成至[BigCode代码生成语言模型测试套件(BigCode Code Generation LM Harness)](https://github.com/bigcode-project/bigcode-evaluation-harness),这是使用MultiPL-E的最简方式。
- MultiPL-E 自带支持专有模型、提示词消融实验、更多源基准测试集以及新近新增编程语言的评估框架。可参阅[MultiPL-E教程](https://nuprl.github.io/MultiPL-E/)直接使用该框架。
## MultiPL-E 提示词消融实验
MultiPL-E 论文针对原始编程语言集提出了多种提示词消融实验方案。当前版本的MultiPL-E未包含这些方案,但仍可在仓库的`d23b094`及更早提交版本中获取(可在调用`datasets.load_dataset`时指定`revision`参数来获取)。
这些提示词变体包括:
- *SRCDATA-LANG-keep*:与*SRCDATA-LANG*配置完全一致,但提示词文本完全保留原始内容。若原始提示词包含Python文档测试(doctests),则会保留Python格式而非翻译为目标语言*LANG*;若原始提示词包含Python专属术语(如“list”),则会保留原术语而不做翻译(例如不会将其替换为C++中的“vector”)。
- *SRCDATA-LANG-transform*:将文档测试翻译为目标语言*LANG*,但提示词中的自然语言文本保持不变。
- *SRCDATA-LANG-removed*:移除提示词中的文档测试。
注意:MBPP 基准测试集本身不包含文档测试,因此MBPP相关配置无法使用“removed”与“transform”变体。
## 更新日志
### 版本 3.3
本次更新修复了Lua语言相关的一处漏洞:此前存在一个多余的停止词(stop token),会对所有Lua语言的测试结果产生负面影响。使用该修复版本重新评估模型后,Lua语言的测试结果应与此前一致或略有提升。更多信息可参阅[议题165](https://github.com/nuprl/MultiPL-E/issues/165)。
### 版本 3.2
感谢[Rowan Walshe](https://github.com/rowan-walshe)的贡献,MultiPL-E 现已支持Ada编程语言。Rowan发现了一些可能对现有语言的基准测试分数产生小幅负面影响的问题,目前我们尚未针对这些语言更新提示词。更多信息可参阅[拉取请求162](https://github.com/nuprl/MultiPL-E/pull/162)与[拉取请求163](https://github.com/nuprl/MultiPL-E/pull/163)的相关讨论。
### 版本 3.1.1
感谢[Niels Mündler](https://github.com/nielstron)的贡献,本次版本修复了影响部分TypeScript题目的漏洞,该问题会波及基于MBPP的题目。修复内容仅修改了少量基于HumanEval的题目的空白符,影响可忽略。相关修改如下:
diff
=== mbpp-ts_prompt_mbpp_253_count_integer.diff ===
- function count_integer(list1: number| string| number[]): number {
+ function count_integer(list1: (number | string | number)[]): number {
=== mbpp-ts_prompt_mbpp_278_count_first_elements.diff ===
- function count_first_elements(test_tup: number| [number, number][]): number {
+ function count_first_elements(test_tup: (number | [number, number])[]): number {
=== mbpp-ts_prompt_mbpp_294_max_val.diff ===
- function max_val(listval: string| number[]): number {
+ function max_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_297_flatten_list.diff ===
- function flatten_list(list1: number| number[][]): number[] {
+ function flatten_list(list1: (number | number[])[]): number[] {
=== mbpp-ts_prompt_mbpp_405_check_tuplex.diff ===
- function check_tuplex(tuplex: string| number[], tuple1: any): boolean {
+ function check_tuplex(tuplex: (string | number)[], tuple1: any): boolean {
=== mbpp-ts_prompt_mbpp_410_min_val.diff ===
- function min_val(listval: string| number[]): number {
+ function min_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_419_round_and_sum.diff ===
- function round_and_sum(list1: number| number[]): number {
+ function round_and_sum(list1: (number | number)[]): number {
=== mbpp-ts_prompt_mbpp_65_recursive_list_sum.diff ===
- function recursive_list_sum(data_list: number| number[][]): number {
+ function recursive_list_sum(data_list: (number | number[])[]): number {
=== mbpp-ts_prompt_mbpp_755_second_smallest.diff ===
- function second_smallest(numbers: number| number[]): number | undefined {
+ function second_smallest(numbers: (number | number)[]): number | undefined {
更多信息可参阅[GitHub议题160](https://github.com/nuprl/MultiPL-E/issues/160)。
### 版本 3.1
感谢[Devon Carew](https://github.com/devoncarew)的贡献,MultiPL-E 现已支持Dart编程语言。
### 版本 3.0
这是MultiPL-E 被用于StarCoder 1后的首次重大更新:
1. 数据集版本号升级至3.0,我们同步更新了软件版本号以保持一致。
2. 我们不再随当前版本发布MultiPL-E 的消融实验数据,但仍可在`d23b094`及更早的提交版本中获取。
3. 新增支持的编程语言:
- Clojure,感谢[Alex Miller](https://github.com/puredanger)
- Elixir,感谢[Marko Vukovic](https://github.com/mvkvc)
- Haskell,感谢[Thomas Dwyer](https://github.com/Cajunvoodoo)
- OCaml,感谢[John Gouwar](https://johngouwar.github.io)
4. 针对现有基于HumanEval的题目所做的修改:
- 修复了4道Scala题目的提示词与测试用例(题号12、90、128、162)。
- 对Racket(18道题目)、R(36道题目)、Julia(159道题目)以及D(156道题目)的题目仅做了空白符相关调整,我们后续将尽量避免此类修改。
5. 基于MBPP的题目也做了与HumanEval类似的修改。
可在数据集仓库的`diffs_v3.0`目录中查看各提示词的修改差异。
### 版本 0.5.0
新增指令遵循支持与新编程语言
- 新增语言:Luau、Elixir、Lean、Coq、Dafny
- 支持指令遵循型提示词
- 新增vLLM支持以加速评估
### 版本 0.4.0
体验优化与新编程语言支持
- 新增语言:OCaml、MATLAB
- 提示词文件格式从`.json`改为`.jsonl`
- 修复了多处提示词相关的漏洞
### 版本 0.3.0
- 该版本曾用于评估[StarCoder](https://arxiv.org/abs/2305.06161)
- 该版本修复了多处提示词与测试用例中的漏洞,这些漏洞曾导致部分静态类型语言的pass@k评分偏低。最显著的变化是:基于HumanEval的Java语言测试的pass@k评分提升了约2%。
### 版本 0.2.0
该版本曾用于评估[SantaCoder](https://arxiv.org/abs/2301.03988)
[SantaCoder]: https://arxiv.org/abs/2301.03988
[StarCoder]: https://arxiv.org/abs/2305.06161
[BigCode代码生成语言模型测试套件(BigCode Code Generation LM Harness)]: https://github.com/bigcode-project/bigcode-evaluation-harness
[MultiPL-E教程]: https://nuprl.github.io/MultiPL-E/
提供机构:
maas
创建时间:
2025-12-04



