evalplus-arabic
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/evalplus-arabic
下载链接
链接失效反馈官方服务:
资源简介:
# 3LM Code Arabic Benchmark
## Dataset Summary
This dataset includes Arabic translations of two widely-used code evaluation benchmarks — HumanEval+ and MBPP+ — adapted into Arabic for the first time as part of the 3LM project. It includes both the base and plus versions with extended unit test coverage.
## Motivation
Arabic LLMs lack meaningful benchmarks to assess code generation abilities. This dataset bridges that gap by providing high-quality Arabic natural language descriptions aligned with formal Python test cases.
## Dataset Structure
### `humanevalplus-arabic`
- `task_id`: Unique identifier (e.g., HumanEval/18)
- `prompt`: Task description in Arabic
- `entry_point`: Function name
- `canonical_solution`: Reference Python implementation
- `test`: test-cases
```json
{
"task_id": "HumanEval/3",
"prompt": "لديك قائمة من عمليات الإيداع والسحب في حساب بنكي يبدأ برصيد صفري. مهمتك هي اكتشاف إذا في أي لحظة انخفض رصيد الحساب إلى ما دون الصفر، وفي هذه اللحظة يجب أن تعيد الدالة True. وإلا فيجب أن تعيد False.",
"entry_point": "below_zero",
"canonical_solution": "...",
"test": "...",
}
```
<br>
### `mbppplus-arabic`
- `task_id`: Unique identifier (e.g., 2)
- `prompt`: Task description in Arabic
- `code`: canonical Python solution
- `source_file`: Path of the original MBPP problem file
- `test_imports`: Import statements required by the tests
- `test_list`: 3 Python `assert` statements for the task
- `test`: test cases
```json
{
"task_id": "2",
"code": "def similar_elements(test_tup1, test_tup2):\n return tuple(set(test_tup1) & set(test_tup2))"
"prompt": "اكتب دالة للعثور على العناصر المشتركة من القائمتين المعطاتين.",
"source_file": "Benchmark Questions Verification V2.ipynb",
"test_imports": "[]",
"test_list": "...",
"test": "...",
}
```
## Data Sources
- Original datasets: [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus)
- Translated with GPT-4o
- Validated via backtranslation with ROUGE-L F1 thresholds (0.8+), followed by human review
## Translation Methodology
- **Backtranslation** to ensure fidelity
- **Threshold-based filtering** and **manual review**
- **Arabic prompts only**, with code/test logic unchanged to preserve function behavior
## Code and Paper
- EvalPlus-Arabic dataset on GitHub: https://github.com/tiiuae/3LM-benchmark/frameworks/evalplus-arabic/evalplus/data/data_files
- 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark
- 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850
## Licensing
[Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## Citation
```bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
```
# 3LM代码阿拉伯语基准测试(3LM Code Arabic Benchmark)
## 数据集概述
本数据集作为3LM项目的组成部分,首次将两款广泛使用的代码评估基准——HumanEval+与MBPP+——翻译成阿拉伯语,涵盖基础版与增强版(plus版本),并拓展了单元测试覆盖范围。
## 研究动机
当前阿拉伯语大语言模型(LLM/Large Language Model)缺乏用于评估代码生成能力的有效基准数据集。本数据集通过提供与规范Python测试用例对齐的高质量阿拉伯语自然语言任务描述,填补了这一领域的空白。
## 数据集结构
### `humanevalplus-arabic`
- `task_id`:唯一标识符(例如HumanEval/18)
- `prompt`:阿拉伯语任务描述
- `entry_point`:函数名称
- `canonical_solution`:参考Python实现
- `test`:测试用例
json
{
"task_id": "HumanEval/3",
"prompt": "你有一个以零余额起始的银行账户存取款列表。你的任务是检测是否存在任一时刻账户余额降至零以下,若存在则令函数返回True,否则返回False。",
"entry_point": "below_zero",
"canonical_solution": "...",
"test": "...",
}
<br>
### `mbppplus-arabic`
- `task_id`:唯一标识符(例如2)
- `prompt`:阿拉伯语任务描述
- `code`:规范Python解决方案
- `source_file`:原始MBPP问题文件的路径
- `test_imports`:测试所需的导入语句
- `test_list`:该任务对应的3条Python `assert`断言语句
- `test`:测试用例
json
{
"task_id": "2",
"code": "def similar_elements(test_tup1, test_tup2):
return tuple(set(test_tup1) & set(test_tup2))",
"prompt": "编写一个函数,从给定的两个元组中找出共同元素。",
"source_file": "Benchmark Questions Verification V2.ipynb",
"test_imports": "[]",
"test_list": "...",
"test": "...",
}
## 数据来源
- 原始数据集:[MBPP+](https://huggingface.co/datasets/evalplus/mbppplus)、[HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus)
- 由GPT-4o完成翻译
- 通过反向翻译结合ROUGE-L F1阈值(≥0.8)进行验证,随后辅以人工审核
## 翻译方法
- **反向翻译**:保障翻译保真度
- **基于阈值的筛选**与**人工复核**
- 仅对任务提示语进行阿拉伯语翻译,保留代码与测试逻辑不变以确保函数行为一致性
## 代码与论文
- GitHub上的EvalPlus-Arabic数据集仓库:https://github.com/tiiuae/3LM-benchmark/frameworks/evalplus-arabic/evalplus/data/data_files
- 3LM基准仓库:https://github.com/tiiuae/3LM-benchmark
- 3LM相关预印本论文(ArXiv):https://arxiv.org/pdf/2507.15850
## 许可协议
[Falcon LLM许可协议](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## 引用格式
bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-02



