tiiuae/evalplus-arabic

Name: tiiuae/evalplus-arabic
Creator: tiiuae
Published: 2026-02-14 19:18:59
License: 暂无描述

Hugging Face2026-02-14 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/tiiuae/evalplus-arabic

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: humanevalplus-arabic features: - name: task_id dtype: string - name: prompt dtype: string - name: canonical_solution dtype: string - name: entry_point dtype: string - name: test dtype: string splits: - name: test num_bytes: 10978353 num_examples: 164 download_size: 2907286 dataset_size: 10978353 - config_name: mbppplus-arabic features: - name: task_id dtype: int64 - name: code dtype: string - name: prompt dtype: string - name: source_file dtype: string - name: test_imports dtype: string - name: test_list dtype: string - name: test dtype: string splits: - name: test num_bytes: 4855903 num_examples: 378 download_size: 1132190 dataset_size: 4855903 configs: - config_name: humanevalplus-arabic data_files: - split: test path: humanevalplus-arabic/test-* - config_name: mbppplus-arabic data_files: - split: test path: mbppplus-arabic/test-* --- # 3LM Code Arabic Benchmark ## Dataset Summary This dataset includes Arabic translations of two widely-used code evaluation benchmarks — HumanEval+ and MBPP+ — adapted into Arabic for the first time as part of the 3LM project. It includes both the base and plus versions with extended unit test coverage. ## Motivation Arabic LLMs lack meaningful benchmarks to assess code generation abilities. This dataset bridges that gap by providing high-quality Arabic natural language descriptions aligned with formal Python test cases. ## Dataset Structure ### `humanevalplus-arabic` - `task_id`: Unique identifier (e.g., HumanEval/18) - `prompt`: Task description in Arabic - `entry_point`: Function name - `canonical_solution`: Reference Python implementation - `test`: test-cases ```json { "task_id": "HumanEval/3", "prompt": "لديك قائمة من عمليات الإيداع والسحب في حساب بنكي يبدأ برصيد صفري. مهمتك هي اكتشاف إذا في أي لحظة انخفض رصيد الحساب إلى ما دون الصفر، وفي هذه اللحظة يجب أن تعيد الدالة True. وإلا فيجب أن تعيد False.", "entry_point": "below_zero", "canonical_solution": "...", "test": "...", } ``` <br> ### `mbppplus-arabic` - `task_id`: Unique identifier (e.g., 2) - `prompt`: Task description in Arabic - `code`: canonical Python solution - `source_file`: Path of the original MBPP problem file - `test_imports`: Import statements required by the tests - `test_list`: 3 Python `assert` statements for the task - `test`: test cases ```json { "task_id": "2", "code": "def similar_elements(test_tup1, test_tup2):\n return tuple(set(test_tup1) & set(test_tup2))" "prompt": "اكتب دالة للعثور على العناصر المشتركة من القائمتين المعطاتين.", "source_file": "Benchmark Questions Verification V2.ipynb", "test_imports": "[]", "test_list": "...", "test": "...", } ``` ## Data Sources - Original datasets: [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus) - Translated with GPT-4o - Validated via backtranslation with ROUGE-L F1 thresholds (0.8+), followed by human review ## Translation Methodology - **Backtranslation** to ensure fidelity - **Threshold-based filtering** and **manual review** - **Arabic prompts only**, with code/test logic unchanged to preserve function behavior ## Code and Paper - EvalPlus-Arabic dataset on GitHub: https://github.com/tiiuae/3LM-benchmark/frameworks/evalplus-arabic/evalplus/data/data_files - 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark - 3LM paper: https://aclanthology.org/2025.arabicnlp-main.4/ ## Licensing [Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## Citation ```bibtex @inproceedings{boussaha-etal-2025-3lm, title = "3{LM}: Bridging {A}rabic, {STEM}, and Code through Benchmarking", author = "Boussaha, Basma El Amel and Al Qadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim", booktitle = "Proceedings of The Third Arabic Natural Language Processing Conference", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.arabicnlp-main.4/", doi = "10.18653/v1/2025.arabicnlp-main.4", pages = "42--63", ISBN = "979-8-89176-352-4", } ```

数据集信息： - 配置名称：humanevalplus-arabic 特征字段： - 名称：task_id，数据类型：字符串（string） - 名称：prompt，数据类型：字符串（string） - 名称：canonical_solution，数据类型：字符串（string） - 名称：entry_point，数据类型：字符串（string） - 名称：test，数据类型：字符串（string）数据拆分： - 拆分名称：测试集（test），字节数：10978353，样本数量：164 下载大小：2907286，数据集总大小：10978353 - 配置名称：mbppplus-arabic 特征字段： - 名称：task_id，数据类型：64位整数（int64） - 名称：code，数据类型：字符串（string） - 名称：prompt，数据类型：字符串（string） - 名称：source_file，数据类型：字符串（string） - 名称：test_imports，数据类型：字符串（string） - 名称：test_list，数据类型：字符串（string） - 名称：test，数据类型：字符串（string）数据拆分： - 拆分名称：测试集（test），字节数：4855903，样本数量：378 下载大小：1132190，数据集总大小：4855903 配置项： - 配置名称：humanevalplus-arabic 数据文件： - 拆分：测试集（test），路径：humanevalplus-arabic/test-* - 配置名称：mbppplus-arabic 数据文件： - 拆分：测试集（test），路径：mbppplus-arabic/test-* ## 3LM 阿拉伯语代码基准测试集 ### 数据集概览本数据集包含两个广泛应用的代码评估基准——HumanEval+与MBPP+——的阿拉伯语翻译版本，作为3LM项目的组成部分首次适配阿拉伯语场景。数据集涵盖基础版本与增强版本，均配备了扩展的单元测试覆盖范围。 ### 构建动机阿拉伯语大语言模型（LLM）缺乏用于评估代码生成能力的有效基准。本数据集通过提供与规范Python测试用例对齐的高质量阿拉伯语自然语言任务描述，填补了这一研究空白。 ### 数据集结构 #### `humanevalplus-arabic` - `task_id`：唯一标识符（例如：HumanEval/18） - `prompt`：阿拉伯语任务描述 - `entry_point`：函数入口名称 - `canonical_solution`：参考Python实现代码 - `test`：测试用例 json { "task_id": "HumanEval/3", "prompt": "لديك قائمة من عمليات الإيداع والسحب في حساب بنكي يبدأ برصيد صفري. مهمتك هي اكتشاف إذا في أي لحظة انخفض رصيد الحساب إلى ما دون الصفر، وفي هذه اللحظة يجب أن تعيد الدالة True. وإلا فيجب أن تعيد False.", "entry_point": "below_zero", "canonical_solution": "...", "test": "...", } <br> #### `mbppplus-arabic` - `task_id`：唯一标识符（例如：2） - `prompt`：阿拉伯语任务描述 - `code`：规范Python解决方案代码 - `source_file`：原始MBPP问题文件的路径 - `test_imports`：测试所需的导入语句 - `test_list`：用于该任务的3条Python `assert`断言语句 - `test`：完整测试用例 json { "task_id": "2", "code": "def similar_elements(test_tup1, test_tup2): return tuple(set(test_tup1) & set(test_tup2))", "prompt": "اكتب دالة للعثور على العناصر المشتركة من القائمتين المعطاتين.", "source_file": "Benchmark Questions Verification V2.ipynb", "test_imports": "[]", "test_list": "...", "test": "...", } ### 数据来源 - 原始数据集：[MBPP+](https://huggingface.co/datasets/evalplus/mbppplus)、[HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus) - 翻译工具：GPT-4o - 验证流程：先通过反向翻译结合ROUGE-L F1阈值（≥0.8）进行质量校验，随后开展人工复核 ### 翻译方法 - **反向翻译**：确保翻译内容与原文语义保真 - **阈值筛选**与**人工审核**：进一步保障翻译质量 - 仅对任务提示语使用阿拉伯语，代码与测试逻辑保持不变，以确保函数行为与原始基准一致 ### 代码与论文 - GitHub上的EvalPlus-Arabic数据集仓库：https://github.com/tiiuae/3LM-benchmark/frameworks/evalplus-arabic/evalplus/data/data_files - 3LM项目主仓库：https://github.com/tiiuae/3LM-benchmark - 3LM相关学术论文：https://aclanthology.org/2025.arabicnlp-main.4/ ### 许可协议 [Falcon LLM许可协议](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ### 引用格式 bibtex @inproceedings{boussaha-etal-2025-3lm, title = "3{LM}: Bridging {A}rabic, {STEM}, and Code through Benchmarking", author = "Boussaha, Basma El Amel and Al Qadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim", booktitle = "Proceedings of The Third Arabic Natural Language Processing Conference", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.arabicnlp-main.4/", doi = "10.18653/v1/2025.arabicnlp-main.4", pages = "42--63", ISBN = "979-8-89176-352-4", }

提供机构：

tiiuae

5,000+

优质数据集

54 个

任务类型

进入经典数据集