ai-for-good-lab/Global-MMLU-Lite

Name: ai-for-good-lab/Global-MMLU-Lite
Creator: ai-for-good-lab
Published: 2026-04-15 05:57:50
License: 暂无描述

Hugging Face2026-04-15 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/ai-for-good-lab/Global-MMLU-Lite

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - multiple-choice language: - iu - ny - mi tags: - byol - mmlu - global-mmlu - global-mmlu-lite - human-translated - low-resource-languages - evaluation - benchmark pretty_name: Global MMLU-Lite (Chichewa, Māori, Inuktitut) source_datasets: - CohereForAI/Global-MMLU-Lite size_categories: - 1K<n<10K configs: - config_name: nya data_files: - split: test path: nya/test.jsonl - split: dev path: nya/dev.jsonl - config_name: mri data_files: - split: test path: mri/test.jsonl - split: dev path: mri/dev.jsonl - config_name: iku data_files: - split: test path: iku/test.jsonl - split: dev path: iku/dev.jsonl --- # Global MMLU-Lite — Human Translated [Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) is a multilingual evaluation benchmark for LLMs covering 18 languages. This dataset extends it with **professional human translations** for three additional low-resource languages that are not in the original: **Chichewa (nya)**, **Māori (mri)**, and **Inuktitut (iku)**. Released as part of the [BYOL: Bring Your Own Language Into LLMs](https://github.com/microsoft/byol) project ([paper](https://arxiv.org/abs/2601.10804)). ## What's New The [original Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) by Cohere covers 18 languages: Arabic, Bengali, Chinese, Burmese, Welsh, German, English, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Albanian, Swahili, and Yoruba. This dataset adds **3 new languages** with professional human translations (not machine-translated): | Language | ISO 639-3 | Subset | Test | Dev | |---|---|---|---|---| | Chichewa | nya | `nya` | 400 | 215 | | Māori | mri | `mri` | 400 | 215 | | Inuktitut | iku | `iku` | 390 | 211 | ## Usage ```python from datasets import load_dataset # Same API as the original CohereForAI/Global-MMLU-Lite ds = load_dataset("ai-for-good-lab/Global-MMLU-Lite", "mri", split="test") print(ds[0]) ``` ## Format All subsets share the same 17-column schema as the [original dataset](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite), including `sample_id`, `subject`, `subject_category`, `question`, `option_a`–`option_d`, `answer`, and annotation metadata fields. > **Note:** Annotation metadata (e.g., `required_knowledge`, `cultural_sensitivity_label`) for > Māori and Inuktitut was populated from the Chichewa annotations by matching on `sample_id`, > since all three languages translate the same English source questions. ## License This dataset is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), consistent with the [original Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) license. ## Citation ```bibtex @article{zamir2026byolbringlanguagellms, title={BYOL: Bring Your Own Language Into LLMs}, author={Syed Waqas Zamir and Wassim Hamidouche and Boulbaba Ben Amor and Luana Marotti and Inbal Becker-Reshef and Juan Lavista Ferres}, year={2026}, journal={arXiv:2601.10804}, url={https://arxiv.org/abs/2601.10804}, } ``` ## Acknowledgments We thank the [Government of Nunavut](https://www.gov.nu.ca/en) for providing the professional human translations of Global MMLU-Lite into Inuktitut. This dataset builds on [Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) by [Cohere Labs](https://huggingface.co/CohereLabs).

提供机构：

ai-for-good-lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集