ai-for-good-lab/Global-MMLU-Lite
收藏Hugging Face2026-04-15 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/ai-for-good-lab/Global-MMLU-Lite
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- multiple-choice
language:
- iu
- ny
- mi
tags:
- byol
- mmlu
- global-mmlu
- global-mmlu-lite
- human-translated
- low-resource-languages
- evaluation
- benchmark
pretty_name: Global MMLU-Lite (Chichewa, Māori, Inuktitut)
source_datasets:
- CohereForAI/Global-MMLU-Lite
size_categories:
- 1K<n<10K
configs:
- config_name: nya
data_files:
- split: test
path: nya/test.jsonl
- split: dev
path: nya/dev.jsonl
- config_name: mri
data_files:
- split: test
path: mri/test.jsonl
- split: dev
path: mri/dev.jsonl
- config_name: iku
data_files:
- split: test
path: iku/test.jsonl
- split: dev
path: iku/dev.jsonl
---
# Global MMLU-Lite — Human Translated
[Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) is a multilingual
evaluation benchmark for LLMs covering 18 languages. This dataset extends it with **professional
human translations** for three additional low-resource languages that are not in the original:
**Chichewa (nya)**, **Māori (mri)**, and **Inuktitut (iku)**.
Released as part of the [BYOL: Bring Your Own Language Into LLMs](https://github.com/microsoft/byol)
project ([paper](https://arxiv.org/abs/2601.10804)).
## What's New
The [original Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) by Cohere
covers 18 languages: Arabic, Bengali, Chinese, Burmese, Welsh, German, English, Spanish, French, Hindi,
Indonesian, Italian, Japanese, Korean, Portuguese, Albanian, Swahili, and Yoruba.
This dataset adds **3 new languages** with professional human translations (not machine-translated):
| Language | ISO 639-3 | Subset | Test | Dev |
|---|---|---|---|---|
| Chichewa | nya | `nya` | 400 | 215 |
| Māori | mri | `mri` | 400 | 215 |
| Inuktitut | iku | `iku` | 390 | 211 |
## Usage
```python
from datasets import load_dataset
# Same API as the original CohereForAI/Global-MMLU-Lite
ds = load_dataset("ai-for-good-lab/Global-MMLU-Lite", "mri", split="test")
print(ds[0])
```
## Format
All subsets share the same 17-column schema as the
[original dataset](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite),
including `sample_id`, `subject`, `subject_category`, `question`, `option_a`–`option_d`,
`answer`, and annotation metadata fields.
> **Note:** Annotation metadata (e.g., `required_knowledge`, `cultural_sensitivity_label`) for
> Māori and Inuktitut was populated from the Chichewa annotations by matching on `sample_id`,
> since all three languages translate the same English source questions.
## License
This dataset is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0),
consistent with the [original Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite) license.
## Citation
```bibtex
@article{zamir2026byolbringlanguagellms,
title={BYOL: Bring Your Own Language Into LLMs},
author={Syed Waqas Zamir and Wassim Hamidouche and Boulbaba Ben Amor and Luana Marotti and Inbal Becker-Reshef and Juan Lavista Ferres},
year={2026},
journal={arXiv:2601.10804},
url={https://arxiv.org/abs/2601.10804},
}
```
## Acknowledgments
We thank the [Government of Nunavut](https://www.gov.nu.ca/en) for providing the
professional human translations of Global MMLU-Lite into Inuktitut.
This dataset builds on [Global MMLU-Lite](https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite)
by [Cohere Labs](https://huggingface.co/CohereLabs).
提供机构:
ai-for-good-lab



