mmlu-indic
收藏魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/sarvamai/mmlu-indic
下载链接
链接失效反馈官方服务:
资源简介:
# Indic MMLU Dataset
A multilingual version of the [Massive Multitask Language Understanding (MMLU) benchmark](https://huggingface.co/datasets/cais/mmlu), translated from English into 10 Indian languages.
This version contains the translations of the development and test sets only.
### Languages Covered
The dataset includes translations in the following languages:
- Bengali (bn)
- Gujarati (gu)
- Hindi (hi)
- Kannada (kn)
- Marathi (mr)
- Malayalam (ml)
- Oriya (or)
- Punjabi (pa)
- Tamil (ta)
- Telugu (te)
### Task Format
Each example is a multiple-choice question containing:
- `question`: Question text in target language
- `choices`: List of four possible answers (A, B, C, D) in target language
- `answer`: Correct answer index (0-3)
- `language`: ISO 639-1 language code
## Dataset Statistics
- Validation (dev in the original): ~280 examples per language
- Test: ~14k examples per language
## Usage
```python
from datasets import load_dataset
# we do not maintain subject groupings
dataset = load_dataset("sarvamai/mmlu-indic")
```
## Known Limitations
- Technical terminology may be challenging to translate precisely
- Some subjects (like US Law) may have concepts without direct equivalents
- Cultural and educational system differences may affect question relevance
## License
This dataset follows the same license as the original MMLU dataset.
## Acknowledgments
- Original MMLU dataset creators.
# Indic MMLU 数据集(Indic MMLU Dataset)
本数据集是[大规模多任务语言理解(Massive Multitask Language Understanding, MMLU)基准测试](https://huggingface.co/datasets/cais/mmlu)的多语言版本,由英文翻译为10种印度语言。本版本仅包含开发集与测试集的翻译内容。
### 覆盖语言
本数据集包含以下语言的翻译版本:
- 孟加拉语(Bengali, bn)
- 古吉拉特语(Gujarati, gu)
- 印地语(Hindi, hi)
- 卡纳达语(Kannada, kn)
- 马拉地语(Marathi, mr)
- 马拉雅拉姆语(Malayalam, ml)
- 奥里亚语(Oriya, or)
- 旁遮普语(Punjabi, pa)
- 泰米尔语(Tamil, ta)
- 泰卢固语(Telugu, te)
### 任务格式
每个样本均为一道单项选择题,包含以下字段:
- `question`:目标语言的问题文本
- `choices`:目标语言下的4个候选答案列表(标记为A、B、C、D)
- `answer`:正确答案的索引(取值范围为0至3)
- `language`:ISO 639-1 语言代码
## 数据集统计信息
- 验证集(原始数据集中称为开发集):每种语言约280条样本
- 测试集:每种语言约1.4万条样本
## 使用方法
python
from datasets import load_dataset
# 本数据集不维护主题分组
dataset = load_dataset("sarvamai/mmlu-indic")
## 已知局限性
- 技术术语可能难以实现精准翻译
- 部分主题(如美国法律)的相关概念无直接对等译法
- 文化与教育体系的差异可能影响题目的适配性
## 授权协议
本数据集采用与原始MMLU数据集一致的授权协议。
## 致谢
- 原始MMLU数据集的创作者。
提供机构:
maas
创建时间:
2025-05-26



