five

Global-MMLU-Lite

收藏
魔搭社区2026-05-21 更新2024-12-21 收录
下载链接:
https://modelscope.cn/datasets/CohereForAI/Global-MMLU-Lite
下载链接
链接失效反馈
官方服务:
资源简介:
![GlobalMMLU Lite 3.0](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite/resolve/main/gmmlulite3.jpeg) # Releases: - Version 3.0 (May 2026): GMMLU Lite 3.0 release with 5 new languages: Czech, Hungarian, Italian (updated), Oriya, Slovak and Tajik - Version 2.0 (Dec 2025): GMMLU Lite 2.0 release with 3 new languages: Albanian, Burmese and Welsh - Version 1.0 (Dec 2024): GMMLU Lite initial release with 15 languages. ![GlobalMMLU Header](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite/resolve/main/language_coverage.png) # Dataset Summary Global-MMLU-Lite is a multilingual evaluation set spanning 23 languages, including English. It is "lite" version of the original [Global-MMLU dataset](https://huggingface.co/datasets/CohereLabs/Global-MMLU) 🌍. It includes 200 Culturally Sensitive (CS) and 200 Culturally Agnostic (CA) samples per language. All samples in Global MMLU lite are fully human translated or post edited. **NOTE:** Of the 23 languages presently included in Global-MMLU-Lite, 15 are taken from [Global-MMLU dataset](https://huggingface.co/datasets/CohereLabs/Global-MMLU) whereas Albanian, Burmese, Welsh have been contributed by external collaborators or members of Cohere Labs Open Science Community. The evaluation results reported in our [paper](https://arxiv.org/abs/2412.03304) are corresponding to the 15 languages originally released as part of Global-MMLU-Lite. - **Curated by:** Professional annotators and contributors of [Cohere Labs Community](https://cohere.com/research) - **Language(s):** 23 languages - **License:** [Apache 2.0](https://opensource.org/license/apache-2-0) ### **Global-MMLU Dataset Family:** | Name | Explanation | |------|--------------| | [Global-MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) | Full Global-MMLU set with translations for all 14K samples including CS and CA subsets| | [Global-MMLU-Lite](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite) | Lite version of Global-MMLU with human translated samples in 23 languages and containing 200 samples each for CS and CA subsets per language.| ## Load with Datasets To load this dataset with `datasets`, you'll first need to install it using `pip install datasets` and then use the following code: ```python from datasets import load_dataset # load HF dataset gmmlu_lite = load_dataset("CohereLabs/Global-MMLU-Lite", 'en') # can also be used as pandas dataframe gmmlu_lite.set_format("pandas") gmmlu_lite_test = gmmlu_lite['test'][:] gmmlu_lite_dev = gmmlu_lite['dev'][:] ``` <details> <summary> additional details </summary> The columns corresponding to annotations collected from our cultural bias study (i.e. 'required_knowledge', 'time_sensitive', 'reference', 'culture', 'region', 'country') contain a list of values representing annotations from different annotators. However, to avoid conversion issues to HF dataset, these columns are provided as string in the final dataset. You can convert these columns back to list of values for easier manipulation as follows: ```python import ast # convert string values to list gmmlu_lite_test['required_knowledge'] = gmmlu_lite_test['required_knowledge'].apply(lamda x: ast.literal_eval(x)) ``` </details> <br> ## Data Fields The data fields are the same among all splits. Brief description of each field is provided below. <details> <summary> data field description </summary> - `sample_id`: A unique identifier for the question. - `subject`: The main topic the question falls under. - `subject_category`: The high-level category the subject falls under i.e. STEM/Humanities/Social Sciences/Medical/Business/Other. - `question`: translated question from MMLU - `option_a`: one of the possible option choices - `option_b`: one of the possible option choices - `option_c`: one of the possible option choices - `option_d`: one of the possible option choices - `answer': the correct answer (A/B/C/D) - `required_knowledge`: annotator votes for knowledge needed to answer the question correctly. Possible values include: "cultural", "regional", "dialect" or "none" - `time_sensitive`: annotator votes indicating if the question's answer is time-dependent. Possible values include: Yes/No - `reference`: annotations for which part of the question contains cultural/regional/dialect references. The different items in the list are annotations from different annotators. - `culture`: annotations for which culture does the question belong to. The different items in the list correspond to annotations from different annotators. - `region`: Geographic region the question is relevant to. Each item in the list correspond to annotations from different annotators. - `country`: Specific country the question pertains to. Each item in the list correspond to annotations from different annotators. - `cultural_sensitivity_label`: Label to indicate if question is culturally sensitive (CS) or culturally agnostic (CA) based on annotator votes. - `is_annotated`: True/False flag to indicate if sample contains any annotations from our cultural bias study. </details> <br> ## Data Splits The following are the splits of the data: | Split | No. of instances | Language Coverage | |-------|------------------|-------------------| | test | 9,200 | 23 | | dev | 4,655| 22 | ## Data Instances An example from `test` set looks as follows: ```json {'sample_id': 'astronomy/test/58', 'subject': 'astronomy', 'subject_category': 'STEM', 'question': 'When traveling north from the United States into Canada you’ll see the North Star (Polaris) getting _________.', 'option_a': 'Brighter', 'option_b': 'Dimmer', 'option_c': 'Higher in the sky', 'option_d': 'Lower in the sky', 'answer': 'C', 'required_knowledge': "['regional', 'regional', 'regional', 'regional']", 'time_sensitive': "['No', 'No', 'No', 'No']", 'reference': "[{'end': 55, 'label': 'Geographic', 'score': None, 'start': 5}, {'end': 43, 'label': 'Geographic', 'score': None, 'start': 30}, {'end': 55, 'label': 'Geographic', 'score': None, 'start': 5}, {'end': 43, 'label': 'Geographic', 'score': None, 'start': 30}]", 'culture': '[]', 'region': "['North America', 'North America', 'North America', 'North America']", 'country': "['United States of America (USA)', 'United States of America (USA)', 'United States of America (USA)', 'United States of America (USA)']", 'cultural_sensitivity_label': 'CS', 'is_annotated': True } ``` ## Statistics ### Annotation Types The following is the breakdown of CS🗽, CA⚖️ and MA📝 samples in the final dataset. | Type of Annotation | Instances per language | No. of languages | Total instances |--------------------|------------------------|------------------|----------------| | Culturally Sensitive 🗽 | 200 | 23 | 4,600 | | Culturally Agnostic ⚖️ | 200 |23 | 4,600 | | MMLU Annotated 📝| 400 | 23 | 9,200 | ### Languages The dataset covers 23 languages. The following is details about the languages included in the dataset. <details> <summary> Languages Info </summary> | ISO Code | Language | Resources | |----------|----------|-----------| | `ar` | Arabic (Standard)| High | | `bn` | Bengali | Mid | | `cs` | Czech | Mid | | `cy` | Welsh | Low | | `de` | German | High | | `en` | English | High | | `fr` | French | High | | `hi` | Hindi | High | | `hu` | Hungarian | Mid | | `id` | Indonesian | Mid | | `it` | Italian | High | | `ja` | Japanese | High | | `ko` | Korean | Mid | | `my` | Burmese | Low | | `or` | Oriya | Low | | `pt` | Portuguese | High | | `es` | Spanish | High | | `sk` | Slovak | Mid | | `sq` | Albanian | Low | | `sw` | Swahili | Low | | `tg` | Tajik | Low | | `yo` | Yorùbá | Low | | `zh` | Chinese (Simplified) | High | </details> <br> **Note:** Albaninan, Burmese and Welsh are new languages added as part of v2 release of Global MMLU Lite. # Known Limitations A brief overview of limitations of this dataset is provided below. <details> <summary> show limitations </summary> - **Language and dialect coverage:** Global-MMLU focusses on 42 languages. However, this is still only a tiny fraction of the world’s linguistic diversity. Future work is needed to continue to improve evaluations beyond these 42 languages and take into account how technology serves different dialects. - **Uneven distribution of contributions:** The dataset contains translation post-edits from community volunteers, with a 'long tail' of volunteers making only one or two contributions. Similarly, there is a huge gap between languages with the highest number of contributions and ones with the lowest number of contributions. - **Toxic or offensive speech:** Our annotation process did not focus on flagging for toxic,harmful, or offensive speech, so it is possible that Global-MMLU contains some data that could be considered harmful. We believe this is of relatively low risk because of the nature of the original MMLU and the focus on examination material. - **Region Category Assignment:** For the annotation of geographically sensitive questions, we classified regions into six geographic regions (Africa, Asia, Europe, North America, Oceania,and South America). However, based upon discussions we would going forward recommend switching to the taxonomy proposed by the World Bank which is more granular and includes separate designations for Central America and Sub-Saharan Africa. - **Identifying cultural sensitivity does not guarantee cultural inclusion:** While Global-MMLU highlights important limitations in current datasets by identifying gaps in non-Western cultural representation. Future work must prioritize the integration of diverse culturally grounded knowledge to achieve true inclusivity and fairness in multilingual AI evaluation. </details> <br> # Additional Information ## Provenance - **Methods Used:** Professional annotations as well as crowd-sourced through volunteer annotations. - **Methodology Details:** We collected cultural bias annotations as well as post-edits of translations for different mmlu questions. - [Cultural Sensitivity Annotation Platform](https://huggingface.co/spaces/CohereLabs/MMLU-evaluation) - [Translation Quality Annotation Platform](https://huggingface.co/spaces/CohereLabs/review-mmlu-translations) - Dates of Collection: May 2024 - Aug 2024 ## Dataset Version and Maintenance - **Maintenance Status:** Actively Maintained - **Version Details:** - *Current version:* 3.0 - *Last Update:* 05/2026 - *First Release:* 12/2024 ## Authorship - **Publishing Organization:** [Cohere Labs](https://cohere.com/research) - **Industry Type:** Not-for-profit - Tech ## Licensing Information This dataset can be used for any purpose, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License. ## Continuous Improvement: If you want to contribute to improving the quality of translations in Global-MMLU-Lite then please contribute using our [annotation UI](https://huggingface.co/spaces/CohereLabs/review-global-mmlu-lite). You can also help review and edit machine translations in additional languages using our annotation interface to help improve language coverage of Global-MMLU-Lite. ## Additional Details For any additional details, please check our paper, [Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation](https://arxiv.org/abs/2412.03304). ## Citation Information ```bibtex @misc{singh2024globalmmluunderstandingaddressing, title={Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation}, author={Shivalika Singh and Angelika Romanou and Clémentine Fourrier and David I. Adelani and Jian Gang Ngui and Daniel Vila-Suero and Peerat Limkonchotiwat and Kelly Marchisio and Wei Qi Leong and Yosephine Susanto and Raymond Ng and Shayne Longpre and Wei-Yin Ko and Madeline Smith and Antoine Bosselut and Alice Oh and Andre F. T. Martins and Leshem Choshen and Daphne Ippolito and Enzo Ferrante and Marzieh Fadaee and Beyza Ermis and Sara Hooker}, year={2024}, eprint={2412.03304}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.03304}, } ```

![GlobalMMLU Lite 2.0](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite/resolve/main/gmmlulite2.png) # 版本发布 - 版本2.0(2025年12月):GMMLU Lite 2.0正式发布,新增阿尔巴尼亚语、缅甸语、威尔士语3种语言 - 版本1.0(2024年12月):GMMLU Lite首次发布,包含15种语言 ![GlobalMMLU Header](https://huggingface.co/datasets/CohereLabs/Global-MMLU/resolve/main/global_mmlu.jpg) # 数据集概述 Global-MMLU-Lite是一个涵盖18种语言(含英语)的多语言评测集,是原始[Global-MMLU数据集](https://huggingface.co/datasets/CohereLabs/Global-MMLU) 的轻量化版本🌍。 该数据集每种语言包含200个**文化敏感(Culturally Sensitive, CS)**样本与200个**文化无关(Culturally Agnostic, CA)**样本。Global MMLU Lite中的所有样本均经过人工翻译或译后编辑。 **注意**:当前Global-MMLU-Lite包含的18种语言中,15种源自[Global-MMLU数据集](https://huggingface.co/datasets/CohereLabs/Global-MMLU),而阿尔巴尼亚语、缅甸语、威尔士语由外部合作者或Cohere Labs开放科学社区成员贡献。 本团队[论文](https://arxiv.org/abs/2412.03304)中报告的评测结果,对应Global-MMLU-Lite初始发布的15种语言。 - **整理方**:专业标注人员与[Cohere Labs社区](https://cohere.com/research)贡献者 - **语言覆盖**:18种语言 - **许可协议**:[Apache 2.0](https://opensource.org/license/apache-2-0) ### Global-MMLU 数据集家族: | 名称 | 说明 | |------|------| | [Global-MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) | 完整的Global-MMLU数据集,包含全部14000条样本的翻译,涵盖CS与CA子集 | | [Global-MMLU-Lite](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite) | Global-MMLU的轻量化版本,包含18种语言的人工翻译样本,每种语言的CS与CA子集各含200条样本 | ## 使用Datasets库加载 若需通过`datasets`库加载本数据集,请先通过`pip install datasets`安装依赖库,随后使用如下代码: python from datasets import load_dataset # 加载Hugging Face数据集 gmmlu_lite = load_dataset("CohereLabs/Global-MMLU-Lite", 'en') # 也可转换为Pandas DataFrame格式 gmmlu_lite.set_format("pandas") gmmlu_lite_test = gmmlu_lite['test'][:] gmmlu_lite_dev = gmmlu_lite['dev'][:] <details> <summary> 补充细节 </summary> 与文化偏见研究相关的标注列(即`required_knowledge`、`time_sensitive`、`reference`、`culture`、`region`、`country`)存储的是多位标注者的标注结果列表。为避免与Hugging Face数据集格式出现转换冲突,最终数据集中这些列以字符串形式存储。你可通过如下代码将其转换为列表以便后续处理: python import ast # 将字符串值转换为列表 gmmlu_lite_test['required_knowledge'] = gmmlu_lite_test['required_knowledge'].apply(lambda x: ast.literal_eval(x)) </details> <br> ## 数据字段 所有数据划分(split)的字段格式保持一致,各字段简要说明如下: <details> <summary> 数据字段详情 </summary> - `sample_id`:问题的唯一标识符 - `subject`:问题所属的主要主题 - `subject_category`:主题所属的高级分类,例如STEM/人文/社会科学/医学/商科/其他 - `question`:源自MMLU的翻译后问题 - `option_a`:可选答案之一 - `option_b`:可选答案之一 - `option_c`:可选答案之一 - `option_d`:可选答案之一 - `answer`:正确答案(A/B/C/D) - `required_knowledge`:标注者投票得出的、回答该问题所需的知识类型,可选值为:"cultural"(文化)、"regional"(地域)、"dialect"(方言)或"none"(无) - `time_sensitive`:标注者投票得出的、问题答案是否随时间变化,可选值为:Yes/No - `reference`:标注结果,标识问题中包含文化/地域/方言参考的部分,列表中的每个元素对应一位标注者的标注结果 - `culture`:标注结果,标识问题所属的文化,列表中的每个元素对应一位标注者的标注结果 - `region`:问题相关的地理区域,列表中的每个元素对应一位标注者的标注结果 - `country`:问题涉及的具体国家,列表中的每个元素对应一位标注者的标注结果 - `cultural_sensitivity_label`:基于标注者投票得出的标签,标识该问题属于文化敏感(CS)还是文化无关(CA) - `is_annotated`:布尔值(True/False),标识该样本是否包含文化偏见研究的标注结果 </details> <br> ## 数据划分 本数据集包含以下划分: | 划分 | 样本数量 | 语言覆盖 | |------|----------|----------| | test | 7200 | 18 | | dev | 3655 |17 | ## 数据样例 `test`划分中的一条样例如所示: json {'sample_id': 'astronomy/test/58', 'subject': 'astronomy', 'subject_category': 'STEM', 'question': "从美国向北前往加拿大时,你会看到北极星(勾陈一)变得_________。", 'option_a': '更亮', 'option_b': '更暗', 'option_c': '在天空中位置更高', 'option_d': '在天空中位置更低', 'answer': 'C', 'required_knowledge': "['regional', 'regional', 'regional', 'regional']", 'time_sensitive': "['No', 'No', 'No', 'No']", 'reference': "[{'end': 55, 'label': 'Geographic', 'score': None, 'start': 5}, {'end': 43, 'label': 'Geographic', 'score': None, 'start': 30}, {'end': 55, 'label': 'Geographic', 'score': None, 'start': 5}, {'end': 43, 'label': 'Geographic', 'score': None, 'start': 30}]", 'culture': '[]', 'region': "['North America', 'North America', 'North America', 'North America']", 'country': "['美利坚合众国(USA)', '美利坚合众国(USA)', '美利坚合众国(USA)', '美利坚合众国(USA)']", 'cultural_sensitivity_label': 'CS', 'is_annotated': True } ## 统计信息 ### 标注类型 本数据集的CS🗽、CA⚖️与MA📝样本分布如下: | 标注类型 | 单语言样本数 | 语言数量 | 总样本数 | |----------|--------------|----------|----------| | 文化敏感(CS)🗽 | 200 | 18 | 3600 | | 文化无关(CA)⚖️ |200 |18 |3600 | | MMLU标注(MA)📝|400 |18 |7200 | ### 语言覆盖 本数据集涵盖18种语言,各语言详情如下: <details> <summary> 语言信息 </summary> | ISO代码 | 语言 | 资源标注 | |----------|------|----------| | `ar` | 阿拉伯语(标准) | 高 | | `bn` | 孟加拉语 | 中 | | `cy` | 威尔士语 | 低 | | `de` | 德语 | 高 | | `en` | 英语 | 高 | | `fr` | 法语 | 高 | | `hi` | 印地语 | 高 | | `id` | 印度尼西亚语 | 中 | | `it` | 意大利语 | 高 | | `ja` | 日语 | 高 | | `ko` | 韩语 | 中 | | `my` | 缅甸语 | 低 | | `pt` | 葡萄牙语 | 高 | | `es` | 西班牙语 | 高 | | `sq` | 阿尔巴尼亚语 | 低 | | `sw` | 斯瓦希里语 | 低 | | `yo` | 约鲁巴语 | 低 | | `zh` | 汉语(简体) | 高 | </details> <br> **注**:阿尔巴尼亚语、缅甸语与威尔士语为Global MMLU Lite v2版本新增的语言。 # 已知局限性 本数据集的局限性概述如下: <details> <summary> 查看局限性详情 </summary> - **语言与方言覆盖**:Global-MMLU聚焦于42种语言,但这仅占全球语言多样性的极小一部分。未来仍需拓展评测覆盖的语言范围,并关注不同方言的评测需求。 - **贡献分布不均**:本数据集的译后编辑工作由社区志愿者完成,存在“长尾效应”——多数志愿者仅贡献1-2条样本。此外,不同语言的标注贡献量差异悬殊。 - **有害/冒犯性内容**:本团队的标注流程未针对有毒、有害或冒犯性内容进行筛查,因此Global-MMLU可能包含部分潜在有害数据。但考虑到原始MMLU的考试材料属性,此类风险相对较低。 - **区域分类标准**:在地理敏感问题的标注中,我们将区域划分为六大地理分区(非洲、亚洲、欧洲、北美洲、大洋洲、南美洲)。经讨论,后续建议改用世界银行提出的更细粒度分类标准,该标准单独划分中美洲与撒哈拉以南非洲。 - **文化敏感性标注不代表文化包容**:尽管Global-MMLU通过识别非西方文化表征的缺口,揭示了当前数据集的重要局限性,但未来仍需优先融入多样化的本土文化知识,以实现多语言AI评测的真正包容与公平。 </details> <br> # 补充信息 ## 来源 - **使用方法**:专业标注与志愿者众包标注相结合 - **方法细节**:我们收集了文化偏见标注以及不同MMLU问题的译后编辑结果: - [文化敏感性标注平台](https://huggingface.co/spaces/CohereLabs/MMLU-evaluation) - [翻译质量标注平台](https://huggingface.co/spaces/CohereLabs/review-mmlu-translations) - 标注收集时间:2024年5月 - 2024年8月 ## 数据集版本与维护 - **维护状态**:持续维护 - **版本详情**: - 当前版本:2.0 - 最近更新:2025年12月 - 首次发布:2024年12月 ## 作者信息 - **发布机构**:[Cohere Labs](https://cohere.com/research) - **行业类型**:非营利性科技机构 ## 许可协议 本数据集可在[Apache 2.0](https://opensource.org/license/apache-2-0)许可协议条款下用于任何用途。 ## 持续改进 若你希望为提升Global-MMLU-Lite的翻译质量贡献力量,请通过我们的[标注界面](https://huggingface.co/spaces/CohereLabs/review-global-mmlu-lite)提交贡献。你也可通过该标注界面审阅并编辑其他语言的机器翻译结果,以拓展Global-MMLU-Lite的语言覆盖范围。 ## 补充细节 如需更多信息,请查阅本团队的论文《Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation》,链接为:https://arxiv.org/abs/2412.03304。 ## 引用信息 bibtex @misc{singh2024globalmmluunderstandingaddressing, title={Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation}, author={Shivalika Singh and Angelika Romanou and Clémentine Fourrier and David I. Adelani and Jian Gang Ngui and Daniel Vila-Suero and Peerat Limkonchotiwat and Kelly Marchisio and Wei Qi Leong and Yosephine Susanto and Raymond Ng and Shayne Longpre and Wei-Yin Ko and Madeline Smith and Antoine Bosselut and Alice Oh and Andre F. T. Martins and Leshem Choshen and Daphne Ippolito and Enzo Ferrante and Marzieh Fadaee and Beyza Ermis and Sara Hooker}, year={2024}, eprint={2412.03304}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.03304}, }
提供机构:
maas
创建时间:
2024-12-15
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Global-MMLU-Lite是一个涵盖23种语言的多语言评估数据集,作为原始Global-MMLU的轻量版本,每种语言包含200个文化敏感和200个文化无关的样本,所有样本均经过人工翻译或后编辑处理。该数据集采用Apache 2.0许可证,目前处于积极维护状态,最新版本为3.0。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作