five

Global-MMLU

收藏
魔搭社区2026-01-06 更新2024-12-14 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Global-MMLU
下载链接
链接失效反馈
官方服务:
资源简介:
![GlobalMMLU Header](https://huggingface.co/datasets/CohereLabs/Global-MMLU/resolve/main/global_mmlu.jpg) # Dataset Summary [Global-MMLU](https://arxiv.org/abs/2412.03304) 🌍 is a multilingual evaluation set spanning 42 languages, including English. This dataset combines machine translations for [MMLU](https://huggingface.co/datasets/cais/mmlu) questions along with professional translations and crowd-sourced post-edits. It also includes cultural sensitivity annotations for a subset of the questions (2850 questions per language) and classifies them as *Culturally Sensitive* (CS) 🗽 or *Culturally Agnostic* (CA) ⚖️. These annotations were collected as part of an open science initiative led by Cohere Labs in collaboration with many external collaborators from both industry and academia. - **Curated by:** Professional annotators and contributors of [Cohere Labs Community](https://cohere.com/research). - **Language(s):** 42 languages. - **License:** [Apache 2.0](https://opensource.org/license/apache-2-0) **Note:** We also provide a "lite" version of Global MMLU called ["Global-MMLU-Lite"](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite). This datatset is more balanced containing 200 samples each for CA and CS subsets for each language. And provides coverage for 15 languages with human translations. ### **Global-MMLU Dataset Family:** | Name | Explanation | |------|--------------| | [Global-MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) | Full Global-MMLU set with translations for all 14K samples including CS and CA subsets| | [Global-MMLU-Lite](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite) | Lite version of Global-MMLU with human translated samples in 15 languages and containing 200 samples each for CS and CA subsets per language.| ## Load with Datasets To load this dataset with `datasets`, you'll first need to install it using `pip install datasets` and then use the following code: ```python from datasets import load_dataset # load HF dataset global_mmlu = load_dataset("CohereLabs/Global-MMLU", 'en') # can also be used as pandas dataframe global_mmlu.set_format("pandas") global_mmlu_test = global_mmlu['test'][:] global_mmlu_dev = global_mmlu['dev'][:] ``` <details> <summary> additional details </summary> The columns corresponding to annotations collected from our cultural bias study (i.e. 'required_knowledge', 'time_sensitive', 'reference', 'culture', 'region', 'country') contain a list of values representing annotations from different annotators. However, to avoid conversion issues to HF dataset, these columns are provided as string in the final dataset. You can convert these columns back to list of values for easier manipulation as follows: ```python import ast # convert string values to list global_mmlu_df['required_knowledge'] = global_mmlu_df['required_knowledge'].apply(lamda x: ast.literal_eval(x)) ``` </details> <br> ## Data Fields The data fields are the same among all splits. Brief description of each field is provided below. <details> <summary> data field description </summary> - `sample_id`: A unique identifier for the question. - `subject`: The main topic the question falls under. - `subject_category`: The high-level category the subject falls under i.e. STEM/Humanities/Social Sciences/Medical/Business/Other. - `question`: translated question from MMLU - `option_a`: one of the possible option choices - `option_b`: one of the possible option choices - `option_c`: one of the possible option choices - `option_d`: one of the possible option choices - `answer': the correct answer (A/B/C/D) - `required_knowledge`: annotator votes for knowledge needed to answer the question correctly. Possible values include: "cultural", "regional", "dialect" or "none" - `time_sensitive`: annotator votes indicating if the question's answer is time-dependent. Possible values include: Yes/No - `reference`: annotations for which part of the question contains cultural/regional/dialect references. The different items in the list are annotations from different annotators. - `culture`: annotations for which culture does the question belong to. The different items in the list correspond to annotations from different annotators. - `region`: Geographic region the question is relevant to. Each item in the list correspond to annotations from different annotators. - `country`: Specific country the question pertains to. Each item in the list correspond to annotations from different annotators. - `cultural_sensitivity_label`: Label to indicate if question is culturally sensitive (CS) or culturally agnostic (CA) based on annotator votes. - `is_annotated`: True/False flag to indicate if sample contains any annotations from our cultural bias study. </details> <br> ## Data Splits The following are the splits of the data: | Split | No. of instances | Language Coverage | |-------|------------------|-------------------| | test | 589,764 | 42 | | dev | 11,970 | 42 | ## Data Instances An example from `test` set looks as follows: ```json {'sample_id': 'world_religions/test/170' 'subject': 'world_religions', 'subject_category': 'Humanities', 'question': ' The numen of Augustus referred to which of the following characteristics?', 'option_a': 'Divine power', 'option_b': 'Sexual virility', 'option_c': 'Military acumen', 'option_d': 'Philosophical intellect', 'answer': 'A', 'required_knowledge': "['none', 'cultural', 'cultural', 'cultural']", 'time_sensitive': "['No', 'No', 'No', 'No']", 'reference': "['-', '-', {'end': 22, 'label': 'Cultural', 'score': None, 'start': 5}, {'end': 22, 'label': 'Cultural', 'score': None, 'start': 5}]", 'culture': "['Western Culture', 'Western Culture', 'Western Culture']", 'region': "['North America', 'Europe']", 'country': "['Italy']", 'cultural_sensitivity_label': 'CS', 'is_annotated': True, } ``` ## Statistics ### Annotation Types The following is the breakdown of CS🗽, CA⚖️ and MA📝 samples in the final dataset. | Type of Annotation | Instances per language | No. of languages | Total instances |--------------------|------------------------|------------------|----------------| | Culturally Sensitive 🗽 | 792 | 42 | 33,264 | | Culturally Agnostic ⚖️ | 2058 |42 | 86,436 | | MMLU Annotated 📝| 2850 |42 | 119,700 | ### Languages The dataset covers 42 languages: 20 high-resource, 9 mid-resource, and 13 low-resource languages. The following is details about the languages included in the dataset. <details> <summary> Languages Info </summary> | ISO Code | Language | Resources | |----------|----------|-----------| | `am` | Amharic | Low | | `ar` | Arabic (Standard)| High | | `bn` | Bengali | Mid | | `cs`| Czech | High | | `de` | German | High | | `el` | Greek | Mid | | `en` | English | High | | `fil` | Filipino | Mid | | `fr` | French | High | | `ha` | Hausa | Low | | `he` | Hebrew | Mid | | `hi` | Hindi | High | | `ig` | Igbo | Low | | `id` | Indonesian | Mid | | `it` | Italian | High | | `ja` | Japanese | High | | `ky` | Kyrgyz | Low | | `ko` | Korean | Mid | | `lt` | Lithuanian | Mid | | `mg` | Malagasy | Low | | `ms` | Malay | Mid | | `ne` | Nepali | Low | | `nl` | Dutch | High | | `ny` | Chichewa | Low | | `fa` | Persian | High | | `pl` | Polish | High | | `pt` | Portuguese | High | | `ro`| Romanian | Mid | | `ru` | Russian | High | | `si` | Sinhala | Low | | `sn` | Shona | Low | | `so` | Somali | Low | | `es` | Spanish | High | | `sr` | Serbian | High | | `sw` | Swahili | Low | | `sv` | Swedish | High | | `te` | Telugu | Low | | `tr` | Turkish | High | | `uk` | Ukrainian | Mid | | `vi` | Vietnamese | High | | `yo` | Yorùbá | Low | | `zh` | Chinese (Simplified) | High | </details> <br> # Known Limitations A brief overview of limitations of this dataset is provided below. <details> <summary> show limitations </summary> - **Language and dialect coverage:** Global-MMLU focusses on 42 languages. However, this is still only a tiny fraction of the world’s linguistic diversity. Future work is needed to continue to improve evaluations beyond these 42 languages and take into account how technology serves different dialects. - **Uneven distribution of contributions:** The dataset contains translation post-edits from community volunteers, with a 'long tail' of volunteers making only one or two contributions. Similarly, there is a huge gap between languages with the highest number of contributions and ones with the lowest number of contributions. - **Toxic or offensive speech:** Our annotation process did not focus on flagging for toxic,harmful, or offensive speech, so it is possible that Global-MMLU contains some data that could be considered harmful. We believe this is of relatively low risk because of the nature of the original MMLU and the focus on examination material. - **Region Category Assignment:** For the annotation of geographically sensitive questions, we classified regions into six geographic regions (Africa, Asia, Europe, North America, Oceania,and South America). However, based upon discussions we would going forward recommend switching to the taxonomy proposed by the World Bank which is more granular and includes separate designations for Central America and Sub-Saharan Africa. - **Identifying cultural sensitivity does not guarantee cultural inclusion:** While Global-MMLU highlights important limitations in current datasets by identifying gaps in non-Western cultural representation. Future work must prioritize the integration of diverse culturally grounded knowledge to achieve true inclusivity and fairness in multilingual AI evaluation. </details> <br> # Additional Information ## Provenance - **Methods Used:** Professional annotations as well as crowd-sourced through volunteer annotations. - **Methodology Details:** We collected cultural bias annotations as well as post-edits of translations for different mmlu questions. - [Cultural Sensitivity Annotation Platform](https://huggingface.co/spaces/CohereLabs/MMLU-evaluation) - [Translation Quality Annotation Platform](https://huggingface.co/spaces/CohereLabs/review-mmlu-translations) - Dates of Collection: May 2024 - Aug 2024 ## Dataset Version and Maintenance - **Maintenance Status:** Actively Maintained - **Version Details:** - *Current version:* 1.0 - *Last Update:* 12/2024 - *First Release:* 12/2024 ## Authorship - **Publishing Organization:** [Cohere Labs](https://cohere.com/research) - **Industry Type:** Not-for-profit - Tech ## Licensing Information This dataset can be used for any purpose, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License. ## Additional Details For any additional details, please check our paper, [Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation](https://arxiv.org/abs/2412.03304). ## Citation Information ```bibtex @misc{singh2024globalmmluunderstandingaddressing, title={Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation}, author={Shivalika Singh and Angelika Romanou and Clémentine Fourrier and David I. Adelani and Jian Gang Ngui and Daniel Vila-Suero and Peerat Limkonchotiwat and Kelly Marchisio and Wei Qi Leong and Yosephine Susanto and Raymond Ng and Shayne Longpre and Wei-Yin Ko and Madeline Smith and Antoine Bosselut and Alice Oh and Andre F. T. Martins and Leshem Choshen and Daphne Ippolito and Enzo Ferrante and Marzieh Fadaee and Beyza Ermis and Sara Hooker}, year={2024}, eprint={2412.03304}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.03304}, } ```

![GlobalMMLU Header](https://huggingface.co/datasets/CohereLabs/Global-MMLU/resolve/main/global_mmlu.jpg) # 数据集概述 [Global-MMLU](https://arxiv.org/abs/2412.03304) 🌍 是一个涵盖42种语言(含英语)的多语言评测基准集。该数据集结合了大规模多任务语言理解(MMLU)问题的机器翻译结果、专业译制以及众包后编辑内容。 此外,该数据集针对部分问题(每种语言2850个问题)添加了文化敏感性标注,并将其分类为*文化敏感型(Culturally Sensitive,CS)* 🗽 或 *文化无涉型(Culturally Agnostic,CA)* ⚖️。这些标注由Cohere Labs牵头的开放科学项目联合众多工业界与学术界外部合作者共同收集。 - **整理方:** 专业标注人员与[Cohere Labs社区](https://cohere.com/research)的贡献者。 - **覆盖语言:** 42种 - **许可证:** [Apache 2.0](https://opensource.org/license/apache-2-0) **注:** 我们还提供了Global-MMLU的轻量化版本——*Global-MMLU-Lite*。该数据集更为均衡,每种语言的CS与CA子集各包含200条样本,且针对15种语言提供了人工翻译结果。 ### Global-MMLU数据集家族: | 数据集名称 | 说明 | |------|--------------| | [Global-MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) | 完整的Global-MMLU数据集,包含全部14000条样本的翻译结果,涵盖CS与CA子集| | [Global-MMLU-Lite](https://huggingface.co/datasets/CohereLabs/Global-MMLU-Lite) | Global-MMLU的轻量化版本,针对15种语言提供人工翻译样本,每种语言的CS与CA子集各含200条样本。| ## 使用Datasets库加载 若要使用`datasets`库加载该数据集,需先通过`pip install datasets`安装依赖,随后使用以下代码: python from datasets import load_dataset # 加载Hugging Face数据集 global_mmlu = load_dataset("CohereLabs/Global-MMLU", 'en') # 也可转换为Pandas DataFrame格式 global_mmlu.set_format("pandas") global_mmlu_test = global_mmlu['test'][:] global_mmlu_dev = global_mmlu['dev'][:] <details> <summary>额外说明</summary> 与文化偏见研究收集的标注对应的列(即`required_knowledge`、`time_sensitive`、`reference`、`culture`、`region`、`country`)包含多位标注者的投票结果列表。但为避免与Hugging Face数据集格式出现转换问题,最终数据集中这些列以字符串形式存储。你可以通过以下代码将其转换为列表以方便操作: python import ast # 将字符串值转换为列表 global_mmlu_df['required_knowledge'] = global_mmlu_df['required_knowledge'].apply(lambda x: ast.literal_eval(x)) </details> <br> ## 数据字段 所有数据划分的字段格式保持一致,各字段简要说明如下: <details> <summary>数据字段详情</summary> - `sample_id`: 问题的唯一标识符。 - `subject`: 问题所属的主要主题。 - `subject_category`: 主题所属的高级分类,即STEM/人文/社会科学/医学/商科/其他。 - `question`: 来自MMLU的翻译后问题。 - `option_a`: 可选答案之一 - `option_b`: 可选答案之一 - `option_c`: 可选答案之一 - `option_d`: 可选答案之一 - `answer`: 正确答案(A/B/C/D) - `required_knowledge`: 标注者对回答该问题所需知识类型的投票结果,可选值包括:"cultural"(文化相关)、"regional"(地域相关)、"dialect"(方言相关)或"none"(无特殊要求)。 - `time_sensitive`: 标注者对问题答案是否随时间变化的投票结果,可选值为:Yes/No(是/否)。 - `reference`: 标注者指出的问题中包含文化/地域/方言相关引用的部分,列表中的每个条目对应一位标注者的标注结果。 - `culture`: 标注者指出的问题所属文化,列表中的每个条目对应一位标注者的标注结果。 - `region`: 问题相关的地理区域,列表中的每个条目对应一位标注者的标注结果。 - `country`: 问题涉及的具体国家,列表中的每个条目对应一位标注者的标注结果。 - `cultural_sensitivity_label`: 根据标注者投票结果标注的问题类型标签,为文化敏感型(CS)或文化无涉型(CA)。 - `is_annotated`: 布尔值标记,用于指示该样本是否包含文化偏见研究的标注信息。 </details> <br> ## 数据划分 数据集包含以下划分: | 划分 | 样本数量 | 语言覆盖范围 | |-------|------------------|-------------------| | test | 589,764 | 42种 | | dev | 11,970 | 42种 | ## 数据样例 `test`集合中的一条样例如下: json {'sample_id': 'world_religions/test/170', 'subject': 'world_religions', 'subject_category': 'Humanities', 'question': ' 奥古斯都的numen指的是以下哪一项特征?', 'option_a': '神性力量', 'option_b': '性能力', 'option_c': '军事才能', 'option_d': '哲学思辨能力', 'answer': 'A', 'required_knowledge': "['none', 'cultural', 'cultural', 'cultural']", 'time_sensitive': "['No', 'No', 'No', 'No']", 'reference': "['-', '-', {'end': 22, 'label': 'Cultural', 'score': None, 'start': 5}, {'end': 22, 'label': 'Cultural', 'score': None, 'start': 5}]", 'culture': "['Western Culture', 'Western Culture', 'Western Culture']", 'region': "['North America', 'Europe']", 'country': "['Italy']", 'cultural_sensitivity_label': 'CS', 'is_annotated': True, } ## 统计信息 ### 标注类型 以下为最终数据集中文化敏感型(CS)🗽、文化无涉型(CA)⚖️与MMLU标注集(MA)📝的样本分布: | 标注类型 | 每种语言的样本数 | 覆盖语言数 | 总样本数 |--------------------|------------------------|------------------|----------------| | 文化敏感型(CS)🗽 | 792 | 42 | 33,264 | | 文化无涉型(CA)⚖️ | 2058 |42 | 86,436 | | MMLU标注集(MA)📝| 2850 |42 | 119,700 | ### 覆盖语言 该数据集共覆盖42种语言,其中20种为高资源语言、9种为中资源语言、13种为低资源语言。以下为数据集包含的语言详情: <details> <summary>语言详情</summary> | ISO代码 | 语言名称 | 资源等级 | |----------|----------|-----------| | `am` | 阿姆哈拉语 | 低 | | `ar` | 标准阿拉伯语 | 高 | | `bn` | 孟加拉语 | 中 | | `cs`| 捷克语 | 高 | | `de` | 德语 | 高 | | `el` | 希腊语 | 中 | | `en` | 英语 | 高 | | `fil` | 他加禄语(菲律宾语) | 中 | | `fr` | 法语 | 高 | | `ha` | 豪萨语 | 低 | | `he` | 希伯来语 | 中 | | `hi` | 印地语 | 高 | | `ig` | 伊博语 | 低 | | `id` | 印度尼西亚语 | 中 | | `it` | 意大利语 | 高 | | `ja` | 日语 | 高 | | `ky` | 吉尔吉斯语 | 低 | | `ko` | 韩语 | 中 | | `lt` | 立陶宛语 | 中 | | `mg` | 马达加斯加语 | 低 | | `ms` | 马来语 | 中 | | `ne` | 尼泊尔语 | 低 | | `nl` | 荷兰语 | 高 | | `ny` | 齐切瓦语 | 低 | | `fa` | 波斯语 | 高 | | `pl` | 波兰语 | 高 | | `pt` | 葡萄牙语 | 高 | | `ro`| 罗马尼亚语 | 中 | | `ru` | 俄语 | 高 | | `si` | 僧伽罗语 | 低 | | `sn` | 绍纳语 | 低 | | `so` | 索马里语 | 低 | | `es` | 西班牙语 | 高 | | `sr` | 塞尔维亚语 | 高 | | `sw` | 斯瓦希里语 | 低 | | `sv` | 瑞典语 | 高 | | `te` | 泰卢固语 | 低 | | `tr` | 土耳其语 | 高 | | `uk` | 乌克兰语 | 中 | | `vi` | 越南语 | 高 | | `yo` | 约鲁巴语 | 低 | | `zh` | 简体中文 | 高 | </details> <br> # 已知局限性 以下为该数据集的简要局限性说明: <details> <summary>查看局限性</summary> - **语言与方言覆盖范围:** Global-MMLU仅覆盖42种语言,这仅占世界语言多样性的极小一部分。未来仍需拓展评测覆盖范围至更多语言,并考虑不同方言对模型评测的影响。 - **贡献分布不均:** 数据集的翻译后编辑工作由社区志愿者完成,存在“长尾效应”——多数志愿者仅贡献1-2条内容。此外,不同语言的贡献数量差异巨大。 - **有害内容风险:** 标注过程未针对有毒、有害或冒犯性内容进行筛查,因此Global-MMLU可能包含部分易被视为有害的数据。考虑到原始MMLU数据集的性质与考试材料的定位,我们认为此类风险相对较低。 - **地域分类标准:** 在地理敏感问题的标注中,我们将地域划分为六大区域:非洲、亚洲、欧洲、北美洲、大洋洲与南美洲。但经讨论,后续我们将建议采用世界银行提出的更细粒度分类标准,该标准可单独区分中美洲与撒哈拉以南非洲。 - **识别文化敏感性不等于实现文化包容:** 尽管Global-MMLU通过识别非西方文化表征的缺口,揭示了当前数据集的重要局限性,但未来仍需优先整合多样化的本土文化知识,以实现多语言AI评测的真正包容性与公平性。 </details> <br> # 补充信息 ## 数据来源 - **标注方法:** 结合专业标注与众包志愿者标注。 - **方法细节:** 我们收集了文化偏见标注以及MMLU问题翻译的后编辑结果。 - [文化敏感性标注平台](https://huggingface.co/spaces/CohereLabs/MMLU-evaluation) - [翻译质量标注平台](https://huggingface.co/spaces/CohereLabs/review-mmlu-translations) - 数据收集时间:2024年5月 - 2024年8月 ## 数据集版本与维护 - **维护状态:** 持续维护中 - **版本详情:** - *当前版本:* 1.0 - *最后更新:* 2024年12月 - *首次发布:* 2024年12月 ## 作者信息 - **发布机构:** [Cohere Labs](https://cohere.com/research) - **行业类型:** 非营利性科技机构 ## 许可信息 本数据集可在[Apache 2.0](https://opensource.org/license/apache-2-0)许可证条款下用于任何用途。 ## 更多详情 如需更多信息,请查阅我们的论文:*Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation*,链接:[https://arxiv.org/abs/2412.03304](https://arxiv.org/abs/2412.03304)。 ## 引用信息 bibtex @misc{singh2024globalmmluunderstandingaddressing, title={Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation}, author={Shivalika Singh and Angelika Romanou and Clémentine Fourrier and David I. Adelani and Jian Gang Ngui and Daniel Vila-Suero and Peerat Limkonchotiwat and Kelly Marchisio and Wei Qi Leong and Yosephine Susanto and Raymond Ng and Shayne Longpre and Wei-Yin Ko and Madeline Smith and Antoine Bosselut and Alice Oh and Andre F. T. Martins and Leshem Choshen and Daphne Ippolito and Enzo Ferrante and Marzieh Fadaee and Beyza Ermis and Sara Hooker}, year={2024}, eprint={2412.03304}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.03304}, }
提供机构:
maas
创建时间:
2024-12-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作