SeaLLMs/SeaExam

Name: SeaLLMs/SeaExam
Creator: SeaLLMs
Published: 2024-05-31 09:27:49
License: 暂无描述

Hugging Face2024-05-31 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/SeaLLMs/SeaExam

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 configs: - config_name: m3exam-chinese data_files: - split: dev path: m3exam-chinese/dev.json - split: test path: m3exam-chinese/test.json - config_name: m3exam-english data_files: - split: dev path: m3exam-english/dev.json - split: test path: m3exam-english/test.json - config_name: m3exam-thai data_files: - split: dev path: m3exam-thai/dev.json - split: test path: m3exam-thai/test.json - config_name: m3exam-vietnamese data_files: - split: dev path: m3exam-vietnamese/dev.json - split: test path: m3exam-vietnamese/test.json - config_name: m3exam-indonesian data_files: - split: dev path: m3exam-indonesian/dev.json - split: test path: m3exam-indonesian/test.json - config_name: mmlu-english data_files: - split: dev path: mmlu-english/dev.json - split: test path: mmlu-english/test.json - config_name: mmlu-chinese data_files: - split: dev path: mmlu-chinese/dev.json - split: test path: mmlu-chinese/test.json - config_name: mmlu-thai data_files: - split: dev path: mmlu-thai/dev.json - split: test path: mmlu-thai/test.json - config_name: mmlu-vietnamese data_files: - split: dev path: mmlu-vietnamese/dev.json - split: test path: mmlu-vietnamese/test.json - config_name: mmlu-indonesian data_files: - split: dev path: mmlu-indonesian/dev.json - split: test path: mmlu-indonesian/test.json task_categories: - multiple-choice language: - en - id - vi - th - zh tags: - exam --- > Check the 🏆 [leaderboard](https://huggingface.co/spaces/SeaLLMs/SeaExam_leaderboard) constructed with this dataset and the corresponding 👨🏻‍💻 [evaluation code](https://github.com/DAMO-NLP-SG/SeaExam). # SeaExam dataset The SeaExam dataset aims to evaluate Large Language Models (LLMs) on a diverse set of Southeast Asian (SEA) languages including English, Chinese, Indonesian, Thai, and Vietnamese. Our goal is to ensure a fair and consistent comparison across different LLMs on those languages while mitigating the risk of data contamination. It consists of the following two parts: ### M3Exam (with adjustments) The original [M3Exam](https://github.com/DAMO-NLP-SG/M3Exam) dataset is constructed with real human exam questions collected from different countries. As a result, the dataset retains the diverse cultural characteristics inherent in the questions. We further process the original dataset with the following operations: - We standardized the total number of answer options to four. This involved removing questions with fewer than four options and eliminating one incorrect option from questions that initially had more than four options. - All answers have been mapped to a numerical value within the range [0, 1, 2, 3] for consistency. - We removed the option index from each answer choice (e.g., changing "A. good" to "good") to simplify the format. - Randomly shuffle the options. ### Translated MMLU The [MMLU](https://github.com/hendrycks/test) dataset contains English questions from 57 subjects. We translate the original English questions to different languages to measure the cross-lingual alignment: - We randomly selected 50 questions from each subject, totaling 2850 questions. - These questions have been translated from English into Chinese, Indonesian, Thai, and Vietnamese using Google Translate to ensure linguistic diversity. - Randomly shuffle the options. # Usage To load a particular subset of the dataset, you need to specify the sub-dataset name of the language. For example, ```python from datasets import load_dataset ds_name = "m3exam" lang = "english" dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}") ``` To load the whole dataset: ```python from datasets import load_dataset for ds_name in ['m3exam','mmlu']: for lang in ['english', 'chinese', 'thai', 'vietnamese', 'indonesian']: dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}") print(dataset) ```

许可证：Apache-2.0 配置项： - 配置名称：m3exam-chinese 数据文件： - 划分集：验证集，路径：m3exam-chinese/dev.json - 划分集：测试集，路径：m3exam-chinese/test.json - 配置名称：m3exam-english 数据文件： - 划分集：验证集，路径：m3exam-english/dev.json - 划分集：测试集，路径：m3exam-english/test.json - 配置名称：m3exam-thai 数据文件： - 划分集：验证集，路径：m3exam-thai/dev.json - 划分集：测试集，路径：m3exam-thai/test.json - 配置名称：m3exam-vietnamese 数据文件： - 划分集：验证集，路径：m3exam-vietnamese/dev.json - 划分集：测试集，路径：m3exam-vietnamese/test.json - 配置名称：m3exam-indonesian 数据文件： - 划分集：验证集，路径：m3exam-indonesian/dev.json - 划分集：测试集，路径：m3exam-indonesian/test.json - 配置名称：mmlu-english 数据文件： - 划分集：验证集，路径：mmlu-english/dev.json - 划分集：测试集，路径：mmlu-english/test.json - 配置名称：mmlu-chinese 数据文件： - 划分集：验证集，路径：mmlu-chinese/dev.json - 划分集：测试集，路径：mmlu-chinese/test.json - 配置名称：mmlu-thai 数据文件： - 划分集：验证集，路径：mmlu-thai/dev.json - 划分集：测试集，路径：mmlu-thai/test.json - 配置名称：mmlu-vietnamese 数据文件： - 划分集：验证集，路径：mmlu-vietnamese/dev.json - 划分集：测试集，路径：mmlu-vietnamese/test.json - 配置名称：mmlu-indonesian 数据文件： - 划分集：验证集，路径：mmlu-indonesian/dev.json - 划分集：测试集，路径：mmlu-indonesian/test.json 任务类别：多项选择语言：英语、印尼语、越南语、泰语、中文标签：考试 > 可查看基于本数据集构建的🏆[排行榜](https://huggingface.co/spaces/SeaLLMs/SeaExam_leaderboard)以及对应的👨🏻‍💻[评估代码](https://github.com/DAMO-NLP-SG/SeaExam)。 # SeaExam 数据集 SeaExam 数据集旨在针对涵盖英语、中文、印尼语、泰语及越南语的多语种东南亚（SEA）语言集合，对大语言模型（Large Language Model, LLM）进行性能评估。本数据集的目标是确保不同大语言模型在上述语言上的评估结果具备公平性与一致性，同时降低数据污染风险。本数据集包含以下两个部分： ### 经过调整的M3考试（M3Exam）数据集原始[M3考试（M3Exam）](https://github.com/DAMO-NLP-SG/M3Exam)数据集由从多国收集的真实人类考试试题构建而成，因此保留了试题中蕴含的多元文化特征。我们对原始数据集进行了如下预处理操作： - 将答案选项的总数统一为四项：移除选项数不足四项的试题，并从初始选项数超过四项的试题中剔除一个错误选项。 - 为保证格式统一，我们将所有答案映射至[0, 1, 2, 3]的数值范围内。 - 移除每个答案选项前的索引标识（例如将"A. good"调整为"good"）以简化格式。 - 对选项进行随机打乱。 ### 翻译版MMLU数据集 [MMLU](https://github.com/hendrycks/test)数据集包含来自57个学科的英语试题。为了评估跨语言对齐能力，我们将原始英语试题翻译为其他语言： - 从每个学科中随机选取50道试题，总计2850道。 - 为保证语言多样性，我们通过谷歌翻译将这些试题从英语翻译至中文、印尼语、泰语及越南语。 - 对选项进行随机打乱。 # 使用方法若需加载数据集的特定子集，需指定对应语言的子数据集名称，示例如下： python from datasets import load_dataset ds_name = "m3exam" lang = "english" dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}") 若需加载完整数据集： python from datasets import load_dataset for ds_name in ['m3exam','mmlu']: for lang in ['english', 'chinese', 'thai', 'vietnamese', 'indonesian']: dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}") print(dataset)

提供机构：

SeaLLMs

原始信息汇总

SeaExam 数据集

SeaExam 数据集旨在评估大型语言模型（LLMs）在包括英语、中文、印尼语、泰语和越南语在内的多种东南亚语言上的表现。数据集包括以下两部分：

M3Exam（调整版）

原始的 M3Exam 数据集由来自不同国家的真实人类考试题目组成，保留了问题中固有的多样文化特征。我们对原始数据集进行了以下处理：

标准化答案选项总数为四个，移除了少于四个选项的问题，并从最初有超过四个选项的问题中删除一个错误选项。
所有答案已映射到 [0, 1, 2, 3] 范围内的数值，以保持一致性。
移除了每个答案选项的索引（例如，将 "A. good" 改为 "good"）以简化格式。
随机打乱选项。

翻译的 MMLU

MMLU 数据集包含来自 57 个科目的英语问题。我们将原始英语问题翻译成不同语言以衡量跨语言对齐：

从每个科目中随机选择了 50 个问题，总计 2850 个问题。
这些问题已从英语翻译成中文、印尼语、泰语和越南语，使用 Google Translate 以确保语言多样性。
随机打乱选项。

使用方法

要加载特定子集的数据集，需要指定语言的子数据集名称。例如： python from datasets import load_dataset

ds_name = "m3exam" lang = "english" dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}")

要加载整个数据集： python from datasets import load_dataset

for ds_name in [m3exam,mmlu]: for lang in [english, chinese, thai, vietnamese, indonesian]: dataset = load_dataset(f"SeaLLMs/SeaExam", f"{ds_name}-{lang}") print(dataset)

搜集汇总

数据集介绍

构建方式

在跨语言自然语言处理评估领域，SeaExam数据集的构建体现了对东南亚语言多样性的深刻考量。该数据集整合了M3Exam原始考试题目与翻译后的MMLU题目，通过标准化处理确保评估一致性。M3Exam部分源自真实人类考试，涵盖多国文化背景，经过选项数量统一、答案数值映射及格式简化等操作，保留了题目的文化特性。MMLU部分则从57个学科中随机抽取题目，借助谷歌翻译转化为中文、印尼语、泰语和越南语，以衡量模型的跨语言对齐能力，所有选项均经过随机重排以消除偏差。

使用方法

使用SeaExam数据集时，研究人员可通过Hugging Face的datasets库灵活加载特定子集。例如，指定子数据集名称如'm3exam-english'或'mmlu-chinese'，即可调用相应语言版本的评估数据。若需全面评估模型的多语言性能，可循环遍历所有子集，涵盖'm3exam'和'mmlu'两大类别及五种语言组合。该设计便于用户进行针对性测试或整体分析，支持跨语言能力基准的建立，同时配套的排行榜和评估代码进一步简化了性能比较流程，助力大模型在东南亚语言环境下的深入研究。

背景与挑战

背景概述

在人工智能领域，大型语言模型的多语言评估已成为衡量模型泛化能力的关键环节。SeaExam数据集由SeaLLMs团队构建，旨在系统评估大型语言模型在东南亚多种语言上的性能，涵盖英语、中文、印尼语、泰语和越南语。该数据集整合了M3Exam的真实人类考试题目与翻译自MMLU的跨学科问题，通过标准化处理确保评估的一致性与公平性，其核心研究问题聚焦于模型在多元文化语境下的知识理解与推理能力，为多语言自然语言处理研究提供了重要的基准工具。

当前挑战

SeaExam数据集致力于解决多语言环境下大型语言模型的综合评估挑战，其核心在于克服语言多样性带来的语义对齐与知识迁移难题。在构建过程中，团队面临真实考试题目格式异构性的挑战，需通过答案选项标准化与索引移除实现数据统一；同时，跨语言翻译可能引入语义偏差，影响评估的准确性。此外，确保数据免受污染以维持评测的公正性，亦是该数据集构建中的关键考量。

常用场景

经典使用场景

在自然语言处理领域，多语言大语言模型的评估一直是研究热点。SeaExam数据集通过整合东南亚语言的真实考试题目与翻译后的学术知识问题，为模型的多语言理解与推理能力提供了标准化测试平台。其经典使用场景在于系统性地评估模型在英语、中文、印尼语、泰语和越南语等语言上的表现，尤其侧重于跨语言知识对齐与文化语境适应性，成为衡量模型在多样化语言环境中泛化能力的重要基准。

解决学术问题

该数据集有效解决了多语言大语言模型评估中数据污染风险高、跨语言比较基准缺失等学术难题。通过规范化选项结构与答案映射，确保了评估过程的公平性与一致性；同时，其融合真实考试题目与翻译学术问题的设计，能够深入探究模型在文化特定语境与通用知识迁移之间的平衡机制，为理解模型的多语言表示学习与知识迁移路径提供了关键实证基础。

实际应用

在实际应用层面，SeaExam数据集为开发面向东南亚地区的智能教育系统、多语言客服助手及跨语言信息检索工具提供了核心评估依据。教育科技企业可借助该数据集优化语言模型在本地化考试辅导中的表现；跨国企业则能基于其多语言评估结果，定制适应不同语言文化背景的对话系统，显著提升服务覆盖范围与用户体验。

数据集最近研究