HAERAE-HUB/KMMMU

Name: HAERAE-HUB/KMMMU
Creator: HAERAE-HUB
Published: 2026-04-16 05:13:23
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/HAERAE-HUB/KMMMU

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ko size_categories: - 1K<n<10K license: cc-by-nc-4.0 --- # KMMMU (Korean MMMU) > technical report https://arxiv.org/abs/2604.13058 > link to evaluation tutorial! https://github.com/HAE-RAE/KMMMU KMMMU is a Korean version of MMMU: a multimodal benchmark designed to evaluate **college-/exam-level reasoning** that requires combining **images + Korean text**. This dataset contains **3,466** questions collected from Korean exam sources including: - Civil service recruitment exams - National Technical Qualifications - National Competency Standard (NCS) exams - Academic Olympiads ## Key statistics - **Total questions:** 3,466 - **Total images:** 3,628 - **Questions with in-image text:** 2,550 (images contain text such as Korean or other languages) - **Questions without in-image text:** 1,078 - **Korean-specific questions:** 300 --- # Load the dataset ```python from datasets import load_dataset ds = load_dataset( "HAERAE-HUB/KMMMU", data_files="kmmmu.csv", ) df = ds["train"].to_pandas() df.head() ``` --- # Dataset Structure Each row in the dataset contains: - question: The problem statement (Korean) - answer: The gold answer - question_type: Question type category - image_link: A list (string format) of image URLs associated with the question (Some questions contain multiple images.) --- # Loading Images The image_link field stores a list of full image URLs in string format. It must be parsed before use. ```python import requests from PIL import Image from io import BytesIO df_images = [] for _,row in df.iterrows(): images = [] for link in eval(row.image_link): response = requests.get(link, timeout=30) image = Image.open(BytesIO(response.content)).convert("RGB") images.append(image) df_images.append(images) ``` ### Point of Contact For any questions contact us via the following email:) ``` naa012@cau.ac.kr, guijin.son@snu.ac.kr ```

语言: - ko 样本规模: - 1000 < 样本数量 < 10000 许可证: CC BY-NC 4.0 --- # KMMMU（韩语版MMMU） > 技术报告即将发布 > 评估教程链接：https://github.com/HAE-RAE/KMMMU KMMMU是MMMU的韩语版本，是一款专为评估**大学及考试级别的多模态推理能力**而设计的多模态基准数据集，该任务要求结合**图像与韩语文本**开展推理。本数据集共包含**3466道**试题，均采集自以下韩语考试资源： - 公务员招录考试 - 国家职业资格考试 - 国家能力标准（NCS）考试 - 学科奥林匹克竞赛 ## 核心统计指标 - **总题量**：3466道 - **总图像数**：3628张 - **含图像内嵌文本的试题**：2550道 (图像包含韩语或其他语言的文本) - **无图像内嵌文本的试题**：1078道 - **韩语专属试题**：300道 --- # 数据集加载 python from datasets import load_dataset ds = load_dataset( "HAERAE-HUB/KMMMU", data_files="kmmmu.csv", ) df = ds["train"].to_pandas() df.head() --- # 数据集结构数据集中的每一行包含以下字段： - question：试题题干（韩语） - answer：标准答案 - question_type：试题类型分类 - image_link：关联该试题的图像URL列表（字符串格式） (部分试题包含多张图像。) --- # 图像加载 image_link字段以字符串格式存储完整图像URL列表，使用前需先对其进行解析。 python import requests from PIL import Image from io import BytesIO df_images = [] for _,row in df.iterrows(): images = [] for link in eval(row.image_link): response = requests.get(link, timeout=30) image = Image.open(BytesIO(response.content)).convert("RGB") images.append(image) df_images.append(images) ### 联系方式如有任何疑问，请通过以下邮箱联系我们： naa012@cau.ac.kr, guijin.son@snu.ac.kr

提供机构：

HAERAE-HUB

5,000+

优质数据集

54 个

任务类型

进入经典数据集