five

HAERAE-HUB/KMMMU

收藏
Hugging Face2026-04-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/HAERAE-HUB/KMMMU
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ko size_categories: - 1K<n<10K license: cc-by-nc-4.0 --- # KMMMU (Korean MMMU) > technical report https://arxiv.org/abs/2604.13058 > link to evaluation tutorial! https://github.com/HAE-RAE/KMMMU KMMMU is a Korean version of MMMU: a multimodal benchmark designed to evaluate **college-/exam-level reasoning** that requires combining **images + Korean text**. This dataset contains **3,466** questions collected from Korean exam sources including: - Civil service recruitment exams - National Technical Qualifications - National Competency Standard (NCS) exams - Academic Olympiads ## Key statistics - **Total questions:** 3,466 - **Total images:** 3,628 - **Questions with in-image text:** 2,550 (images contain text such as Korean or other languages) - **Questions without in-image text:** 1,078 - **Korean-specific questions:** 300 --- # Load the dataset ```python from datasets import load_dataset ds = load_dataset( "HAERAE-HUB/KMMMU", data_files="kmmmu.csv", ) df = ds["train"].to_pandas() df.head() ``` --- # Dataset Structure Each row in the dataset contains: - question: The problem statement (Korean) - answer: The gold answer - question_type: Question type category - image_link: A list (string format) of image URLs associated with the question (Some questions contain multiple images.) --- # Loading Images The image_link field stores a list of full image URLs in string format. It must be parsed before use. ```python import requests from PIL import Image from io import BytesIO df_images = [] for _,row in df.iterrows(): images = [] for link in eval(row.image_link): response = requests.get(link, timeout=30) image = Image.open(BytesIO(response.content)).convert("RGB") images.append(image) df_images.append(images) ``` ### Point of Contact For any questions contact us via the following email:) ``` naa012@cau.ac.kr, guijin.son@snu.ac.kr ```

语言: - ko 样本规模: - 1000 < 样本数量 < 10000 许可证: CC BY-NC 4.0 --- # KMMMU(韩语版MMMU) > 技术报告即将发布 > 评估教程链接:https://github.com/HAE-RAE/KMMMU KMMMU是MMMU的韩语版本,是一款专为评估**大学及考试级别的多模态推理能力**而设计的多模态基准数据集,该任务要求结合**图像与韩语文本**开展推理。 本数据集共包含**3466道**试题,均采集自以下韩语考试资源: - 公务员招录考试 - 国家职业资格考试 - 国家能力标准(NCS)考试 - 学科奥林匹克竞赛 ## 核心统计指标 - **总题量**:3466道 - **总图像数**:3628张 - **含图像内嵌文本的试题**:2550道 (图像包含韩语或其他语言的文本) - **无图像内嵌文本的试题**:1078道 - **韩语专属试题**:300道 --- # 数据集加载 python from datasets import load_dataset ds = load_dataset( "HAERAE-HUB/KMMMU", data_files="kmmmu.csv", ) df = ds["train"].to_pandas() df.head() --- # 数据集结构 数据集中的每一行包含以下字段: - question:试题题干(韩语) - answer:标准答案 - question_type:试题类型分类 - image_link:关联该试题的图像URL列表(字符串格式) (部分试题包含多张图像。) --- # 图像加载 image_link字段以字符串格式存储完整图像URL列表,使用前需先对其进行解析。 python import requests from PIL import Image from io import BytesIO df_images = [] for _,row in df.iterrows(): images = [] for link in eval(row.image_link): response = requests.get(link, timeout=30) image = Image.open(BytesIO(response.content)).convert("RGB") images.append(image) df_images.append(images) ### 联系方式 如有任何疑问,请通过以下邮箱联系我们: naa012@cau.ac.kr, guijin.son@snu.ac.kr
提供机构:
HAERAE-HUB
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作