HAERAE-HUB/KMMMU
收藏Hugging Face2026-04-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/HAERAE-HUB/KMMMU
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ko
size_categories:
- 1K<n<10K
license: cc-by-nc-4.0
---
# KMMMU (Korean MMMU)
> technical report https://arxiv.org/abs/2604.13058
> link to evaluation tutorial! https://github.com/HAE-RAE/KMMMU
KMMMU is a Korean version of MMMU: a multimodal benchmark designed to evaluate **college-/exam-level reasoning** that requires combining **images + Korean text**.
This dataset contains **3,466** questions collected from Korean exam sources including:
- Civil service recruitment exams
- National Technical Qualifications
- National Competency Standard (NCS) exams
- Academic Olympiads
## Key statistics
- **Total questions:** 3,466
- **Total images:** 3,628
- **Questions with in-image text:** 2,550
(images contain text such as Korean or other languages)
- **Questions without in-image text:** 1,078
- **Korean-specific questions:** 300
---
# Load the dataset
```python
from datasets import load_dataset
ds = load_dataset(
"HAERAE-HUB/KMMMU",
data_files="kmmmu.csv",
)
df = ds["train"].to_pandas()
df.head()
```
---
# Dataset Structure
Each row in the dataset contains:
- question: The problem statement (Korean)
- answer: The gold answer
- question_type: Question type category
- image_link: A list (string format) of image URLs associated with the question
(Some questions contain multiple images.)
---
# Loading Images
The image_link field stores a list of full image URLs in string format.
It must be parsed before use.
```python
import requests
from PIL import Image
from io import BytesIO
df_images = []
for _,row in df.iterrows():
images = []
for link in eval(row.image_link):
response = requests.get(link, timeout=30)
image = Image.open(BytesIO(response.content)).convert("RGB")
images.append(image)
df_images.append(images)
```
### Point of Contact
For any questions contact us via the following email:)
```
naa012@cau.ac.kr, guijin.son@snu.ac.kr
```
语言:
- ko
样本规模:
- 1000 < 样本数量 < 10000
许可证: CC BY-NC 4.0
---
# KMMMU(韩语版MMMU)
> 技术报告即将发布
> 评估教程链接:https://github.com/HAE-RAE/KMMMU
KMMMU是MMMU的韩语版本,是一款专为评估**大学及考试级别的多模态推理能力**而设计的多模态基准数据集,该任务要求结合**图像与韩语文本**开展推理。
本数据集共包含**3466道**试题,均采集自以下韩语考试资源:
- 公务员招录考试
- 国家职业资格考试
- 国家能力标准(NCS)考试
- 学科奥林匹克竞赛
## 核心统计指标
- **总题量**:3466道
- **总图像数**:3628张
- **含图像内嵌文本的试题**:2550道
(图像包含韩语或其他语言的文本)
- **无图像内嵌文本的试题**:1078道
- **韩语专属试题**:300道
---
# 数据集加载
python
from datasets import load_dataset
ds = load_dataset(
"HAERAE-HUB/KMMMU",
data_files="kmmmu.csv",
)
df = ds["train"].to_pandas()
df.head()
---
# 数据集结构
数据集中的每一行包含以下字段:
- question:试题题干(韩语)
- answer:标准答案
- question_type:试题类型分类
- image_link:关联该试题的图像URL列表(字符串格式)
(部分试题包含多张图像。)
---
# 图像加载
image_link字段以字符串格式存储完整图像URL列表,使用前需先对其进行解析。
python
import requests
from PIL import Image
from io import BytesIO
df_images = []
for _,row in df.iterrows():
images = []
for link in eval(row.image_link):
response = requests.get(link, timeout=30)
image = Image.open(BytesIO(response.content)).convert("RGB")
images.append(image)
df_images.append(images)
### 联系方式
如有任何疑问,请通过以下邮箱联系我们:
naa012@cau.ac.kr, guijin.son@snu.ac.kr
提供机构:
HAERAE-HUB



