HAERAE-HUB/KMMLU
收藏Hugging Face2024-03-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HAERAE-HUB/KMMLU
下载链接
链接失效反馈官方服务:
资源简介:
我们提出了KMMLU,这是一个新的韩语基准测试,包含35,030个专家级别的多项选择题,涵盖45个学科,从人文到STEM领域。与之前从现有英语基准翻译而来的韩语基准不同,KMMLU是从韩语考试中收集的,旨在捕捉韩语的语言和文化特点。我们测试了26个公开和专有的LLMs,发现现有模型在韩语理解方面有显著的改进空间。表现最好的公开模型在KMMLU上的得分为50.54%,远低于人类平均水平的62.6%。该模型主要针对英语和中文进行训练,而非韩语。当前针对韩语定制的LLMs,如Polyglot-Ko,表现更差。令人惊讶的是,即使是能力最强的专有LLMs,如GPT-4和HyperCLOVA X,分别只达到了59.95%和53.40%。这表明需要进一步改进韩语LLMs,而KMMLU提供了跟踪这一进展的正确工具。我们将数据集公开发布在Hugging Face Hub上,并将基准测试集成到EleutherAI的语言模型评估工具中。
We introduce KMMLU, a new Korean benchmark comprising 35,030 expert-level multiple-choice questions spanning 45 disciplines, ranging from humanities to STEM. Unlike prior Korean benchmarks translated from existing English benchmarks, KMMLU is collected from Korean-language examinations, designed to capture Korean linguistic and cultural characteristics. We evaluated 26 public and proprietary LLMs, and found that existing models have significant room for improvement in Korean language understanding. The top-performing public model achieves a score of 50.54% on KMMLU, far below the human average of 62.6%. This model is primarily trained on English and Chinese data rather than Korean. Current Korean-tailored LLMs, such as Polyglot-Ko, perform even worse. Surprisingly, even the most capable proprietary LLMs, such as GPT-4 and HyperCLOVA X, only attain scores of 59.95% and 53.40% respectively. This demonstrates that further improvements to Korean LLMs are necessary, and KMMLU provides a suitable tool to track such progress. We publicly release the dataset on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.
提供机构:
HAERAE-HUB
原始信息汇总
数据集概述
数据集名称: KMMLU
数据集规模: 包含35,030个专家级别的韩国语多选题。
覆盖领域: 涵盖45个不同学科,从人文科学到STEM(科学、技术、工程和数学)。
数据集特点:
- 与以往从英语基准翻译而来的韩国语基准不同,KMMLU是从原始韩国语考试中收集的,能准确捕捉韩国语言和文化的特点。
模型测试结果:
- 测试了26个公开可用和专有的语言模型(LLMs)。
- 最佳公开可用模型在KMMLU上的表现为50.54%,远低于人类平均表现62.6%。
- 专门针对韩国语的语言模型,如Polyglot-Ko,表现更差。
- 最先进的专有LLMs,如GPT-4和HyperCLOVA X,分别达到59.95%和53.40%。
数据集可用性: 已公开发布于HF Mirror Hub,并整合到EleutherAI的Language Model Evaluation Harness中。



