HAERAE-HUB/KMMLU-HARD
收藏Hugging Face2024-03-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HAERAE-HUB/KMMLU-HARD
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: maritime_engineering
data_files:
- split: dev
path: data/maritime_engineering-dev.csv
- split: test
path: data/maritime_engineering-hard-test.csv
- config_name: materials_engineering
data_files:
- split: dev
path: data/materials_engineering-dev.csv
- split: test
path: data/materials_engineering-hard-test.csv
- config_name: railway_and_automotive_engineering
data_files:
- split: dev
path: data/railway_and_automotive_engineering-dev.csv
- split: test
path: data/railway_and_automotive_engineering-hard-test.csv
- config_name: biology
data_files:
- split: dev
path: data/biology-dev.csv
- split: test
path: data/biology-hard-test.csv
- config_name: public_safety
data_files:
- split: dev
path: data/public_safety-dev.csv
- split: test
path: data/public_safety-hard-test.csv
- config_name: criminal_law
data_files:
- split: dev
path: data/criminal_law-dev.csv
- split: test
path: data/criminal_law-hard-test.csv
- config_name: information_technology
data_files:
- split: dev
path: data/information_technology-dev.csv
- split: test
path: data/information_technology-hard-test.csv
- config_name: geomatics
data_files:
- split: dev
path: data/geomatics-dev.csv
- split: test
path: data/geomatics-hard-test.csv
- config_name: management
data_files:
- split: dev
path: data/management-dev.csv
- split: test
path: data/management-hard-test.csv
- config_name: math
data_files:
- split: dev
path: data/math-dev.csv
- split: test
path: data/math-hard-test.csv
- config_name: accounting
data_files:
- split: dev
path: data/accounting-dev.csv
- split: test
path: data/accounting-hard-test.csv
- config_name: chemistry
data_files:
- split: dev
path: data/chemistry-dev.csv
- split: test
path: data/chemistry-hard-test.csv
- config_name: nondestructive_testing
data_files:
- split: dev
path: data/nondestructive_testing-dev.csv
- split: test
path: data/nondestructive_testing-hard-test.csv
- config_name: computer_science
data_files:
- split: dev
path: data/computer_science-dev.csv
- split: test
path: data/computer_science-hard-test.csv
- config_name: ecology
data_files:
- split: dev
path: data/ecology-dev.csv
- split: test
path: data/ecology-hard-test.csv
- config_name: health
data_files:
- split: dev
path: data/health-dev.csv
- split: test
path: data/health-hard-test.csv
- config_name: political_science_and_sociology
data_files:
- split: dev
path: data/political_science_and_sociology-dev.csv
- split: test
path: data/political_science_and_sociology-hard-test.csv
- config_name: patent
data_files:
- split: dev
path: data/patent-dev.csv
- split: test
path: data/patent-hard-test.csv
- config_name: electrical_engineering
data_files:
- split: dev
path: data/electrical_engineering-dev.csv
- split: test
path: data/electrical_engineering-hard-test.csv
- config_name: electronics_engineering
data_files:
- split: dev
path: data/electronics_engineering-dev.csv
- split: test
path: data/electronics_engineering-hard-test.csv
- config_name: korean_history
data_files:
- split: dev
path: data/korean_history-dev.csv
- split: test
path: data/korean_history-hard-test.csv
- config_name: gas_technology_and_engineering
data_files:
- split: dev
path: data/gas_technology_and_engineering-dev.csv
- split: test
path: data/gas_technology_and_engineering-hard-test.csv
- config_name: machine_design_and_manufacturing
data_files:
- split: dev
path: data/machine_design_and_manufacturing-dev.csv
- split: test
path: data/machine_design_and_manufacturing-hard-test.csv
- config_name: chemical_engineering
data_files:
- split: dev
path: data/chemical_engineering-dev.csv
- split: test
path: data/chemical_engineering-hard-test.csv
- config_name: telecommunications_and_wireless_technology
data_files:
- split: dev
path: data/telecommunications_and_wireless_technology-dev.csv
- split: test
path: data/telecommunications_and_wireless_technology-hard-test.csv
- config_name: food_processing
data_files:
- split: dev
path: data/food_processing-dev.csv
- split: test
path: data/food_processing-hard-test.csv
- config_name: social_welfare
data_files:
- split: dev
path: data/social_welfare-dev.csv
- split: test
path: data/social_welfare-hard-test.csv
- config_name: real_estate
data_files:
- split: dev
path: data/real_estate-dev.csv
- split: test
path: data/real_estate-hard-test.csv
- config_name: marketing
data_files:
- split: dev
path: data/marketing-dev.csv
- split: test
path: data/marketing-hard-test.csv
- config_name: mechanical_engineering
data_files:
- split: dev
path: data/mechanical_engineering-dev.csv
- split: test
path: data/mechanical_engineering-hard-test.csv
- config_name: fashion
data_files:
- split: dev
path: data/fashion-dev.csv
- split: test
path: data/fashion-hard-test.csv
- config_name: psychology
data_files:
- split: dev
path: data/psychology-dev.csv
- split: test
path: data/psychology-hard-test.csv
- config_name: taxation
data_files:
- split: dev
path: data/taxation-dev.csv
- split: test
path: data/taxation-hard-test.csv
- config_name: environmental_science
data_files:
- split: dev
path: data/environmental_science-dev.csv
- split: test
path: data/environmental_science-hard-test.csv
- config_name: refrigerating_machinery
data_files:
- split: dev
path: data/refrigerating_machinery-dev.csv
- split: test
path: data/refrigerating_machinery-hard-test.csv
- config_name: education
data_files:
- split: dev
path: data/education-dev.csv
- split: test
path: data/education-hard-test.csv
- config_name: industrial_engineer
data_files:
- split: dev
path: data/industrial_engineer-dev.csv
- split: test
path: data/industrial_engineer-hard-test.csv
- config_name: civil_engineering
data_files:
- split: dev
path: data/civil_engineering-dev.csv
- split: test
path: data/civil_engineering-hard-test.csv
- config_name: energy_management
data_files:
- split: dev
path: data/energy_management-dev.csv
- split: test
path: data/energy_management-hard-test.csv
- config_name: law
data_files:
- split: dev
path: data/law-dev.csv
- split: test
path: data/law-hard-test.csv
- config_name: agricultural_sciences
data_files:
- split: dev
path: data/agricultural_sciences-dev.csv
- split: test
path: data/agricultural_sciences-hard-test.csv
- config_name: interior_architecture_and_design
data_files:
- split: dev
path: data/interior_architecture_and_design-dev.csv
- split: test
path: data/interior_architecture_and_design-hard-test.csv
- config_name: aviation_engineering_and_maintenance
data_files:
- split: dev
path: data/aviation_engineering_and_maintenance-dev.csv
- split: test
path: data/aviation_engineering_and_maintenance-hard-test.csv
- config_name: construction
data_files:
- split: dev
path: data/construction-dev.csv
- split: test
path: data/construction-hard-test.csv
- config_name: economics
data_files:
- split: dev
path: data/economics-dev.csv
- split: test
path: data/economics-hard-test.csv
license: cc-by-nd-4.0
task_categories:
- question-answering
language:
- ko
tags:
- haerae
- mmlu
size_categories:
- 100K<n<1M
---
### KMMLU (Korean-MMLU)
We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM.
Unlike previous Korean benchmarks that are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language.
We test 26 publically available and proprietary LLMs, identifying significant room for improvement.
The best publicly available model achieves 50.54% on KMMLU, far below the average human performance of 62.6%.
This model was primarily trained for English and Chinese, not Korean.
Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X, achieve 59.95% and 53.40%, respectively.
This suggests that further work is needed to improve Korean LLMs, and KMMLU offers the right tool to track this progress.
We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.
Link to Paper: [KMMLU: Measuring Massive Multitask Language Understanding in Korean](https://arxiv.org/abs/2402.11548)
### KMMLU Statistics
| Category | # Questions |
|------------------------------|-------------|
| **Prerequisites** | |
| None | 59,909 |
| 1 Prerequisite Test | 12,316 |
| 2 Prerequisite Tests | 776 |
| 2+ Years of Experience | 65,135 |
| 4+ Years of Experience | 98,678 |
| 9+ Years of Experience | 6,963 |
| **Question Type** | |
| Positive | 207,030 |
| Negation | 36,777 |
| **Split** | |
| Train | 208,522 |
| Validation | 225 |
| Test | 35,030 |
| **Total** | 243,777 |
### Categories
To reimplement the categories in the paper, refer to the following:
```
supercategories = {
"accounting": "HUMSS",
"agricultural_sciences": "Other",
"aviation_engineering_and_maintenance": "Applied Science",
"biology": "STEM",
"chemical_engineering": "STEM",
"chemistry": "STEM",
"civil_engineering": "STEM",
"computer_science": "STEM",
"construction": "Other",
"criminal_law": "HUMSS",
"ecology": "STEM",
"economics": "HUMSS",
"education": "HUMSS",
"electrical_engineering": "STEM",
"electronics_engineering": "Applied Science",
"energy_management": "Applied Science",
"environmental_science": "Applied Science",
"fashion": "Other",
"food_processing": "Other",
"gas_technology_and_engineering": "Applied Science",
"geomatics": "Applied Science",
"health": "Other",
"industrial_engineer": "Applied Science",
"information_technology": "STEM",
"interior_architecture_and_design": "Other",
"law": "HUMSS",
"machine_design_and_manufacturing": "Applied Science",
"management": "HUMSS",
"maritime_engineering": "Applied Science",
"marketing": "Other",
"materials_engineering": "STEM",
"mechanical_engineering": "STEM",
"nondestructive_testing": "Applied Science",
"patent": "Other",
"political_science_and_sociology": "HUMSS",
"psychology": "HUMSS",
"public_safety": "Other",
"railway_and_automotive_engineering": "Applied Science",
"real_estate": "Other",
"refrigerating_machinery": "Other",
"social_welfare": "HUMSS",
"taxation": "HUMSS",
"telecommunications_and_wireless_technology": "Applied Science",
"korean_history": "HUMSS",
"math": "STEM"
}
```
### Point of Contact
For any questions contact us via the following email:)
```
spthsrbwls123@yonsei.ac.kr
```
提供机构:
HAERAE-HUB
原始信息汇总
数据集概述
数据集名称
KMMLU (Korean-MMLU)
数据集描述
KMMLU是一个包含35,030个专家级别的多项选择题的韩国基准测试,涵盖从人文到STEM的45个学科。与之前的韩国基准测试不同,KMMLU是从原始的韩国考试中收集的,捕捉了韩语的语言和文化特点。
数据集配置
数据集包含以下配置:
maritime_engineeringmaterials_engineeringrailway_and_automotive_engineeringbiologypublic_safetycriminal_lawinformation_technologygeomaticsmanagementmathaccountingchemistrynondestructive_testingcomputer_scienceecologyhealthpolitical_science_and_sociologypatentelectrical_engineeringelectronics_engineeringkorean_historygas_technology_and_engineeringmachine_design_and_manufacturingchemical_engineeringtelecommunications_and_wireless_technologyfood_processingsocial_welfarereal_estatemarketingmechanical_engineeringfashionpsychologytaxationenvironmental_sciencerefrigerating_machineryeducationindustrial_engineercivil_engineeringenergy_managementlawagricultural_sciencesinterior_architecture_and_designaviation_engineering_and_maintenanceconstructioneconomics
数据文件
每个配置包含两个数据文件:
dev文件:用于开发集hard-test文件:用于测试集
数据集统计
| 类别 | 问题数量 |
|---|---|
| 前置条件 | |
| 无 | 59,909 |
| 1个前置测试 | 12,316 |
| 2个前置测试 | 776 |
| 2年以上经验 | 65,135 |
| 4年以上经验 | 98,678 |
| 9年以上经验 | 6,963 |
| 问题类型 | |
| 正面 | 207,030 |
| 否定 | 36,777 |
| 分割 | |
| 训练 | 208,522 |
| 验证 | 225 |
| 测试 | 35,030 |
| 总计 | 243,777 |
类别分类
数据集的类别分为以下超类别:
HUMSSSTEMApplied ScienceOther
许可证
cc-by-nd-4.0
任务类别
- 问答
语言
- 韩语
标签
- haerae
- mmlu
数据集大小
- 100K<n<1M



