NYTK/hu-mmlu
收藏Hugging Face2026-02-11 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/NYTK/hu-mmlu
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "Hu-MMLU"
language:
- hu
license: mit
task_categories:
- question-answering
task_ids:
- multiple-choice-qa
tags:
- mmlu
- benchmark
- evaluation
- multiple-choice
- hungarian
size_categories:
- 1K<n<10K
---
# Hu-MMLU
This dataset is a Hungarian translation/alignment of the MMLU (Massive Multitask Language Understanding) benchmark, organized **per subject** (one Hugging Face *config* per subject), mirroring the subject structure of the Hub-hosted MMLU distribution.
**Upstream reference dataset:** `cais/mmlu`
**License:** MIT (kept consistent with the upstream distribution)
> ⚠️ Translation note: This is a translated benchmark. Residual artifacts (formatting, terminology drift, or occasional awkward phrasing) may exist. Use primarily for evaluation and analysis.
---
## Repository structure
### Configs (subsets)
- **One config per subject**, e.g.:
- `high_school_biology`
- `college_medicine`
- `abstract_algebra`
- …
- an additional **`all`** config that concatenates all subjects into one dataset.
### Splits
Each config contains:
- `dev`
- `validation`
- `test`
Split naming follows the upstream MMLU convention.
---
## Data format
### Columns (schema)
Each split contains:
- `id` *(string)*: unique example identifier
- `subject` *(string)*: subject name (also equals the config name for per-subject configs)
- `question` *(string)*: Hungarian question prompt
- `choices` *(list[string], length = 4)*: answer options in order `[A, B, C, D]`
- `answer` *(ClassLabel: A/B/C/D)*: correct option label
### Example record
```python
{
"id": "test_260",
"subject": "high_school_biology",
"question": "Két személynek, akik közül az egyik B, a másik AB vércsoportú, gyermeke születik. Annak valószínűsége, hogy a gyermek O vércsoportú,",
"choices": [
"0%",
"25%",
"50%",
"100%"
],
"answer": "A"
}
```
---
## How to load
Replace `ORG_NAME/DATASET_NAME` with your actual repo id (e.g. `NYTK/mmlu-hu`).
### Load a single subject
```python
from datasets import load_dataset
repo_id = "ORG_NAME/DATASET_NAME"
ds = load_dataset(repo_id, "high_school_biology", split="test")
print(ds[0])
```
### List all available subject configs
```python
from datasets import get_dataset_config_names
repo_id = "ORG_NAME/DATASET_NAME"
print(get_dataset_config_names(repo_id))
```
### Load all subjects (if `all` exists)
```python
from datasets import load_dataset
repo_id = "ORG_NAME/DATASET_NAME"
ds_all = load_dataset(repo_id, "all", split="test")
```
---
## Evaluation protocol
This dataset is intended for **multiple-choice accuracy** evaluation.
### Recommended scoring
1. For each example, produce one of `{A, B, C, D}` (or index `{0,1,2,3}` corresponding to the `choices` order).
2. Compute accuracy against the `answer` label.
### Typical prompting format (plain)
Present the model with:
- the Hungarian question (`question`)
- the four options (`choices`)
and ask it to return **only** `A/B/C/D`.
---
## Quality control
- The dataset is translated to Hungarian and **manually reviewed where possible**.
- During publishing, common formatting artifacts (e.g. ratio/decimal notation) can be normalized.
- A publishing/QC script can generate a local `qc_report.tsv` to flag rows with applied safe fixes and/or duplicate answer options.
## Known limitations
- Some items may still contain minor stylistic differences from preferred Hungarian domain usage.
- A subset of items is inherently US-centric (especially civics/economics), which may affect “naturalness” in Hungarian.
## Intended uses
- Benchmarking Hungarian-capable LLMs on a broad, multi-domain multiple-choice suite.
- Cross-lingual robustness analysis (English vs Hungarian performance).
- Error analysis on terminology sensitivity and instruction-following for MCQ tasks.
---
## Not recommended uses
- High-stakes decision-making (education, medicine, law).
- Using translated questions as primary pedagogical material without review.
- Treating model performance on this dataset as a direct measure of real-world competence.
---
## Ethics and safety
This dataset includes general-knowledge questions across many domains, including medicine and law. Evaluate models responsibly; do not present benchmark performance as a substitute for professional judgment.
---
## Versioning
- The Hub commit history serves as the source of truth for dataset revisions.
- If you make systematic fixes (terminology sweeps, formatting normalization), document changes here (or in release tags).
---
## License
- MIT (kept consistent with the upstream Hub distribution of MMLU, `cais/mmlu`).
---
## Citation
### MMLU
```bibtex
@article{hendrycks2020measuring,
title={Measuring Massive Multitask Language Understanding},
author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
journal={arXiv preprint arXiv:2009.03300},
year={2020}
}
```
### HuGME (Hungarian benchmark reference)
```bibtex
@inproceedings{ligeti-nagy-etal-2025-hugme,
title = "{H}u{GME}: A benchmark system for evaluating {H}ungarian generative {LLM}s",
author = "Ligeti-Nagy, No{'e}mi and
Madarasz, Gabor and
Foldesi, Flora and
Lengyel, Mariann and
Osvath, Matyas and
Sarossy, Bence and
Varga, Kristof and
Yang, Gy{\H{o}}z{\H{o}} Zijian and
H{'e}ja, Enik{\H{o}} and
V{'a}radi, Tam{'a}s and
Pr{'o}sz{'e}ky, G{'a}bor",
booktitle = "Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM{\texttwosuperior})",
month = jul,
year = "2025",
address = "Vienna, Austria and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.gem-1.32/",
pages = "385--403",
isbn = "979-8-89176-261-9"
}
```
### This dataset
If you use this Hungarian MMLU dataset as part of the HuGME benchmark ecosystem, please cite the **HuGME paper** above in addition to the original MMLU paper.
---
## Contact / contributions
Issues and PRs are welcome for:
- mistranslations
- terminology alignment
- formatting fixes
- duplicate option corrections
- split/config consistency
When reporting an issue, include:
- `subject` (config name)
- `split`
- `id`
- the problematic `question` / `choices`
- suggested correction
提供机构:
NYTK



