five

NYTK/hu-mmlu

收藏
Hugging Face2026-02-11 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/NYTK/hu-mmlu
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: "Hu-MMLU" language: - hu license: mit task_categories: - question-answering task_ids: - multiple-choice-qa tags: - mmlu - benchmark - evaluation - multiple-choice - hungarian size_categories: - 1K<n<10K --- # Hu-MMLU This dataset is a Hungarian translation/alignment of the MMLU (Massive Multitask Language Understanding) benchmark, organized **per subject** (one Hugging Face *config* per subject), mirroring the subject structure of the Hub-hosted MMLU distribution. **Upstream reference dataset:** `cais/mmlu` **License:** MIT (kept consistent with the upstream distribution) > ⚠️ Translation note: This is a translated benchmark. Residual artifacts (formatting, terminology drift, or occasional awkward phrasing) may exist. Use primarily for evaluation and analysis. --- ## Repository structure ### Configs (subsets) - **One config per subject**, e.g.: - `high_school_biology` - `college_medicine` - `abstract_algebra` - … - an additional **`all`** config that concatenates all subjects into one dataset. ### Splits Each config contains: - `dev` - `validation` - `test` Split naming follows the upstream MMLU convention. --- ## Data format ### Columns (schema) Each split contains: - `id` *(string)*: unique example identifier - `subject` *(string)*: subject name (also equals the config name for per-subject configs) - `question` *(string)*: Hungarian question prompt - `choices` *(list[string], length = 4)*: answer options in order `[A, B, C, D]` - `answer` *(ClassLabel: A/B/C/D)*: correct option label ### Example record ```python { "id": "test_260", "subject": "high_school_biology", "question": "Két személynek, akik közül az egyik B, a másik AB vércsoportú, gyermeke születik. Annak valószínűsége, hogy a gyermek O vércsoportú,", "choices": [ "0%", "25%", "50%", "100%" ], "answer": "A" } ``` --- ## How to load Replace `ORG_NAME/DATASET_NAME` with your actual repo id (e.g. `NYTK/mmlu-hu`). ### Load a single subject ```python from datasets import load_dataset repo_id = "ORG_NAME/DATASET_NAME" ds = load_dataset(repo_id, "high_school_biology", split="test") print(ds[0]) ``` ### List all available subject configs ```python from datasets import get_dataset_config_names repo_id = "ORG_NAME/DATASET_NAME" print(get_dataset_config_names(repo_id)) ``` ### Load all subjects (if `all` exists) ```python from datasets import load_dataset repo_id = "ORG_NAME/DATASET_NAME" ds_all = load_dataset(repo_id, "all", split="test") ``` --- ## Evaluation protocol This dataset is intended for **multiple-choice accuracy** evaluation. ### Recommended scoring 1. For each example, produce one of `{A, B, C, D}` (or index `{0,1,2,3}` corresponding to the `choices` order). 2. Compute accuracy against the `answer` label. ### Typical prompting format (plain) Present the model with: - the Hungarian question (`question`) - the four options (`choices`) and ask it to return **only** `A/B/C/D`. --- ## Quality control - The dataset is translated to Hungarian and **manually reviewed where possible**. - During publishing, common formatting artifacts (e.g. ratio/decimal notation) can be normalized. - A publishing/QC script can generate a local `qc_report.tsv` to flag rows with applied safe fixes and/or duplicate answer options. ## Known limitations - Some items may still contain minor stylistic differences from preferred Hungarian domain usage. - A subset of items is inherently US-centric (especially civics/economics), which may affect “naturalness” in Hungarian. ## Intended uses - Benchmarking Hungarian-capable LLMs on a broad, multi-domain multiple-choice suite. - Cross-lingual robustness analysis (English vs Hungarian performance). - Error analysis on terminology sensitivity and instruction-following for MCQ tasks. --- ## Not recommended uses - High-stakes decision-making (education, medicine, law). - Using translated questions as primary pedagogical material without review. - Treating model performance on this dataset as a direct measure of real-world competence. --- ## Ethics and safety This dataset includes general-knowledge questions across many domains, including medicine and law. Evaluate models responsibly; do not present benchmark performance as a substitute for professional judgment. --- ## Versioning - The Hub commit history serves as the source of truth for dataset revisions. - If you make systematic fixes (terminology sweeps, formatting normalization), document changes here (or in release tags). --- ## License - MIT (kept consistent with the upstream Hub distribution of MMLU, `cais/mmlu`). --- ## Citation ### MMLU ```bibtex @article{hendrycks2020measuring, title={Measuring Massive Multitask Language Understanding}, author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob}, journal={arXiv preprint arXiv:2009.03300}, year={2020} } ``` ### HuGME (Hungarian benchmark reference) ```bibtex @inproceedings{ligeti-nagy-etal-2025-hugme, title = "{H}u{GME}: A benchmark system for evaluating {H}ungarian generative {LLM}s", author = "Ligeti-Nagy, No{'e}mi and Madarasz, Gabor and Foldesi, Flora and Lengyel, Mariann and Osvath, Matyas and Sarossy, Bence and Varga, Kristof and Yang, Gy{\H{o}}z{\H{o}} Zijian and H{'e}ja, Enik{\H{o}} and V{'a}radi, Tam{'a}s and Pr{'o}sz{'e}ky, G{'a}bor", booktitle = "Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM{\texttwosuperior})", month = jul, year = "2025", address = "Vienna, Austria and virtual meeting", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.gem-1.32/", pages = "385--403", isbn = "979-8-89176-261-9" } ``` ### This dataset If you use this Hungarian MMLU dataset as part of the HuGME benchmark ecosystem, please cite the **HuGME paper** above in addition to the original MMLU paper. --- ## Contact / contributions Issues and PRs are welcome for: - mistranslations - terminology alignment - formatting fixes - duplicate option corrections - split/config consistency When reporting an issue, include: - `subject` (config name) - `split` - `id` - the problematic `question` / `choices` - suggested correction
提供机构:
NYTK
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作