Metric-AI/ArmBench-LLM-data

Name: Metric-AI/ArmBench-LLM-data
Creator: Metric-AI
Published: 2026-04-01 11:40:20
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Metric-AI/ArmBench-LLM-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: belebele-in-context-mcqa features: - name: flores_passage dtype: string - name: question dtype: string - name: mc_answer1 dtype: string - name: mc_answer2 dtype: string - name: mc_answer3 dtype: string - name: mc_answer4 dtype: string - name: correct_answer_num dtype: int64 - name: orig_index dtype: int64 splits: - name: train num_bytes: 68068 num_examples: 50 download_size: 44784 dataset_size: 68068 - config_name: conversation-in-context-qa features: - name: label dtype: int64 - name: dialogue dtype: string - name: question dtype: string - name: choices list: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 57968 num_examples: 50 download_size: 37784 dataset_size: 57968 - config_name: conversational-sum features: - name: dialogue dtype: string - name: summary dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 87533 num_examples: 100 download_size: 49163 dataset_size: 87533 - config_name: email-sum features: - name: email dtype: string - name: summary dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 168313 num_examples: 100 download_size: 79580 dataset_size: 168313 - config_name: exam_history features: - name: question dtype: string - name: context dtype: string - name: choices list: string - name: label list: string - name: task_type dtype: int64 splits: - name: train num_bytes: 60877 num_examples: 70 download_size: 33961 dataset_size: 60877 - config_name: exam_literature features: - name: question dtype: string - name: context dtype: string - name: choices list: string - name: label list: string - name: task_type dtype: int64 splits: - name: train num_bytes: 50137 num_examples: 69 download_size: 30131 dataset_size: 50137 - config_name: exam_math features: - name: task dtype: string - name: question dtype: string - name: choices list: string - name: label list: string - name: task_type dtype: int64 splits: - name: train num_bytes: 16378 num_examples: 65 download_size: 9130 dataset_size: 16378 - config_name: finer features: - name: text dtype: string - name: gold_entities list: list: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 377941 num_examples: 100 download_size: 167198 dataset_size: 377941 - config_name: include-mcqa features: - name: question dtype: string - name: option_a dtype: string - name: option_b dtype: string - name: option_c dtype: string - name: option_d dtype: string - name: answer dtype: int64 - name: orig_index dtype: int64 splits: - name: train num_bytes: 26576 num_examples: 50 download_size: 18680 dataset_size: 26576 - config_name: mmlu_pro features: - name: question_id dtype: int64 - name: question dtype: string - name: options list: string - name: answer dtype: string - name: answer_index dtype: int64 - name: cot_content dtype: string - name: category dtype: string - name: src dtype: string - name: question_arm dtype: string - name: options_arm list: string splits: - name: train num_bytes: 1862353 num_examples: 999 download_size: 885104 dataset_size: 1862353 - config_name: ms-marco-in-context-qa features: - name: armenian dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 47281 num_examples: 50 download_size: 21979 dataset_size: 47281 - config_name: paraphrase features: - name: text dtype: string - name: paraphrases list: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 139251 num_examples: 100 download_size: 51659 dataset_size: 139251 - config_name: pioner features: - name: tokens list: string - name: ner_tags list: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 49385 num_examples: 100 download_size: 20356 dataset_size: 49385 - config_name: pos features: - name: form dtype: string - name: upos_en dtype: string - name: upos_hy dtype: string splits: - name: train num_bytes: 3937 num_examples: 100 download_size: 2682 dataset_size: 3937 - config_name: public-services-mcqa features: - name: question dtype: string - name: answer dtype: string - name: distractors list: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 50622 num_examples: 45 download_size: 30293 dataset_size: 50622 - config_name: punctuation features: - name: orig_index dtype: int64 - name: gold dtype: string - name: corrupted_punctuation dtype: string splits: - name: train num_bytes: 28053 num_examples: 100 download_size: 19398 dataset_size: 28053 - config_name: scientific-in-context-mcqa features: - name: context dtype: string - name: question dtype: string - name: correct_answer dtype: string - name: distractor1 dtype: string - name: distractor2 dtype: string - name: distractor3 dtype: string - name: orig_index dtype: int64 - name: choices list: string - name: gold_index dtype: int64 splits: - name: train num_bytes: 62239 num_examples: 50 download_size: 40010 dataset_size: 62239 - config_name: sentiment features: - name: text dtype: string - name: sentiment_categories list: string splits: - name: train num_bytes: 26089 num_examples: 100 download_size: 15020 dataset_size: 26089 - config_name: simpleqa features: - name: question dtype: string - name: answer dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 413394 num_examples: 50 download_size: 184583 dataset_size: 413394 - config_name: space_fix features: - name: orig_index dtype: int64 - name: gold dtype: string - name: corrupted_spaces dtype: string splits: - name: train num_bytes: 308468 num_examples: 100 download_size: 165606 dataset_size: 308468 - config_name: squad-in-context-qa features: - name: context dtype: string - name: question dtype: string - name: answer dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 87951 num_examples: 50 download_size: 52797 dataset_size: 87951 - config_name: syndarin-in-context-mcqa features: - name: paragraph dtype: string - name: question dtype: string - name: answer_candidate_1 dtype: string - name: answer_candidate_2 dtype: string - name: answer_candidate_3 dtype: string - name: answer_candidate_4 dtype: string - name: correct_answer dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 65246 num_examples: 50 download_size: 43091 dataset_size: 65246 - config_name: topic-14class features: - name: category dtype: string - name: text dtype: string - name: orig_index dtype: int64 splits: - name: train num_bytes: 82361 num_examples: 280 download_size: 44064 dataset_size: 82361 - config_name: translation_short_sentences features: - name: eng dtype: string - name: hy dtype: string splits: - name: train num_bytes: 6373 num_examples: 100 download_size: 5931 dataset_size: 6373 configs: - config_name: belebele-in-context-mcqa data_files: - split: train path: belebele-in-context-mcqa/train-* - config_name: conversation-in-context-qa data_files: - split: train path: conversation-in-context-qa/train-* - config_name: conversational-sum data_files: - split: train path: conversational-sum/train-* - config_name: email-sum data_files: - split: train path: email-sum/train-* - config_name: exam_history data_files: - split: train path: exam_history/train-* - config_name: exam_literature data_files: - split: train path: exam_literature/train-* - config_name: exam_math data_files: - split: train path: exam_math/train-* - config_name: finer data_files: - split: train path: finer/train-* - config_name: include-mcqa data_files: - split: train path: include-mcqa/train-* - config_name: mmlu_pro data_files: - split: train path: mmlu_pro/train-* - config_name: ms-marco-in-context-qa data_files: - split: train path: ms-marco-in-context-qa/train-* - config_name: paraphrase data_files: - split: train path: paraphrase/train-* - config_name: pioner data_files: - split: train path: pioner/train-* - config_name: pos data_files: - split: train path: pos/train-* - config_name: public-services-mcqa data_files: - split: train path: public-services-mcqa/train-* - config_name: punctuation data_files: - split: train path: punctuation/train-* - config_name: scientific-in-context-mcqa data_files: - split: train path: scientific-in-context-mcqa/train-* - config_name: sentiment data_files: - split: train path: sentiment/train-* - config_name: simpleqa data_files: - split: train path: simpleqa/train-* - config_name: space_fix data_files: - split: train path: space_fix/train-* - config_name: squad-in-context-qa data_files: - split: train path: squad-in-context-qa/train-* - config_name: syndarin-in-context-mcqa data_files: - split: train path: syndarin-in-context-mcqa/train-* - config_name: topic-14class data_files: - split: train path: topic-14class/train-* - config_name: translation_short_sentences data_files: - split: train path: translation_short_sentences/train-* license: mit language: - hy --- # lighteval-armenian **Armenian LLM Evaluation Benchmark for LightEval** ## Dataset Description This is a multi-task benchmark created specifically to evaluate Large Language Models on **Armenian** (`hy`) language capabilities. It was developed to add full native Armenian support to the [LightEval](https://github.com/huggingface/lighteval) framework by Hugging Face. The benchmark contains only the tasks currently used in the official Armenian evaluation suite. It mixes: - Translated/adapted versions of popular benchmarks (MMLU-Pro, Belebele, SQuAD, MS MARCO, INCLUDE, etc.) - Native Armenian datasets (pioNER, national exams, public-services style tasks, punctuation/space normalization, etc.) - Custom or newly created tasks for summarization, generation, and text processing **Languages**: Primarily Armenian. Some configs are bilingual (English + Armenian) or contain parallel data. **Intended Use** Fast, reliable zero-shot / few-shot evaluation inside LightEval. Tasks are grouped into categories (see below). ## Task Categories & Metrics The benchmark is organized into the following evaluation categories: | Category | Tasks (config names) | |-----------------------|-----------------------------------------------------------| | **NER** | finer, pioner | | **POS** | pos | | **Reading Comprehension** | squad-in-context-qa, belebele-in-context-mcqa, conversation-in-context-qa, public-services-mcqa, ms-marco-in-context-qa | | **Classification** | include-mcqa, syndarin-in-context-mcqa, topic-14class, scientific-in-context-mcqa, sentiment | | **Generation** | email-sum, conversational-sum, simpleqa, paraphrase | | **Translation** | translation_short_sentences | | **Exams** | exam_math, exam_literature, exam_history | | **Text Processing** | punctuation, space_fix | | **MMLU** | mmlu_pro | ## Configurations / Subsets All configs use the `train` split (optimized for fast evaluation — 50–100 examples each). Exact config names you can load: ### NER - **finer**: Fine-grained / nested Named Entity Recognition task (`text` + `gold_entities` list of lists). - **pioner**: **pioNER** — Gold-standard Named Entity Recognition dataset for Armenian (`tokens` + `ner_tags`). ### POS Tagging - **pos**: Part-of-Speech tagging using Universal Dependencies tags (`form`, `upos_en`, `upos_hy`). ### Reading Comprehension - **squad-in-context-qa**: In-context extractive QA adapted from SQuAD (`context`, `question`, `answer`). - **belebele-in-context-mcqa**: In-context multiple-choice QA from the multilingual **Belebele** benchmark (FLORES passages). - **conversation-in-context-qa**: Multiple choice QA from conversations. - **public-services-mcqa**: Question answering adapted from Armenian public service **Hartak.am**. - **ms-marco-in-context-qa**: In-context question answering adapted from MS MARCO. ### Classification - **include-mcqa**: Subset of the **INCLUDE** benchmark — real multilingual exam-style multiple-choice questions (Armenian version). - **syndarin-in-context-mcqa**: In-context MCQA from **SynDARin** (high-quality synthesized reasoning dataset for low-resource languages). - **topic-14class**: Text classification into 14 topic categories (`category` + `text`). - **scientific-in-context-mcqa**: Scientific-domain in-context multiple-choice reading comprehension. - **sentiment**: Multi-category sentiment analysis (`text` + `sentiment_categories`). ### Generation / Summarization - **email-sum**: Summarization of email content (`email` + `summary`). - **conversational-sum**: Conversation/dialogue summarization task. - **simpleqa**: Simple question-answering task. - **paraphrase**: Paraphrase generation or detection (`text` + `paraphrases` list). ### Translation - **translation_short_sentences**: Parallel English ↔ Armenian short sentences for translation evaluation (`eng` + `hy`). ### Exams (Armenian National / Educational) - **exam_math**: Mathematics questions from Armenian exams (`task`, `question`, `choices`, `label`). - **exam_literature**: Literature questions from Armenian exams. - **exam_history**: History questions from Armenian exams. ### Text Processing / Normalization - **punctuation**: Punctuation restoration (`gold` vs `corrupted_punctuation`). - **space_fix**: Correction of spacing/tokenization errors (`gold` vs `corrupted_spaces`). ### Advanced Knowledge - **mmlu_pro**: Challenging **MMLU-Pro** benchmark fully adapted to Armenian (`question_arm`, `options_arm` available). ## Data Fields Fields vary by config (see original `dataset_info` or load a config to inspect). ## Loading the Dataset ```python from datasets import load_dataset # Load any task ds = load_dataset("Metric-AI/ArmBench-LLM-data", "mmlu_pro") ds = load_dataset("Metric-AI/ArmBench-LLM-data", "pioner") ds = load_dataset("Metric-AI/ArmBench-LLM-data", "public-services-mcqa") ``` ## Dataset Creation & Sources Translated benchmarks (MMLU-Pro, Belebele, SQuAD, MS MARCO, INCLUDE, SynDARin, etc.) — professionally translated and culturally validated. Native Armenian resources — pioNER, national exam questions, punctuation/space tasks, and custom generation/summarization data collected from public sources. ## Ethical Considerations & Limitations Small evaluation-sized subsets (50–100 examples) for speed and reproducibility. Translation and adaptation quality has been prioritized; minor cultural nuances may remain. Exam data reflects real Armenian educational content.

提供机构：

Metric-AI

搜集汇总

数据集介绍

构建方式

在亚美尼亚语自然语言处理领域，ArmBench-LLM-data 数据集的构建体现了多源融合与专业适配的策略。该数据集通过整合翻译改编的国际主流评测基准与本土原生语料，系统性地覆盖了命名实体识别、阅读理解、文本分类等多个任务维度。具体而言，构建过程涉及对 MMLU-Pro、Belebele 等知名基准的专业翻译与文化校验，同时收录了 pioNER 实体识别数据集、国家考试试题以及基于公共服务的问答语料，确保了语言的地道性与任务的多样性。每个子集均经过精心筛选，规模控制在 50 至 100 个样本之间，旨在保障评估效率的同时维持数据的代表性与可靠性。

特点

该数据集的核心特点在于其专为亚美尼亚语大语言模型评估设计的全面性与针对性。它涵盖了从基础语言处理到高阶知识推理的广泛任务类型，包括文本生成、翻译、考试问答及文本规范化等独特范畴。数据集采用模块化配置，每个子集对应特定评测场景，如 mmlu_pro 适配复杂知识问答，pioner 支持实体识别，而 punctuation 与 space_fix 则专注于文本修复任务。这种结构不仅支持零样本与少样本评估的高效执行，还通过双语并行数据与本土化内容的结合，为模型在低资源语言环境下的能力提供了多维度的检验框架。

使用方法

在实践应用中，该数据集可通过 Hugging Face 的 datasets 库直接加载，并依托 LightEval 框架进行标准化评估。用户可根据需要选择特定配置名称，如加载 mmlu_pro 以测试模型在知识密集型问答中的表现，或调用 pioner 进行命名实体识别性能分析。每个子集均提供清晰的字段结构，例如阅读理解任务包含上下文、问题与答案，生成任务则提供原文与摘要对照。评估过程可灵活设定提示模板与采样策略，从而系统性地衡量模型在亚美尼亚语各类任务上的泛化能力与鲁棒性，为语言模型的优化与比较提供可靠基准。

背景与挑战

背景概述

在自然语言处理领域，针对低资源语言的评估基准建设是推动语言技术普惠发展的关键环节。ArmBench-LLM-data数据集由Metric-AI团队于近年创建，旨在为亚美尼亚语（hy）提供一套全面、多任务的大语言模型评估基准。该数据集整合了翻译自国际知名基准（如MMLU-Pro、Belebele）的任务与本土原生语料（如pioNER、国家考试试题），覆盖命名实体识别、阅读理解、文本生成等十余类语言能力评测。其核心研究问题聚焦于填补亚美尼亚语在标准化模型评估体系上的空白，通过融入LightEval框架，为低资源语言的模型性能量化提供了重要基础设施，对促进语言技术在多语环境中的公平发展具有显著影响力。

当前挑战

该数据集致力于解决亚美尼亚语作为低资源语言在自然语言处理任务中面临的评估标准化挑战，具体包括模型在复杂语义理解、跨文化语境适应以及多领域知识推理等方面的性能评测难题。在构建过程中，团队需克服多重障碍：一是高质量双语语料的稀缺性，要求对国际基准进行专业级翻译与文化适配，确保语言表达的准确性与本土相关性；二是原生数据的标注一致性，尤其在命名实体识别、考试试题解析等任务中，需建立严格的标注规范以保障数据可靠性；三是评估任务的多样性平衡，需在有限样本规模下兼顾阅读、生成、分类等任务的代表性，避免评估偏差。

常用场景

经典使用场景

在亚美尼亚语自然语言处理领域，ArmBench-LLM-data数据集作为多任务评估基准，其经典使用场景集中于大语言模型的零样本与少样本性能评测。该数据集整合了阅读理解、分类、生成等多样化任务，例如通过belebele-in-context-mcqa配置进行多语言篇章理解评估，或借助pioner配置完成命名实体识别。这些任务设计旨在模拟真实语言处理环境，为模型提供跨领域的综合能力检验，尤其适用于LightEval框架下的高效自动化评估流程。

衍生相关工作

围绕该数据集衍生的经典工作主要聚焦于亚美尼亚语模型的能力拓展与基准创新。例如，基于pioner配置的命名实体识别研究深化了低资源语言信息抽取方法；而exam_math等考试任务则催生了教育领域自适应评估模型的探索。同时，数据集的多任务结构激励了跨任务迁移学习框架的开发，为后续构建更全面的亚美尼亚语评估生态系统奠定了方法论基础。

数据集最近研究