Metric-AI/ArmBench-LLM-data
收藏Hugging Face2026-04-01 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Metric-AI/ArmBench-LLM-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: belebele-in-context-mcqa
features:
- name: flores_passage
dtype: string
- name: question
dtype: string
- name: mc_answer1
dtype: string
- name: mc_answer2
dtype: string
- name: mc_answer3
dtype: string
- name: mc_answer4
dtype: string
- name: correct_answer_num
dtype: int64
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 68068
num_examples: 50
download_size: 44784
dataset_size: 68068
- config_name: conversation-in-context-qa
features:
- name: label
dtype: int64
- name: dialogue
dtype: string
- name: question
dtype: string
- name: choices
list: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 57968
num_examples: 50
download_size: 37784
dataset_size: 57968
- config_name: conversational-sum
features:
- name: dialogue
dtype: string
- name: summary
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 87533
num_examples: 100
download_size: 49163
dataset_size: 87533
- config_name: email-sum
features:
- name: email
dtype: string
- name: summary
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 168313
num_examples: 100
download_size: 79580
dataset_size: 168313
- config_name: exam_history
features:
- name: question
dtype: string
- name: context
dtype: string
- name: choices
list: string
- name: label
list: string
- name: task_type
dtype: int64
splits:
- name: train
num_bytes: 60877
num_examples: 70
download_size: 33961
dataset_size: 60877
- config_name: exam_literature
features:
- name: question
dtype: string
- name: context
dtype: string
- name: choices
list: string
- name: label
list: string
- name: task_type
dtype: int64
splits:
- name: train
num_bytes: 50137
num_examples: 69
download_size: 30131
dataset_size: 50137
- config_name: exam_math
features:
- name: task
dtype: string
- name: question
dtype: string
- name: choices
list: string
- name: label
list: string
- name: task_type
dtype: int64
splits:
- name: train
num_bytes: 16378
num_examples: 65
download_size: 9130
dataset_size: 16378
- config_name: finer
features:
- name: text
dtype: string
- name: gold_entities
list:
list: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 377941
num_examples: 100
download_size: 167198
dataset_size: 377941
- config_name: include-mcqa
features:
- name: question
dtype: string
- name: option_a
dtype: string
- name: option_b
dtype: string
- name: option_c
dtype: string
- name: option_d
dtype: string
- name: answer
dtype: int64
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 26576
num_examples: 50
download_size: 18680
dataset_size: 26576
- config_name: mmlu_pro
features:
- name: question_id
dtype: int64
- name: question
dtype: string
- name: options
list: string
- name: answer
dtype: string
- name: answer_index
dtype: int64
- name: cot_content
dtype: string
- name: category
dtype: string
- name: src
dtype: string
- name: question_arm
dtype: string
- name: options_arm
list: string
splits:
- name: train
num_bytes: 1862353
num_examples: 999
download_size: 885104
dataset_size: 1862353
- config_name: ms-marco-in-context-qa
features:
- name: armenian
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 47281
num_examples: 50
download_size: 21979
dataset_size: 47281
- config_name: paraphrase
features:
- name: text
dtype: string
- name: paraphrases
list: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 139251
num_examples: 100
download_size: 51659
dataset_size: 139251
- config_name: pioner
features:
- name: tokens
list: string
- name: ner_tags
list: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 49385
num_examples: 100
download_size: 20356
dataset_size: 49385
- config_name: pos
features:
- name: form
dtype: string
- name: upos_en
dtype: string
- name: upos_hy
dtype: string
splits:
- name: train
num_bytes: 3937
num_examples: 100
download_size: 2682
dataset_size: 3937
- config_name: public-services-mcqa
features:
- name: question
dtype: string
- name: answer
dtype: string
- name: distractors
list: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 50622
num_examples: 45
download_size: 30293
dataset_size: 50622
- config_name: punctuation
features:
- name: orig_index
dtype: int64
- name: gold
dtype: string
- name: corrupted_punctuation
dtype: string
splits:
- name: train
num_bytes: 28053
num_examples: 100
download_size: 19398
dataset_size: 28053
- config_name: scientific-in-context-mcqa
features:
- name: context
dtype: string
- name: question
dtype: string
- name: correct_answer
dtype: string
- name: distractor1
dtype: string
- name: distractor2
dtype: string
- name: distractor3
dtype: string
- name: orig_index
dtype: int64
- name: choices
list: string
- name: gold_index
dtype: int64
splits:
- name: train
num_bytes: 62239
num_examples: 50
download_size: 40010
dataset_size: 62239
- config_name: sentiment
features:
- name: text
dtype: string
- name: sentiment_categories
list: string
splits:
- name: train
num_bytes: 26089
num_examples: 100
download_size: 15020
dataset_size: 26089
- config_name: simpleqa
features:
- name: question
dtype: string
- name: answer
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 413394
num_examples: 50
download_size: 184583
dataset_size: 413394
- config_name: space_fix
features:
- name: orig_index
dtype: int64
- name: gold
dtype: string
- name: corrupted_spaces
dtype: string
splits:
- name: train
num_bytes: 308468
num_examples: 100
download_size: 165606
dataset_size: 308468
- config_name: squad-in-context-qa
features:
- name: context
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 87951
num_examples: 50
download_size: 52797
dataset_size: 87951
- config_name: syndarin-in-context-mcqa
features:
- name: paragraph
dtype: string
- name: question
dtype: string
- name: answer_candidate_1
dtype: string
- name: answer_candidate_2
dtype: string
- name: answer_candidate_3
dtype: string
- name: answer_candidate_4
dtype: string
- name: correct_answer
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 65246
num_examples: 50
download_size: 43091
dataset_size: 65246
- config_name: topic-14class
features:
- name: category
dtype: string
- name: text
dtype: string
- name: orig_index
dtype: int64
splits:
- name: train
num_bytes: 82361
num_examples: 280
download_size: 44064
dataset_size: 82361
- config_name: translation_short_sentences
features:
- name: eng
dtype: string
- name: hy
dtype: string
splits:
- name: train
num_bytes: 6373
num_examples: 100
download_size: 5931
dataset_size: 6373
configs:
- config_name: belebele-in-context-mcqa
data_files:
- split: train
path: belebele-in-context-mcqa/train-*
- config_name: conversation-in-context-qa
data_files:
- split: train
path: conversation-in-context-qa/train-*
- config_name: conversational-sum
data_files:
- split: train
path: conversational-sum/train-*
- config_name: email-sum
data_files:
- split: train
path: email-sum/train-*
- config_name: exam_history
data_files:
- split: train
path: exam_history/train-*
- config_name: exam_literature
data_files:
- split: train
path: exam_literature/train-*
- config_name: exam_math
data_files:
- split: train
path: exam_math/train-*
- config_name: finer
data_files:
- split: train
path: finer/train-*
- config_name: include-mcqa
data_files:
- split: train
path: include-mcqa/train-*
- config_name: mmlu_pro
data_files:
- split: train
path: mmlu_pro/train-*
- config_name: ms-marco-in-context-qa
data_files:
- split: train
path: ms-marco-in-context-qa/train-*
- config_name: paraphrase
data_files:
- split: train
path: paraphrase/train-*
- config_name: pioner
data_files:
- split: train
path: pioner/train-*
- config_name: pos
data_files:
- split: train
path: pos/train-*
- config_name: public-services-mcqa
data_files:
- split: train
path: public-services-mcqa/train-*
- config_name: punctuation
data_files:
- split: train
path: punctuation/train-*
- config_name: scientific-in-context-mcqa
data_files:
- split: train
path: scientific-in-context-mcqa/train-*
- config_name: sentiment
data_files:
- split: train
path: sentiment/train-*
- config_name: simpleqa
data_files:
- split: train
path: simpleqa/train-*
- config_name: space_fix
data_files:
- split: train
path: space_fix/train-*
- config_name: squad-in-context-qa
data_files:
- split: train
path: squad-in-context-qa/train-*
- config_name: syndarin-in-context-mcqa
data_files:
- split: train
path: syndarin-in-context-mcqa/train-*
- config_name: topic-14class
data_files:
- split: train
path: topic-14class/train-*
- config_name: translation_short_sentences
data_files:
- split: train
path: translation_short_sentences/train-*
license: mit
language:
- hy
---
# lighteval-armenian
**Armenian LLM Evaluation Benchmark for LightEval**
## Dataset Description
This is a multi-task benchmark created specifically to evaluate Large Language Models on **Armenian** (`hy`) language capabilities. It was developed to add full native Armenian support to the [LightEval](https://github.com/huggingface/lighteval) framework by Hugging Face.
The benchmark contains only the tasks currently used in the official Armenian evaluation suite. It mixes:
- Translated/adapted versions of popular benchmarks (MMLU-Pro, Belebele, SQuAD, MS MARCO, INCLUDE, etc.)
- Native Armenian datasets (pioNER, national exams, public-services style tasks, punctuation/space normalization, etc.)
- Custom or newly created tasks for summarization, generation, and text processing
**Languages**: Primarily Armenian. Some configs are bilingual (English + Armenian) or contain parallel data.
**Intended Use**
Fast, reliable zero-shot / few-shot evaluation inside LightEval. Tasks are grouped into categories (see below).
## Task Categories & Metrics
The benchmark is organized into the following evaluation categories:
| Category | Tasks (config names) |
|-----------------------|-----------------------------------------------------------|
| **NER** | finer, pioner |
| **POS** | pos |
| **Reading Comprehension** | squad-in-context-qa, belebele-in-context-mcqa, conversation-in-context-qa, public-services-mcqa, ms-marco-in-context-qa |
| **Classification** | include-mcqa, syndarin-in-context-mcqa, topic-14class, scientific-in-context-mcqa, sentiment |
| **Generation** | email-sum, conversational-sum, simpleqa, paraphrase |
| **Translation** | translation_short_sentences |
| **Exams** | exam_math, exam_literature, exam_history |
| **Text Processing** | punctuation, space_fix |
| **MMLU** | mmlu_pro |
## Configurations / Subsets
All configs use the `train` split (optimized for fast evaluation — 50–100 examples each). Exact config names you can load:
### NER
- **finer**: Fine-grained / nested Named Entity Recognition task (`text` + `gold_entities` list of lists).
- **pioner**: **pioNER** — Gold-standard Named Entity Recognition dataset for Armenian (`tokens` + `ner_tags`).
### POS Tagging
- **pos**: Part-of-Speech tagging using Universal Dependencies tags (`form`, `upos_en`, `upos_hy`).
### Reading Comprehension
- **squad-in-context-qa**: In-context extractive QA adapted from SQuAD (`context`, `question`, `answer`).
- **belebele-in-context-mcqa**: In-context multiple-choice QA from the multilingual **Belebele** benchmark (FLORES passages).
- **conversation-in-context-qa**: Multiple choice QA from conversations.
- **public-services-mcqa**: Question answering adapted from Armenian public service **Hartak.am**.
- **ms-marco-in-context-qa**: In-context question answering adapted from MS MARCO.
### Classification
- **include-mcqa**: Subset of the **INCLUDE** benchmark — real multilingual exam-style multiple-choice questions (Armenian version).
- **syndarin-in-context-mcqa**: In-context MCQA from **SynDARin** (high-quality synthesized reasoning dataset for low-resource languages).
- **topic-14class**: Text classification into 14 topic categories (`category` + `text`).
- **scientific-in-context-mcqa**: Scientific-domain in-context multiple-choice reading comprehension.
- **sentiment**: Multi-category sentiment analysis (`text` + `sentiment_categories`).
### Generation / Summarization
- **email-sum**: Summarization of email content (`email` + `summary`).
- **conversational-sum**: Conversation/dialogue summarization task.
- **simpleqa**: Simple question-answering task.
- **paraphrase**: Paraphrase generation or detection (`text` + `paraphrases` list).
### Translation
- **translation_short_sentences**: Parallel English ↔ Armenian short sentences for translation evaluation (`eng` + `hy`).
### Exams (Armenian National / Educational)
- **exam_math**: Mathematics questions from Armenian exams (`task`, `question`, `choices`, `label`).
- **exam_literature**: Literature questions from Armenian exams.
- **exam_history**: History questions from Armenian exams.
### Text Processing / Normalization
- **punctuation**: Punctuation restoration (`gold` vs `corrupted_punctuation`).
- **space_fix**: Correction of spacing/tokenization errors (`gold` vs `corrupted_spaces`).
### Advanced Knowledge
- **mmlu_pro**: Challenging **MMLU-Pro** benchmark fully adapted to Armenian (`question_arm`, `options_arm` available).
## Data Fields
Fields vary by config (see original `dataset_info` or load a config to inspect).
## Loading the Dataset
```python
from datasets import load_dataset
# Load any task
ds = load_dataset("Metric-AI/ArmBench-LLM-data", "mmlu_pro")
ds = load_dataset("Metric-AI/ArmBench-LLM-data", "pioner")
ds = load_dataset("Metric-AI/ArmBench-LLM-data", "public-services-mcqa")
```
## Dataset Creation & Sources
Translated benchmarks (MMLU-Pro, Belebele, SQuAD, MS MARCO, INCLUDE, SynDARin, etc.) — professionally translated and culturally validated.
Native Armenian resources — pioNER, national exam questions, punctuation/space tasks, and custom generation/summarization data collected from public sources.
## Ethical Considerations & Limitations
Small evaluation-sized subsets (50–100 examples) for speed and reproducibility.
Translation and adaptation quality has been prioritized; minor cultural nuances may remain.
Exam data reflects real Armenian educational content.
提供机构:
Metric-AI
搜集汇总
数据集介绍

构建方式
在亚美尼亚语自然语言处理领域,ArmBench-LLM-data 数据集的构建体现了多源融合与专业适配的策略。该数据集通过整合翻译改编的国际主流评测基准与本土原生语料,系统性地覆盖了命名实体识别、阅读理解、文本分类等多个任务维度。具体而言,构建过程涉及对 MMLU-Pro、Belebele 等知名基准的专业翻译与文化校验,同时收录了 pioNER 实体识别数据集、国家考试试题以及基于公共服务的问答语料,确保了语言的地道性与任务的多样性。每个子集均经过精心筛选,规模控制在 50 至 100 个样本之间,旨在保障评估效率的同时维持数据的代表性与可靠性。
特点
该数据集的核心特点在于其专为亚美尼亚语大语言模型评估设计的全面性与针对性。它涵盖了从基础语言处理到高阶知识推理的广泛任务类型,包括文本生成、翻译、考试问答及文本规范化等独特范畴。数据集采用模块化配置,每个子集对应特定评测场景,如 mmlu_pro 适配复杂知识问答,pioner 支持实体识别,而 punctuation 与 space_fix 则专注于文本修复任务。这种结构不仅支持零样本与少样本评估的高效执行,还通过双语并行数据与本土化内容的结合,为模型在低资源语言环境下的能力提供了多维度的检验框架。
使用方法
在实践应用中,该数据集可通过 Hugging Face 的 datasets 库直接加载,并依托 LightEval 框架进行标准化评估。用户可根据需要选择特定配置名称,如加载 mmlu_pro 以测试模型在知识密集型问答中的表现,或调用 pioner 进行命名实体识别性能分析。每个子集均提供清晰的字段结构,例如阅读理解任务包含上下文、问题与答案,生成任务则提供原文与摘要对照。评估过程可灵活设定提示模板与采样策略,从而系统性地衡量模型在亚美尼亚语各类任务上的泛化能力与鲁棒性,为语言模型的优化与比较提供可靠基准。
背景与挑战
背景概述
在自然语言处理领域,针对低资源语言的评估基准建设是推动语言技术普惠发展的关键环节。ArmBench-LLM-data数据集由Metric-AI团队于近年创建,旨在为亚美尼亚语(hy)提供一套全面、多任务的大语言模型评估基准。该数据集整合了翻译自国际知名基准(如MMLU-Pro、Belebele)的任务与本土原生语料(如pioNER、国家考试试题),覆盖命名实体识别、阅读理解、文本生成等十余类语言能力评测。其核心研究问题聚焦于填补亚美尼亚语在标准化模型评估体系上的空白,通过融入LightEval框架,为低资源语言的模型性能量化提供了重要基础设施,对促进语言技术在多语环境中的公平发展具有显著影响力。
当前挑战
该数据集致力于解决亚美尼亚语作为低资源语言在自然语言处理任务中面临的评估标准化挑战,具体包括模型在复杂语义理解、跨文化语境适应以及多领域知识推理等方面的性能评测难题。在构建过程中,团队需克服多重障碍:一是高质量双语语料的稀缺性,要求对国际基准进行专业级翻译与文化适配,确保语言表达的准确性与本土相关性;二是原生数据的标注一致性,尤其在命名实体识别、考试试题解析等任务中,需建立严格的标注规范以保障数据可靠性;三是评估任务的多样性平衡,需在有限样本规模下兼顾阅读、生成、分类等任务的代表性,避免评估偏差。
常用场景
经典使用场景
在亚美尼亚语自然语言处理领域,ArmBench-LLM-data数据集作为多任务评估基准,其经典使用场景集中于大语言模型的零样本与少样本性能评测。该数据集整合了阅读理解、分类、生成等多样化任务,例如通过belebele-in-context-mcqa配置进行多语言篇章理解评估,或借助pioner配置完成命名实体识别。这些任务设计旨在模拟真实语言处理环境,为模型提供跨领域的综合能力检验,尤其适用于LightEval框架下的高效自动化评估流程。
衍生相关工作
围绕该数据集衍生的经典工作主要聚焦于亚美尼亚语模型的能力拓展与基准创新。例如,基于pioner配置的命名实体识别研究深化了低资源语言信息抽取方法;而exam_math等考试任务则催生了教育领域自适应评估模型的探索。同时,数据集的多任务结构激励了跨任务迁移学习框架的开发,为后续构建更全面的亚美尼亚语评估生态系统奠定了方法论基础。
数据集最近研究
最新研究方向
在低资源语言大模型评估领域,ArmBench-LLM-data数据集正推动亚美尼亚语能力评估的前沿探索。该数据集整合了翻译基准与本土任务,为亚美尼亚语大模型提供了多维度的评估框架。当前研究聚焦于跨语言迁移学习与少样本评估,通过MMLU-Pro等复杂知识任务的本地化,探索模型在低资源语言中的推理泛化能力。同时,文本处理任务如标点恢复与空格校正,正成为提升亚美尼亚语自然语言处理实用性的热点方向。这些研究不仅助力亚美尼亚语数字生态建设,也为全球低资源语言的大模型评估提供了重要范式。
以上内容由遇见数据集搜集并总结生成



