masakhane/uhura-arc-easy
收藏Hugging Face2024-12-03 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/masakhane/uhura-arc-easy
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- am
- en
- nso
- ha
- sw
- yo
- zu
size_categories:
- 1K<n<10K
multilinguality:
- multilingual
pretty_name: Uhura-Arc-Easy
language_details: am, en, ha, nso, sw, yo, zu
tags:
- uhura
- arc-easy
- arc
task_categories:
- multiple-choice
- question-answering
task_ids:
- multiple-choice-qa
configs:
- config_name: am_multiple_choice
data_files:
- split: train
path: am_train.json
- split: test
path: am_test.json
- split: validation
path: am_dev.json
- config_name: en_multiple_choice
data_files:
- split: train
path: en_train.json
- split: test
path: en_test.json
- split: validation
path: en_dev.json
- config_name: ha_multiple_choice
data_files:
- split: train
path: ha_train.json
- split: test
path: ha_test.json
- split: validation
path: ha_dev.json
- config_name: nso_multiple_choice
data_files:
- split: train
path: nso_train.json
- split: test
path: nso_test.json
- split: validation
path: nso_dev.json
- config_name: nso_multiple_choice_unmatched
data_files:
- split: train
path: nso_train_unmatched.json
- split: test
path: nso_test_unmatched.json
- config_name: sw_multiple_choice
data_files:
- split: train
path: sw_train.json
- split: test
path: sw_test.json
- split: validation
path: sw_dev.json
- config_name: yo_multiple_choice
data_files:
- split: train
path: yo_train.json
- split: test
path: yo_test.json
- split: validation
path: yo_dev.json
- config_name: zu_multiple_choice
data_files:
- split: train
path: zu_train.json
- split: test
path: zu_test.json
---
# Dataset Card for Uhura-Arc-Easy
## Dataset Summary
Uhura-ARC-Easy is a widely recognized scientific question answering benchmark composed of multiple-choice science questions derived from grade-school examinations that test various styles of knowledge and reasoning.
The original English version of the benchmark originates from [Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) (Clark et al., 2018) and is divided into "Challenge" and "Easy" subsets, with 2,590 and 5,197 questions, respectively.
We translated a subset of Arc-Easy into 6 low-resource African languages using professional human translators. Relying on human translators for this evaluation increases confidence in the accuracy of the translations.
You can find more details about the dataset in our paper [Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages](https://arxiv.org/abs/2412.00948).
## Languages
Uhura includes six widely spoken Sub-Saharan African languages, representing millions of speakers across the continent: Amharic, Hausa, Northern Sotho (Sepedi), Yoruba, and Zulu.
## Dataset Structure
### Data Instances
For the `multiple_choice` configuration, each instance contains a question and multiple-choice answer choices with corresponding labels and an answer key as well as an id.
```python
{
"id": "Mercury_7072328",
"question": "Ìdí ago ẹnu ọ̀nà ní pàtó ni láti sọ agbára iná ẹ̀lẹ̀tírìkì di?",
"choices": {
"label": [ "A", "B", "C", "D" ],
"text": [ "Ohùn", "Ìrìn", "Agbára Iná", "Agbára Kẹ́míkà" ]
},
"answerKey": "A",
}
```
### Data Fields
- `id`: a `string` feature.
- `question`: a `string` feature.
- `choices`: a dictionary feature containing:
- `text`: a `string` feature.
- `label`: a `string` feature.
- `answerKey`: a `string` feature.
### Data Splits
| name |train|dev|test|
|-------------|----:|--:|---:|
|am | 656| 92| 491|
|ha | 655| 93| 452|
|nso | 440| 3| 509|
|sw | 650| 90| 491|
|yo | 659| 93| 494|
|zu | 909| 0| 300|
*Note: Numbers vary across languages due to differences in the number of questions that can be translated for each language.*
## Dataset Creation
You can find more details about the dataset creation in our paper [Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages](https://arxiv.org/abs/2412.00948).
### Curation Rationale
From the paper:
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
The Uhura-Arc-Easy dataset is licensed under the [MIT License](https://opensource.org/licenses/MIT).
### Citation
To cite Uhura, please use the following BibTeX entry:
```bibtex
@article{bayes2024uhurabenchmarkevaluatingscientific,
title={Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages},
author={Edward Bayes and Israel Abebe Azime and Jesujoba O. Alabi and Jonas Kgomo and Tyna Eloundou and Elizabeth Proehl and Kai Chen and Imaan Khadir and Naome A. Etori and Shamsuddeen Hassan Muhammad and Choice Mpanza and Igneciah Pocia Thete and Dietrich Klakow and David Ifeoluwa Adelani},
year={2024},
eprint={2412.00948},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.00948},
}
```
### Acknowledgements
This work was supported by OpenAI. We also want to thank our translators, whose contributions made this work possible.
提供机构:
masakhane



