large-traversaal/openbookqa_urdu_final
收藏Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/large-traversaal/openbookqa_urdu_final
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card: OpenBookQA Urdu
## Dataset Summary
`openbookqa_urdu_cleaned` is a cleaned Urdu translation of the **OpenBookQA** dataset, a multiple-choice question answering benchmark designed to test **elementary science understanding combined with commonsense reasoning**. Each example consists of a question and four answer options, with exactly one correct answer.
The dataset provides Urdu translations of questions and answer choices, enabling evaluation and training of **Urdu and multilingual language models** on scientific reasoning tasks in a low-resource language setting.
## Dataset Details
* **Dataset Name:** openbookqa_urdu_cleaned
* **Maintained by:** large-traversaal (Traversaal.ai)
* **Task Type:** Multiple-choice question answering
* **Domain:** Elementary science and commonsense reasoning
* **Languages:** Urdu (primary), English (where original fields are retained)
* **Format:** Parquet
* **Answer Choices:** 4 per question
## Dataset Structure
Each record typically contains the following fields:
* `id`: Unique example identifier
* `question`: Urdu translation of the question
* `choices`: Answer options (four choices, labeled A–D)
* `answerKey`: Correct answer label (A, B, C, or D)
* `english_question` (optional): Original English question
* `english_choices` (optional): Original English answer options
Exact field names may vary slightly depending on split and preprocessing version.
## Intended Uses
This dataset is intended for:
* Training and evaluating Urdu and multilingual QA models
* Benchmarking reasoning performance on science-based questions
* Cross-lingual transfer learning from English to Urdu
* Research in low-resource language understanding and reasoning
## Loading the Dataset
```python
from datasets import load_dataset
ds = load_dataset("large-traversaal/openbookqa_urdu_cleaned")
```
## Licensing and Usage
Licensing follows the terms of the original OpenBookQA dataset. Users should verify license details on the Hugging Face dataset page before redistribution or commercial use.
## 📄 Citation
If you use this dataset in your research, please cite the **UrduBench paper**:
```bibtex
@misc{shafique2026urdubenchurdureasoningbenchmark,
title={UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop},
author={Muhammad Ali Shafique and Areej Mehboob and Layba Fiaz and Muhammad Usman Qadeer and Hamza Farooq},
year={2026},
eprint={2601.21000},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.21000}
}
```
提供机构:
large-traversaal



