NativeQA-RDP
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/NativeQA-RDP
下载链接
链接失效反馈官方服务:
资源简介:
# 3LM Native STEM Arabic Benchmark - RDP version
## Dataset Summary
The 3LM Native STEM dataset contains 865 multiple-choice questions (MCQs) curated from real Arabic educational sources. It targets mid- to high-school level content in Biology, Chemistry, Physics, Mathematics, and Geography. This benchmark is designed to evaluate Arabic large language models on structured, domain-specific knowledge.<br><br>
In this **"RDP - Robustness under Distractor Perturbation"** version, 25% of the [Native Benchmark](https://huggingface.co/datasets/tiiuae/NativeQA) samples were modified using targeted distractor strategies. In 20% of the cases, correct answers were removed and replaced with varied Arabic equivalents of “none of the above.” In another 5%, these phrases were inserted as distractors by replacing incorrect options. [Detailed approach can be found in the paper](#code-and-paper).
## Motivation
While Arabic NLP has seen growth in cultural and linguistic tasks, scientific reasoning remains underrepresented. This dataset fills that gap by using authentic, in-domain Arabic materials to evaluate factual and conceptual understanding.
## Dataset Structure
- `question_text`: Arabic text of the MCQ (fully self-contained)
- `choices`: List of four choices labeled "أ", "ب", "ج", "د"
- `correct_choice`: Correct answer (letter only)
- `domain`: Subject area (e.g., biology, physics)
- `difficulty`: Score from 1 (easy) to 10 (hard)
```json
{
"question_text": "ما هو الغاز الذي يتنفسه الإنسان؟",
"choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"],
"correct_choice": "أ",
"domain": "biology",
"difficulty": 3
}
```
## Data Sources
Collected from open-access Arabic textbooks, worksheets, and question banks sourced through web crawling and regex-based filtering.
## Data Curation
1. **OCR Processing**: Dual-stage OCR (text + math) using Pix2Tex for LaTeX support.
2. **Extraction Pipeline**: Used LLMs to extract Q&A pairs.
3. **Classification**: Questions tagged by type, domain, and difficulty.
4. **Standardization**: Reformatted to MCQ and randomized correct answer positions.
5. **Manual Verification**: All questions reviewed by Arabic speakers with STEM background.
## Code and Paper
- 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark
- 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850
## Licensing
[Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## Citation
```bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
```
# 3LM 原生STEM阿拉伯语基准测试集——RDP版本
## 数据集概述
3LM原生STEM数据集包含865道多项选择题(Multiple-Choice Questions,简称MCQs),均从真实阿拉伯语教育资源中精选而来。该数据集覆盖生物学(Biology)、化学(Chemistry)、物理学(Physics)、数学(Mathematics)以及地理学(Geography)的初高中阶段内容,旨在评估阿拉伯语大语言模型(Large Language Model,简称LLM)在结构化领域特定知识上的表现。
在本**「干扰项扰动鲁棒性(Robustness under Distractor Perturbation,简称RDP)」**版本中,研究人员针对[原生基准测试集(Native Benchmark)](https://huggingface.co/datasets/tiiuae/NativeQA)中25%的样本采用定向干扰项策略进行修改:其中20%的样本会移除正确答案,并替换为阿拉伯语中“无正确选项”的多种等效表达;剩余5%的样本则通过替换错误选项的方式,将上述表达作为干扰项加入其中。[详细方法可参见相关论文](#code-and-paper)。
## 设计动机
尽管阿拉伯语自然语言处理(Natural Language Processing,简称NLP)在文化与语言任务领域取得了长足进展,但科学推理相关任务仍占比不足。本数据集通过采用真实的领域内阿拉伯语素材,评估模型的事实性与概念性理解能力,填补了这一空白。
## 数据集结构
- `question_text`:该多项选择题的阿拉伯语文本(完全自包含)
- `choices`:包含四个选项的列表,选项标记为「أ」、「ب」、「ج」、「د」
- `correct_choice`:正确答案(仅返回选项字母)
- `domain`:所属学科领域(例如生物学、物理学)
- `difficulty`:难度评分,范围为1(简单)至10(困难)
json
{
"question_text": "ما هو الغاز الذي يتنفسه الإنسان؟",
"choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"],
"correct_choice": "أ",
"domain": "biology",
"difficulty": 3
}
## 数据来源
本数据集通过网络爬虫与基于正则表达式的筛选方式,从开源阿拉伯语教材、练习册以及题库中收集得到。
## 数据整理流程
1. **OCR处理**:采用支持LaTeX格式的Pix2Tex工具,执行双阶段光学字符识别(Optical Character Recognition,简称OCR),覆盖文本与数学公式两类内容
2. **提取流水线**:利用大语言模型提取问答对
3. **分类标注**:为问题标注题型、所属领域与难度等级
4. **标准化处理**:将数据重新格式化为多项选择题形式,并随机化正确答案的位置
5. **人工校验**:所有问题均由具备STEM(科学、技术、工程、数学,Science, Technology, Engineering, Mathematics)背景的阿拉伯语使用者进行审核
## 代码与论文
- GitHub上的3LM仓库:https://github.com/tiiuae/3LM-benchmark
- ArXiv平台上的3LM论文:https://arxiv.org/pdf/2507.15850
## 许可协议
[Falcon大语言模型许可协议(Falcon LLM Licence)](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## 引用格式
bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-03



