NativeQA-RDP

Name: NativeQA-RDP
Creator: maas
Published: 2025-12-04 16:51:14
License: 暂无描述

魔搭社区2025-12-04 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/tiiuae/NativeQA-RDP

下载链接

链接失效反馈

官方服务：

资源简介：

# 3LM Native STEM Arabic Benchmark - RDP version ## Dataset Summary The 3LM Native STEM dataset contains 865 multiple-choice questions (MCQs) curated from real Arabic educational sources. It targets mid- to high-school level content in Biology, Chemistry, Physics, Mathematics, and Geography. This benchmark is designed to evaluate Arabic large language models on structured, domain-specific knowledge.<br><br> In this **"RDP - Robustness under Distractor Perturbation"** version, 25% of the [Native Benchmark](https://huggingface.co/datasets/tiiuae/NativeQA) samples were modified using targeted distractor strategies. In 20% of the cases, correct answers were removed and replaced with varied Arabic equivalents of “none of the above.” In another 5%, these phrases were inserted as distractors by replacing incorrect options. [Detailed approach can be found in the paper](#code-and-paper). ## Motivation While Arabic NLP has seen growth in cultural and linguistic tasks, scientific reasoning remains underrepresented. This dataset fills that gap by using authentic, in-domain Arabic materials to evaluate factual and conceptual understanding. ## Dataset Structure - `question_text`: Arabic text of the MCQ (fully self-contained) - `choices`: List of four choices labeled "أ", "ب", "ج", "د" - `correct_choice`: Correct answer (letter only) - `domain`: Subject area (e.g., biology, physics) - `difficulty`: Score from 1 (easy) to 10 (hard) ```json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ``` ## Data Sources Collected from open-access Arabic textbooks, worksheets, and question banks sourced through web crawling and regex-based filtering. ## Data Curation 1. **OCR Processing**: Dual-stage OCR (text + math) using Pix2Tex for LaTeX support. 2. **Extraction Pipeline**: Used LLMs to extract Q&A pairs. 3. **Classification**: Questions tagged by type, domain, and difficulty. 4. **Standardization**: Reformatted to MCQ and randomized correct answer positions. 5. **Manual Verification**: All questions reviewed by Arabic speakers with STEM background. ## Code and Paper - 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark - 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850 ## Licensing [Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## Citation ```bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} } ```

# 3LM 原生STEM阿拉伯语基准测试集——RDP版本 ## 数据集概述 3LM原生STEM数据集包含865道多项选择题（Multiple-Choice Questions，简称MCQs），均从真实阿拉伯语教育资源中精选而来。该数据集覆盖生物学（Biology）、化学（Chemistry）、物理学（Physics）、数学（Mathematics）以及地理学（Geography）的初高中阶段内容，旨在评估阿拉伯语大语言模型（Large Language Model，简称LLM）在结构化领域特定知识上的表现。在本**「干扰项扰动鲁棒性（Robustness under Distractor Perturbation，简称RDP）」**版本中，研究人员针对[原生基准测试集（Native Benchmark）](https://huggingface.co/datasets/tiiuae/NativeQA)中25%的样本采用定向干扰项策略进行修改：其中20%的样本会移除正确答案，并替换为阿拉伯语中“无正确选项”的多种等效表达；剩余5%的样本则通过替换错误选项的方式，将上述表达作为干扰项加入其中。[详细方法可参见相关论文](#code-and-paper)。 ## 设计动机尽管阿拉伯语自然语言处理（Natural Language Processing，简称NLP）在文化与语言任务领域取得了长足进展，但科学推理相关任务仍占比不足。本数据集通过采用真实的领域内阿拉伯语素材，评估模型的事实性与概念性理解能力，填补了这一空白。 ## 数据集结构 - `question_text`：该多项选择题的阿拉伯语文本（完全自包含） - `choices`：包含四个选项的列表，选项标记为「أ」、「ب」、「ج」、「د」 - `correct_choice`：正确答案（仅返回选项字母） - `domain`：所属学科领域（例如生物学、物理学） - `difficulty`：难度评分，范围为1（简单）至10（困难） json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ## 数据来源本数据集通过网络爬虫与基于正则表达式的筛选方式，从开源阿拉伯语教材、练习册以及题库中收集得到。 ## 数据整理流程 1. **OCR处理**：采用支持LaTeX格式的Pix2Tex工具，执行双阶段光学字符识别（Optical Character Recognition，简称OCR），覆盖文本与数学公式两类内容 2. **提取流水线**：利用大语言模型提取问答对 3. **分类标注**：为问题标注题型、所属领域与难度等级 4. **标准化处理**：将数据重新格式化为多项选择题形式，并随机化正确答案的位置 5. **人工校验**：所有问题均由具备STEM（科学、技术、工程、数学，Science, Technology, Engineering, Mathematics）背景的阿拉伯语使用者进行审核 ## 代码与论文 - GitHub上的3LM仓库：https://github.com/tiiuae/3LM-benchmark - ArXiv平台上的3LM论文：https://arxiv.org/pdf/2507.15850 ## 许可协议 [Falcon大语言模型许可协议（Falcon LLM Licence）](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## 引用格式 bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} }

提供机构：

maas

创建时间：

2025-10-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集