five

NativeQA-RDP

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/NativeQA-RDP
下载链接
链接失效反馈
官方服务:
资源简介:
# 3LM Native STEM Arabic Benchmark - RDP version ## Dataset Summary The 3LM Native STEM dataset contains 865 multiple-choice questions (MCQs) curated from real Arabic educational sources. It targets mid- to high-school level content in Biology, Chemistry, Physics, Mathematics, and Geography. This benchmark is designed to evaluate Arabic large language models on structured, domain-specific knowledge.<br><br> In this **"RDP - Robustness under Distractor Perturbation"** version, 25% of the [Native Benchmark](https://huggingface.co/datasets/tiiuae/NativeQA) samples were modified using targeted distractor strategies. In 20% of the cases, correct answers were removed and replaced with varied Arabic equivalents of “none of the above.” In another 5%, these phrases were inserted as distractors by replacing incorrect options. [Detailed approach can be found in the paper](#code-and-paper). ## Motivation While Arabic NLP has seen growth in cultural and linguistic tasks, scientific reasoning remains underrepresented. This dataset fills that gap by using authentic, in-domain Arabic materials to evaluate factual and conceptual understanding. ## Dataset Structure - `question_text`: Arabic text of the MCQ (fully self-contained) - `choices`: List of four choices labeled "أ", "ب", "ج", "د" - `correct_choice`: Correct answer (letter only) - `domain`: Subject area (e.g., biology, physics) - `difficulty`: Score from 1 (easy) to 10 (hard) ```json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ``` ## Data Sources Collected from open-access Arabic textbooks, worksheets, and question banks sourced through web crawling and regex-based filtering. ## Data Curation 1. **OCR Processing**: Dual-stage OCR (text + math) using Pix2Tex for LaTeX support. 2. **Extraction Pipeline**: Used LLMs to extract Q&A pairs. 3. **Classification**: Questions tagged by type, domain, and difficulty. 4. **Standardization**: Reformatted to MCQ and randomized correct answer positions. 5. **Manual Verification**: All questions reviewed by Arabic speakers with STEM background. ## Code and Paper - 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark - 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850 ## Licensing [Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## Citation ```bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} } ```

# 3LM 原生STEM阿拉伯语基准测试集——RDP版本 ## 数据集概述 3LM原生STEM数据集包含865道多项选择题(Multiple-Choice Questions,简称MCQs),均从真实阿拉伯语教育资源中精选而来。该数据集覆盖生物学(Biology)、化学(Chemistry)、物理学(Physics)、数学(Mathematics)以及地理学(Geography)的初高中阶段内容,旨在评估阿拉伯语大语言模型(Large Language Model,简称LLM)在结构化领域特定知识上的表现。 在本**「干扰项扰动鲁棒性(Robustness under Distractor Perturbation,简称RDP)」**版本中,研究人员针对[原生基准测试集(Native Benchmark)](https://huggingface.co/datasets/tiiuae/NativeQA)中25%的样本采用定向干扰项策略进行修改:其中20%的样本会移除正确答案,并替换为阿拉伯语中“无正确选项”的多种等效表达;剩余5%的样本则通过替换错误选项的方式,将上述表达作为干扰项加入其中。[详细方法可参见相关论文](#code-and-paper)。 ## 设计动机 尽管阿拉伯语自然语言处理(Natural Language Processing,简称NLP)在文化与语言任务领域取得了长足进展,但科学推理相关任务仍占比不足。本数据集通过采用真实的领域内阿拉伯语素材,评估模型的事实性与概念性理解能力,填补了这一空白。 ## 数据集结构 - `question_text`:该多项选择题的阿拉伯语文本(完全自包含) - `choices`:包含四个选项的列表,选项标记为「أ」、「ب」、「ج」、「د」 - `correct_choice`:正确答案(仅返回选项字母) - `domain`:所属学科领域(例如生物学、物理学) - `difficulty`:难度评分,范围为1(简单)至10(困难) json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ## 数据来源 本数据集通过网络爬虫与基于正则表达式的筛选方式,从开源阿拉伯语教材、练习册以及题库中收集得到。 ## 数据整理流程 1. **OCR处理**:采用支持LaTeX格式的Pix2Tex工具,执行双阶段光学字符识别(Optical Character Recognition,简称OCR),覆盖文本与数学公式两类内容 2. **提取流水线**:利用大语言模型提取问答对 3. **分类标注**:为问题标注题型、所属领域与难度等级 4. **标准化处理**:将数据重新格式化为多项选择题形式,并随机化正确答案的位置 5. **人工校验**:所有问题均由具备STEM(科学、技术、工程、数学,Science, Technology, Engineering, Mathematics)背景的阿拉伯语使用者进行审核 ## 代码与论文 - GitHub上的3LM仓库:https://github.com/tiiuae/3LM-benchmark - ArXiv平台上的3LM论文:https://arxiv.org/pdf/2507.15850 ## 许可协议 [Falcon大语言模型许可协议(Falcon LLM Licence)](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## 引用格式 bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} }
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作