five

NativeQA

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/NativeQA
下载链接
链接失效反馈
官方服务:
资源简介:
# 3LM Native STEM Arabic Benchmark ## Dataset Summary The 3LM Native STEM dataset contains 865 multiple-choice questions (MCQs) curated from real Arabic educational sources. It targets mid- to high-school level content in Biology, Chemistry, Physics, Mathematics, and Geography. This benchmark is designed to evaluate Arabic large language models on structured, domain-specific knowledge. ## Motivation While Arabic NLP has seen growth in cultural and linguistic tasks, scientific reasoning remains underrepresented. This dataset fills that gap by using authentic, in-domain Arabic materials to evaluate factual and conceptual understanding. ## Dataset Structure - `question_text`: Arabic text of the MCQ (fully self-contained) - `choices`: List of four choices labeled "أ", "ب", "ج", "د" - `correct_choice`: Correct answer (letter only) - `domain`: Subject area (e.g., biology, physics) - `difficulty`: Score from 1 (easy) to 10 (hard) ```json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ``` ## Data Sources Collected from open-access Arabic textbooks, worksheets, and question banks sourced through web crawling and regex-based filtering. ## Data Curation 1. **OCR Processing**: Dual-stage OCR (text + math) using Pix2Tex for LaTeX support. 2. **Extraction Pipeline**: Used LLMs to extract Q&A pairs. 3. **Classification**: Questions tagged by type, domain, and difficulty. 4. **Standardization**: Reformatted to MCQ and randomized correct answer positions. 5. **Manual Verification**: All questions reviewed by Arabic speakers with STEM background. ## Code and Paper - 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark - 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850 ## Licensing [Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## Citation ```bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} } ```

# 3LM 原生阿拉伯语STEM基准数据集(3LM Native STEM Arabic Benchmark) ## 数据集概述 3LM 原生阿拉伯语STEM基准数据集包含865道多项选择题(Multiple-Choice Questions, MCQs),所有题目均从真实阿拉伯语教育资源中精心甄选而来。该数据集覆盖生物学、化学、物理学、数学及地理学的初高中阶段内容,旨在评估阿拉伯语大语言模型(Large Language Model, LLM)在结构化领域专属知识上的表现。 ## 研究动机 尽管阿拉伯语自然语言处理(Natural Language Processing, NLP)在文化与语言任务上取得了进展,但科学推理相关任务仍占比不足。本数据集通过采用真实的领域专属阿拉伯语资料,评估模型对事实性与概念性知识的掌握程度,填补了这一空白。 ## 数据集结构 - `question_text`:该多项选择题的阿拉伯语文本(内容完整自洽) - `choices`:包含4个选项的列表,选项标签为“أ”、“ب”、“ج”、“د” - `correct_choice`:正确答案(仅标注选项字母) - `domain`:所属学科领域(例如生物学、物理学) - `difficulty`:难度评分,范围为1(简单)至10(困难) json { "question_text": "ما هو الغاز الذي يتنفسه الإنسان؟", "choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"], "correct_choice": "أ", "domain": "biology", "difficulty": 3 } ## 数据来源 本数据集通过网络爬虫与基于正则表达式的过滤手段,从公开可获取的阿拉伯语教材、练习册及题库中采集得到。 ## 数据整理流程 1. **光学字符识别(Optical Character Recognition, OCR)处理**:采用支持LaTeX格式的Pix2Tex工具,执行双阶段OCR(文本+数学公式)流程。 2. **提取流水线**:使用大语言模型(Large Language Model, LLM)提取问答对。 3. **分类标注**:为题目标注题型、学科领域及难度等级。 4. **标准化处理**:将题目重新格式化为多项选择题,并随机化正确选项的位置。 5. **人工校验**:所有题目均由具备STEM学科背景的阿拉伯语使用者进行审核。 ## 代码与论文 - GitHub上的3LM仓库:https://github.com/tiiuae/3LM-benchmark - Arxiv上的3LM相关论文:https://arxiv.org/pdf/2507.15850 ## 授权协议 [Falcon LLM授权协议](https://falconllm.tii.ae/falcon-terms-and-conditions.html) ## 引用格式 bibtex @article{boussaha2025threeLM, title={3LM: Bridging Arabic, STEM, and Code through Benchmarking}, author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim}, journal={arXiv preprint arXiv:2507.15850}, year={2025} }
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作