NativeQA
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/NativeQA
下载链接
链接失效反馈官方服务:
资源简介:
# 3LM Native STEM Arabic Benchmark
## Dataset Summary
The 3LM Native STEM dataset contains 865 multiple-choice questions (MCQs) curated from real Arabic educational sources. It targets mid- to high-school level content in Biology, Chemistry, Physics, Mathematics, and Geography. This benchmark is designed to evaluate Arabic large language models on structured, domain-specific knowledge.
## Motivation
While Arabic NLP has seen growth in cultural and linguistic tasks, scientific reasoning remains underrepresented. This dataset fills that gap by using authentic, in-domain Arabic materials to evaluate factual and conceptual understanding.
## Dataset Structure
- `question_text`: Arabic text of the MCQ (fully self-contained)
- `choices`: List of four choices labeled "أ", "ب", "ج", "د"
- `correct_choice`: Correct answer (letter only)
- `domain`: Subject area (e.g., biology, physics)
- `difficulty`: Score from 1 (easy) to 10 (hard)
```json
{
"question_text": "ما هو الغاز الذي يتنفسه الإنسان؟",
"choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"],
"correct_choice": "أ",
"domain": "biology",
"difficulty": 3
}
```
## Data Sources
Collected from open-access Arabic textbooks, worksheets, and question banks sourced through web crawling and regex-based filtering.
## Data Curation
1. **OCR Processing**: Dual-stage OCR (text + math) using Pix2Tex for LaTeX support.
2. **Extraction Pipeline**: Used LLMs to extract Q&A pairs.
3. **Classification**: Questions tagged by type, domain, and difficulty.
4. **Standardization**: Reformatted to MCQ and randomized correct answer positions.
5. **Manual Verification**: All questions reviewed by Arabic speakers with STEM background.
## Code and Paper
- 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark
- 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850
## Licensing
[Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## Citation
```bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
```
# 3LM 原生阿拉伯语STEM基准数据集(3LM Native STEM Arabic Benchmark)
## 数据集概述
3LM 原生阿拉伯语STEM基准数据集包含865道多项选择题(Multiple-Choice Questions, MCQs),所有题目均从真实阿拉伯语教育资源中精心甄选而来。该数据集覆盖生物学、化学、物理学、数学及地理学的初高中阶段内容,旨在评估阿拉伯语大语言模型(Large Language Model, LLM)在结构化领域专属知识上的表现。
## 研究动机
尽管阿拉伯语自然语言处理(Natural Language Processing, NLP)在文化与语言任务上取得了进展,但科学推理相关任务仍占比不足。本数据集通过采用真实的领域专属阿拉伯语资料,评估模型对事实性与概念性知识的掌握程度,填补了这一空白。
## 数据集结构
- `question_text`:该多项选择题的阿拉伯语文本(内容完整自洽)
- `choices`:包含4个选项的列表,选项标签为“أ”、“ب”、“ج”、“د”
- `correct_choice`:正确答案(仅标注选项字母)
- `domain`:所属学科领域(例如生物学、物理学)
- `difficulty`:难度评分,范围为1(简单)至10(困难)
json
{
"question_text": "ما هو الغاز الذي يتنفسه الإنسان؟",
"choices": ["أ. الأكسجين", "ب. ثاني أكسيد الكربون", "ج. النيتروجين", "د. الهيدروجين"],
"correct_choice": "أ",
"domain": "biology",
"difficulty": 3
}
## 数据来源
本数据集通过网络爬虫与基于正则表达式的过滤手段,从公开可获取的阿拉伯语教材、练习册及题库中采集得到。
## 数据整理流程
1. **光学字符识别(Optical Character Recognition, OCR)处理**:采用支持LaTeX格式的Pix2Tex工具,执行双阶段OCR(文本+数学公式)流程。
2. **提取流水线**:使用大语言模型(Large Language Model, LLM)提取问答对。
3. **分类标注**:为题目标注题型、学科领域及难度等级。
4. **标准化处理**:将题目重新格式化为多项选择题,并随机化正确选项的位置。
5. **人工校验**:所有题目均由具备STEM学科背景的阿拉伯语使用者进行审核。
## 代码与论文
- GitHub上的3LM仓库:https://github.com/tiiuae/3LM-benchmark
- Arxiv上的3LM相关论文:https://arxiv.org/pdf/2507.15850
## 授权协议
[Falcon LLM授权协议](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## 引用格式
bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-03



