tiiuae/SyntheticQA
收藏Hugging Face2026-02-14 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/tiiuae/SyntheticQA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: Biology
features:
- name: question
dtype: string
- name: choices
dtype: string
- name: self_answer
dtype: string
- name: estimated_difficulty
dtype: int64
- name: self_assessed_question_type
dtype: string
splits:
- name: test
num_bytes: 151342
num_examples: 272
download_size: 68603
dataset_size: 151342
- config_name: Chemistry
features:
- name: question
dtype: string
- name: choices
dtype: string
- name: self_answer
dtype: string
- name: estimated_difficulty
dtype: int64
- name: self_assessed_question_type
dtype: string
splits:
- name: test
num_bytes: 199731
num_examples: 414
download_size: 85996
dataset_size: 199731
- config_name: General_Science
features:
- name: question
dtype: string
- name: choices
dtype: string
- name: self_answer
dtype: string
- name: estimated_difficulty
dtype: int64
- name: self_assessed_question_type
dtype: string
splits:
- name: test
num_bytes: 191398
num_examples: 370
download_size: 84749
dataset_size: 191398
- config_name: Math
features:
- name: question
dtype: string
- name: choices
dtype: string
- name: self_answer
dtype: string
- name: estimated_difficulty
dtype: int64
- name: self_assessed_question_type
dtype: string
splits:
- name: test
num_bytes: 134795
num_examples: 314
download_size: 60601
dataset_size: 134795
- config_name: Physics
features:
- name: question
dtype: string
- name: choices
dtype: string
- name: self_answer
dtype: string
- name: estimated_difficulty
dtype: int64
- name: self_assessed_question_type
dtype: string
splits:
- name: test
num_bytes: 196728
num_examples: 374
download_size: 84723
dataset_size: 196728
configs:
- config_name: Biology
data_files:
- split: test
path: Biology/test-*
- config_name: Chemistry
data_files:
- split: test
path: Chemistry/test-*
- config_name: General_Science
data_files:
- split: test
path: General_Science/test-*
- config_name: Math
data_files:
- split: test
path: Math/test-*
- config_name: Physics
data_files:
- split: test
path: Physics/test-*
---
# 3LM Synthetic STEM Arabic Benchmark
## Dataset Summary
The 3LM Synthetic STEM dataset contains 1,744 automatically generated MCQs in Arabic covering STEM subjects: Biology, Chemistry, Physics, Mathematics, and General Science. These questions were generated using the YourBench framework, adapted for Arabic content.
## Motivation
Arabic LLMs lack access to native, diverse, and high-difficulty STEM datasets. This synthetic benchmark addresses that gap with carefully curated, LLM-generated questions evaluated for challenge, clarity, and subject balance.
## Dataset Structure
- `question`: Arabic MCQ text (self-contained)
- `choices`: Four Arabic-labeled options ("أ", "ب", "ج", "د")
- `self_answer`: Correct choice (letter only)
- `estimated_difficulty`: From 6–10, focusing on mid-to-high challenge
- `self_assessed_question_type`: Question type — conceptual, factual, analytical, application
```json
{
"question": "ما هو التفاعل الكيميائي الذي يمتص الحرارة؟",
"choices": ["أ. احتراق", "ب. تبخر", "ج. تحليل", "د. تفاعل ماص للحرارة"],
"self_answer": "د",
"estimated_difficulty": 7,
"self_assessed_question_type": "conceptual"
}
```
## Data Generation
- Source material: Arabic STEM textbooks and exams
- Pipeline: [YourBench](https://huggingface.co/spaces/HuggingFaceH4/YourBench) adapted for Arabic
- Stages: preprocessing → summarization → chunking → question generation → filtering
- Filtering: Removed visually dependent questions and ensured question quality via LLM and human review
## Code and Paper
- 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark
- 3LM paper: https://aclanthology.org/2025.arabicnlp-main.4/
## Licensing
[Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## Citation
```bibtex
@inproceedings{boussaha-etal-2025-3lm,
title = "3{LM}: Bridging {A}rabic, {STEM}, and Code through Benchmarking",
author = "Boussaha, Basma El Amel and
Al Qadi, Leen and
Farooq, Mugariya and
Alsuwaidi, Shaikha and
Campesan, Giulia and
Alzubaidi, Ahmed and
Alyafeai, Mohammed and
Hacid, Hakim",
booktitle = "Proceedings of The Third Arabic Natural Language Processing Conference",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.arabicnlp-main.4/",
doi = "10.18653/v1/2025.arabicnlp-main.4",
pages = "42--63",
ISBN = "979-8-89176-352-4",
}
```
提供机构:
tiiuae



