SyntheticQA
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/tiiuae/SyntheticQA
下载链接
链接失效反馈官方服务:
资源简介:
# 3LM Synthetic STEM Arabic Benchmark
## Dataset Summary
The 3LM Synthetic STEM dataset contains 1,744 automatically generated MCQs in Arabic covering STEM subjects: Biology, Chemistry, Physics, Mathematics, and General Science. These questions were generated using the YourBench framework, adapted for Arabic content.
## Motivation
Arabic LLMs lack access to native, diverse, and high-difficulty STEM datasets. This synthetic benchmark addresses that gap with carefully curated, LLM-generated questions evaluated for challenge, clarity, and subject balance.
## Dataset Structure
- `question`: Arabic MCQ text (self-contained)
- `choices`: Four Arabic-labeled options ("أ", "ب", "ج", "د")
- `self_answer`: Correct choice (letter only)
- `estimated_difficulty`: From 6–10, focusing on mid-to-high challenge
- `self_assessed_question_type`: Question type — conceptual, factual, analytical, application
```json
{
"question": "ما هو التفاعل الكيميائي الذي يمتص الحرارة؟",
"choices": ["أ. احتراق", "ب. تبخر", "ج. تحليل", "د. تفاعل ماص للحرارة"],
"self_answer": "د",
"estimated_difficulty": 7,
"self_assessed_question_type": "conceptual"
}
```
## Data Generation
- Source material: Arabic STEM textbooks and exams
- Pipeline: [YourBench](https://huggingface.co/spaces/HuggingFaceH4/YourBench) adapted for Arabic
- Stages: preprocessing → summarization → chunking → question generation → filtering
- Filtering: Removed visually dependent questions and ensured question quality via LLM and human review
## Code and Paper
- 3LM repo on GitHub: https://github.com/tiiuae/3LM-benchmark
- 3LM paper on Arxiv: https://arxiv.org/pdf/2507.15850
## Licensing
[Falcon LLM Licence](https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## Citation
```bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
```
# 3LM 合成阿拉伯语STEM基准数据集
## 数据集概述
本3LM合成阿拉伯语STEM基准数据集包含1744道自动生成的阿拉伯语多项选择题(Multiple Choice Questions, MCQs),涵盖生物学、化学、物理学、数学及普通科学等STEM学科领域。所有题目均基于适配阿拉伯语内容的YourBench框架生成。
## 研发动机
阿拉伯语大语言模型(Large Language Model, LLM)缺乏原生、多样且高难度的STEM数据集。本合成基准数据集通过精心甄选、大语言模型生成的题目填补了这一空白,所有题目均经过挑战性、清晰度及学科平衡性的评估校验。
## 数据集结构
- `question`:阿拉伯语多项选择题文本(具备独立性)
- `choices`:四个阿拉伯语标注的选项,依次为「أ」「ب」「ج」「د」
- `self_answer`:正确选项(仅标注字母)
- `estimated_difficulty`:难度评级区间为6至10,聚焦中高难度题目
- `self_assessed_question_type`:题目类型,涵盖概念型、事实型、分析型及应用型
以下为单条数据示例:
json
{
"question": "ما هو التفاعل الكيميائي الذي يمتص الحرارة؟",
"choices": ["أ. احتراق", "ب. تبخر", "ج. تحليل", "د. تفاعل ماص للحرارة"],
"self_answer": "د",
"estimated_difficulty": 7,
"self_assessed_question_type": "conceptual"
}
## 数据生成
- 源材料:阿拉伯语STEM教材及考试真题
- 处理流程:采用适配阿拉伯语的YourBench框架(详见https://huggingface.co/spaces/HuggingFaceH4/YourBench)
- 处理阶段:预处理→摘要生成→文本分块→题目生成→筛选
- 筛选规则:移除依赖视觉展示的题目,并通过大语言模型及人工评审确保题目质量
## 代码与论文
- GitHub仓库地址:https://github.com/tiiuae/3LM-benchmark
- ArXiv论文链接:https://arxiv.org/pdf/2507.15850
## 授权协议
采用Falcon LLM许可协议(详见https://falconllm.tii.ae/falcon-terms-and-conditions.html)
## 引用格式
bibtex
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}
提供机构:
maas
创建时间:
2025-10-04



