ARB

Name: ARB
Creator: maas
Published: 2025-12-05 16:35:46
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/MBZUAI/ARB

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <img src="assets/arab_logo.png" width="12%" align="left"/> </div> <div style="margin-top:50px;"> <h1 style="font-size: 30px; margin: 0;"> ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark</h1> </div> <div align="center" style="margin-top:10px;"> [Sara Ghaboura](https://huggingface.co/SLMLAH) <sup> * </sup>   [Ketan More](https://github.com/ketanmore2002) <sup> * </sup>   [Wafa Alghallabi](https://huggingface.co/SLMLAH)   [Omkar Thawakar](https://omkarthawakar.github.io)   <br> [Jorma Laaksonen](https://scholar.google.com/citations?user=qQP6WXIAAAAJ&hl=en)   [Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)   [Salman Khan](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)   [Rao M. Anwer](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)<br> <em> <sup> *Equal Contribution </sup> </em> <br> </div> <div align="center" style="margin-top:10px;"> [![arXiv](https://img.shields.io/badge/arXiv-2505.17021-C0DAD9)](https://arxiv.org/abs/2505.17021) [![Our Page](https://img.shields.io/badge/Visit-Our%20Page-D4EBDB?style=flat)](https://mbzuai-oryx.github.io/ARB/) ## 🪔✨ ARB Scope and Diversity <p align="left"> ARB is the first benchmark focused on step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation. <br> </p> <p align="center"> <img src="assets/arb_sample_intro.png" width="600px" height="125px" alt="Figure: ARB Dataset Coverage"/> </p> </div> </p> ## 🌟 Key Features - Includes **1,356** multimodal samples with **5,119** curated reasoning steps. - Spans **11 diverse domains**, from visual reasoning to historical and scientific analysis. - Emphasizes **step-by-step reasoning**, beyond just final answer prediction. - Each sample contains a **chain of 2–6+ reasoning steps** aligned to human logic. - Curated and verified by **native Arabic speakers** and **domain experts** for linguistic and cultural fidelity. - Built from **hybrid sources**: original Arabic data, high-quality translations, and synthetic samples. - Features a **robust evaluation framework** measuring both final answer accuracy and reasoning quality. - Fully **open-source dataset** and toolkit to support research in **Arabic reasoning and multimodal AI**. ## 🏗️ ARB Construction Pipeline <p align="center"> <img src="assets/arb_pipeline.png" width="750px" height="180px" alt="Figure: ARB Pipeline Overview"/> </p> ## 🗂️ ARB Collection <p align="center"> <img src="assets/arb_collection.png" width="750px" height="180px" alt="Figure: ARB Collection"/> </p> ## 🗂️ ARB Distribution <p align="center"> <img src="assets/arb_dist.png" width="400px" height="100px" alt="Figure: ARB dist"/> </p> ## 🧪 Evaluation Protocol <div> <p align="left"> We evaluated 12 open- and closed-source LMMs using: - **Lexical and Semantic Similarity Scoes**: BLEU, ROUGE, BERTScore, LaBSE - **Stepwise Evaluation Using LLM-as-Judge**: Our curated metric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more. </p> </div> ## 🏆 Evaluation Results - Stepwise Evaluation Using LLM-as-Judge for Closed-Source Models: | Metric ↓ / Model → | GPT-4o | GPT-4o-mini | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash | |----------------------------|--------|-------------|---------|---------|----------------|------------------| | Final Answer (%) | **60.22** | 52.22 | 59.43 | 58.93 | 56.70 | 57.80 | | Reasoning Steps (%) | 64.29 | 61.02 | 80.41 | **80.75**| 64.34 | 64.09 | - Stepwise Evaluation Using LLM-as-Judge for Open-Source Models: | Metric ↓ / Model → | Qwen2.5-VL | LLaMA-3.2 | AIN | LLaMA-4 Scout | Aya-Vision | InternVL3 | |----------------------------|------------|-----------|-------|----------------|-------------|-----------| | Final Answer (%) | 37.02 | 25.58 | 27.35 | **48.52** | 28.81 | 31.04 | | Reasoning Steps (%) | 64.03 | 53.20 | 52.77 | **77.70** | 63.64 | 54.50 | ## 📂 Dataset Structure <div> <p align="left"> Each sample includes: - `image_id`: Visual input - `question`: Arabic question grounded in image reasoning - `choices`: The choices for the MCQ - `steps`: Ordered reasoning chain - `answer`: Final solution (Arabic) - `category`: One of 11 categories (e.g., OCR, Scientific, Visual, Math) </p> Example JSON: ```json { "image_id":"Chart_2.png", "question":"من خلال الرسم البياني لعدد القطع لكل عضو في الكشف عن السرطان، إذا جمعنا نسبة 'أخرى' مع نسبة 'الرئة'، فكيف يقاربان نسبة 'الكلى' تقريبًا؟", "answer":"ج", "choices":"['أ. مجموعهما أكبر بكثير من نسبة الكلى', 'ب. مجموعهما يساوي تقريبًا نسبة الكلى', 'ج. مجموعهما أقل بشكل ملحوظ من نسبة الكلى']", "steps":"الخطوة 1: تحديد النسب المئوية لكل من 'أخرى' و'الرئة' و'الكلى' من الرسم البياني.\nالإجراء 1: 'أخرى' = 0.7%، 'الرئة' = 1.8%، 'الكلى' = 4.3%.\n\nالخطوة 2: حساب مجموع النسب المئوية لـ 'أخرى' و'الرئة'.\nالإجراء 2: 0.7% + 1.8% = 2.5%.\n\nالخطوة 3: مقارنة مجموع النسب المئوية لـ 'أخرى' و'الرئة' مع نسبة 'الكلى'.\nالإجراء 3: 2.5% (مجموع 'أخرى' و'الرئة') أقل من 4.3% (نسبة 'الكلى').\n\nالخطوة 4: اختيار الإجابة الصحيحة بناءً على المقارنة.\nالإجراء 4: اختيار 'ج' لأن مجموعهما أقل بشكل ملحوظ من نسبة 'الكلى'.", "category ":"CDT", }, ``` </div> <div align="left"> ## 📚 Citation If you use ARB dataset in your research, please consider citing: ```bibtex @misc{ghaboura2025arbcomprehensivearabicmultimodal, title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark}, author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer}, year={2025}, eprint={2505.17021}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.17021}, } ``` </div>

<div align="center"> <img src="assets/arab_logo.png" width="12%" align="left"/> </div> <div style="margin-top:50px;"> <h1 style="font-size: 30px; margin: 0;"> ARB: 全面的阿拉伯语多模态推理基准数据集</h1> </div> <div align="center" style="margin-top:10px;"> [萨拉·加布拉（Sara Ghaboura）](https://huggingface.co/SLMLAH) <sup> * </sup>   [凯坦·莫尔（Ketan More）](https://github.com/ketanmore2002) <sup> * </sup>   [瓦法·加拉比（Wafa Alghallabi）](https://huggingface.co/SLMLAH)   [奥姆卡尔·塔瓦卡尔（Omkar Thawakar）](https://omkarthawakar.github.io)   <br> [约尔马·拉科森（Jorma Laaksonen）](https://scholar.google.com/citations?user=qQP6WXIAAAAJ&hl=en)   [希沙姆·乔拉卡尔（Hisham Cholakkal）](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)   [萨尔曼·汗（Salman Khan）](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)   [拉奥·M·安瓦尔（Rao M. Anwer）](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)<br> <em> <sup> *同等贡献</sup> </em> <br> </div> <div align="center" style="margin-top:10px;"> [![arXiv](https://img.shields.io/badge/arXiv-2505.17021-C0DAD9)](https://arxiv.org/abs/2505.17021) [![访问项目主页](https://img.shields.io/badge/访问-项目主页-D4EBDB?style=flat)](https://mbzuai-oryx.github.io/ARB/) ## 🪔✨ ARB的范围与多样性 <p align="left"> ARB是首个聚焦阿拉伯语跨文本与视觉模态逐步推理的基准数据集，涵盖科学、文化、光学字符识别（Optical Character Recognition, OCR）、历史解读等11个多元领域。 <br> </p> <p align="center"> <img src="assets/arb_sample_intro.png" width="600px" height="125px" alt="图：ARB数据集覆盖范围"/> </p> </div> </p> ## 🌟 核心特性 - 包含**1356**个多模态样本与**5119**条经人工筛选的推理步骤。 - 覆盖**11个多元领域**，涵盖视觉推理、历史与科学分析等方向。 - 强调**逐步推理能力**，而非仅预测最终答案。 - 每个样本均包含**2~6+条符合人类逻辑的有序推理链**。 - 由**阿拉伯语母语者与领域专家**审核筛选，确保语言与文化准确性。 - 数据源自**混合来源**：原创阿拉伯语数据集、高质量翻译样本与人工合成样本。 - 配备**完善的评估框架**，可同时衡量最终答案准确率与推理质量。 - 数据集与配套工具均为**开源**，旨在推动阿拉伯语推理与多模态人工智能（Multimodal AI）领域的研究。 ## 🏗️ ARB构建流程 <p align="center"> <img src="assets/arb_pipeline.png" width="750px" height="180px" alt="图：ARB构建流程概览"/> </p> ## 🗂️ ARB数据采集 <p align="center"> <img src="assets/arb_collection.png" width="750px" height="180px" alt="图：ARB数据采集流程"/> </p> ## 🗂️ ARB数据分布 <p align="center"> <img src="assets/arb_dist.png" width="400px" height="100px" alt="图：ARB数据分布"/> </p> ## 🧪 评估协议 <div> <p align="left"> 我们采用以下方式对12款开源与闭源多模态大模型（Large Multimodal Model, LMM）进行评估： - **词汇与语义相似度评分**：包含BLEU、ROUGE、BERTScore与LaBSE指标 - **基于大模型作为评判者的逐步评估**：我们定制的评估指标涵盖忠实性、解释深度、连贯性、幻觉性等10项评估维度。 </p> </div> ## 🏆 评估结果 - 基于大模型作为评判者的闭源模型逐步评估： | 评估指标 ↓ / 模型 → | GPT-4o | GPT-4o-mini | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash | |----------------------------|--------|-------------|---------|---------|----------------|------------------| | 最终答案准确率（%） | **60.22** | 52.22 | 59.43 | 58.93 | 56.70 | 57.80 | | 推理步骤准确率（%） | 64.29 | 61.02 | 80.41 | **80.75**| 64.34 | 64.09 | - 基于大模型作为评判者的开源模型逐步评估： | 评估指标 ↓ / 模型 → | Qwen2.5-VL | LLaMA-3.2 | AIN | LLaMA-4 Scout | Aya-Vision | InternVL3 | |----------------------------|------------|-----------|-------|----------------|-------------|-----------| | 最终答案准确率（%） | 37.02 | 25.58 | 27.35 | **48.52** | 28.81 | 31.04 | | 推理步骤准确率（%） | 64.03 | 53.20 | 52.77 | **77.70** | 63.64 | 54.50 | ## 📂 数据集结构 <div> <p align="left"> 每个样本包含以下字段： - `image_id`：视觉输入文件标识 - `question`：基于图像推理的阿拉伯语问题 - `choices`：多项选择题的选项集合 - `steps`：有序推理链 - `answer`：最终解决方案（阿拉伯语） - `category`：所属的11个类别之一（例如OCR、科学、视觉、数学） </p> 示例JSON结构如下： json { "image_id":"Chart_2.png", "question":"من خلال الرسم البياني لعدد القطع لكل عضو في الكشف عن السرطان، إذا جمعنا نسبة 'أخرى' مع نسبة 'الرئة'، فكيف يقاربان نسبة 'الكلى' تقريبًا؟", "answer":"ج", "choices":"['أ. مجموعهما أكبر بكثير من نسبة الكلى', 'ب. مجموعهما يساوي تقريبًا نسبة الكلى', 'ج. مجموعهما أقل بشكل ملحوظ من نسبة الكلى']", "steps":"الخطوة 1: تحديد النسب المئوية لكل من 'أخرى' و'الرئة' و'الكلى' من الرسم البياني. الإجراء 1: 'أخرى' = 0.7%، 'الرئة' = 1.8%، 'الكلى' = 4.3%. الخطوة 2: حساب مجموع النسب المئوية لـ 'أخرى' و'الرئة'. الإجراء 2: 0.7% + 1.8% = 2.5%. الخطوة 3: مقارنة مجموع النسب المئوية لـ 'أخرى' و'الرئة' مع نسبة 'الكلى'. الإجراء 3: 2.5% (مجموع 'أخرى' و'الرئة') أقل من 4.3% (نسبة 'الكلى'). الخطوة 4: اختيار الإجابة الصحيحة بناءً على المقارنة. الإجراء 4: اختيار 'ج' لأن مجموعهما أقل بشكل ملحوظ من نسبة 'الكلى'.", "category ":"CDT", }, </div> <div align="left"> ## 📚 引用说明如果您在研究中使用ARB数据集，请引用以下文献： bibtex @misc{ghaboura2025arbcomprehensivearabicmultimodal, title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark}, author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer}, year={2025}, eprint={2505.17021}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.17021}, } </div>

提供机构：

maas

创建时间：

2025-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集