ALM-Bench

Name: ALM-Bench
Creator: maas
Published: 2025-10-09 16:26:32
License: 暂无描述

魔搭社区2025-10-09 更新2025-03-22 收录

下载链接：

https://modelscope.cn/datasets/MBZUAI/ALM-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

# All Languages Matter Benchmark (ALM-Bench) [CVPR 2025 🔥] # Summary Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-Bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-Bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-Bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-Bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-Bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark and codes are publicly available. [Arxiv Link](https://arxiv.org/abs/2411.16508), [Project Page](https://mbzuai-oryx.github.io/ALM-Bench/), [GitHub Page](https://github.com/mbzuai-oryx/ALM-Bench) --- # Dataset Structure ## Data Instances An example of `test` looks as follows: ``` {'file_name': , 'ID': '031_31_01_001', 'Language': 'Italian', 'Category': 'Lifestyle', 'Question_Type': 'Short Questions', 'English_Question': 'What type of clothing are the people in the image wearing?', 'English_Answer': 'The people in the image are wearing professional clothing.', 'Translated_Question': " Che tipo di abbigliamento indossano le persone nell'immagine?", 'Translated_Answer': " Le persone nell'immagine indossano abiti professionali.", 'Image_Url': 'https://assets.vogue.com/photos/650c97c9e5c5af360f4668ac/master/w_2560%2Cc_limit/GettyImages-1499571723.jpg' } ``` Data Fields The data fields are: ``` - 'file_name': , - 'ID': A unique ID in the language#_cat#_img# format. - 'Language': A language from the 100 languages. - 'Category': A category from our total 19 categories. - 'Question_Type': One of four question types, MCQs, T/F, SVQAs, and LVQAs. - 'English_Question': The original question in the English Language. - 'English_Answer': The original answer in the English Language. - 'Translated_Question': The translated and annotated question in the Native language. - 'Translated_Answer': The translated and annotated answer in the Native language. - 'Image_Url': The image URL that we have retrieved from the internet. ``` --- # Data Statistics Data statistics of our ALM-bench showing the diversity of the scripts, global coverage, comprehensive categories, and various question types. Our dataset contains 22.7K high-quality question-answers in total, covering 100 languages and 24 scripts. All the samples are manually verified by native speakers. --- # Dataset Benchmark Comparison Comparison of various LMM benchmarks with a focus on multilingual and cultural understanding. The Domains indicate the range of aspects covered by the dataset for each language. Question Form is categorized as "Diverse" if the questions phrasing varies, and "Fixed" otherwise. Annotation Types are classified as "Manual" if questions were originally in the local language, "Manual+Auto" if questions were generated or translated using GPT-4/Google API and subsequently validated by human experts, and "Auto" if generated or translated automatically without human validation. Bias Correction reflects whether the dataset is balanced across cultures and countries, while Diversity indicates whether the dataset includes both Western and non-Western minority cultures. ‘-’ means information not available. --- # Experimental Results ALM-Bench Performance comparison of different open and closed-sourced models (y-axis) on the 100 languages (x-axis) of our ALM-Bench. The performance is represented as an average accuracy across all questions in a language. The actual performance of a model on a language is shown in each respective box, where the higher accuracy is highlighted with a high color intensity. --- # Citation **BibTeX:** ```bibtex @misc{vayani2024alm, title={All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages}, author={Ashmal Vayani and Dinura Dissanayake and Hasindri Watawana and Noor Ahsan and Nevasini Sasikumar and Omkar Thawakar and Henok Biadglign Ademtew and Yahya Hmaiti and Amandeep Kumar and Kartik Kuckreja and Mykola Maslych and Wafa Al Ghallabi and Mihail Mihaylov and Chao Qin and Abdelrahman M Shaker and Mike Zhang and Mahardika Krisna Ihsani and Amiel Esplana and Monil Gokani and Shachar Mirkin and Harsh Singh and Ashay Srivastava and Endre Hamerlik and Fathinah Asma Izzati and Fadillah Adamsyah Maani and Sebastian Cavada and Jenny Chim and Rohit Gupta and Sanjay Manjunath and Kamila Zhumakhanova and Feno Heriniaina Rabevohitra and Azril Amirudin and Muhammad Ridzuan and Daniya Kareem and Ketan More and Kunyang Li and Pramesh Shakya and Muhammad Saad and Amirpouya Ghasemaghaei and Amirbek Djanibekov and Dilshod Azizov and Branislava Jankovic and Naman Bhatia and Alvaro Cabrera and Johan Obando-Ceron and Olympiah Otieno and Fabian Farestam and Muztoba Rabbani and Sanoojan Baliah and Santosh Sanjeev and Abduragim Shtanchaev and Maheen Fatima and Thao Nguyen and Amrin Kareem and Toluwani Aremu and Nathan Xavier and Amit Bhatkal and Hawau Toyin and Aman Chadha and Hisham Cholakkal and Rao Muhammad Anwer and Michael Felsberg and Jorma Laaksonen and Thamar Solorio and Monojit Choudhury and Ivan Laptev and Mubarak Shah and Salman Khan and Fahad Khan}, year={2024}, eprint={2411.16508}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.16508}, } ``` --- ## Licensing Information We release our work under [CC BY-NC 4.0 License](https://creativecommons.org/licenses/by-nc/4.0/). The CC BY-NC 4.0 license allows others to share, remix, and adapt the work, as long as it's for non-commercial purposes and proper attribution is given to the original creator.

# 通用语言重要性基准测试（All Languages Matter Benchmark，ALM-Bench）[CVPR 2025 🔥] # 概述当前的多模态大模型（Large Multimodal Models，LMMs）通常仅聚焦于少数地区与语言。随着多模态大模型性能持续提升，确保其能够理解文化语境、尊重本土敏感性、支持低资源语言，并有效整合对应视觉线索，这一点愈发重要。为了构建具有文化多样性的全球多模态模型，我们提出的通用语言重要性基准测试（ALM-Bench）是目前为止针对100种语言的多模态大模型评估领域规模最大、覆盖最全面的工作。 ALM-Bench通过测试模型理解和推理多语言文化多样性图像与文本的能力，对现有模型提出挑战，其中涵盖了诸多在多模态大模型研究中传统上代表性不足的低资源语言。该基准测试提供了一套严谨且细致的评估框架，涵盖多种题型，包括判断题、选择题以及开放式问题，且进一步细分为短回答与长回答两类。ALM-Bench的设计确保能够全面评估模型在视觉与语言推理中应对不同难度层级的能力。为了展现全球文化的丰富多样性，ALM-Bench从13个不同的文化维度精心甄选内容，涵盖传统习俗、仪式礼仪、知名人物以及节庆活动等范畴。借此，ALM-Bench不仅为当前最先进的开源与闭源多模态大模型提供了严格的测试平台，同时也凸显了文化与语言包容性的重要性，助力开发能够有效服务全球多元人群的模型。本基准测试与代码均已公开上线。 [Arxiv链接](https://arxiv.org/abs/2411.16508), [项目主页](https://mbzuai-oryx.github.io/ALM-Bench/), [GitHub页面](https://github.com/mbzuai-oryx/ALM-Bench) --- # 数据集结构 ## 数据实例测试集（test）的示例如下： {'file_name': , 'ID': '031_31_01_001', 'Language': 'Italian', 'Category': 'Lifestyle', 'Question_Type': 'Short Questions', 'English_Question': 'What type of clothing are the people in the image wearing?', 'English_Answer': 'The people in the image are wearing professional clothing.', 'Translated_Question': " Che tipo di abbigliamento indossano le persone nell'immagine?", 'Translated_Answer': " Le persone nell'immagine indossano abiti professionali.", 'Image_Url': 'https://assets.vogue.com/photos/650c97c9e5c5af360f4668ac/master/w_2560%2Cc_limit/GettyImages-1499571723.jpg' } ## 数据字段各数据字段说明如下： - 'file_name': 文件名称 - 'ID': 采用`language#_cat#_img#`格式的唯一标识符 - 'Language': 100种目标语言之一 - 'Category': 涵盖全部19个分类中的一个类别 - 'Question_Type': 四种题型之一，即多项选择题（Multiple Choice Questions, MCQs）、判断题（True/False, T/F）、短回答视觉问答（Short Visual Question Answering, SVQAs）与长回答视觉问答（Long Visual Question Answering, LVQAs） - 'English_Question': 英文原版问题 - 'English_Answer': 英文原版答案 - 'Translated_Question': 经翻译并标注的母语版问题 - 'Translated_Answer': 经翻译并标注的母语版答案 - 'Image_Url': 从互联网获取的图像链接 --- # 数据统计信息 ALM-bench的数据统计结果展现了其在书写系统多样性、全球覆盖范围、分类全面性以及题型丰富度上的优势。本数据集总计包含2.27万条高质量问答样本，覆盖100种语言与24种书写系统，所有样本均经母语使用者人工核验。 --- # 基准测试对比各类多模态大模型基准测试的对比分析，重点关注多语言与文化理解能力。其中，**领域（Domains）** 指代数据集针对每种语言所覆盖的内容维度；**题型形式（Question Form）** 若问题表述存在多样性则归类为“多样化”，否则为“固定化”；**标注类型（Annotation Types）** 若问题最初为本土语言则归类为“人工标注”，若通过GPT-4/Google API生成或翻译并经人类专家验证则归类为“人机协同标注（Manual+Auto）”，若仅通过自动化方式生成或翻译且无人工验证则归类为“自动标注”；**偏差校正（Bias Correction）** 反映数据集是否在不同文化与国家间保持平衡，**多样性（Diversity）** 则反映数据集是否同时涵盖西方与非西方少数文化。“-”表示信息不可得。 --- # 实验结果 ALM-Bench性能对比图展示了不同开源与闭源模型（纵轴）在本基准测试涵盖的100种语言（横轴）上的表现。模型性能以某一语言下所有问题的平均准确率表征，每个格子对应模型在对应语言上的实际性能，准确率越高则颜色亮度越高。 --- # 引用 **BibTeX:** bibtex @misc{vayani2024alm, title={All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages}, author={Ashmal Vayani and Dinura Dissanayake and Hasindri Watawana and Noor Ahsan and Nevasini Sasikumar and Omkar Thawakar and Henok Biadglign Ademtew and Yahya Hmaiti and Amandeep Kumar and Kartik Kuckreja and Mykola Maslych and Wafa Al Ghallabi and Mihail Mihaylov and Chao Qin and Abdelrahman M Shaker and Mike Zhang and Mahardika Krisna Ihsani and Amiel Esplana and Monil Gokani and Shachar Mirkin and Harsh Singh and Ashay Srivastava and Endre Hamerlik and Fathinah Asma Izzati and Fadillah Adamsyah Maani and Sebastian Cavada and Jenny Chim and Rohit Gupta and Sanjay Manjunath and Kamila Zhumakhanova and Feno Heriniaina Rabevohitra and Azril Amirudin and Muhammad Ridzuan and Daniya Kareem and Ketan More and Kunyang Li and Pramesh Shakya and Muhammad Saad and Amirpouya Ghasemaghaei and Amirbek Djanibekov and Dilshod Azizov and Branislava Jankovic and Naman Bhatia and Alvaro Cabrera and Johan Obando-Ceron and Olympiah Otieno and Fabian Farestam and Muztoba Rabbani and Sanoojan Baliah and Santosh Sanjeev and Abduragim Shtanchaev and Maheen Fatima and Thao Nguyen and Amrin Kareem and Toluwani Aremu and Nathan Xavier and Amit Bhatkal and Hawau Toyin and Aman Chadha and Hisham Cholakkal and Rao Muhammad Anwer and Michael Felsberg and Jorma Laaksonen and Thamar Solorio and Monojit Choudhury and Ivan Laptev and Mubarak Shah and Salman Khan and Fahad Khan}, year={2024}, eprint={2411.16508}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.16508}, } --- ## 授权信息本工作采用[CC BY-NC 4.0协议](https://creativecommons.org/licenses/by-nc/4.0/)进行开源发布。CC BY-NC 4.0协议允许他人对本作品进行分享、改编与二次创作，但仅可用于非商业用途，且需为原创作者保留适当署名。

提供机构：

maas

创建时间：

2025-03-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集