five

kaleidoscope

收藏
魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/CohereLabs/kaleidoscope
下载链接
链接失效反馈
官方服务:
资源简介:
# <span style="font-variant: small-caps;">Kaleidoscope</span> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e4b943a37cb5b49818287b5/_fCLWAuX8sl93viDFgTsY.png" style="vertical-align: middle; width: auto; height: 1em; display: inline-block;"> <span>(18 Languages)</span> ## Dataset Description The <span style="font-variant: small-caps;">Kaleidoscope</span> Benchmark is a global collection of multiple-choice questions sourced from real-world exams, with the goal of evaluating multimodal and multilingual understanding in VLMs. The collected exams are in a Multiple-choice question answering (MCQA) format which provides a structured framework for evaluation by prompting models with predefined answer choices, closely mimicking conventional human testing methodologies. **📄 Paper**: https://arxiv.org/abs/2504.07072 </br> **🌐 Website**: http://cohere.com/research/kaleidoscope ### Dataset Summary The <span style="font-variant: small-caps;">Kaleidoscope</span> benchmark contains 20,911 questions across 18 languages belonging to 8 language families. A total of 11,459 questions require an image to be answered (55%), while the remaining 9,452 (45%) are text-only. The dataset covers 14 different subjects, grouped into 6 broad domains. ### Languages Arabic, Bengali, Croatian, Dutch, English, French, German, Hindi, Hungarian, Lithuanian, Nepali, Persian, Portuguese, Russian, Serbian, Spanish, Telugu, Ukrainian ### Topics - **Humanities & Social Sciences**: Economics, Geography, History, Language, Social Sciences, Sociology - **STEM**: Biology, Chemistry, Engineering, Mathematics, Physics - **Reasoning, Health Science, and Practical Skills**: Reasoning, Medicine, Driving License ### Data schema An example from a UNICAMP question looks as follows: ```json { "question": "Em uma xícara que já contém certa quantidade de açúcar, despeja-se café. A curva abaixo representa a função exponencial $\\mathrm{M}(\\mathrm{t})$, que fornece a quantidade de açúcar não dissolvido (em gramas), t minutos após o café ser despejado. Pelo gráfico, podemos concluir que", "options": [ "$\\mathrm{m}(\\mathrm{t})=2^{(4-\\mathrm{t} / 75)}$.", "$m(t)=2^{(4-t / 50)}$.", "$m(t)=2^{(5-t / 50)}$", "$m(t)=2^{(5-t / 150)}$" ], "answer": 0, "question_image": "unicamp_2011_30_0.png", "image_information": "essential", "image_type": "graph", "language": "pt", "country": "Brazil", "contributor_country": "Brazil", "file_name": "Unicamp2011_1fase_prova.pdf", "source": "https://www.curso-objetivo.br/vestibular/resolucao-comentada/unicamp/2011_1fase/unicamp2011_1fase_prova.pdf", "license": "Unknown", "level": "University Entrance", "category_en": "Mathematics", "category_source_lang": "Matemática", "original_question_num": 30, } ``` Here 'unicamp_2011_30_0.png' contains: <img src="https://cdn-uploads.huggingface.co/production/uploads/5e4b943a37cb5b49818287b5/SszvTTTPqXszB6hUk53_e.png" width="400"> ### Model Performance Models performance on the <span style="font-variant: small-caps;">Kaleidoscope</span> benchmark: | Model | Overall | | | Multimodal | | | Text-only | | | |------------------|---------|-------|-------|------------|-------|-------|-----------|-------|-------| | | Total Acc. | Format Err. | Valid Acc. | Total Acc. | Format Err. | Valid Acc. | Total Acc. | Format Err. | Valid Acc. | | Claude 3.5 Sonnet| **62.91**| 1.78 | **63.87**| **55.63**| 3.24 | **57.24**| **73.54**| 0.02 | **73.57**| | Gemini 1.5 Pro | 62.10 | 1.62 | 62.95 | 55.01 | 1.46 | 55.71 | 72.35 | 1.81 | 73.45 | | GPT-4o | 58.32 | 6.52 | 62.10 | 49.80 | 10.50 | 55.19 | 71.40 | 1.71 | 72.39 | | Qwen2.5-VL-72B | 52.94 | 0.02 | 53.00 | 48.40 | 0.03 | 48.41 | 60.00 | 0.02 | 60.01 | | Aya Vision 32B | 39.27 | 1.05 | 39.66 | 35.74 | 1.49 | 36.28 | 44.73 | 0.51 | 45.00 | | Qwen2.5-VL-32B | 48.21 | 0.88 | 48.64 | 44.90 | 0.28 | 45.05 | 53.77 | 1.61 | 54.60 | | Aya Vision 8B | 35.09 | 0.07 | 35.11 | 32.35 | 0.05 | 32.36 | 39.27 | 0.10 | 39.30 | | Molmo-7B-D | 32.87 | 0.04 | 32.88 | 31.43 | 0.06 | 31.44 | 35.12 | 0.01 | 35.13 | | Pangea-7B | 31.31 | 7.42 | 34.02 | 27.15 | 13.52 | 31.02 | 37.84 | 0.03 | 37.86 | | Qwen2.5-VL-7B | 39.56 | 0.08 | 39.60 | 36.85 | 0.04 | 36.88 | 43.91 | 0.11 | 43.96 | | Qwen2.5-VL-3B | 35.56 | 0.19 | 35.63 | 33.67 | 0.32 | 33.79 | 38.51 | 0.03 | 38.53 | ## Citation ``` @misc{salazar2025kaleidoscopeinlanguageexamsmassively, title={Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation}, author={Israfel Salazar and Manuel Fernández Burda and Shayekh Bin Islam and Arshia Soltani Moakhar and Shivalika Singh and Fabian Farestam and Angelika Romanou and Danylo Boiko and Dipika Khullar and Mike Zhang and Dominik Krzemiński and Jekaterina Novikova and Luísa Shimabucoro and Joseph Marvin Imperial and Rishabh Maheshwary and Sharad Duwal and Alfonso Amayuelas and Swati Rajwal and Jebish Purbey and Ahmed Ruby and Nicholas Popovič and Marek Suppa and Azmine Toushik Wasi and Ram Mohan Rao Kadiyala and Olga Tsymboi and Maksim Kostritsya and Bardia Soltani Moakhar and Gabriel da Costa Merlin and Otávio Ferracioli Coletti and Maral Jabbari Shiviari and MohammadAmin farahani fard and Silvia Fernandez and María Grandury and Dmitry Abulkhanov and Drishti Sharma and Andre Guarnier De Mitri and Leticia Bossatto Marchezi and Johan Obando-Ceron and Nazar Kohut and Beyza Ermis and Desmond Elliott and Enzo Ferrante and Sara Hooker and Marzieh Fadaee}, year={2025}, eprint={2504.07072}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.07072}, } ```

# 万花筒(Kaleidoscope) <img src="https://cdn-uploads.huggingface.co/production/uploads/5e4b943a37cb5b49818287b5/_fCLWAuX8sl93viDFgTsY.png" style="vertical-align: middle; width: auto; height: 1em; display: inline-block;"> <span>(18种语言)</span> ## 数据集描述 万花筒(Kaleidoscope)基准测试集是一套源自真实考试的全球范围多项选择题集,旨在评估视觉语言模型(Vision-Language Model,VLM)的多模态与多语言理解能力。所收集的试题均采用多项选择题问答(Multiple-choice question answering,MCQA)格式,该格式通过为模型提供预定义的选项来构建标准化的评估框架,高度贴近人类常规考试的测评范式。 **📄 论文链接**:https://arxiv.org/abs/2504.07072 </br> **🌐 官方网站**:http://cohere.com/research/kaleidoscope ### 数据集概览 万花筒(Kaleidoscope)基准集包含覆盖8大语系的18种语言的20911道试题。其中11459道(占比55%)需结合图像方可作答,剩余9452道(占比45%)为纯文本试题。该数据集涵盖14个细分学科,归为6大领域。 ### 语言覆盖 阿拉伯语、孟加拉语、克罗地亚语、荷兰语、英语、法语、德语、印地语、匈牙利语、立陶宛语、尼泊尔语、波斯语、葡萄牙语、俄语、塞尔维亚语、西班牙语、泰卢固语、乌克兰语 ### 主题分类 - **人文与社会科学**:经济学、地理学、历史学、语言学、社会科学、社会学 - **科学、技术、工程与数学(STEM)**:生物学、化学、工程学、数学、物理学 - **推理能力、健康科学与实用技能**:推理能力、医学、驾驶执照考核 ### 数据结构 以下是来自坎皮纳斯大学(UNICAMP)试题的一个示例: json { "question": "Em uma xícara que já contém certa quantidade de açúcar, despeja-se café. A curva abaixo representa a função exponencial $\mathrm{M}(\mathrm{t})$, que fornece a quantidade de açúcar não dissolvido (em gramas), t minutos após o café ser despejado. Pelo gráfico, podemos concluir que", "options": [ "$\mathrm{m}(\mathrm{t})=2^{(4-\mathrm{t} / 75)}$.", "$m(t)=2^{(4-t / 50)}$.", "$m(t)=2^{(5-t / 50)}$", "$m(t)=2^{(5-t / 150)}$" ], "answer": 0, "question_image": "unicamp_2011_30_0.png", "image_information": "essential", "image_type": "graph", "language": "pt", "country": "Brazil", "contributor_country": "Brazil", "file_name": "Unicamp2011_1fase_prova.pdf", "source": "https://www.curso-objetivo.br/vestibular/resolucao-comentada/unicamp/2011_1fase/unicamp2011_1fase_prova.pdf", "license": "Unknown", "level": "University Entrance", "category_en": "Mathematics", "category_source_lang": "Matemática", "original_question_num": 30, } 其中`unicamp_2011_30_0.png`对应的图像为: <img src="https://cdn-uploads.huggingface.co/production/uploads/5e4b943a37cb5b49818287b5/SszvTTTPqXszB6hUk53_e.png" width="400"> ### 模型性能 各模型在万花筒(Kaleidoscope)基准测试集上的性能表现如下: | 模型 | 整体指标 | | | 多模态指标 | | | 纯文本指标 | | | |------------------|---------|-------|-------|------------|-------|-------|-----------|-------|-------| | | 总准确率 | 格式错误率 | 有效准确率 | 总准确率 | 格式错误率 | 有效准确率 | 总准确率 | 格式错误率 | 有效准确率 | | Claude 3.5 Sonnet| **62.91**| 1.78 | **63.87**| **55.63**| 3.24 | **57.24**| **73.54**| 0.02 | **73.57**| | Gemini 1.5 Pro | 62.10 | 1.62 | 62.95 | 55.01 | 1.46 | 55.71 | 72.35 | 1.81 | 73.45 | | GPT-4o | 58.32 | 6.52 | 62.10 | 49.80 | 10.50 | 55.19 | 71.40 | 1.71 | 72.39 | | Qwen2.5-VL-72B | 52.94 | 0.02 | 53.00 | 48.40 | 0.03 | 48.41 | 60.00 | 0.02 | 60.01 | | Aya Vision 32B | 39.27 | 1.05 | 39.66 | 35.74 | 1.49 | 36.28 | 44.73 | 0.51 | 45.00 | | Qwen2.5-VL-32B | 48.21 | 0.88 | 48.64 | 44.90 | 0.28 | 45.05 | 53.77 | 1.61 | 54.60 | | Aya Vision 8B | 35.09 | 0.07 | 35.11 | 32.35 | 0.05 | 32.36 | 39.27 | 0.10 | 39.30 | | Molmo-7B-D | 32.87 | 0.04 | 32.88 | 31.43 | 0.06 | 31.44 | 35.12 | 0.01 | 35.13 | | Pangea-7B | 31.31 | 7.42 | 34.02 | 27.15 | 13.52 | 31.02 | 37.84 | 0.03 | 37.86 | | Qwen2.5-VL-7B | 39.56 | 0.08 | 39.60 | 36.85 | 0.04 | 36.88 | 43.91 | 0.11 | 43.96 | | Qwen2.5-VL-3B | 35.56 | 0.19 | 35.63 | 33.67 | 0.32 | 33.79 | 38.51 | 0.03 | 38.53 | ## 引用 @misc{salazar2025kaleidoscopeinlanguageexamsmassively, title={Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation}, author={Israfel Salazar and Manuel Fernández Burda and Shayekh Bin Islam and Arshia Soltani Moakhar and Shivalika Singh and Fabian Farestam and Angelika Romanou and Danylo Boiko and Dipika Khullar and Mike Zhang and Dominik Krzemiński and Jekaterina Novikova and Luísa Shimabucoro and Joseph Marvin Imperial and Rishabh Maheshwary and Sharad Duwal and Alfonso Amayuelas and Swati Rajwal and Jebish Purbey and Ahmed Ruby and Nicholas Popovič and Marek Suppa and Azmine Toushik Wasi and Ram Mohan Rao Kadiyala and Olga Tsymboi and Maksim Kostritsya and Bardia Soltani Moakhar and Gabriel da Costa Merlin and Otávio Ferracioli Coletti and Maral Jabbari Shiviari and MohammadAmin farahani fard and Silvia Fernandez and María Grandury and Dmitry Abulkhanov and Drishti Sharma and Andre Guarnier De Mitri and Leticia Bossatto Marchezi and Johan Obando-Ceron and Nazar Kohut and Beyza Ermis and Desmond Elliott and Enzo Ferrante and Sara Hooker and Marzieh Fadaee}, year={2025}, eprint={2504.07072}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.07072}, }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作