AyaVisionBench

Name: AyaVisionBench
Creator: maas
Published: 2025-12-18 16:25:39
License: 暂无描述

魔搭社区2025-12-18 更新2025-03-08 收录

下载链接：

https://modelscope.cn/datasets/CohereForAI/AyaVisionBench

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Card for Aya Vision Benchmark <img src="ayavisionbench.png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/> ## Dataset Details The Aya Vision Benchmark is designed to evaluate vision-language models in real-world multilingual scenarios. It spans 23 languages and 9 distinct task categories, with 15 samples per category, resulting in 135 image-question pairs per language. Each question requires visual context for the answer and covers languages that half of the world's population speaks, making this dataset particularly suited for comprehensive assessment of cross-lingual and multimodal understanding. The tasks span: - Image captioning - Chart and figure understanding - Finding differences between two images - General visual question answering - OCR - Document understanding - Text transcription - Visual reasoning (including logic and math) - Converting screenshots to code To create this dataset, we first selected images from the [Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) held-out test set, a large collection derived from 50 high-quality datasets, ensuring they had not been seen during training. For each image, we then generated a corresponding question that explicitly required visual context for an answer. These questions were synthetically generated and subsequently refined through a two-stage verification process. First, human annotators reviewed and validated each question to ensure it was clear, relevant, and truly dependent on the image. Then, an automated filtering step was applied to further verify consistency and quality across languages. ## Languages To ensure multilingual coverage, the non-English portion of the dataset was generated by translating the English subset into 22 additional languages using Google Translate API v3. The dataset includes a diverse range of language families and scripts, ensuring a comprehensive evaluation of model generalizability and robustness. The languages included are: Arabic (arb_Arab), Chinese (zho_Hans), Czech (ces_Latn), Dutch (nld_Latn), English (eng_Latn), French (fra_Latn), German (deu_Latn), Greek (ell_Grek), Hebrew (heb_Hebr), Hindi (hin_Deva), Indonesian (ind_Latn), Italian (ita_Latn), Japanese (jpn_Jpan), Korean (kor_Hang), Persian (fas_Arab), Polish (pol_Latn), Portuguese (por_Latn), Romanian (ron_Latn), Russian (rus_Cyrl), Spanish (spa_Latn), Turkish (tur_Latn), Ukrainian (ukr_Cyrl), and Vietnamese (vie_Latn). By incorporating languages from different families and scripts, this benchmark enables a comprehensive assessment of vision-language models, particularly their ability to generalize across diverse languages. ## Load with Datasets To load this dataset with Datasets, you'll need to install Datasets as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset dataset = load_dataset("CohereLabs/AyaVisionBench", "kor_Hang") ``` The above code block will load only the Korean subset of the entire dataset. You can load other subsets by specifying other supported languages of interest or the entire dataset by leaving that argument as blank. ## Dataset Fields The following are the fields in the dataset: - **image:** The raw image data in .jpg format. - **image_source:** The original dataset from which the image was sourced. - **image_source_category:** The category of the image source, as defined in Cauldron. - **index:** A unique identifier for each sample. Identifiers are consistent across different language subsets. - **question:** The text of the prompt, which may be a question or an instruction. - **language:** The language of the sample, indicating the subset to which it belongs. ### Dataset Structure An instance of the data from the English subset looks as follows: <img src="example.png" width="300" style="margin-left:'auto' margin-right:'auto' display:'block'"/> ```python {'image': [PIL.Image], 'image_source': 'VisText', 'image_source_category': 'Chart/figure understanding', 'index' : '17' 'question': 'If the top three parties by vote percentage formed a coalition, what percentage of the total votes would they collectively represent, and how does this compare to the combined percentage of all other parties shown in the chart?' 'language': 'eng_Latn' } ``` ### Authorship - Publishing Organization: [Cohere Labs](https://cohere.com/research) - Industry Type: Not-for-profit - Tech - Contact Details: https://cohere.com/research/aya ### Licensing Information This dataset can be used for any purpose, whether academic or commercial, under the terms of the Apache 2.0 License.

## 数据集卡片：Aya视觉基准测试集 <img src="ayavisionbench.png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/> ## 数据集详情 Aya视觉基准测试集旨在评估真实多语言场景下的视觉语言模型。该数据集涵盖23种语言与9类不同任务，每类任务包含15个样本，因此每种语言下共有135张图像-问答对。每个问题的回答均需依托视觉上下文，且覆盖了全球半数人口使用的语言，非常适合用于全面评估跨语言与多模态理解能力。任务类型涵盖： - 图像字幕生成 - 图表与图形理解 - 双图像差异比对 - 通用视觉问答 - 光学字符识别（OCR） - 文档理解 - 文本转录 - 视觉推理（含逻辑与数学推理） - 截图转代码为构建该数据集，我们首先从[Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)的预留测试集中筛选图像——该数据集集合源自50个高质量数据集，可确保模型在训练阶段从未接触过这些图像。随后，我们为每张图像生成对应的问题，此类问题的回答明确需要依托视觉上下文。这些问题均为人工合成生成，并通过两阶段验证流程进行优化：首先由人类标注员审阅并验证每个问题，确保其表述清晰、关联准确且确实依赖图像内容；随后执行自动化过滤步骤，进一步验证多语言间的一致性与数据质量。 ## 语言覆盖为保障多语言覆盖范围，数据集的非英语子集通过谷歌翻译API v3将英语子集翻译为另外22种语言生成。该数据集涵盖多样的语系与文字系统，可确保对模型的泛化能力与鲁棒性进行全面评估。包含的语言如下：阿拉伯语（arb_Arab）、中文（zho_Hans）、捷克语（ces_Latn）、荷兰语（nld_Latn）、英语（eng_Latn）、法语（fra_Latn）、德语（deu_Latn）、希腊语（ell_Grek）、希伯来语（heb_Hebr）、印地语（hin_Deva）、印度尼西亚语（ind_Latn）、意大利语（ita_Latn）、日语（jpn_Jpan）、韩语（kor_Hang）、波斯语（fas_Arab）、波兰语（pol_Latn）、葡萄牙语（por_Latn）、罗马尼亚语（ron_Latn）、俄语（rus_Cyrl）、西班牙语（spa_Latn）、土耳其语（tur_Latn）、乌克兰语（ukr_Cyrl）、越南语（vie_Latn）。通过涵盖不同语系与文字系统的语言，该基准测试集可全面评估视觉语言模型的性能，尤其是其在多样化语言间的泛化能力。 ## 用Datasets库加载数据若需使用Datasets库加载该数据集，请先通过`pip install datasets --upgrade`安装并升级Datasets库，随后使用如下代码： python from datasets import load_dataset dataset = load_dataset("CohereLabs/AyaVisionBench", "kor_Hang") 上述代码块仅加载整个数据集中的韩语子集。若需加载其他受支持的语言子集，可将参数替换为对应语言代码；若需加载完整数据集，则留空该参数即可。 ## 数据集字段数据集包含以下字段： - **image**：.jpg格式的原始图像数据 - **image_source**：该图像的原始来源数据集 - **image_source_category**：图像来源的类别，遵循Cauldron中的定义 - **index**：每个样本的唯一标识符，不同语言子集的标识符保持一致 - **question**：提示文本，可为问题或指令 - **language**：样本所属语言，用于标识其所在子集 ### 数据结构英语子集的单条数据示例如下： <img src="example.png" width="300" style="margin-left:'auto' margin-right:'auto' display:'block'"/> python {'image': [PIL.Image], 'image_source': 'VisText', 'image_source_category': 'Chart/figure understanding', 'index' : '17', 'question': 'If the top three parties by vote percentage formed a coalition, what percentage of the total votes would they collectively represent, and how does this compare to the combined percentage of all other parties shown in the chart?', 'language': 'eng_Latn' } ### 作者信息 - 发布机构：[Cohere Labs](https://cohere.com/research) - 行业类型：非营利性科技领域 - 联系方式：https://cohere.com/research/aya ### 许可信息本数据集可在Apache 2.0许可证条款下，用于学术或商业等任意用途。

提供机构：

maas

创建时间：

2025-03-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集