下载链接：

https://modelscope.cn/datasets/MMInstruction/M3IT-80

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for M3IT-80 Project Page: [https://m3-it.github.io/](https://m3-it.github.io/) ## Dataset Description - **Homepage: https://huggingface.co/datasets/MMInstruction/M3IT-80** - **Repository: https://huggingface.co/datasets/MMInstruction/M3IT-80** - **Paper: https://huggingface.co/papers/2306.04387** - **Leaderboard:** - **Point of Contact:** ### Languages 80 languages translated from English. ## Dataset Metainfo [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. **M3IT-80** is the 80-language translated version of M3IT. ### Languages ```python _LAN_CODES = [ "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca", "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr", "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id", "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko", "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi", "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl", "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta", "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh", ] ``` ### Dataset Statistics We report the number of the train/validation/test of each dataset per language. | Task | Dataset | #Train | #Val | #Test | |---------------------------|--------------|--------|------|-------| | Classification | `imagenet` | 500 | 500 | 0 | | Visual Question Answering | `vqa-v2` | 500 | 500 | 0 | | Knowledgeable Visual QA | `okvqa` | 500 | 500 | 0 | | Reasoning | `winoground` | 0 | 0 | 800 | | Generation | `vist` | 500 | 500 | 500 | | Video | `msrvtt` | 500 | 500 | 0 | | | `msrvtt-qa` | 500 | 500 | 0 | ### Source Data Source language: English | Task | Dataset [Citation] | Source | |---------------------------|--------------------|------------------------------------------------------------------------------------| | Classification | `imagenet` [1] | [Source](https://www.image-net.org/) | | Visual Question Answering | `vqa-v2` [2] | [Source](https://visualqa.org/) | | Knowledgeable Visual QA | `okvqa` [3] | [Source](https://okvqa.allenai.org/) | | Reasoning | `winoground` [4] | [Source](https://huggingface.co/datasets/facebook/winoground) | | Generation | `vist` [5] | [Source](https://visionandlanguage.net/VIST/) | | Video | `msrvtt` [6] | [Source](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [7] | [Source](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### Translation We use free [Alibaba Translate](https://www.alibabacloud.com/product/machine-translation), a deep neural network translation (NMT) system, to perform the translation task. ## Dataset Structure ### HuggingFace Login (Optional) ```python # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token) ``` ### Data Loading ```python from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) ``` ### Data Splits ```python from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ``` ### Data Instances ```python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ``` ### Data Fields ```python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ``` ### Licensing Information The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information. Our annotated instruction data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information ```bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ``` ### Contributions M3IT-80 is the translated version of M3IT, an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents. ## References - [1] Imagenet large scale visual recognition challenge - [2] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering - [3] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - [4] WinoGround: Probing vision and language models for visio-linguistic compositionality - [5] Visual Storytelling - [6] Video Question Answering via Gradually Refined Attention over Appearance and Motion - [7] MSR-VTT: A large video description dataset for bridging video and language

# M3IT-80 数据集卡片项目主页：[https://m3-it.github.io/](https://m3-it.github.io/) ## 数据集描述 - **数据集主页：https://huggingface.co/datasets/MMInstruction/M3IT-80** - **代码仓库：https://huggingface.co/datasets/MMInstruction/M3IT-80** - **相关论文：https://huggingface.co/papers/2306.04387** - **排行榜：** - **联系人：** ### 语言包含80种由英语翻译而来的语言。 ## 数据集元信息 [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) 数据集汇集了经典视觉-语言任务的多样化子任务，涵盖图像字幕生成、视觉问答（Visual Question Answering, VQA）、视觉条件生成、推理与分类任务。**M3IT-80** 即M3IT的80语言翻译版本。 ### 语言 python _LAN_CODES = [ "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca", "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr", "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id", "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko", "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi", "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl", "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta", "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh", ] ### 数据集统计信息我们将报告每种语言下各数据集的训练、验证、测试集样本数量。 | 任务类型 | 数据集 | 训练集样本数 | 验证集样本数 | 测试集样本数 | |-------------------------|---------------|--------------|--------------|--------------| | 分类任务 | `imagenet` | 500 | 500 | 0 | | 视觉问答 | `vqa-v2` | 500 | 500 | 0 | | 知识型视觉问答 | `okvqa` | 500 | 500 | 0 | | 推理任务 | `winoground` | 0 | 0 | 800 | | 生成任务 | `vist` | 500 | 500 | 500 | | 视频任务 | `msrvtt` | 500 | 500 | 0 | | | `msrvtt-qa` | 500 | 500 | 0 | ### 源数据源语言：英语 | 任务类型 | 数据集 [引用编号] | 来源链接 | |-------------------------|--------------------|------------------------------------------------------------------------------------| | 分类任务 | `imagenet` [1] | [来源](https://www.image-net.org/) | | 视觉问答 | `vqa-v2` [2] | [来源](https://visualqa.org/) | | 知识型视觉问答 | `okvqa` [3] | [来源](https://okvqa.allenai.org/) | | 推理任务 | `winoground` [4] | [来源](https://huggingface.co/datasets/facebook/winoground) | | 生成任务 | `vist` [5] | [来源](https://visionandlanguage.net/VIST/) | | 视频任务 | `msrvtt` [6] | [来源](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [7] | [来源](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### 翻译流程我们采用阿里巴巴翻译（Alibaba Translate）——一款深度神经网络机器翻译（Neural Machine Translation, NMT）系统——来完成全部翻译任务。 ## 数据集结构 ### Hugging Face 登录（可选） python # 或执行 huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: 请设置有效的Hugging Face访问令牌以加载数据集/模型 login(token=hf_token) ### 数据加载 python from datasets import load_dataset ds_name = "okvqa-zh" # 请在此处修改数据集名称 dataset = load_dataset("MMInstruction/M3IT-80", ds_name) ### 数据划分 python from datasets import load_dataset ds_name = "okvqa-zh" # 请在此处修改数据集名称 dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ### 数据实例 python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "okvqa-zh" # 请在此处修改数据集名称 dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # 字符串类型 inputs = train_instance["inputs"] # 字符串类型 outputs = train_instance["outputs"] # 字符串类型 image_base64_str_list = train_instance["image_base64_str"] # base64编码字符串 image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ### 数据字段 python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ### 许可协议信息原始数据集遵循其自身的许可协议。我们建议，对于许可类型未知或自定义的任务，用户可查阅原始项目文档或联系数据集所有者以获取详细的许可信息。我们标注的指令数据遵循 [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) 许可协议。 ### 引用信息 bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ### 贡献说明 M3IT-80 是M3IT的翻译版本，后者是一款开源的大规模多模态多语言指令调优数据集，旨在助力通用多模态AI智能体（AI Agent）的开发。 ## 参考文献 - [1] ImageNet大规模视觉识别挑战赛 - [2] 让VQA中的V发挥作用：提升视觉问答任务中图像理解的地位 - [3] OK-VQA：需要外部知识的视觉问答基准数据集 - [4] Winoground：探索视觉语言模型的视觉-语言组合能力 - [5] 视觉故事生成 - [6] 基于外观与运动渐进式注意力机制的视频问答 - [7] MSR-VTT：用于连接视频与语言的大规模视频描述数据集

应用场景：