five

M3IT

收藏
魔搭社区2026-05-14 更新2024-05-25 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/M3IT
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for M3IT Project Page: [M3IT](https://m3-it.github.io/) ## Dataset Description - **Homepage: https://huggingface.co/datasets/MMInstruction/M3IT** - **Repository: https://huggingface.co/datasets/MMInstruction/M3IT** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Languages English and Chinese. 80 translated version can be found at [M3IT-80](https://huggingface.co/datasets/MMInstruction/M3IT-80). ## Dataset Statistics Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. ### Instruction Statistics | Task | #Instructions | |---------------------------|---------------| | Image Captioning | 52 | | Classification | 113 | | Visual Question Answering | 95 | | Knowledgeable Visual QA | 40 | | Reasoning | 60 | | Generation | 40 | | Total | 400 | ### Task Statistics | Task | Description | #Train | #Val | #Test | |---------------------------|-----------------------------------------------------------------|---------|---------|---------| | Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 | | Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 | | Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 | | Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 | | Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 | | Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 | | Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 | | Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 | | Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 | ### Detailed Dataset Statistics | Task | Dataset | #Train | #Val | #Test | |---------------------------|------------------------------|---------|--------|--------| | Image Captioning | `coco` | 566,747 | 25,010 | 25,010 | | | `textcap` | 97,765 | 13,965 | 0 | | | `image-paragraph-captioning` | 14,575 | 2,487 | 2,489 | | Classification | `coco-goi` | 30,000 | 2,000 | 0 | | | `coco-text` | 118,312 | 27,550 | 0 | | | `imagenet` | 30,000 | 50,000 | 0 | | | `coco-itm` | 30,000 | 5,000 | 5,000 | | | `snli-ve` | 20,000 | 14,339 | 14,740 | | | `mocheg` | 4,991 | 180 | 466 | | | `iqa` | 5,000 | 1,000 | 1,000 | | Visual Question Answering | `vqa-v2` | 30,000 | 30,000 | 0 | | | `shapes` | 13,568 | 1,024 | 1,024 | | | `docvqa` | 39,463 | 5,349 | 0 | | | `ocr-vqa` | 11,414 | 4,940 | 0 | | | `st-vqa` | 26,074 | 0 | 4,070 | | | `text-vqa` | 27,113 | 0 | 5,734 | | | `gqa` | 30,001 | 5,001 | 0 | | Knowledgeable Visual QA | `okvqa` | 9,009 | 5,046 | 0 | | | `a-okvqa` | 17,056 | 1,145 | 0 | | | `science-qa` | 12,726 | 4,241 | 4,241 | | | `viquae` | 1,190 | 1,250 | 1,236 | | Reasoning | `clevr` | 30,000 | 2,000 | 0 | | | `nlvr` | 29,372 | 2,000 | 0 | | | `vcr` | 25,000 | 5,000 | 5,000 | | | `visual-mrc` | 15,000 | 2,500 | 5,000 | | | `winoground` | 0 | 0 | 800 | | Generation | `vist` | 5,000 | 4,315 | 4,350 | | | `visual-dialog` | 50,000 | 1,000 | 1,000 | | | `multi30k` | 90,000 | 6,000 | 12,000 | | Chinese | `fm-iqa` | 164,735 | 75,206 | 0 | | | `coco-cn` | 18,341 | 1,000 | 1,000 | | | `flickr8k-cn` | 6,000 | 1,000 | 1,000 | | | `chinese-food` | 0 | 0 | 1,100 | | | `mmchat` | 3,000 | 1,000 | 1,000 | | Video | `ss` | 2,000 | 2,000 | 2,000 | | | `ivqa` | 5,994 | 2,000 | 2,000 | | | `msvd-qa` | 1,161 | 245 | 504 | | | `activitynet-qa` | 3,200 | 1,800 | 800 | | | `msrvtt` | 6,513 | 497 | 2,990 | | | `msrvtt-qa` | 2,000 | 1,000 | 1,000 | ## Dataset Structure ### HuggingFace Login (Optional) ```python # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token) ``` ### Data Loading ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) ``` ### Data Splits ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ``` ### Data Instances ```python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ``` ### Data Fields ```python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ``` ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data | Task | Dataset [Citation] | Source | |---------------------------|----------------------------------|------------------------------------------------------------------------------------| | Image Captioning | `coco` [1] | [Source](https://cocodataset.org/#home) | | | `textcap` [2] | [Source](https://textvqa.org/textcaps/) | | | `image-paragraph-captioning` [3] | [Source](https://cs.stanford.edu/people/ranjaykrishna/im2p/index.html) | | Classification | `coco-goi` [1] | [Source](https://cocodataset.org/#home) | | | `coco-text` [4] | [Source](https://bgshih.github.io/cocotext/) | | | `imagenet` [5] | [Source](https://www.image-net.org/) | | | `coco-itm` [1] | [Source](https://cocodataset.org/#home) | | | `snli-ve` [6] | [Source](https://github.com/necla-ml/SNLI-VE) | | | `mocheg` [7] | [Source](https://github.com/VT-NLP/Mocheg) | | | `iqa` [8] | [Source](https://github.com/icbcbicc/IQA-Dataset) | | Visual Question Answering | `vqa-v2` [9] | [Source](https://visualqa.org/) | | | `shapes` [10] | [Source](https://github.com/ronghanghu/n2nmn) | | | `docvqa` [11] | [Source](https://www.docvqa.org/) | | | `ocr-vqa` [12] | [Source](https://ocr-vqa.github.io/) | | | `st-vqa` [13] | [Source](https://rrc.cvc.uab.es/?ch=11) | | | `text-vqa` [14] | [Source](https://textvqa.org/) | | | `gqa` [15] | [Source](https://cs.stanford.edu/people/dorarad/gqa/about.html) | | Knowledgeable Visual QA | `okvqa` [16] | [Source](https://okvqa.allenai.org/) | | | `a-okvqa` [17] | [Source](https://allenai.org/project/a-okvqa/home) | | | `science-qa` [18] | [Source](https://scienceqa.github.io/) | | | `viquae` [19] | [Source](https://github.com/PaulLerner/ViQuAE) | | Reasoning | `clevr` [20] | [Source](https://cs.stanford.edu/people/jcjohns/clevr/) | | | `nlvr` [21] | [Source](https://lil.nlp.cornell.edu/nlvr/) | | | `vcr` [22] | [Source](https://visualcommonsense.com/) | | | `visual-mrc` [23] | [Source](https://github.com/nttmdlab-nlp/VisualMRC) | | | `winoground` [24] | [Source](https://huggingface.co/datasets/facebook/winoground) | | Generation | `vist` [25] | [Source](https://visionandlanguage.net/VIST/) | | | `visual-dialog` [26] | [Source](https://visualdialog.org/) | | | `multi30k` [27] | [Source](https://github.com/multi30k/dataset) | | Chinese | `fm-iqa` [28] | [Source](https://paperswithcode.com/dataset/fm-iqa) | | | `coco-cn` [29] | [Source](https://github.com/li-xirong/coco-cn) | | | `flickr8k-cn` [30] | [Source](https://github.com/li-xirong/flickr8kcn) | | | `chinese-food` [31] | [Source](https://sites.google.com/view/chinesefoodnet) | | | `mmchat` [32] | [Source](https://github.com/silverriver/MMChat) | | Video | `ss` [33] | [Source](https://developer.qualcomm.com/software/ai-datasets/something-something) | | | `ivqa` [34] | [Source](https://antoyang.github.io/just-ask.html) | | | `msvd-qa` [35] | [Source](https://paperswithcode.com/dataset/msvd) | | | `activitynet-qa` [36] | [Source](https://github.com/MILVLG/activitynet-qa) | | | `msrvtt` [35] | [Source](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [37] | [Source](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### Annotations #### Annotation process To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps: - (1) **Stage I: Instruction Writing**: writing instructions for each task; - (2) **Stage II: Data Format Unification**: structuring images and texts into a unified schema; - (3) **Stage III: Quality Check**: checking the overall dataset quality; - (4) **Stage IV: Key Datasets Translation**: building multilingual sets. #### Who are the annotators? Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature. ## Additional Information ### Licensing Information The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information. Our annotated instruction data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information ```bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ``` ### Contributions M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents. ## References - [1] Microsoft COCO: Common Objects in Context - [2] TextCaps: a dataset for image captioning with reading comprehension - [3] A Hierarchical Approach for Generating Descriptive Image Paragraphs - [4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images - [5] Imagenet large scale visual recognition challenge - [6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks - [7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models - [8] Quantifying visual image quality: A Bayesian view - [9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering - [10] Neural Module Networks - [11] DocVQA: A dataset for vqa on document images - [12] OCR-VQA: Visual Question Answering by Reading Text in Images - [13] Scene Text Visual Question Answering - [14] Towards VQA Models That Can Read - [15] GQA: A new dataset for real-world visual reasoning and compositional question answering - [16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - [17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge - [18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [19] ViQuAE: a dataset for knowledge-based visual question answering about named entities - [20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning - [21] A Corpus of Natural Language for Visual Reasoning - [22] From recognition to cognition: Visual Commonsense Reasoning - [23] VisualMRC: Machine reading comprehension on document images - [24] WinoGround: Probing vision and language models for visio-linguistic compositionality - [25] Visual Storytelling - [26] Visual Dialog - [27] Multi30k: Multilingual english-german image descriptions - [28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question - [29] COCO-CN for cross-lingual image tagging, captioning, and retrieval - [30] Adding Chinese Captions to Images - [31] ChineseFoodNet: A large-scale image dataset for chinese food recognition - [32] MMChat: Multi-Modal Chat Dataset on Social Media - [33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense - [34] Just Ask: Learning to answer questions from millions of narrated videos - [35] Video Question Answering via Gradually Refined Attention over Appearance and Motion - [36] ActivityNet-qa: A dataset for understanding complex web videos via question answering - [37] MSR-VTT: A large video description dataset for bridging video and language
提供机构:
maas
创建时间:
2024-05-20
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
M3IT是一个大规模多模态多语言指令调优数据集,涵盖图像描述、视觉问答等多种任务,支持中英文及80种其他语言,旨在促进通用多模态代理的开发。数据集包含丰富的任务类型和数据量,适用于多样化的视觉语言研究需求。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作