下载链接：

https://modelscope.cn/datasets/AI-ModelScope/M3IT

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for M3IT Project Page: [M3IT](https://m3-it.github.io/) ## Dataset Description - **Homepage: https://huggingface.co/datasets/MMInstruction/M3IT** - **Repository: https://huggingface.co/datasets/MMInstruction/M3IT** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Languages English and Chinese. 80 translated version can be found at [M3IT-80](https://huggingface.co/datasets/MMInstruction/M3IT-80). ## Dataset Statistics Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. ### Instruction Statistics | Task | #Instructions | |---------------------------|---------------| | Image Captioning | 52 | | Classification | 113 | | Visual Question Answering | 95 | | Knowledgeable Visual QA | 40 | | Reasoning | 60 | | Generation | 40 | | Total | 400 | ### Task Statistics | Task | Description | #Train | #Val | #Test | |---------------------------|-----------------------------------------------------------------|---------|---------|---------| | Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 | | Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 | | Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 | | Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 | | Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 | | Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 | | Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 | | Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 | | Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 | ### Detailed Dataset Statistics | Task | Dataset | #Train | #Val | #Test | |---------------------------|------------------------------|---------|--------|--------| | Image Captioning | `coco` | 566,747 | 25,010 | 25,010 | | | `textcap` | 97,765 | 13,965 | 0 | | | `image-paragraph-captioning` | 14,575 | 2,487 | 2,489 | | Classification | `coco-goi` | 30,000 | 2,000 | 0 | | | `coco-text` | 118,312 | 27,550 | 0 | | | `imagenet` | 30,000 | 50,000 | 0 | | | `coco-itm` | 30,000 | 5,000 | 5,000 | | | `snli-ve` | 20,000 | 14,339 | 14,740 | | | `mocheg` | 4,991 | 180 | 466 | | | `iqa` | 5,000 | 1,000 | 1,000 | | Visual Question Answering | `vqa-v2` | 30,000 | 30,000 | 0 | | | `shapes` | 13,568 | 1,024 | 1,024 | | | `docvqa` | 39,463 | 5,349 | 0 | | | `ocr-vqa` | 11,414 | 4,940 | 0 | | | `st-vqa` | 26,074 | 0 | 4,070 | | | `text-vqa` | 27,113 | 0 | 5,734 | | | `gqa` | 30,001 | 5,001 | 0 | | Knowledgeable Visual QA | `okvqa` | 9,009 | 5,046 | 0 | | | `a-okvqa` | 17,056 | 1,145 | 0 | | | `science-qa` | 12,726 | 4,241 | 4,241 | | | `viquae` | 1,190 | 1,250 | 1,236 | | Reasoning | `clevr` | 30,000 | 2,000 | 0 | | | `nlvr` | 29,372 | 2,000 | 0 | | | `vcr` | 25,000 | 5,000 | 5,000 | | | `visual-mrc` | 15,000 | 2,500 | 5,000 | | | `winoground` | 0 | 0 | 800 | | Generation | `vist` | 5,000 | 4,315 | 4,350 | | | `visual-dialog` | 50,000 | 1,000 | 1,000 | | | `multi30k` | 90,000 | 6,000 | 12,000 | | Chinese | `fm-iqa` | 164,735 | 75,206 | 0 | | | `coco-cn` | 18,341 | 1,000 | 1,000 | | | `flickr8k-cn` | 6,000 | 1,000 | 1,000 | | | `chinese-food` | 0 | 0 | 1,100 | | | `mmchat` | 3,000 | 1,000 | 1,000 | | Video | `ss` | 2,000 | 2,000 | 2,000 | | | `ivqa` | 5,994 | 2,000 | 2,000 | | | `msvd-qa` | 1,161 | 245 | 504 | | | `activitynet-qa` | 3,200 | 1,800 | 800 | | | `msrvtt` | 6,513 | 497 | 2,990 | | | `msrvtt-qa` | 2,000 | 1,000 | 1,000 | ## Dataset Structure ### HuggingFace Login (Optional) ```python # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token) ``` ### Data Loading ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) ``` ### Data Splits ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ``` ### Data Instances ```python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ``` ### Data Fields ```python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ``` ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data | Task | Dataset [Citation] | Source | |---------------------------|----------------------------------|------------------------------------------------------------------------------------| | Image Captioning | `coco` [1] | [Source](https://cocodataset.org/#home) | | | `textcap` [2] | [Source](https://textvqa.org/textcaps/) | | | `image-paragraph-captioning` [3] | [Source](https://cs.stanford.edu/people/ranjaykrishna/im2p/index.html) | | Classification | `coco-goi` [1] | [Source](https://cocodataset.org/#home) | | | `coco-text` [4] | [Source](https://bgshih.github.io/cocotext/) | | | `imagenet` [5] | [Source](https://www.image-net.org/) | | | `coco-itm` [1] | [Source](https://cocodataset.org/#home) | | | `snli-ve` [6] | [Source](https://github.com/necla-ml/SNLI-VE) | | | `mocheg` [7] | [Source](https://github.com/VT-NLP/Mocheg) | | | `iqa` [8] | [Source](https://github.com/icbcbicc/IQA-Dataset) | | Visual Question Answering | `vqa-v2` [9] | [Source](https://visualqa.org/) | | | `shapes` [10] | [Source](https://github.com/ronghanghu/n2nmn) | | | `docvqa` [11] | [Source](https://www.docvqa.org/) | | | `ocr-vqa` [12] | [Source](https://ocr-vqa.github.io/) | | | `st-vqa` [13] | [Source](https://rrc.cvc.uab.es/?ch=11) | | | `text-vqa` [14] | [Source](https://textvqa.org/) | | | `gqa` [15] | [Source](https://cs.stanford.edu/people/dorarad/gqa/about.html) | | Knowledgeable Visual QA | `okvqa` [16] | [Source](https://okvqa.allenai.org/) | | | `a-okvqa` [17] | [Source](https://allenai.org/project/a-okvqa/home) | | | `science-qa` [18] | [Source](https://scienceqa.github.io/) | | | `viquae` [19] | [Source](https://github.com/PaulLerner/ViQuAE) | | Reasoning | `clevr` [20] | [Source](https://cs.stanford.edu/people/jcjohns/clevr/) | | | `nlvr` [21] | [Source](https://lil.nlp.cornell.edu/nlvr/) | | | `vcr` [22] | [Source](https://visualcommonsense.com/) | | | `visual-mrc` [23] | [Source](https://github.com/nttmdlab-nlp/VisualMRC) | | | `winoground` [24] | [Source](https://huggingface.co/datasets/facebook/winoground) | | Generation | `vist` [25] | [Source](https://visionandlanguage.net/VIST/) | | | `visual-dialog` [26] | [Source](https://visualdialog.org/) | | | `multi30k` [27] | [Source](https://github.com/multi30k/dataset) | | Chinese | `fm-iqa` [28] | [Source](https://paperswithcode.com/dataset/fm-iqa) | | | `coco-cn` [29] | [Source](https://github.com/li-xirong/coco-cn) | | | `flickr8k-cn` [30] | [Source](https://github.com/li-xirong/flickr8kcn) | | | `chinese-food` [31] | [Source](https://sites.google.com/view/chinesefoodnet) | | | `mmchat` [32] | [Source](https://github.com/silverriver/MMChat) | | Video | `ss` [33] | [Source](https://developer.qualcomm.com/software/ai-datasets/something-something) | | | `ivqa` [34] | [Source](https://antoyang.github.io/just-ask.html) | | | `msvd-qa` [35] | [Source](https://paperswithcode.com/dataset/msvd) | | | `activitynet-qa` [36] | [Source](https://github.com/MILVLG/activitynet-qa) | | | `msrvtt` [35] | [Source](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [37] | [Source](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### Annotations #### Annotation process To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps: - (1) **Stage I: Instruction Writing**: writing instructions for each task; - (2) **Stage II: Data Format Unification**: structuring images and texts into a unified schema; - (3) **Stage III: Quality Check**: checking the overall dataset quality; - (4) **Stage IV: Key Datasets Translation**: building multilingual sets. #### Who are the annotators? Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature. ## Additional Information ### Licensing Information The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information. Our annotated instruction data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information ```bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ``` ### Contributions M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents. ## References - [1] Microsoft COCO: Common Objects in Context - [2] TextCaps: a dataset for image captioning with reading comprehension - [3] A Hierarchical Approach for Generating Descriptive Image Paragraphs - [4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images - [5] Imagenet large scale visual recognition challenge - [6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks - [7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models - [8] Quantifying visual image quality: A Bayesian view - [9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering - [10] Neural Module Networks - [11] DocVQA: A dataset for vqa on document images - [12] OCR-VQA: Visual Question Answering by Reading Text in Images - [13] Scene Text Visual Question Answering - [14] Towards VQA Models That Can Read - [15] GQA: A new dataset for real-world visual reasoning and compositional question answering - [16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - [17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge - [18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [19] ViQuAE: a dataset for knowledge-based visual question answering about named entities - [20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning - [21] A Corpus of Natural Language for Visual Reasoning - [22] From recognition to cognition: Visual Commonsense Reasoning - [23] VisualMRC: Machine reading comprehension on document images - [24] WinoGround: Probing vision and language models for visio-linguistic compositionality - [25] Visual Storytelling - [26] Visual Dialog - [27] Multi30k: Multilingual english-german image descriptions - [28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question - [29] COCO-CN for cross-lingual image tagging, captioning, and retrieval - [30] Adding Chinese Captions to Images - [31] ChineseFoodNet: A large-scale image dataset for chinese food recognition - [32] MMChat: Multi-Modal Chat Dataset on Social Media - [33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense - [34] Just Ask: Learning to answer questions from millions of narrated videos - [35] Video Question Answering via Gradually Refined Attention over Appearance and Motion - [36] ActivityNet-qa: A dataset for understanding complex web videos via question answering - [37] MSR-VTT: A large video description dataset for bridging video and language

应用场景：