MMInstruction/M3IT

Name: MMInstruction/M3IT
Creator: MMInstruction
Published: 2023-11-24 08:23:25
License: 暂无描述

Hugging Face2023-11-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MMInstruction/M3IT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - image-to-text - image-classification size_categories: - 1M<n<10M language: - en - zh --- # Dataset Card for M3IT Project Page: [M3IT](https://m3-it.github.io/) ## Dataset Description - **Homepage: https://huggingface.co/datasets/MMInstruction/M3IT** - **Repository: https://huggingface.co/datasets/MMInstruction/M3IT** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Languages English and Chinese. 80 translated version can be found at [M3IT-80](https://huggingface.co/datasets/MMInstruction/M3IT-80). ## Dataset Statistics Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. ### Instruction Statistics | Task | #Instructions | |---------------------------|---------------| | Image Captioning | 52 | | Classification | 113 | | Visual Question Answering | 95 | | Knowledgeable Visual QA | 40 | | Reasoning | 60 | | Generation | 40 | | Total | 400 | ### Task Statistics | Task | Description | #Train | #Val | #Test | |---------------------------|-----------------------------------------------------------------|---------|---------|---------| | Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 | | Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 | | Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 | | Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 | | Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 | | Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 | | Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 | | Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 | | Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 | ### Detailed Dataset Statistics | Task | Dataset | #Train | #Val | #Test | |---------------------------|------------------------------|---------|--------|--------| | Image Captioning | `coco` | 566,747 | 25,010 | 25,010 | | | `textcap` | 97,765 | 13,965 | 0 | | | `image-paragraph-captioning` | 14,575 | 2,487 | 2,489 | | Classification | `coco-goi` | 30,000 | 2,000 | 0 | | | `coco-text` | 118,312 | 27,550 | 0 | | | `imagenet` | 30,000 | 50,000 | 0 | | | `coco-itm` | 30,000 | 5,000 | 5,000 | | | `snli-ve` | 20,000 | 14,339 | 14,740 | | | `mocheg` | 4,991 | 180 | 466 | | | `iqa` | 5,000 | 1,000 | 1,000 | | Visual Question Answering | `vqa-v2` | 30,000 | 30,000 | 0 | | | `shapes` | 13,568 | 1,024 | 1,024 | | | `docvqa` | 39,463 | 5,349 | 0 | | | `ocr-vqa` | 11,414 | 4,940 | 0 | | | `st-vqa` | 26,074 | 0 | 4,070 | | | `text-vqa` | 27,113 | 0 | 5,734 | | | `gqa` | 30,001 | 5,001 | 0 | | Knowledgeable Visual QA | `okvqa` | 9,009 | 5,046 | 0 | | | `a-okvqa` | 17,056 | 1,145 | 0 | | | `science-qa` | 12,726 | 4,241 | 4,241 | | | `viquae` | 1,190 | 1,250 | 1,236 | | Reasoning | `clevr` | 30,000 | 2,000 | 0 | | | `nlvr` | 29,372 | 2,000 | 0 | | | `vcr` | 25,000 | 5,000 | 5,000 | | | `visual-mrc` | 15,000 | 2,500 | 5,000 | | | `winoground` | 0 | 0 | 800 | | Generation | `vist` | 5,000 | 4,315 | 4,350 | | | `visual-dialog` | 50,000 | 1,000 | 1,000 | | | `multi30k` | 90,000 | 6,000 | 12,000 | | Chinese | `fm-iqa` | 164,735 | 75,206 | 0 | | | `coco-cn` | 18,341 | 1,000 | 1,000 | | | `flickr8k-cn` | 6,000 | 1,000 | 1,000 | | | `chinese-food` | 0 | 0 | 1,100 | | | `mmchat` | 3,000 | 1,000 | 1,000 | | Video | `ss` | 2,000 | 2,000 | 2,000 | | | `ivqa` | 5,994 | 2,000 | 2,000 | | | `msvd-qa` | 1,161 | 245 | 504 | | | `activitynet-qa` | 3,200 | 1,800 | 800 | | | `msrvtt` | 6,513 | 497 | 2,990 | | | `msrvtt-qa` | 2,000 | 1,000 | 1,000 | ## Dataset Structure ### HuggingFace Login (Optional) ```python # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token) ``` ### Data Loading ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) ``` ### Data Splits ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ``` ### Data Instances ```python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ``` ### Data Fields ```python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ``` ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data | Task | Dataset [Citation] | Source | |---------------------------|----------------------------------|------------------------------------------------------------------------------------| | Image Captioning | `coco` [1] | [Source](https://cocodataset.org/#home) | | | `textcap` [2] | [Source](https://textvqa.org/textcaps/) | | | `image-paragraph-captioning` [3] | [Source](https://cs.stanford.edu/people/ranjaykrishna/im2p/index.html) | | Classification | `coco-goi` [1] | [Source](https://cocodataset.org/#home) | | | `coco-text` [4] | [Source](https://bgshih.github.io/cocotext/) | | | `imagenet` [5] | [Source](https://www.image-net.org/) | | | `coco-itm` [1] | [Source](https://cocodataset.org/#home) | | | `snli-ve` [6] | [Source](https://github.com/necla-ml/SNLI-VE) | | | `mocheg` [7] | [Source](https://github.com/VT-NLP/Mocheg) | | | `iqa` [8] | [Source](https://github.com/icbcbicc/IQA-Dataset) | | Visual Question Answering | `vqa-v2` [9] | [Source](https://visualqa.org/) | | | `shapes` [10] | [Source](https://github.com/ronghanghu/n2nmn) | | | `docvqa` [11] | [Source](https://www.docvqa.org/) | | | `ocr-vqa` [12] | [Source](https://ocr-vqa.github.io/) | | | `st-vqa` [13] | [Source](https://rrc.cvc.uab.es/?ch=11) | | | `text-vqa` [14] | [Source](https://textvqa.org/) | | | `gqa` [15] | [Source](https://cs.stanford.edu/people/dorarad/gqa/about.html) | | Knowledgeable Visual QA | `okvqa` [16] | [Source](https://okvqa.allenai.org/) | | | `a-okvqa` [17] | [Source](https://allenai.org/project/a-okvqa/home) | | | `science-qa` [18] | [Source](https://scienceqa.github.io/) | | | `viquae` [19] | [Source](https://github.com/PaulLerner/ViQuAE) | | Reasoning | `clevr` [20] | [Source](https://cs.stanford.edu/people/jcjohns/clevr/) | | | `nlvr` [21] | [Source](https://lil.nlp.cornell.edu/nlvr/) | | | `vcr` [22] | [Source](https://visualcommonsense.com/) | | | `visual-mrc` [23] | [Source](https://github.com/nttmdlab-nlp/VisualMRC) | | | `winoground` [24] | [Source](https://huggingface.co/datasets/facebook/winoground) | | Generation | `vist` [25] | [Source](https://visionandlanguage.net/VIST/) | | | `visual-dialog` [26] | [Source](https://visualdialog.org/) | | | `multi30k` [27] | [Source](https://github.com/multi30k/dataset) | | Chinese | `fm-iqa` [28] | [Source](https://paperswithcode.com/dataset/fm-iqa) | | | `coco-cn` [29] | [Source](https://github.com/li-xirong/coco-cn) | | | `flickr8k-cn` [30] | [Source](https://github.com/li-xirong/flickr8kcn) | | | `chinese-food` [31] | [Source](https://sites.google.com/view/chinesefoodnet) | | | `mmchat` [32] | [Source](https://github.com/silverriver/MMChat) | | Video | `ss` [33] | [Source](https://developer.qualcomm.com/software/ai-datasets/something-something) | | | `ivqa` [34] | [Source](https://antoyang.github.io/just-ask.html) | | | `msvd-qa` [35] | [Source](https://paperswithcode.com/dataset/msvd) | | | `activitynet-qa` [36] | [Source](https://github.com/MILVLG/activitynet-qa) | | | `msrvtt` [35] | [Source](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [37] | [Source](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### Annotations #### Annotation process To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps: - (1) **Stage I: Instruction Writing**: writing instructions for each task; - (2) **Stage II: Data Format Unification**: structuring images and texts into a unified schema; - (3) **Stage III: Quality Check**: checking the overall dataset quality; - (4) **Stage IV: Key Datasets Translation**: building multilingual sets. #### Who are the annotators? Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature. ## Additional Information ### Licensing Information The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information. Our annotated instruction data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information ```bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ``` ### Contributions M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents. ## References - [1] Microsoft COCO: Common Objects in Context - [2] TextCaps: a dataset for image captioning with reading comprehension - [3] A Hierarchical Approach for Generating Descriptive Image Paragraphs - [4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images - [5] Imagenet large scale visual recognition challenge - [6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks - [7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models - [8] Quantifying visual image quality: A Bayesian view - [9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering - [10] Neural Module Networks - [11] DocVQA: A dataset for vqa on document images - [12] OCR-VQA: Visual Question Answering by Reading Text in Images - [13] Scene Text Visual Question Answering - [14] Towards VQA Models That Can Read - [15] GQA: A new dataset for real-world visual reasoning and compositional question answering - [16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - [17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge - [18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [19] ViQuAE: a dataset for knowledge-based visual question answering about named entities - [20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning - [21] A Corpus of Natural Language for Visual Reasoning - [22] From recognition to cognition: Visual Commonsense Reasoning - [23] VisualMRC: Machine reading comprehension on document images - [24] WinoGround: Probing vision and language models for visio-linguistic compositionality - [25] Visual Storytelling - [26] Visual Dialog - [27] Multi30k: Multilingual english-german image descriptions - [28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question - [29] COCO-CN for cross-lingual image tagging, captioning, and retrieval - [30] Adding Chinese Captions to Images - [31] ChineseFoodNet: A large-scale image dataset for chinese food recognition - [32] MMChat: Multi-Modal Chat Dataset on Social Media - [33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense - [34] Just Ask: Learning to answer questions from millions of narrated videos - [35] Video Question Answering via Gradually Refined Attention over Appearance and Motion - [36] ActivityNet-qa: A dataset for understanding complex web videos via question answering - [37] MSR-VTT: A large video description dataset for bridging video and language

提供机构：

MMInstruction

原始信息汇总

M3IT数据集概述

数据集描述

名称: M3IT
类别:
- 任务类别:
  - 图像到文本
  - 图像分类
- 大小类别: 1M<n<10M
语言: 英语、中文
许可: 其他

数据集统计

指令统计

任务	指令数量
图像标题生成	52
分类	113
视觉问答	95
知识丰富的视觉问答	40
推理	60
生成	40
总计	400

任务统计

任务	描述	训练集	验证集	测试集
图像标题生成	给定图像，为其编写描述	679,087	41,462	27,499
分类	给定图像，将其分类到预定义类别中	238,303	100,069	21,206
视觉问答	给定图像，回答与图像相关的问题	177,633	46,314	10,828
知识丰富的视觉问答	给定图像，回答需要外部知识的问题	39,981	11,682	5,477
推理	给定图像，对图像进行推理	99,372	11,500	10,000
生成	给定图像，根据特定要求进行创作	145,000	11,315	17,350
中文	中文环境下的标题生成、分类、视觉问答和生成任务	192,076	77,306	4,100
视频	视频-语言数据集上的标题生成、分类和视觉问答任务	20,868	7,542	9,294
多语言	80种语言的翻译任务	0	240,000	184,000

数据集结构

数据加载

python from datasets import load_dataset

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name)

数据分割

python from datasets import load_dataset

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]

数据实例

python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"]

for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

数据字段

python import datasets

features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )