five

MMInstruction/M3IT

收藏
Hugging Face2023-11-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MMInstruction/M3IT
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - image-to-text - image-classification size_categories: - 1M<n<10M language: - en - zh --- # Dataset Card for M3IT Project Page: [M3IT](https://m3-it.github.io/) ## Dataset Description - **Homepage: https://huggingface.co/datasets/MMInstruction/M3IT** - **Repository: https://huggingface.co/datasets/MMInstruction/M3IT** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Languages English and Chinese. 80 translated version can be found at [M3IT-80](https://huggingface.co/datasets/MMInstruction/M3IT-80). ## Dataset Statistics Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. ### Instruction Statistics | Task | #Instructions | |---------------------------|---------------| | Image Captioning | 52 | | Classification | 113 | | Visual Question Answering | 95 | | Knowledgeable Visual QA | 40 | | Reasoning | 60 | | Generation | 40 | | Total | 400 | ### Task Statistics | Task | Description | #Train | #Val | #Test | |---------------------------|-----------------------------------------------------------------|---------|---------|---------| | Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 | | Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 | | Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 | | Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 | | Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 | | Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 | | Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 | | Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 | | Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 | ### Detailed Dataset Statistics | Task | Dataset | #Train | #Val | #Test | |---------------------------|------------------------------|---------|--------|--------| | Image Captioning | `coco` | 566,747 | 25,010 | 25,010 | | | `textcap` | 97,765 | 13,965 | 0 | | | `image-paragraph-captioning` | 14,575 | 2,487 | 2,489 | | Classification | `coco-goi` | 30,000 | 2,000 | 0 | | | `coco-text` | 118,312 | 27,550 | 0 | | | `imagenet` | 30,000 | 50,000 | 0 | | | `coco-itm` | 30,000 | 5,000 | 5,000 | | | `snli-ve` | 20,000 | 14,339 | 14,740 | | | `mocheg` | 4,991 | 180 | 466 | | | `iqa` | 5,000 | 1,000 | 1,000 | | Visual Question Answering | `vqa-v2` | 30,000 | 30,000 | 0 | | | `shapes` | 13,568 | 1,024 | 1,024 | | | `docvqa` | 39,463 | 5,349 | 0 | | | `ocr-vqa` | 11,414 | 4,940 | 0 | | | `st-vqa` | 26,074 | 0 | 4,070 | | | `text-vqa` | 27,113 | 0 | 5,734 | | | `gqa` | 30,001 | 5,001 | 0 | | Knowledgeable Visual QA | `okvqa` | 9,009 | 5,046 | 0 | | | `a-okvqa` | 17,056 | 1,145 | 0 | | | `science-qa` | 12,726 | 4,241 | 4,241 | | | `viquae` | 1,190 | 1,250 | 1,236 | | Reasoning | `clevr` | 30,000 | 2,000 | 0 | | | `nlvr` | 29,372 | 2,000 | 0 | | | `vcr` | 25,000 | 5,000 | 5,000 | | | `visual-mrc` | 15,000 | 2,500 | 5,000 | | | `winoground` | 0 | 0 | 800 | | Generation | `vist` | 5,000 | 4,315 | 4,350 | | | `visual-dialog` | 50,000 | 1,000 | 1,000 | | | `multi30k` | 90,000 | 6,000 | 12,000 | | Chinese | `fm-iqa` | 164,735 | 75,206 | 0 | | | `coco-cn` | 18,341 | 1,000 | 1,000 | | | `flickr8k-cn` | 6,000 | 1,000 | 1,000 | | | `chinese-food` | 0 | 0 | 1,100 | | | `mmchat` | 3,000 | 1,000 | 1,000 | | Video | `ss` | 2,000 | 2,000 | 2,000 | | | `ivqa` | 5,994 | 2,000 | 2,000 | | | `msvd-qa` | 1,161 | 245 | 504 | | | `activitynet-qa` | 3,200 | 1,800 | 800 | | | `msrvtt` | 6,513 | 497 | 2,990 | | | `msrvtt-qa` | 2,000 | 1,000 | 1,000 | ## Dataset Structure ### HuggingFace Login (Optional) ```python # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token) ``` ### Data Loading ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) ``` ### Data Splits ```python from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"] ``` ### Data Instances ```python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0]))) ``` ### Data Fields ```python import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } ) ``` ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data | Task | Dataset [Citation] | Source | |---------------------------|----------------------------------|------------------------------------------------------------------------------------| | Image Captioning | `coco` [1] | [Source](https://cocodataset.org/#home) | | | `textcap` [2] | [Source](https://textvqa.org/textcaps/) | | | `image-paragraph-captioning` [3] | [Source](https://cs.stanford.edu/people/ranjaykrishna/im2p/index.html) | | Classification | `coco-goi` [1] | [Source](https://cocodataset.org/#home) | | | `coco-text` [4] | [Source](https://bgshih.github.io/cocotext/) | | | `imagenet` [5] | [Source](https://www.image-net.org/) | | | `coco-itm` [1] | [Source](https://cocodataset.org/#home) | | | `snli-ve` [6] | [Source](https://github.com/necla-ml/SNLI-VE) | | | `mocheg` [7] | [Source](https://github.com/VT-NLP/Mocheg) | | | `iqa` [8] | [Source](https://github.com/icbcbicc/IQA-Dataset) | | Visual Question Answering | `vqa-v2` [9] | [Source](https://visualqa.org/) | | | `shapes` [10] | [Source](https://github.com/ronghanghu/n2nmn) | | | `docvqa` [11] | [Source](https://www.docvqa.org/) | | | `ocr-vqa` [12] | [Source](https://ocr-vqa.github.io/) | | | `st-vqa` [13] | [Source](https://rrc.cvc.uab.es/?ch=11) | | | `text-vqa` [14] | [Source](https://textvqa.org/) | | | `gqa` [15] | [Source](https://cs.stanford.edu/people/dorarad/gqa/about.html) | | Knowledgeable Visual QA | `okvqa` [16] | [Source](https://okvqa.allenai.org/) | | | `a-okvqa` [17] | [Source](https://allenai.org/project/a-okvqa/home) | | | `science-qa` [18] | [Source](https://scienceqa.github.io/) | | | `viquae` [19] | [Source](https://github.com/PaulLerner/ViQuAE) | | Reasoning | `clevr` [20] | [Source](https://cs.stanford.edu/people/jcjohns/clevr/) | | | `nlvr` [21] | [Source](https://lil.nlp.cornell.edu/nlvr/) | | | `vcr` [22] | [Source](https://visualcommonsense.com/) | | | `visual-mrc` [23] | [Source](https://github.com/nttmdlab-nlp/VisualMRC) | | | `winoground` [24] | [Source](https://huggingface.co/datasets/facebook/winoground) | | Generation | `vist` [25] | [Source](https://visionandlanguage.net/VIST/) | | | `visual-dialog` [26] | [Source](https://visualdialog.org/) | | | `multi30k` [27] | [Source](https://github.com/multi30k/dataset) | | Chinese | `fm-iqa` [28] | [Source](https://paperswithcode.com/dataset/fm-iqa) | | | `coco-cn` [29] | [Source](https://github.com/li-xirong/coco-cn) | | | `flickr8k-cn` [30] | [Source](https://github.com/li-xirong/flickr8kcn) | | | `chinese-food` [31] | [Source](https://sites.google.com/view/chinesefoodnet) | | | `mmchat` [32] | [Source](https://github.com/silverriver/MMChat) | | Video | `ss` [33] | [Source](https://developer.qualcomm.com/software/ai-datasets/something-something) | | | `ivqa` [34] | [Source](https://antoyang.github.io/just-ask.html) | | | `msvd-qa` [35] | [Source](https://paperswithcode.com/dataset/msvd) | | | `activitynet-qa` [36] | [Source](https://github.com/MILVLG/activitynet-qa) | | | `msrvtt` [35] | [Source](https://paperswithcode.com/dataset/msr-vtt) | | | `msrvtt-qa` [37] | [Source](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1) | ### Annotations #### Annotation process To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps: - (1) **Stage I: Instruction Writing**: writing instructions for each task; - (2) **Stage II: Data Format Unification**: structuring images and texts into a unified schema; - (3) **Stage III: Quality Check**: checking the overall dataset quality; - (4) **Stage IV: Key Datasets Translation**: building multilingual sets. #### Who are the annotators? Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature. ## Additional Information ### Licensing Information The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information. Our annotated instruction data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ### Citation Information ```bibtex @article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} } ``` ### Contributions M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents. ## References - [1] Microsoft COCO: Common Objects in Context - [2] TextCaps: a dataset for image captioning with reading comprehension - [3] A Hierarchical Approach for Generating Descriptive Image Paragraphs - [4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images - [5] Imagenet large scale visual recognition challenge - [6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks - [7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models - [8] Quantifying visual image quality: A Bayesian view - [9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering - [10] Neural Module Networks - [11] DocVQA: A dataset for vqa on document images - [12] OCR-VQA: Visual Question Answering by Reading Text in Images - [13] Scene Text Visual Question Answering - [14] Towards VQA Models That Can Read - [15] GQA: A new dataset for real-world visual reasoning and compositional question answering - [16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - [17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge - [18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [19] ViQuAE: a dataset for knowledge-based visual question answering about named entities - [20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning - [21] A Corpus of Natural Language for Visual Reasoning - [22] From recognition to cognition: Visual Commonsense Reasoning - [23] VisualMRC: Machine reading comprehension on document images - [24] WinoGround: Probing vision and language models for visio-linguistic compositionality - [25] Visual Storytelling - [26] Visual Dialog - [27] Multi30k: Multilingual english-german image descriptions - [28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question - [29] COCO-CN for cross-lingual image tagging, captioning, and retrieval - [30] Adding Chinese Captions to Images - [31] ChineseFoodNet: A large-scale image dataset for chinese food recognition - [32] MMChat: Multi-Modal Chat Dataset on Social Media - [33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense - [34] Just Ask: Learning to answer questions from millions of narrated videos - [35] Video Question Answering via Gradually Refined Attention over Appearance and Motion - [36] ActivityNet-qa: A dataset for understanding complex web videos via question answering - [37] MSR-VTT: A large video description dataset for bridging video and language
提供机构:
MMInstruction
原始信息汇总

M3IT数据集概述

数据集描述

  • 名称: M3IT
  • 类别:
    • 任务类别:
      • 图像到文本
      • 图像分类
    • 大小类别: 1M<n<10M
  • 语言: 英语、中文
  • 许可: 其他

数据集统计

指令统计

任务 指令数量
图像标题生成 52
分类 113
视觉问答 95
知识丰富的视觉问答 40
推理 60
生成 40
总计 400

任务统计

任务 描述 训练集 验证集 测试集
图像标题生成 给定图像,为其编写描述 679,087 41,462 27,499
分类 给定图像,将其分类到预定义类别中 238,303 100,069 21,206
视觉问答 给定图像,回答与图像相关的问题 177,633 46,314 10,828
知识丰富的视觉问答 给定图像,回答需要外部知识的问题 39,981 11,682 5,477
推理 给定图像,对图像进行推理 99,372 11,500 10,000
生成 给定图像,根据特定要求进行创作 145,000 11,315 17,350
中文 中文环境下的标题生成、分类、视觉问答和生成任务 192,076 77,306 4,100
视频 视频-语言数据集上的标题生成、分类和视觉问答任务 20,868 7,542 9,294
多语言 80种语言的翻译任务 0 240,000 184,000

数据集结构

数据加载

python from datasets import load_dataset

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name)

数据分割

python from datasets import load_dataset

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]

数据实例

python from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image

ds_name = "coco" # 在此处更改数据集名称 dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"]

for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

数据字段

python import datasets

features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )

数据集创建

注释过程

  • 阶段I: 指令编写
  • 阶段II: 数据格式统一
  • 阶段III: 质量检查
  • 阶段IV: 关键数据集翻译

注释者

八位作者作为人工注释者,每位都是熟悉相关文献的研究生。

附加信息

许可信息

原始数据集内容遵循其原始许可。建议对于未知/自定义许可的任务,用户可以检查原始项目或联系数据集所有者获取详细的许可信息。

我们的注释指令数据根据CC BY 4.0许可。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作