Infinity-MM
收藏魔搭社区2026-05-14 更新2024-11-02 收录
下载链接:
https://modelscope.cn/datasets/BAAI/Infinity-MM
下载链接
链接失效反馈官方服务:
资源简介:
## **Introduction**
<p align="center">
<img src="infinity-mm-logo.jpeg" width="300">
</p>
<p align="center">
<em>Beijing Academy of Artificial Intelligence (BAAI)</em><br/>
</p>
We collect, organize and open-source the large-scale multimodal instruction dataset, **Infinity-MM**, consisting of tens of millions of samples. Through quality filtering and deduplication, the dataset has high quality and diversity.
We propose a synthetic data generation method based on open-source models and labeling system, using detailed image annotations and diverse question generation.
Based on Infinity-MM, we have successfully trained a 2-billion-parameter VLM model, **Aquila-VL-2B**, achieving SOTA performance among models of the same scale.
## **News**
- `2024/11/19` We have released [**Aquila-VL-2B**](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen/) and all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-VL-2B-Intermediate) obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
- `2024/11/05` The data in stage2/7M_0712_math_plus_system_release_0802 was incomplete. We have now updated it, and the new data is placed in stage2/7M_0712_math_plus_system_release. Please replace the previous data with this updated version.
- `2024/10/28` All the data has been uploaded.
- `2024/10/24` The data of stage 2, stage 3 and stage 4 has been transferred. And the data of stage 1 will complete transmission next Friday(2024/11/1).
## **Data Category of Infinity-MM Data**
| Data Folder | Data Category | Size | Data Composition |
|-------|------------------------------------|-------|----------------------------------------------------------------------------------|
|<div align="center">Stage 1| <div align="center">Image-Caption Data | <div align="center">10M</div> | <div align="center">Caption Data 10M</div> |
|<div align="center"><br><br>Stage 2| <div align="center"><br><br>General Visual Instruction Data | <div align="center"><br><br>25.8M</div> | <div align="center">General Data 7.1M <br> General OCR Data 2.6M <br> Doc/Chart/Screen Data 5.8M <br> Math/Reasoning Data 1.3M <br> Text Instruct Data 9M</div> |
|<div align="center"><br>Stage 3| <div align="center"><br>Selective Visual Instruction Data | <div align="center"><br>6M</div> | <div align="center">LLaVA-onevision Data 3.5M <br> Infinity-Instruct 1.3M <br> Docmatix Data 1.2M</div> |
|<div align="center"><br>Stage 4| <div align="center"><br>GPT4 & Synthetic Data | <div align="center"><br>3M</div> | <div align="center">Data Generated by GPT4 1.7M <br> Synthetic Data 0.8M <br> Specific Task Data 0.4M <br>Infinity-Preference Data 0.1M</div> |
## **An example of the formated data**
```json
{"id": "10",
"source": "llava",
"conversations": [
{
"from": "human",
"value": "<image> Can you create a descriptive poem about the scene?"
},
{
"from": "gpt",
"value": "Upon a plate of verdant hue,\nA cluster of roasted nuts we view.\nCashews golden, shining bright,\nAlmonds dark, a roasted delight.\nSprinkled seeds of sesame light,\nEnhancing flavors, a gentle bite.\nOn white wood they rest in grace,\nA rustic charm, a peaceful place.\nSimple joys in a vibrant array,\nA perfect treat to start the day."
}],
"image": "/path/of/the/image",
"ram++_tags": ["wall", "dry", "grassy", "hill", "stone", "sun", "sunset"],
"ram++_tags_score": [9.56411075592041, 2.3733813762664795, 1.4329272508621216, 1.9840935468673706, 1.9766467809677124, 2.255882501602173, 2.575751781463623],
"phash": [12512305226191801180],
"qw2vl_loss": 3.0559005737304688
}
```
The meaning of each key values:
* **'id'**: The id of the record.
* **'source'**: The source of the record.
* **'conversations'**: The conversations of the record.
* **'image'**: The absolute image path of the image.
* **'ram++_tags' & 'ram++_tags_score'**: These two values are obtained by [Ram++] model. 'ram++_tags' is the tags of the image, and the 'ram++_tags_score' is the score of tags of the image.
* **'phash'**: The phash value of the image.
* **'qw2vl_loss'**: The value is calculated from [Qwen2-VL-2B].
## How to use
You can download the dataset and then follow the steps below:
* **save the following code as 'revert_wds_shards.py'**
```python
import json
import os
import time
import yaml
import glob
import webdataset as wds
from PIL import Image, ImageFile
import jsonlines
import copy
from tqdm import tqdm
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--wds-path', type=str, default=None, help="file path", required=True)
parser.add_argument('--output-path', type=str, default="", help="file path", required=True)
parser.add_argument('--output-prefix', type=str, default="", help="file path", required=True)
args = parser.parse_args()
output = args.output_path
if not os.path.exists(output):
os.makedirs(output)
else:
print(f"Dir: {output} already existed.")
tar_files = glob.glob(args.wds_path)
if not tar_files:
print(f"No files found matching the pattern: {args.wds_path}")
exit(1)
## Allowed fields and Rename
fields_mapping = dict()
fields_mapping['id'] = 'id'
fields_mapping['source'] = 'source'
fields_mapping['conversations'] = 'conversations'
fields_mapping['image'] = 'image'
fields_mapping['tags'] = 'ram++_tags'
fields_mapping['score'] = 'ram++_tags_score'
fields_mapping['phash'] = 'phash'
fields_mapping = {v: k for k, v in fields_mapping.items()}
json_list = []
# dataset = wds.WebDataset(args.wds_path)
dataset = wds.WebDataset(tar_files)
filtered = 0
batch_size = 1000
lines = 0
for sample in tqdm(dataset):
entry = copy.deepcopy(json.loads(sample['json']))
if 'source' in entry:
del entry['source']
if 'ram++_tags' in entry:
del entry['ram++_tags']
if 'ram++_tags_score' in entry:
del entry['ram++_tags_score']
if 'phash' in entry:
del entry['phash']
img_data = sample['jpg']
if img_data == bytes():
pass
else:
file_name_without_ext, file_extension = os.path.splitext(entry['image'])
img_filename = f"{sample['__key__']}{file_extension}"
try:
target_dir = os.path.join(output, f"{int(lines/batch_size):05d}")
os.makedirs(target_dir, exist_ok=True)
img_file = open(os.path.join(target_dir, img_filename), 'wb')
img_file.write(img_data)
img_file.close()
except Exception as exn:
print(exn)
filtered += 1
continue
entry['image'] = os.path.join(os.path.abspath(target_dir), img_filename)
json_list.append(entry)
lines += 1
# writer.write(entry)
json_file = os.path.join(output, f"{args.output_prefix}.json")
with open(json_file, 'w', encoding='utf-8') as f:
json.dump(json_list, f, ensure_ascii=False, indent=4)
print(f"Filtered {filtered} samples.", flush=True)
```
* **Then use the following command to get each subdataset:**
```python
export wds_path='/the/actual/path/of/each/dataset/*.tar'
export output_path='/the/path/you/want/to/save/the/dataset/'
export output_prefix='the json name of dataset you want to save'
python revert_wds_shards.py --wds-path "$wds_path" --output-path "$output_path" --output-prefix "$output_prefix"
```
## **Data Source of Infinity-MM Dataset**
| Data Source | Size |
|---------------------------|--------|
| <div align="center">Emu2 | <div align="center">10M |
| <div align="center">LVIS-Instruct | <div align="center">223K |
| <div align="center">LLaVA-CC3M-Pretrain-595K | <div align="center">595K |
| <div align="center">Visdial | <div align="center">116K |
| <div align="center">Sharegpt4 | <div align="center">3.2M |
| <div align="center">STVQA | <div align="center">43K |
| <div align="center">MMC-INST | <div align="center">500K |
| <div align="center">MathV360K | <div align="center">338K |
| <div align="center">MMC-Alignment | <div align="center">250K |
| <div align="center">DocReason | <div align="center">26K |
| <div align="center">ALLaVA | <div align="center">1.7M |
| <div align="center">Cocotext | <div align="center">163K |
| <div align="center">Docvqa | <div align="center">16K |
| <div align="center">Geoqa+ | <div align="center">72K |
| <div align="center">DocDownstream | <div align="center">700K |
| <div align="center">Cambrian | <div align="center">8.3M |
| <div align="center">DocStruct4M | <div align="center">4M |
| <div align="center">LLaVA-onevision | <div align="center">4M |
| <div align="center">Docmatix | <div align="center">1.2M |
| <div align="center">Infinity-Instruct | <div align="center">7M |
| <div align="center">Our Synthetic Data | <div align="center">0.8M |
## **Model**
Our **[Aquila-VL-2B]** model, a VLM with 2-billion-parameter, achieve state-of-the-art(SOTA) performance among models of the same scale.
## **Citation**
If you find this dataset useful, please cite the following work
```
@misc{gu2024infinitymmscalingmultimodalperformance,
title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
year={2024},
eprint={2410.18558},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18558},
}
```
[Ram++]: https://github.com/xinyu1205/recognize-anything?tab=readme-ov-file
[Qwen2-VL-2B]: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
[Aquila-VL-2B]: https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen
## **引言**
<p align="center">
<img src="infinity-mm-logo.jpeg" width="300">
</p>
<p align="center">
<em>北京人工智能研究院(Beijing Academy of Artificial Intelligence, BAAI)</em><br/>
</p>
我们收集、整理并开源了超千万级规模的多模态指令数据集**Infinity-MM**。经过质量过滤与去重处理,该数据集兼具高质量与丰富多样性。我们提出了一种基于开源模型与标注系统的合成数据生成方法,结合精细化图像标注与多样化问题生成流程。基于Infinity-MM,我们成功训练了参数量为20亿的视觉语言模型(Vision Language Model, VLM)**Aquila-VL-2B**,在同参数量级的模型中实现了当前最优(State-of-the-Art, SOTA)性能。
## **新闻动态**
- `2024/11/19` 我们已发布[Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen/)以及训练各阶段得到的所有中间检查点(checkpoint),相关中间检查点链接为[BAAI/Aquila-VL-2B-Intermediate](https://huggingface.co/BAAI/Aquila-VL-2B-Intermediate),欢迎将这些模型用于分析与实验研究。
- `2024/11/05` 路径为`stage2/7M_0712_math_plus_system_release_0802`的数据存在不完整问题,现已完成更新,新版数据存放于`stage2/7M_0712_math_plus_system_release`,请使用该版本替换旧版数据。
- `2024/10/28` 全量数据已上传完毕。
- `2024/10/24` 第二、三、四阶段的数据已完成传输,第一阶段的数据将于下周五(2024/11/1)完成传输。
## **Infinity-MM 数据分类**
| 数据文件夹 | 数据类别 | 规模 | 数据组成 |
|:-------|:------------------------------------|:-------|:----------------------------------------------------------------------------------|
|<div align="center">Stage 1| <div align="center">图像-字幕数据 | <div align="center">10M</div> | <div align="center">10M字幕数据</div> |
|<div align="center"><br><br>Stage 2| <div align="center"><br><br>通用视觉指令数据 | <div align="center"><br><br>25.8M</div> | <div align="center">7.1M通用数据 <br> 2.6M通用OCR数据 <br> 5.8M文档/图表/屏幕数据 <br> 1.3M数学/推理数据 <br> 9M文本指令数据</div> |
|<div align="center"><br>Stage 3| <div align="center"><br>精选视觉指令数据 | <div align="center"><br>6M</div> | <div align="center">3.5M LLaVA-onevision数据 <br> 1.3M Infinity-Instruct数据 <br> 1.2M Docmatix数据</div> |
|<div align="center"><br>Stage 4| <div align="center"><br>GPT4与合成数据 | <div align="center"><br>3M</div> | <div align="center">1.7M GPT4生成数据 <br> 0.8M合成数据 <br> 0.4M特定任务数据 <br>0.1M Infinity-Preference数据</div> |
## **格式化数据示例**
json
{"id": "10",
"source": "llava",
"conversations": [
{
"from": "human",
"value": "<image> 你能为这幅场景创作一首描述性的诗歌吗?"
},
{
"from": "gpt",
"value": "翠绿餐盘之上,
一簇烤坚果尽显风光。
腰果金黄熠熠生辉,
杏仁深棕,烘烤至味。
芝麻轻撒提香,
一口温柔,风味悠长。
白木盘上,优雅安放,
质朴韵味,静谧地方。
简单喜悦,色彩斑斓,
开启清晨的完美餐点。"
}],
"image": "/path/of/the/image",
"ram++_tags": ["wall", "dry", "grassy", "hill", "stone", "sun", "sunset"],
"ram++_tags_score": [9.56411075592041, 2.3733813762664795, 1.4329272508621216, 1.9840935468673706, 1.9766467809677124, 2.255882501602173, 2.575751781463623],
"phash": [12512305226191801180],
"qw2vl_loss": 3.0559005737304688
}
*注:示例中英文诗句已翻译为中文以作展示,原数据集中保留原始语言内容。*
各键值含义如下:
* **'id'**:数据记录的唯一标识
* **'source'**:数据记录的来源
* **'conversations'**:该数据的对话交互内容
* **'image'**:对应图像的绝对存储路径
* **'ram++_tags' & 'ram++_tags_score'**:二者均由[Ram++]模型生成。其中`ram++_tags`为图像的标签集合,`ram++_tags_score`为对应标签的置信度得分
* **'phash'**:图像的感知哈希值
* **'qw2vl_loss'**:该值由[Qwen2-VL-2B]计算得到
## **使用方法**
你可先下载该数据集,随后按照以下步骤操作:
1. 将下述代码保存为`revert_wds_shards.py`
python
import json
import os
import time
import yaml
import glob
import webdataset as wds
from PIL import Image, ImageFile
import jsonlines
import copy
from tqdm import tqdm
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--wds-path', type=str, default=None, help="file path", required=True)
parser.add_argument('--output-path', type=str, default="", help="file path", required=True)
parser.add_argument('--output-prefix', type=str, default="", help="file path", required=True)
args = parser.parse_args()
output = args.output_path
if not os.path.exists(output):
os.makedirs(output)
else:
print(f"Dir: {output} already existed.")
tar_files = glob.glob(args.wds_path)
if not tar_files:
print(f"No files found matching the pattern: {args.wds_path}")
exit(1)
## Allowed fields and Rename
fields_mapping = dict()
fields_mapping['id'] = 'id'
fields_mapping['source'] = 'source'
fields_mapping['conversations'] = 'conversations'
fields_mapping['image'] = 'image'
fields_mapping['tags'] = 'ram++_tags'
fields_mapping['score'] = 'ram++_tags_score'
fields_mapping['phash'] = 'phash'
fields_mapping = {v: k for k, v in fields_mapping.items()}
json_list = []
# dataset = wds.WebDataset(args.wds_path)
dataset = wds.WebDataset(tar_files)
filtered = 0
batch_size = 1000
lines = 0
for sample in tqdm(dataset):
entry = copy.deepcopy(json.loads(sample['json']))
if 'source' in entry:
del entry['source']
if 'ram++_tags' in entry:
del entry['ram++_tags']
if 'ram++_tags_score' in entry:
del entry['ram++_tags_score']
if 'phash' in entry:
del entry['phash']
img_data = sample['jpg']
if img_data == bytes():
pass
else:
file_name_without_ext, file_extension = os.path.splitext(entry['image'])
img_filename = f"{sample['__key__']}{file_extension}"
try:
target_dir = os.path.join(output, f"{int(lines/batch_size):05d}")
os.makedirs(target_dir, exist_ok=True)
img_file = open(os.path.join(target_dir, img_filename), 'wb')
img_file.write(img_data)
img_file.close()
except Exception as exn:
print(exn)
filtered += 1
continue
entry['image'] = os.path.join(os.path.abspath(target_dir), img_filename)
json_list.append(entry)
lines += 1
# writer.write(entry)
json_file = os.path.join(output, f"{args.output_prefix}.json")
with open(json_file, 'w', encoding='utf-8') as f:
json.dump(json_list, f, ensure_ascii=False, indent=4)
print(f"Filtered {filtered} samples.", flush=True)
2. 随后使用以下命令获取各子数据集:
shell
export wds_path='/the/actual/path/of/each/dataset/*.tar'
export output_path='/the/path/you/want/to/save/the/dataset/'
export output_prefix='the json name of dataset you want to save'
python revert_wds_shards.py --wds-path "$wds_path" --output-path "$output_path" --output-prefix "$output_prefix"
## **Infinity-MM 数据集的数据来源**
| 数据来源 | 规模 |
|:---------------------------|:--------|
|<div align="center">Emu2 | <div align="center">10M |
|<div align="center">LVIS-Instruct | <div align="center">223K |
|<div align="center">LLaVA-CC3M-Pretrain-595K | <div align="center">595K |
|<div align="center">Visdial | <div align="center">116K |
|<div align="center">Sharegpt4 | <div align="center">3.2M |
|<div align="center">STVQA | <div align="center">43K |
|<div align="center">MMC-INST | <div align="center">500K |
|<div align="center">MathV360K | <div align="center">338K |
|<div align="center">MMC-Alignment | <div align="center">250K |
|<div align="center">DocReason | <div align="center">26K |
|<div align="center">ALLaVA | <div align="center">1.7M |
|<div align="center">Cocotext | <div align="center">163K |
|<div align="center">Docvqa | <div align="center">16K |
|<div align="center">Geoqa+ | <div align="center">72K |
|<div align="center">DocDownstream | <div align="center">700K |
|<div align="center">Cambrian | <div align="center">8.3M |
|<div align="center">DocStruct4M | <div align="center">4M |
|<div align="center">LLaVA-onevision | <div align="center">4M |
|<div align="center">Docmatix | <div align="center">1.2M |
|<div align="center">Infinity-Instruct | <div align="center">7M |
|<div align="center">Our Synthetic Data | <div align="center">0.8M |
## **模型**
我们的**Aquila-VL-2B**模型是一款参数量为20亿的视觉语言模型,在同参数量级的模型中实现了当前最优(SOTA)性能。
## **引用**
若你认为本数据集对你的工作有所帮助,请引用下述文献:
bibtex
@misc{gu2024infinitymmscalingmultimodalperformance,
title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
year={2024},
eprint={2410.18558},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18558},
}
## 相关链接
[Ram++]: https://github.com/xinyu1205/recognize-anything?tab=readme-ov-file
[Qwen2-VL-2B]: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
[Aquila-VL-2B]: https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen
提供机构:
maas
创建时间:
2024-10-25
搜集汇总
数据集介绍

背景与挑战
背景概述
Infinity-MM是一个高质量、多样化的大规模多模态指令数据集,包含数千万个样本,分为四个阶段,涵盖多种数据类型。基于该数据集训练的Aquila-VL-2B模型在相同规模模型中表现优异。
以上内容由遇见数据集搜集并总结生成



