Mono-InternVL-2B-Synthetic-Data
收藏魔搭社区2025-12-10 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data
下载链接
链接失效反馈官方服务:
资源简介:
# Mono-InternVL-2B Synthetic Data
This dataset is used for training the S1.2 stage of Mono-InternVL-2B, as described in the paper [Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models](https://huggingface.co/papers/2507.12566).
- **Project Page:** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/)
- **Code:** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL)
## Dataset Description
### Purpose
This dataset is used for training the S1.2 stage of Mono-InternVL-2B.
### Data Source
We utilize the pre-trained InternVL-8B to produce short captions for 258 million images sampled from Laion-2B, Coyo-700M and SAM(en).
### Size
- Total records: 259,064,832
- Files: 3,072 JSONL files, each containing 84,331 records.
## Dataset Structure
### File Format
- Each file is in JSON Lines (`.jsonl`) format, with one JSON object per line.
## Sample Usage
You can download the dataset directly using `git lfs`:
```bash
git lfs install
git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data
```
Alternatively, you can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset. By default, all JSONL files will be loaded into a 'train' split.
dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data")
# Print the dataset info
print(dataset)
# Access an example from the 'train' split
print(dataset["train"][0])
```
For custom dataset preparation for supervised finetuning, refer to the [Dataset Preparation section in the official GitHub repository](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation). An example JSONL entry for fine-tuning conversations looks like this:
```json
{
"id": "000000120375",
"image": "coco/train2017/000000120375.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\
What type of vehicle is driving down the street in the image?"
},
{
"from": "gpt",
"value": "A red sports utility vehicle (SUV) is driving down the street in the image."
},
{
"from": "human",
"value": "Is the street crowded with people?"
},
{
"from": "gpt",
"value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy."
}
]
}
```
# Mono-InternVL-2B 合成数据集
本数据集用于训练Mono-InternVL-2B的S1.2阶段,相关细节可参阅论文《Mono-InternVL-1.5:面向更轻量化、更快速的一体化多模态大语言模型》(论文链接:https://huggingface.co/papers/2507.12566)。
- **项目主页:** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/)
- **代码仓库:** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL)
## 数据集说明
### 数据集用途
本数据集用于训练Mono-InternVL-2B的S1.2阶段。
### 数据来源
我们利用预训练的InternVL-8B模型,对从Laion-2B、Coyo-700M和SAM(en)中采样的2.58亿张图像生成简短描述文本。
### 数据集规模
- 总记录数:259,064,832
- 文件数量:3072个JSON Lines(.jsonl)格式文件,每个文件包含84,331条记录。
## 数据集结构
### 文件格式
所有文件均采用JSON Lines(.jsonl)格式,每行对应一个JSON对象。
## 样例使用方式
您可通过`git lfs`直接下载本数据集:
bash
git lfs install
git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data
或者,您可通过Hugging Face的`datasets`库加载本数据集:
python
from datasets import load_dataset
# 加载数据集,默认情况下所有JSONL文件将被加载至'train'划分下
dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data")
# 打印数据集信息
print(dataset)
# 访问'train'划分下的一条样本
print(dataset["train"][0])
如需针对监督微调任务进行自定义数据集准备,请参阅官方GitHub仓库中的[数据集准备章节](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation)。一条用于微调对话任务的JSONL样本示例如下:
json
{
"id": "000000120375",
"image": "coco/train2017/000000120375.jpg",
"conversations": [
{
"from": "human",
"value": "<image>
What type of vehicle is driving down the street in the image?"
},
{
"from": "gpt",
"value": "A red sports utility vehicle (SUV) is driving down the street in the image."
},
{
"from": "human",
"value": "Is the street crowded with people?"
},
{
"from": "gpt",
"value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy."
}
]
}
提供机构:
maas
创建时间:
2025-03-13



