Mono-InternVL-2B-Synthetic-Data

Name: Mono-InternVL-2B-Synthetic-Data
Creator: maas
Published: 2025-12-10 16:26:21
License: 暂无描述

魔搭社区2025-12-10 更新2025-03-15 收录

下载链接：

https://modelscope.cn/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data

下载链接

链接失效反馈

官方服务：

资源简介：

# Mono-InternVL-2B Synthetic Data This dataset is used for training the S1.2 stage of Mono-InternVL-2B, as described in the paper [Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models](https://huggingface.co/papers/2507.12566). - **Project Page:** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) - **Code:** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL) ## Dataset Description ### Purpose This dataset is used for training the S1.2 stage of Mono-InternVL-2B. ### Data Source We utilize the pre-trained InternVL-8B to produce short captions for 258 million images sampled from Laion-2B, Coyo-700M and SAM(en). ### Size - Total records: 259,064,832 - Files: 3,072 JSONL files, each containing 84,331 records. ## Dataset Structure ### File Format - Each file is in JSON Lines (`.jsonl`) format, with one JSON object per line. ## Sample Usage You can download the dataset directly using `git lfs`: ```bash git lfs install git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data ``` Alternatively, you can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset. By default, all JSONL files will be loaded into a 'train' split. dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data") # Print the dataset info print(dataset) # Access an example from the 'train' split print(dataset["train"][0]) ``` For custom dataset preparation for supervised finetuning, refer to the [Dataset Preparation section in the official GitHub repository](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation). An example JSONL entry for fine-tuning conversations looks like this: ```json { "id": "000000120375", "image": "coco/train2017/000000120375.jpg", "conversations": [ { "from": "human", "value": "<image>\ What type of vehicle is driving down the street in the image?" }, { "from": "gpt", "value": "A red sports utility vehicle (SUV) is driving down the street in the image." }, { "from": "human", "value": "Is the street crowded with people?" }, { "from": "gpt", "value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy." } ] } ```

# Mono-InternVL-2B 合成数据集本数据集用于训练Mono-InternVL-2B的S1.2阶段，相关细节可参阅论文《Mono-InternVL-1.5：面向更轻量化、更快速的一体化多模态大语言模型》（论文链接：https://huggingface.co/papers/2507.12566）。 - **项目主页：** [https://internvl.github.io/blog/2024-10-10-Mono-InternVL/](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) - **代码仓库：** [https://github.com/OpenGVLab/Mono-InternVL](https://github.com/OpenGVLab/Mono-InternVL) ## 数据集说明 ### 数据集用途本数据集用于训练Mono-InternVL-2B的S1.2阶段。 ### 数据来源我们利用预训练的InternVL-8B模型，对从Laion-2B、Coyo-700M和SAM(en)中采样的2.58亿张图像生成简短描述文本。 ### 数据集规模 - 总记录数：259,064,832 - 文件数量：3072个JSON Lines（.jsonl）格式文件，每个文件包含84,331条记录。 ## 数据集结构 ### 文件格式所有文件均采用JSON Lines（.jsonl）格式，每行对应一个JSON对象。 ## 样例使用方式您可通过`git lfs`直接下载本数据集： bash git lfs install git clone https://huggingface.co/datasets/OpenGVLab/Mono-InternVL-2B-Synthetic-Data 或者，您可通过Hugging Face的`datasets`库加载本数据集： python from datasets import load_dataset # 加载数据集，默认情况下所有JSONL文件将被加载至'train'划分下 dataset = load_dataset("OpenGVLab/Mono-InternVL-2B-Synthetic-Data") # 打印数据集信息 print(dataset) # 访问'train'划分下的一条样本 print(dataset["train"][0]) 如需针对监督微调任务进行自定义数据集准备，请参阅官方GitHub仓库中的[数据集准备章节](https://github.com/OpenGVLab/Mono-InternVL#dataset-preparation)。一条用于微调对话任务的JSONL样本示例如下： json { "id": "000000120375", "image": "coco/train2017/000000120375.jpg", "conversations": [ { "from": "human", "value": "<image> What type of vehicle is driving down the street in the image?" }, { "from": "gpt", "value": "A red sports utility vehicle (SUV) is driving down the street in the image." }, { "from": "human", "value": "Is the street crowded with people?" }, { "from": "gpt", "value": "Yes, the street is filled with a considerable number of people, which indicates that the area is busy." } ] }

提供机构：

maas

创建时间：

2025-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集